What is Reinforcement Learning?

Reinforcement Learning (RL) is one of the hottest research topics in the field of Artificial intelligence (AI) and Machine Learning (ML). With RL's recent accomplishments, such as DeepMind’s AlphaGo and OpenAI Five, RL is paving the way for future AI applications.

So, what exactly is it? RL is a type of semi-supervised learning that focuses on intelligent agents taking actions in an environment and receiving rewards as feedback, intending to maximize cumulative reward. This kind of sequential decision making assists in solving problems through automatic learning of optimal decisions over time, all through a trial-and-error learning approach.

Throughout this article, we will cover the characteristics of RL, how it works, some of its core components, and problems found within it. To view a complete list of RL terminology and resources used within my articles, visit the RL Terminology & Resources Cheatsheet article.

Article Contents

  1. Characteristics of Reinforcement Learning
  2. How Reinforcement Learning Works
  3. History and States
  4. Types of Environments
  5. Components of a Reinforcement Learning Agent
  6. Categories of RL Agents
  7. Challenges of Reinforcement Learning

Characteristics of Reinforcement Learning

Figure 1.1. Disciplines of RL

RL encompasses a wide range of fields, causing it to exhibit different characteristics when compared against other ML paradigms. These characteristics include:

  1. Algorithms that require no supervision and only contain a reward signal.
  2. Uses agents that interact with their environment by taking actions based on the subsequent data it receives.
  3. Agents must take multiple actions in their environment before identifying if the reward received is positive or negative, providing a delayed feedback loop.
  4. The steps taken are sequential, making it clear which actions the agent should take based on the given reward.
  5. Time plays a critical role in how the agents perform.
  6. RL algorithms don't use independent and identically distributed (i.i.d) data. Instead, agents must use a sequence of actions that correlate with each other.

These characteristics allow RL to be used in a wide range of applications, such as robotics for industrial automation, controlling drones and aircraft, manage investment portfolios, control power stations, or even be used to optimise chemical reactions.

How Reinforcement Learning Works

RL assists in solving many real-world problems by using agents, a component that decides what actions to take within a given environment. Each agent has one goal based on the following reward hypothesis:

“All goals can be described by the maximisation of expected cumulative reward.”

For an agent to achieve this goal they must select actions to maximise total future reward. While this seems simple, it can be difficult as the actions taken may have long term consequences. Additionally, if the reward the agents obtain is delayed, the positive or negative impact of the action won't become clear until a subset of actions has been taken. Because of this, agents may be required to sacrifice immediate reward to gain more long-term reward in the future. The reward discussed is a scalar feedback signal, denoted by \(R_t\), that indicates how well an agent is doing at a given timestep \(t\).

Additionally, RL can be used in a wide range of applications, such as robotics for industrial automation, controlling drones and aircraft, control power stations, manage investment portfolios, or even be used to optimise chemical reactions.

An RL algorithm has two main components: the agent and the environment. The agent represents an AI system, and the environment represents a specific type of world, such as a game board.

These components have three primary signals: a state, action, and reward. Each one provides a unique use. The state signal acts as a snapshot of the current state of the environment. The action signal is the action that the agent takes during each timestep of the environment. And lastly, the reward signal provides feedback to the agent based on the previous action taken within the environment.

Figure 1.2. RL process cycle

Figure 1.2 represents a typical RL algorithm with both the agent and environment. The cycle starts with the agent receiving the first frame of the environment, state \(S_0\), where the agent then takes an action \(A_0\). After this, the environment transitions to a new state, \(S_1\), and returns a positive or negative reward \(R_1\) to the agent. This process repeats until a termination condition is met. In this example, the output would be a sequence of state, action, reward, and next state: \([S_0, A_0, R_1, S_1]\).

For simplicity and to solidify our understanding, let's break this down into a list. At each timestep \(t\), the agent either:

  • Executes an action in the environment \(A_t\)
  • Receives an observation from the environment \(O_t\)
  • Or, receives a scalar reward from the environment \(R_t\)

While the environment either:

  • Recieves an action from the agent \(A_t\)
  • Emits an observation to the agent \(O_t\)
  • Or, emits a scalar reward to the agent \(R_t\)

History and States

As seen in the figure 1.2 example, RL agents use a sequence of observations, actions, and rewards called history to store all the observable variables up to time \(t\), denoted by equation 1.

\(\begin{equation} H_t = A_1, O_1, R_1, ..., A_t, O_t, R_t \end{equation}\)


The action the agent takes next depends on the observable history. Similarly, the environment also reviews the observable history to select the next observation/reward based on the agent's action. Unfortunately, because the history contains everything that has happened up to a given timestep, it can be very computationally expensive to look through. Instead, states are used, providing a summary of information to determine what happens next. Formally, states are a function of the history, denoted in equation 2.

\(\begin{equation} S_t = f(H_t) \end{equation}\)


Types of States

Typical use cases for states involve only looking at the last observation of the history. With this in mind, there are three types of states: environment, agent, and information.

The environment state, denoted by \(S_t^e\), is the environment’s private representation used to determine what observation/reward is picked next. This state is only visible by the environment and cannot be seen by the agent.

The agent state, denoted by \(S_t^a\), is the agent’s internal representation of the world around it. This state is used to capture exactly what the agent has seen and done so far, which is then used to determine the agent’s next action. Additionally, this is the information that is used by RL algorithms to successfully train the agents, which can be any function of history, using the same formula in equation 2.

The information state, or Markov state, denoted by \(O_t\), contains all the useful information from the history. In order for a state \(S_t\) to be Markov, the probability of the next state, \(S_{t+1}\), conditioned against the state the agent is in, \(S_t\), must be equal to the probability of the next state compared against all of the previous states. This is denoted in equation 3.

\(\begin{equation} \mathbb{P}[S_{t+1} | S_t] = \mathbb{P}[S_{t+1} | S_1, ..., S_t] \end{equation}\)


A Markov state is extremely powerful because it only requires the latest state, providing a sufficient statistic of all future states. The remaining history cannot provide any additional information on what happens in the future, making them redundant when using a Markov state, so it is beneficial to remove them to increase computational resources. With this in mind, a Markov state can be simplified from equation 3 to equation 4.

\(\begin{equation} H_{1:t} \rightarrow S_t \rightarrow H_{t+1:\infty} \end{equation}\)


Types of Environments

Another critical component to RL algorithms is the type of environment used. There are two main types: fully observable and partially observable.

Fully Observable

Fully observable environments allow agents to directly observe the environment state, giving them the ability to see everything that the environment sees. This type of environment is identified mathematically within equation 5.

\(\begin{equation} O_t = S_t^a = S_t^e \end{equation}\)


This type of environment representation is the main formalism for RL, known as a Markov Decision Process (MDP). MDPs are extremely powerful and crucial to most RL algorithms but are unfortunately out of the scope of this article. For a full deep dive on MDPs, please check out the MDPs article series.

On the other hand, it is also crucial to understand that not every problem can be fully observable, which requires a different type of environment, one that is partially observable.

Partially Observable

Partially observable environments provide agents with the ability to indirectly observe the environment state, where they only provide the agents with information relevant to the required task. For example, a robot that uses a camera to see and must localize itself within its environment or a poker-playing agent can only observe its cards own cards and the public cards visible to all players on the table.

These types of environments are known as partially observable Markov decision processes (POMDPs), mathematically represented in equation 6.

\(\begin{equation} S_t^a \neq S_t^e \end{equation}\)


For these environments to successfully solve problems, their agents are required to construct their own state representation \(S_t^a\) of the environment around them. For example:

  • The agent could remember everything it has seen: \(S_t^a = H_t\)
  • Build beliefs of the environment state using a Bayesian approach: \(S_t^a = (\mathbb{P}[S_t^e = s^1], ..., \mathbb{P}[S_t^e = s^n])\)
  • Use a Recurrent Neural Network: \(S_t^a = \sigma(S_{t-1}^a W_s + O_t W_o)\)

Components of a Reinforcement Learning Agent

Agents act as the controller of RL algorithms and are fundamental in their success. In this section, we will explore the main components of RL agents.

Every RL agent has three main components: a policy, value function, and, optionally, a model.


A policy represents the agent’s brain in the form of a function, determining how the agent make its decisions to select its actions during each timestep. An agent’s goal is to find the optimal policy \(\pi^*\) to maximize the expected return during each state and action pair. Policies come in two forms:

  • Deterministic - follows a given function that takes in a state \(s\) to get some action \(a\), denoted in equation 7. This type of policy acts as a map from state to action.

\(\begin{equation} a = \pi(s) \end{equation}\)


  • Stochastic - allows an agent to make random exploratory decisions to see more of the state space by taking the probability of a particular action, conditioned on a given state, denoted in equation 8. This type of policy provides a probability distribution over a set of actions for a given state.

\(\begin{equation} \pi(a \; | \; s) = \mathbb{P}[A = a \; | \; S = s] \end{equation}\)


Value Function

A value function is a prediction of future reward used to evaluate the effectiveness of each future state. Therefore, this function assists in selecting the best action for each state, denoted in equation 9. Value functions use a discount rate \(\gamma\) between 0 and 1, where a value close to 0 encourages agents to care more about long term reward, and a number close to 1 incentivizes them to care more about short term reward.

\(\begin{equation} v_\pi(s) = \mathbb{E}_{\pi}[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} \; + \; ... \; | \; S_t = s] \end{equation}\)



A model is an optional component representing the agent's view of how the environment works, where it tries to foresee what will happen next. Models have two parts: transitions \(\mathcal{P}\) that predict the next state and rewards \(\mathcal{R}\) that anticipate the next immediate reward. Both have mathematical representations, denoted in equation 10.

\(\begin{equation} \mathcal{P}_{ss'}^{a} = \mathbb{P}[S' = s' \; | \; S = s, A = a] \end{equation}\)

\(\begin{equation} \mathcal{R}_{s}^{a} = \mathbb{E}[R \; | \; S = s, A = a] \end{equation}\)


\(\mathcal{P}_{ss'}^{a}\) represents a state transition model that identifies the probability of being in the next state given the previous state and action and \(\mathcal{R}_s^a\) symbolizes a reward model that explains the expected reward given the current state and action.

Categories of RL Agents

When creating an RL algorithm, it's crucial to understand what type of agents are required to complete the task at hand. Unfortunately, there isn't a universal agent that solves every problem. Instead, agents are divided into two main areas that contain individual sub-categories. There are seven total sub-categories, which can be combined using various combinations to solve different problems. Typically, an agent uses one sub-category from each of the main areas.

Model-Free vs Model-Based

The first area includes model-free and model-based agents. Both of these use either a policy or a value function with or without a model. More specifically:

  • Model-free agents do not build a model of their environment or contain rewards. Instead, they directly connect observations/states to actions.
  • Model-based agents, however, rely on the model of the environment to predict the next state and reward. These agents will either know the model perfectly or learn it explicitly.

Value-Based vs Policy-Based vs Actor-Critic

The second area focuses on the functionality of the agents, which can be either: value-based, policy-based or actor-critic, where:

  • Value-based agents have no policy but a value function.
  • Policy-based agents have a policy but no value function.
  • And, Actor-Critic agents use both a policy and a value function.

On-Policy vs Off-Policy

The third area focuses on the type of policy improvement/evaluation method an agent uses. These can be either be on-policy or off-policy:

  • On-Policy methods evaluate or improve a policy via the latest learned version of that policy.
  • Off-Policy methods evaluate or improve a policy using a different data source produced by a separate policy from the target one.

Challenges of Reinforcement Learning

RL has its own level of challenges when creating agents that can successfully navigate an environment. Specifically, there are three sets: learning and planning, the exploration and exploitation trade-off, and prediction and control.

Learning & Planning

The learning problem is where agents are initially unaware of the environment around them and requires them to become familiar with it by interacting with it. Through continuous interactions, the agent improves its policy and begins to maximise its cumulative reward.

The planning problem provides the agent with a model of the environment, allowing it to know all its rules. Instead of interacting with the environment, the agent performs computations on the model, without external interaction, to improve its policy.

While both are different problem sets, they can be linked together. First, the agent learns how the environment works and then plans the best way to solve it.

Exploration vs Exploitation Trade-off

For an RL agent to be effective, it must find a balance between exploration and exploitation. Exploration involves the agent discovering more of the environment by trying random actions to learn more about the environment. While exploring, it substitutes rewards that it knows about to see if there are even greater rewards. Exploitation, however, focuses on the agent abusing the known information to maximise its immediate reward.

Prediction & Control

The prediction and control problems are the final set of distinctions that are important in RL algorithms. Prediction involves the evaluation of how well the agent performs in future states, given the current policy. For example, if an agent was to walk forward, how much reward would it receive?

Comparatively, the control problem focuses on finding the optimal policy to gain the most future reward. For example, which direction should the agent walk to get the most reward? Typically, the prediction problem is solved first to solve the control problem.