reinforcement learning
15th June 2021

RL: Terminology & Resources Cheatsheet

Reinforcement Learning (RL) is an expansive and growing subfield within Machine Learning with a variety of important terminology. Through my RL journey, I often found myself looking for a cheatsheet of vocabulary and resources that I could refer back to whenever needed.

This article aims to do just that! It focuses on critical terminology, fundamental in becoming a RL engineer, and resources that I have found beneficial throughout my journey, including books, courses, articles, and academic papers.

Furthermore, it is important to note that this article will be regularly updated with new terminology and resources when applicable.

Article Contents

  1. Terminology
  2. Resources


Agent: an AI system or controller used to navigate a given environment.

Action: a command that an agent uses during each timestep of its environment, denoted \(A_t\).

Environment: a represention of a specific world, such as a game board.

Fully Observable Environemnt: allows agents to directly observe the environment state.

Partially Observable Environment: agents indirectly observe the environment, only viewing information related to the required task.

Reward: a positive or negative scalar feedback signal provided to the agent after it has interacted with its environment, denoted \(R_t\).

Reward Hypothesis: every agents goal is to select actions to maximise total future reward.

State: acts as a snapshot of the current environment, denoted \(S_t\).

State Space: a set of all possible states within a Markov Process.

Environment State: environment's private representation used to determine what observation/reward to pick next, denoted \(S_t^e\).

Agent State: agent's internal representation of the world around it. Captures exactly what the agent has seen and done so far and used to determine the agent's next action, denoted \(S_t^a\).

Information/Markov State: contains all useful information from the history, denoted \(O_t\).

History: a stored sequence of observations, actions, and rewards, representing a list of observable variables up to time \(t\).

Policy: represents the agent's brain, determining how it makes decisions to select its actions. Can be deterministic or stochastic.

Determinstic Policy: acts as a map from state to action.

Stochasic Policy: agents make random exploratory decisions to see more of the state space.

Value Function: predicts future reward used to evaluate the effectives of each future state.

Model: represents the agent's view of how the environment works, where the agent attempts to foresee what happens next.

Model-Free: agents that don't build a model of their enviroment or contain rewards and instead directly connect observations/states to actions.

Model-Based: agents that try to predict the next state and reward in its environments.

Value-Based: agents have a value function and no policy.

Policy-Based: agents have a policy and no value function.

Actor-Critic: agents use both a policy and a value function.

Learning Problem: agents are initially unaware of the environment around them and must become familiar with it through interactions.

Planning Problem: agents have a model of the environment and know all its rules, where they perform computations (without external interaction) to improve its policy.

Exploration: agents discover more about their environment through random actions, substituing immediate rewards for even greater ones.

Exploitation: agents focus on abusing the known information to maximise its immediate reward.

Prediction Problem: involves evaluating the performance of agents in their future states, given the current policy.

Control Problem: focuses on finding an optimal policy to gain the most future reward.

Markov Process: also known as a Markov Chain, are mathematical systems consisting of observable states that can switch from one state to another based on a set of probabilistic rules. Typically, represented as a tuple of two elements \(〈S, \mathcal{P}〉\), where \(S\) is a finite set of states and \(\mathcal{P}\) is a state transition matrix.

Makrov Property: also known as a Markov State, is used to capture all the relevant information from the history, where:

\(\begin{equation} \mathbb{P}[S_{t+1} \; | \;S_t] = \mathbb{P}[S_{t+1} \; | \; S_1, \cdots, S_t] \end{equation}\)

Markov Reward Process: or MRP for short, are an extension of Makrov Processes that contain two additional components: rewards and a discount factor. Forms a tuple of four components \(〈S, \mathcal{P}, \mathcal{R}, \gamma〉\), where \(S\) is a finite set of states, \(\mathcal{P}\) is a state transition matrix, \(\mathcal{R}\) is a reward function, and \(\gamma\) is a discount rate.

Makrov Decision Process: or MDP for short, are an extension of Markov Reward Processes (MRPs) that introduces decisions (actions). Uses the same tuple as MRPs with the addition of actions, denoted by \(〈S, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma〉\). MDPs are an environment which contains a finite set of states that are Markov, designed to describe the agent/environment interaction setting.

State Transition Matrix: a square matrix of size \(N \times N\), where \(N\) is the number of states in a model, that stores every Markov Chain's state probabilities, as a row, for each possible transition assigned to them. Denoted by \(\mathcal{P}\), where:

\(\begin{equation} \mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s' \; | \; S_t = s] \end{equation}\)

Reward Function: a mathematical function used to explain how much immediate reward \(R\) an agent receives within a given state \(S\), where:

\(\begin{equation} \mathcal{R}_s = \mathbb{E}[R_{t+1} \; | S_t = s] \end{equation}\)

Return: denoted by \(G_t\), representing the total reward over a given number of timesteps \(t\), with \(T\) as the final timestep. Calculated by combining multiple rewards together up to timestep \(t\). Typically used in episodic tasks, where:

\(\begin{equation} G_t = R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_T \end{equation}\)

Discounted Return: an extension of return that includes the use of a discount factor that is used to maximize the total expected discounted return over a given number of timesteps \(t\). Typically used in continuing tasks, where:

\(\begin{equation} G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum\limits^{\infty}_{k = 0} \gamma^k R_{t+k+1} \end{equation}\)

Episodic Tasks: a finite number of states that an agent performs in that concludes in a terminal state.

Continuing Tasks: a never-ending set of states that doesn't have a terminal state.

Value Function: a mathematical function that is used to estimate 'how good' it is for an agent to be in a given state \(s\) by calculating a future expected return.

Optimal Value Function: the best performing value function, denoted with a *, calculated through taking the maximum value function over all policies generated within an environment. Consists of two separate formulas for both the state-value and action-value functions.

State-Value Function: the same as a standard value function but includes a policy to enable agent decision-making. Formally defined as the expected return starting from state \(s\) and then following the policy \(\pi\), where:

\(\begin{equation} v_\pi(s) = \mathbb{E}_\pi[G_t \; | \; S_t = s] \end{equation}\)

Optimal State-Value Function: the best performing state-value function, where:

\(\begin{equation} v_*(s) = \max\limits_\pi v_\pi(s) \end{equation}\)

Action-Value Function: the expected return starting from state \(s\), while taking an action \(a\), and then following the policy \(\pi\), where:

\(\begin{equation} q_\pi(s, a) = \mathbb{E}_\pi[G_t \; | \; S_t = s, A_t = a] \end{equation}\)

Optimal Action-Value Function: the best performing action-value function, where:

\(\begin{equation} q_*(s, a) = \max\limits_\pi q_\pi(s, a) \end{equation}\)

Policy: denoted by \(\pi\), used to define the behaviour of RL agents and helps them to choose which actions to take within their environment. A policy can either be stochastic or determinstic.

Optimal Policy: presents the best behaviour an agent can have within an environment. These policies can be found by defining a partial ordering over all policies, denoted by \(\pi_*\). Can be found by maximising over \(q_*(s, a)\), where:

\(\begin{equation} \pi_*(a|s) = \left\{ \begin{array}{@{} l c @{}} 1 & if \; a = {\arg\max\limits_{a \in A}} \; q_*(s,a) \\ 0 & otherwise \end{array}\right. \end{equation}\)

Stochastic Policy: a probability distribution over a set of actions, used for every state, allowing agents to select actions within a state based on a probability for a given action, where:

\(\begin{equation} \pi(a|s) = \mathbb{P}[A_t = a \; | \; S_t = s] \end{equation}\)

Determinstic Policy: used to map states to actions, preventing uncertainty when taking actions, \(\pi(s) = a\).

Bellman Equation: a decompressed version of a value function, split into two parts: an immediate reward \(R_{t+1}\) and the discounted value of the successor state \(\gamma v(S_{t+1})\). The equation is used to express a relationship between the value of a state and the values of its successor state, used in conjunction with backup diagrams to visualise a one or two-step look ahead from one state to a possible successor state, where:

\(\begin{equation} v(s) = \mathbb{E}[R_{t+1} + \gamma v(S_{t+1}) \; | \; S_t = s] \end{equation}\)

Bellman Expectation Equations: a decomposition of the state-value \(v_\pi(s)\) and action-value \(q_\pi(s, a)\) functions, similar to the Bellman Equation, where:

\(\begin{equation} v_\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) \; | \; S_t = s] \end{equation}\)

\(\begin{equation} q_\pi(s, a) = \mathbb{E}_\pi[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \; | \; S_t = s, A_t = a] \end{equation}\)

Bellman Optimality Equations: similar to the Bellman Expectation Equations, but instead of taking the average, we take the maximum (best) action value that an agent can take, where:

\(\begin{equation} v_*(s) = \max\limits_{a} \mathcal{R}^a_s + \gamma \sum\limits_{s' \in S} \mathcal{P}^a_{ss'} v_*(s') \end{equation}\)

\(\begin{equation} q_*(s, a) = \mathcal{R}^a_s + \gamma \sum\limits_{s' \in S} \mathcal{P}^a_{ss'} \max\limits_{a'} q_*(s', a') \end{equation}\)

Dynamic Programming: or DP for short, is a method for solving complex problems by breaking them down into subproblems. The subproblems are then solved and combined to complete the whole complex problem. DP can be used to solves problems that have two properties: an optimal substructure and overlapping subproblems.

Principle of Optimality Theorem: A policy \(\pi(a|s)\) achieves the optimal value from state \(s\) when the value of that policy is the optimal value function \(v_\pi(s) = v_*(s)\). However, this can only be true if any state \(s'\) is reachable from \(s\) when the policy \(\pi\) followed achieves the optimal value from state \(s'\), mathematically: \(v_\pi(s') = v_*(s')\).

Optimal Substructure: explains that complex problems are decomposable into multiple solvable subproblems, where the optimal solutions to those subproblems can determine the optimal solution to the overall problem. Requires a principle of optimality.

Overlapping subproblems: involves recursive subproblems that can be cached and reused to solve complex problems, typically through divide-and-conquer algorithms.

Bellman's Curse of Dimensionality: involves an expoential growth of the number of states, given the number of state variables \(n = |S|\), which can be very computational expensive for a single backup.

Sample Backups: used for large MDPs, which take a sample of rewards and transitions \(〈S, A, R, S'〉\), instead of a reward function \(\mathcal{R}\) and the MDPs transition dynamics \(\mathcal{P}\). Advantageous for breaking the curse of dimensionality, setting a constant cost of backups that are independent on \(n = |S|\). Also, in model-free algorithms, it removes the requirement to know the model of the environment.

Policy Evaluation: focuses on evaluating a fixed policy \(v_\pi\), and can be used to solve prediction problems. Given an MDP \(〈S, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma〉\) and a policy \(\pi\), the Bellman Expectation Equation is used and converted into an iterative update to use as a mechanism for evaluating the policy, and generate a true value function \(v_\pi\).

Policy Improvement: an approach used to find the best possible policy within a given MDP using a feedback process, where \(\pi' \ge \pi\). Uses the policy improvement theorem to identify the best policy \(q_\pi(s, \pi'(s)) \ge v_\pi(s)\).

Policy Iteration: a combination of policy evaluation and improvement to achieve an optimal policy \(\pi_*\) and satisfy the Bellman optimality equation.

Random Policy: a simple policy an agent follows to take any possible action at random. For example: a uniform random policy would consist of every action having the same probability chance. If there were four actions, each action would have a 25% chance of being taken.

Greey Policy: a policy used by agents to always take the best action an agent can take within a given state. This policy only reflects the short-term by perform a one-step look-ahead for \(v_\pi\). A greedy policy is created mathematically by:

\(\begin{equation} \pi'(s) = arg \max\limits_{a \in A} \; q_\pi(s, a) \end{equation}\)

Value Iteration: a special case of policy iteration, where policy evaluation is stopped after just one sweep of the state space. Uses an update rule version of the Bellman optimality equation. Intuitively, this type of algorithm starts at the terminal state of an MDP and works backwards to determine the optimal to that state to create the optimal policy \(\pi_*\). Mathematically:

\(\begin{equation} v_{k+1}(s) = \max\limits_{a \in A} \left(\mathcal{R}^a_s + \gamma \sum\limits_{s' \in S} \mathcal{P}^a_{ss'} \; v_k(s') \right) \end{equation}\)

\(\begin{equation} v_{k+1} = \max\limits_{a \in A} \mathcal{R}^a + \gamma \mathcal{P}^a v_k \end{equation}\)

Synchronous Dynamic Programming: the basic type of dynamic programming algorithms that require sweeps over an entire MDPs state space. Can be computationally expensive if the state space is large.

Asynchronous Dynamic Programming: a form of in-place iterative algorithms that perform in systematic sweeps of the state space. These types of algorithms backup the values of states individually in any order, proceeding with available state values. Can significantly reduce computation overhead, in comparsion to synchronous methods, and still guarantees convergence to optimal values. Uses three methods to assist in selecting states to update: in-place dynamic programming, prioritised sweeping, and real-time dynamic programming.

In-place Dynamic Programming: used to store a single copy of an MDPs value function for all states in the state space. Mathematically:

\(\begin{equation} v(s) \rightarrow \max\limits_{a \in A} \left(\mathcal{R}^a_s + \gamma \sum\limits_{s' \in S} \mathcal{P}^a_{ss'} \; v(s') \right) \end{equation}\)

Prioritised Sweeping: provides a measure of importance for each state within a state space. Uses the magnitude of the Bellman error to guide state selection by utilising a priority queue that determines which states are more important than others. Mathematically:

\(\begin{equation} \left| \max\limits_{a \in A} \left(\mathcal{R}^a_s + \gamma \sum\limits_{s' \in S} \mathcal{P}^a_{ss'} \; v(s') \right) - v(s) \right| \end{equation}\)

Real-time Dynamic Programming: a method that focuses on choosing states relevant to the agent (ones it has visited) to solve a given MDP. Works in conjunction with an iterative DP algorithm running to receive live samples of information based on the agent's experiences. The experiences are then used to guide the selection of states. After each time-step, we receive \(S_t, A_t, R_{t+1}\). When backing up the state \(S_t\), combined with an in-place calculation, we can identify the most relevant state. Mathematically:

\(\begin{equation} v(S_t) \rightarrow \max\limits_{a \in A} \left(\mathcal{R}^a_{S_t} + \gamma \sum\limits_{s' \in S} \mathcal{P}^a_{S_t s'} \; v(s') \right) \end{equation}\)