reinforcement learning
28th June 2021

MDPs – Part 1: Makrov Process (MP)

Markov Decision Processes (MDPs) are used to formally describe an environment for Reinforcement Learning (RL) that use a decision-maker (agent) to interact with a fully observable environment. Formally, almost all RL problems are MDPs. Some examples include optimal control problems that focus on continuous MDPs, partially observable problems, specifically one’s convertible to MDPs, or bandit problems consisting of single state MDPs.

This article is the first of a three-part series broken down into the core components of MDPs. This one focuses on Markov Processes (MPs). The second follows an extension of MPs with rewards, turning them into Markov Reward Processes (MRPs), found here. The third, and last article, adds actions to those MRPs to create and complete the MDP, located here.

Article Contents

  1. The Markov Process (MP)
  2. State Transition Matrix

The Markov Process (MP)

One of the simplest and fundamental components of MDPs is the MP, also known as a Markov chain. Markov chains, named after Andrey Markov, are mathematical systems consisting of observable states (a situation or set of values). The states within these systems can switch from one to another based on some specified probabilistic rules. Typically, a Markov process is a memoryless random process that consists of a sequence of random states, \(S_1,S_2,\cdots\), with the Markov property. Represented as a tuple of two elements, \(〈S,P〉\), where:

  • \(S\) - a finite set of states.
  • \(\mathcal{P}\) - a state transition probability matrix.

All possible states for a system form a set called the state space. In MPs, this set of states must be finite (but can be extremely large) and uses a sequence of state observations, also known as a chain. For example, we have two states for the weather in a city, sunny (S) or rainy (R), which make up a state space. When observing these states, a sequence of observations is formed over time, creating a chain of states, such as \([S,R,S,S,R,…]\), known as history.

For a system to be a MP, it needs to fulfil the Markov Property, also known as a Markov State.The Markov Property explains that the future depends only on the present state and doesn’t depend on its history. More formally:

“Markov states capture all the relevant information from the history.”

In RL algorithms, statistically, if a state is Markov, then the probability of what happens next in an agent’s environment, \(S_{(t+1)}\), depends only on the current state the agent is in, \(S_t\), and not on everything that came before it, denoted in equation 1.

\(\begin{equation} \mathbb{P}[S_{t+1}|S_t] = \mathbb{P}[S_{t+1}|S_1, \cdots, S_t] \end{equation}\)


State Transition Matrix

Every state within a Markov chain has at least two possible transitions (except for terminal states), one to itself and one to another state. Each transition has a probability assigned to it, where every states transition probabilities are stored within a state transitions matrix (STM), denoted by \(\mathcal{P}\). STMs are a square matrix of size \(N \times N\), where \(N\) is the number of states in a model (Markov chain), denoted in equation 2, and a single row (state) must sum to 1.

\(\begin{equation} \mathcal{P} = \left[ \begin{array}{ccc} \mathcal{P}_{11} & \cdots & \mathcal{P}_{1n} \\ \vdots & & \\ \mathcal{P}_{n1} & \cdots & \mathcal{P}_{nn} \end{array} \right] \end{equation}\)


Each row contains the probability that the system transitions from one state to another (including itself), defined as the transition probability from all states \(s\) to all successor states \(s{'}\). Statistically, a Markov state \(s\) and successor state \(s{'}\) have a state transition probability that the Markov state will turn into the successor state, denoted in equation 3.

\(\begin{equation} \mathcal{P}_{ss{'}} = \mathbb{P}[S_{t+1} = s{'}|S_t = s] \end{equation}\)


The state transition matrix provides a complete structure to the Markov problem, where given a state, it is easy to determine how likely we are to end up in another state. Using the example from earlier, the transition matrix for sunny/rainy could look like this:

Sunny Rainy
Sunny 0.8 0.2
Rainy 0.1 0.9

Table 1.1. Weather example state transition matrix

In table 1.1, if the day is sunny, there is a 90% chance that it will stay sunny the next day and a 20% chance it will be rainy. Similarly, if it is a rainy day, there is a 10% probability that the weather will be sunny the following day and a 90% probability it stays rainy.

Figure 1.1. Weather MP model example

The weather example visualised in figure 1.1 contains nodes corresponding to system states and edges labelled with probabilities, representing a possible transition from state to state. Furthermore, if the probability of a transition is 0%, the corresponding lines are not drawn, and the nodes are displayed as squares, to represent a terminal state.

In practice, the exact transition matrix is rarely known. Instead, system state observations are divided into sample episodes. For example:

  • Sunny → Sunny → Rainy → Rainy → Rainy → Sunny
  • Sunny → Rainy → Rainy → Sunny

Episodes play a critical role in RL algorithms and are discussed further within the next article in the series. To continue with this series, check out the second article on Markov Reward Processes (MRPs).