MDPs – Part 2: Markov Reward Processes (MRPs)

This article is the second in a three-part series broken down into the core components of Markov Decision Processes (MDPs). This one focuses on Markov Reward Processes (MRPs). If you missed the first article on Markov Processes (MPs), please find it here. Alternatively, for the third article, on the final component of MDPs, click here.

Article Contents

  1. Markov Reward Process (MRP)
  2. Return
  3. Value Functions
  4. Bellman Equation

Markov Reward Process (MRP)

MRPs are an extension of MPs that contain two additional components: a reward function and a discount factor.

Formally, MRPs are a tuple \(〈S,\mathcal{P},\mathcal{R},\gamma〉\), where:

  • \(S\) - a finite set of states.
  • \(\mathcal{P}\) - a state transition probability matrix, \(\begin{equation}\mathcal{P}_{ss{'}} = \mathbb{P}[S_{t+1} = s{'} \; | \; S_t = s]\end{equation}\)
  • \(\mathcal{R}\) - a reward function, \(\begin{equation} \mathcal{R}_s = \mathbb{E}[R_{t+1} \; | \; S_t = s]\end{equation}\)
  • \(\gamma\) – a discount rate, \(\gamma \in [0,1]\)

Reward functions are used to explain how much immediate reward \(\mathcal{R}\) an agent receives within a given state \(S\).


Return (\(G_t\)) is the total reward over a given number of timesteps \(t\), represented in equation 1.1, where \(T\) is the final timestep. In general, this is the simplest case for return. The formula makes sense in applications which have a natural notion of final timestep, such as sequences that terminate, more formally known as episodic tasks.

\(\begin{equation} G_t = R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_T \end{equation}\)


\(\begin{equation} G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty}{\gamma^k R_{t+k+1}} \end{equation}\)


In many cases, RL environment interactions don’t naturally break into identifiable episodes. Instead, they go on continuously without limit, formally known as continuing tasks. The formula highlighted in 1.1 can be problematic for these types of tasks due to the finite limitation. MRPs include a discount rate within the return to accommodate this.

Overall, the discounted return (equation 1.2) encourages agents to select actions to maximize the sum of the discounted rewards it receives over the future. In short, it is used to maximize the total expected discounted return over a given number of timesteps \(t\).

It does this by using a discount rate \(\gamma \), which determines the present value of future rewards. If \(\gamma = 0\), the agent is ‘myopic’ in that it is only concerned with maximizing immediate rewards. However, as \(\gamma \) approaches 1, the agent takes future rewards into account more strongly and becomes more farsighted.

Discounted return can be beneficial for these reasons:

  • It is mathematically convenient.
  • It assists in avoiding infinite returns in cyclic MPs.
  • It accounts for uncertainty about the future.
  • It provides a better reflection of human/animal behaviour, as immediate rewards are typically more favoured.

Value Functions

Almost all RL algorithms involve estimating value functions. Denoted by \(v(s)\), these functions estimate how good it is for an agent to be in a given state \(s\). This notion of ‘how good’ relates to the future expected return. In MRPs, the value function is the expected return starting from state \(s\), outlined in equation 2.

\(\begin{equation} v(s) = \mathbb{E}[G_t \; | \; S_t = s] \end{equation}\)


The value functions purpose is to provide a clear indication of how much reward an agent receives starting at a given state until a terminal state is reached.

Using a slightly more complex example than the weather one, highlighted in figure 1.1, we can see how rewards work in a state transition graph. The model focuses on an office worker that contains a state space of the following states:

  • Home – not at the office
  • Computer – working at the office
  • Coffee – drinking coffee at the office
  • Chatting – with colleagues at the office
Figure 1.1. Office worker example

The model shows that the office worker always starts his day from the home state and begins his shift with coffee. Additionally, the workday always ends in the home state from the computer state. There are no exceptions to these rules. With this in mind, the state transition matrix for the graph is as follows:

Home Coffee Chat Computer
Home 0.6 0.4 0 0
Coffee 0 0.1 0.7 0.2
Chat 0 0.2 0.5 0.3
Computer 0.2 0.2 0.1 0.5

Table 1.1. Office worker state transition matrix

Some example episodes may look like the following:

  • Home → Coffee → Coffee → Chat → Chat → Coffee → Computer → Computer → Home
  • Home → Home → Coffee → Computer → Home
  • Home → Coffee → Chat → Coffee → Computer → Home

Using these example episodes, we can calculate their expected return. With a \(\gamma = 0.8\), we would have the following:

\(v_1\) = 1 - (1 * 0.8) + (0 * 0.64) - (2 * 0.51) + (0 * 0.41) + (3 * 0.33) + (5 * 0.26) + (2 * 0.21) = 1.89

\(v_1\) = 1 + (1 * 0.8) + (3 * 0.64) + (2 * 0.51) = 4.74

\(v_1\) = 1 + (0 * 0.8) + (0 * 0.64) + (3 * 0.51) + (2 * 0.41) = 3.35

Bellman Equation

The value function can be decomposed into two parts:

  • Immediate reward \(R_{t+1}\)
  • And the discounted value of the successor state \(\gamma v(S_{t+1})\)

\(\begin{equation} \begin{aligned}[t] v(s) &= \mathbb{E}[G_t \; | \; S_t = s] \\[5pt] &= \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \; | \; S_t = s] \\[5pt] &= \mathbb{E}[R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \cdots) \; | \; S_t = s] \\[5pt] &= \mathbb{E}[R_{t+1} + \gamma G_{t+1} \; | \; S_t = s] \\[5pt] &= \mathbb{E}[R_{t+1} + \gamma v(S_{t+1}) \; | \; S_t = s] \end{aligned} \end{equation}\)


Unrolling the value function (equation 3), we get what is known as the Bellman equation, which expresses a relationship between the value of a state and the values of its successor state. Backup diagrams (figure 1.2) are commonly used in conjunction with the Bellman equation, visualising the idea of looking ahead from one state to a possible successor state and to provide graphical summaries of RL algorithms.

Figure 1.2. Backup diagram

Each circle within figure 1.2 represents a state. Starting from state \(s\), the root node at the top, the agent could move to several next states \(s{'}\), and receive a reward \(r\). The value of the state \(s\) is the reward received upon leaving that state, plus a discounted average over the possible successor states, where the value for each of these successor states gets multiplied by the probability that the agent enters that successor state.

The Bellman equation averages over all the state possibilities, weighting each by its probability of occurring and explains that the value of the start state must equal the (discounted) value of the expected next state, plus the reward expected along the way. It can be expressed using matrices, where \(v\) represents a column vector within one entry per state, such that:

\(\begin{equation} v = \mathcal{R} + \gamma \mathcal{P} v \end{equation}\)

\(\begin{equation} \left[ \begin{array}{c} v(1) \\ \vdots \\ v(n) \end{array}\right] = \left[\begin{array}{c} \mathcal{R}_1 \\ \vdots \\ \mathcal{R}_n \end{array}\right] + \gamma \left[\begin{array}{ccc} \mathcal{P}_{11} & \cdots & \mathcal{P}_{1n} \\ \vdots & & \\ \mathcal{P}_{n1} & \cdots & \mathcal{P}_{nn} \end{array}\right] \left[\begin{array}{c} v(1) \\ \vdots \\ v(n) \end{array} \right] \end{equation}\)


Equation 4 explains that the value function \(v\) for each state is equal to the reward function \(\mathcal{R}\) plus gamma \(\gamma \) multiplied by the dot product of the state transition probability matrix \(\mathcal{P}\) and the initial value function \(v\), creating a new value function using the Bellman equation.

Interestingly, the Bellman equation is linear, allowing it to be solved directly, as seen in equation 5, where \(I\) represents an identity matrix. The computational complexity of this equation is \(O(n^3)\) for n states, preventing it from being a practical solution with large MRPs and making it only a suitable solution to small MRPs.

\(\begin{equation} \begin{aligned}[t] v &= \mathcal{R} + \gamma \mathcal{P} v \\[5pt] (I - \gamma \mathcal{P}) &= \mathcal{R} \\[5pt] v &= (I - \gamma \mathcal{P})^{-1} \mathcal{R} \end{aligned} \end{equation}\)


Some iterative methods for large MRPs include:

  • Dynamic programming
  • Monte-Carlo evaluation
  • Temporal-difference learning

Now that we have an understanding of MRPs, we can add the final component: decisions (actions), discussed in the third and final article Markov Decision Processes (MDPs).