 ## Monte-Carlo Prediction

In my previous article, RL: Dynamic Programming, we discussed Dynamic Programming (DP) and how it is used with model-based agents to solve known Markov Decision Processes (MDPs).

This article is the first of a multi-part series that focuses on model-free agents, arranged into individual articles for prediction and control algorithms. Model-Free Prediction (MFP) focuses on estimating the value function of an unknown MDP. While its counterpart, Model-Free Control (MFC), focuses on optimising the unknown MDPs value function. For more information on MFC, check out the article series!

## Monte-Carlo Learning

Monte-Carlo (MC) methods are the first type of learning method that we will discuss for estimating value functions and discovering optimal policies for an unknown MDP. While these methods are not always the most efficient, they are extremely effective at solving real-world RL problems and are widely used today.

Agents that use these types of methods learn directly from complete episodes of experiences sequence of states, actions, and rewards from actual (or simulated) interaction with the environment. Being model-free, they require no knowledge of the environment’s dynamics, but can still attain optimal behaviour. Due to only taking sample experiences, Monte-Carlo methods are a way to solve RL problems based on averaging sample returns, typically only for episodic tasks, where experiences are divided into episodes that eventually terminate. On the completion of an episode, the value estimates and policies are then updated.

A great example environment for Monte-Carlo Learning is the game blackjack, refer to the Monte-Carlo Learning project for more details.

### MC Prediction

MC Prediction is a form of policy evaluation, where the goal is to learn the value-function $$v_\pi$$ from episodes of experiences under policy $$\pi$$ (equation 1.1.1).

$$\begin{equation} S_1, A_1, R_2, \cdots, S_k \sim \pi \end{equation}$$

(1.1.1)

Commonly, value functions are calculated using the discounted expected return (equation 1.1.2a). However, MC Prediction uses the empirical mean return instead. There are two approaches to calculating this: first-visit and every-visit.

$$\begin{equation} v_\pi(s) = \mathbb{E}_\pi [G_t | S_t = s] \end{equation}$$

(1.1.2a)

$$\begin{equation} G_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-1} R_T \end{equation}$$

(1.1.2b)

First-Visit Policy Evaluation focuses on averaging the returns for the first visit of each state $$s$$ in each episode. Comparatively, Every-Visit Policy Evaluation takes the average return of every visit to each state $$s$$ within each episode. In practice, they use the same functionality except for the above theoretical properties. Every-visit extends naturally to function approximation and eligibility traces, as discussed in the Temporal-Difference Learning section.

To understand the process, we will focus on a First-Visit Policy Evaluation. For each episode, the algorithm performs three steps:

1. Increment counter for the state that the agent has visited for the first time, $$N(s) \leftarrow N(s) + 1$$
2. Increment the total return for all states in the episode after a first visit, $$S(s) \leftarrow S(s) + G_t$$
3. Estimate the value function using the mean return across all episodes, $$V(s) = S(s) / N(s)$$

By the law of large numbers, the sequence of averages of these estimates converges to a true value function, mathematically in equation 1.1.3. With first-visit, each average is an unbiased estimate with the standard deviation of its error falling by $$\frac{1}{\sqrt{n}}$$, where $$n$$ is the number of returns averaged. Every-visit is more complex, but its estimates also converge to the value function $$v_\pi(s)$$.

$$\begin{equation} V(s) \rightarrow v_\pi(s) \; as \; N(s) \rightarrow \infty \end{equation}$$

(1.1.3)

### Incremental Mean

MC is a type of online algorithm that serially processes an input piece-by-piece. With this specific algorithm, we use what is called an incremental mean, represented mathematically in equation 1.2.1.

\begin{equation} \begin{aligned} \mu_k &= \frac{1}{k} \sum^k_{j=1} x_j \\ & = \frac{1}{k} \left( x_k + \sum^{k-1}_{j=1} x_j \right) \\ & = \frac{1}{k} (x_k + (k - 1) \mu_{k-1}) \\ & = \mu_{k-1} + \frac{1}{k} (x_k - \mu_{k-1}) \end{aligned} \end{equation}

(1.2.1)

At the top of equation 1.2.1, we can see the full calculation for the mean when we have $$k$$ elements, where the remaining parts express alternative methods. In the second equation, we sum the numbers up to $$k - 1$$ values, add the $$k$$th element to the result, and multiply it by 1 divided by $$k$$ elements.

Rearranging the terms, we retrieve the third and fourth parts of the set of equations, where the third part takes $$k - 1$$ and multiplies it by the mean of the elements up to that number $$\mu_{k - 1}$$, plus the $$k$$th element to the result and then multiplies that by 1 divided by $$k$$ elements.

Lastly, the fourth part is a simple reposition of the components, which contains an error term $$(x_k - \mu_{k - 1})$$. The second component $$\mu_{k - 1}$$, represents the estimate of what the value will be, and $$x_k$$ represents the new value. Using the final formula, we update the mean of the elements (up to $$k - 1$$), denoted $$\mu_{k - 1}$$, by a small amount in the direction of the error.

Using an Every-Visit Policy Evaluation, we can gain a better understanding of the incremental mean. Using incremental updates to the value function $$V(s)$$ after each episode $$\{S_1, A_1, R_2, \cdots, S_T\}$$, we can iterate through each state $$S_t$$ with a return of $$G_t$$, and:

• Increment the counter for the state that the agent has visited, $$N(S_t) \leftarrow N(S_t + 1)$$
• Then update the value function based on the error of the return that was thought to be received $$V(S_t)$$ against the return that has been received $$G_t$$ by a small amount in the direction of the error, formulated:

$$\begin{equation} V(S_t) \leftarrow V(S_t) + \frac{1}{N(S_t)} (G_t = V(S_t)) \end{equation}$$

(1.3.1)

One optimisation to the Monte-Carlo method includes tracking a running mean, which is beneficial in non-stationary problems and acts as a way to forget old episodes, providing an increase in computational performance. The optimisation requires an alpha $$\alpha$$ parameter (a discount rate between 0 and 1) included in the value function estimation, depicted in equation 1.3.2. The approach is called constant-$$\alpha$$ MC.

$$\begin{equation} V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t)) \end{equation}$$

(1.3.2)

## More On Model-Free Prediction

Monte-Carlo isn't the only model-free method in RL. I discuss alternate algorithms in my TD Prediction & TD($$\lambda$$) article. Furthermore, I discuss the differences between the two methods in my MC & TD Comparison article.