In my previous article, Monte-Carlo Prediction, we discussed MC Learning methods and how they estimate value functions of unknown Markov Decision Processes (MDPs). In this article, we discuss another type of method: Temporal-Difference (TD) Learning.

This article is the second of a multi-part series that focuses on model-free agents, arranged into individual articles for prediction and control algorithms. Model-Free Prediction (MFP) focuses on estimating the value function of an unknown MDP. While its counterpart, Model-Free Control (MFC), focuses on optimising the unknown MDPs value function. For more information on MFC, check out the article series!

## Article Contents

## Temporal-Difference Learning

Temporal-Difference (TD) methods learn directly from incomplete episodes of experience by combining Monte-Carlo (MC) and dynamic programming (DP) concepts. During an episode, TD agents approximate the updated value function given its previous experience, allowing it to learn while it is still in the episode. Like MC, TD is a model-free method and uses a similar approach to DP by partially updating estimates based on other learned approximations in real-time (bootstrapping). Throughout the remainder of this article, we will discuss different TD methods, how they solve Temporal Credit Assignment (TCA) problems, and the differences between TD, MC and DP.

### Temporal Credit Assignment Problem

Before discussing the TCA problem, we need to understand the Credit Assignment (CA) problem. CA problems are tasks that require the correct actions taken that receive the most credit (reward). RL assists in solving this problem by using agents to find an optimum set of actions that maximizes reward.

The TCA problem is an extension of the CA problem that requires solutions across time, where an agent needs to find the best policy across multiple timesteps. Agents that solve these problems can learn in real-time by making updates to a policy during the progression of a task, which is extremely useful in real-world scenarios. In MC, agents have no awareness of time-critical events, such as hitting a moving target with a rifle or manoeuvring a car out of the way of oncoming traffic. With the added element of time or progression of events, TCA problems provide agents with the opportunity to learn the importance of such event timing.

### TD Prediction

Like MC Prediction, TD Prediction's goal is to learn the value function \(v_\pi\) from experience under policy \(\pi\). The simplest algorithm is TD(0), which performs a modified incremental Every-Visit MC Prediction. The MC approach updates the estimated value \(V(S_t)\) towards the actual return \(G_t\) (equation 1.2.1). Comparatively, TD(0) updates the estimated value \(V(S_t)\) toward the estimated return \(R_{t+1} + \gamma V(S_{t+1})\), reflected in equation 1.2.2, and formally known as the TD update rule.

\(\begin{equation} V(S_t) \leftarrow V(S_t) + \alpha \bigg( G_t - V(S_t) \bigg) \end{equation}\)

(1.2.1)

\(\begin{equation} V(S_t) \leftarrow V(S_t) + \alpha \bigg( R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \bigg) \end{equation}\)

(1.2.2)

The estimated return consists of two parts:

- The immediate return, \(R_{t+1}\)
- Plus, the discounted value of the next step, \(\gamma V(S_{t+1})\)

Furthermore, equation 1.2.2 has two critical components have have specific terminology:

- \(R_{t+1} + \gamma V(S_{t+1})\) is called the TD target, which the agent aims to achieve.
- \(\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\) is called the TD error. Representing the difference between the current and target estimates.

TD methods are powerful because they update the value function after every timestep (real-time). Additionally, they can learn without a terminal outcome and work in continuing (non-terminating) environments. In comparison, MC methods require completing an entire episode before updating the value function and can only work within episodic (terminating) environments.

For example, imagine an agent driving a car, and it crashes into another one ahead of it. Using MC learning, the agent needs to wait until it crashes to receive a negative reward. However, with TD learning, the agent would receive a negative reward as it gets too close to the car and slow down before crashing.

## TD(\(\lambda\))

TD(\(\lambda\)) is a popular algorithm that uses a mechanism called eligibility traces (ETs), where \(\lambda\) is the eligibility trace. TD methods can be combined with ETs, such as Q-Learning or Sarsa, to obtain a general approach for learning more efficiently.

There are two ways to view ETs: theoretically and mechanistically. The theoretical view works as a bridge between TD and MC methods, producing a family of algorithms spanning from one-step TD to MC methods. Comparatively, the mechanistic view focuses on ETs as temporary records of an occurrence of an event, such as a state visit or action taken. Both have more formal names: the theoretical view is called forward-view, and the mechanistic view is called backward-view.

Throughout this section, we focus on the prediction problem and how ETs help with estimate value functions \(v_\pi\). For more information on the control problem, refer to the article series.

### \(n\)-step

\(n\)-step is one of the fundamental theoretical methods that produce a range of algorithms between one-step TD and MC. Recall that TD(0) performs a single backup on the next reward, and MC performs a backup for each state within the entire sequence of observed rewards until the end of the episode. By expanding TD(0), we can create two-step, three-step or even \(n\)-step backup methods, visualised in figure 2.1.1.

Figure 2.1.1. Temporal-Difference Learning look ahead variants

To understand this mathematically, let’s consider the following \(n\)-step returns for \(n = 1, 2, \infty\); seen in equation 2.1.1. In the equation, notice how additional steps are simple to implement and follow standard return methods.

\(\begin{align} n &= 1 \; \; \; (TD) \quad \; G_t^{(1)} = R_{t+1} + \gamma V(S_{t+1}) \\ n &= 2 \; \qquad \qquad \; G_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V(S_{t+2}) \\ &\vdots \; \qquad \qquad \qquad \; \; \vdots \\ n &= \infty \; \; (MC) \quad G_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-1} R_T \end{align}\)

(2.1.1)

Both MC (\(\infty\)) and TD methods use the traditional discounted expected return formula, summing \(n\)-steps of actual rewards. However, the TD methods have one minor addition - they add the estimated return of the \(n\)th step, accounting for the absence of the future returns. Equation 2.1.2 displays the complete formula for \(n\)-step TD returns.

\(\begin{equation} G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n}) \end{equation}\)

(2.1.2)

\(\begin{equation} V(S_t) \leftarrow V(S_t) + \alpha \left( G_t^{(n)} - V(S_t) \right) \end{equation}\)

(2.1.3)

Equation 2.1.3 express the \(n\)-step TD learning formula, where the TD target is \(G_t^{(n)}\), and the TD error is \(G_t^{(n)} - V(S_t)\). Unfortunately, \(n\)-step TD methods are inconvenient to implement. When computing \(n\)-step returns, agents must wait for \(n\) steps before observing the rewards and states. For large step sizes, this can prove to be problematic, especially when solving the control problem. Therefore, it is used strictly for theory to help understand similar methods.

### Forward-View TD(\(\lambda\))

Moving onto more theoretical views of ETs, we can begin to understand what is computed by methods that use them. As seen in the previous section, \(n\)-step methods are one tool for obtaining returns, and on their own, they can prove computationally inefficient. Instead, we can combine multiple \(n\)-step methods and average the \(n\)-step returns, creating a new range of algorithms. For example, we perform backups towards a target using half a two-step return and half a four-step return, depicted in equation 2.2.1.

\(\begin{equation} \frac{1}{2} G^{(2)} + \frac{1}{2} G^{(4)} \end{equation}\)

(2.2.1)

Furthermore, we can combine any set of returns that we can average in the same way, even when using an infinite set. The only requirement is that the weights of the \(n\)-step components are positive values that sum to 1 (100%). The composite return creates an error reduction property similar to the individual \(n\)-step returns, providing guaranteed convergence.

Formally, backups that average component backups are called complex backups. Backup diagrams for these consist of the component’s backups with a horizontal line above and the weighting fractions below them. Figure 2.2.1 presents the backup diagram of the example mentioned above.

Figure 2.2.1. A complex backup diagram containing half a two-step return and half a four-step return.

TD(\(\lambda\)) provides a method for averaging complex backups, which contains all the \(n\)-step backups, each weighted proportional to \(\lambda^{n-1}\), where \(\lambda∈[0,1]\), and normalized by a factor of \(1 - \lambda\), ensuring the sum of the weights equals 1 (see figure 2.2.2). The result provides a backup towards a return, formally called \(\lambda\)-return, mathematically described in equation 2.2.2.

\(\begin{equation} G_t^\lambda = (1 - \lambda) \sum\limits_{n=1}^\infty \lambda^{n-1} G_t^{(n)} \end{equation}\)

(2.2.2)

Changing the value of lambda can provide different results. For example, when \(\lambda = 1\), performing backups according to the λ-return causes the same results as constant-\(\alpha\) MC. However, when \(\lambda = 0\), the \(\lambda\)-return acts as a one-step return, identical to TD(0).

Figure 2.2.2. Backup diagrams for TD(\(\lambda\)), where \(\lambda\) enables different functionality. The diagram shows backup examples of one-step, two-step, three-step, and \(\infty\)-steps.

The TD(\(\lambda\)) algorithm uses geometric weighting to increase the computational efficiency by using a memoryless system that prevents the need to store or compute different values at each \(n\)-step return, enabling it to perform at the same computational cost as TD(0). To incorporate this, we replace the TD target \(G_t^{(n)}\) in equation 2.1.3, with the \(\lambda\)-return as the new method for performing backups, denoted by \(G_t^\lambda\), formally creating the forward-view update (equation 2.2.3).

\(\begin{equation} V(S_t) \leftarrow V(S_t) + \alpha \bigg( G_t^\lambda - V(S_t) \bigg) \end{equation}\)

(2.2.3)

Forward-view TD(\(\lambda\)) updates the value function towards the \(\lambda\)-return by looking into the future to compute \(G_t^\lambda\). During the process, when each state is visited, we look at the future rewards and determine the best approach for combining them. After updating the first state, we move on to the next, never returning to the previous one. Future states are viewed and processed repeatedly, but only once from each vantage point that precedes them (figure 2.2.3).

Figure 2.2.3. An example of the forward-view, where state updates are decided by looking at future rewards and states.

### Backward-View TD(\(\lambda\))

Backward-view uses eligibility traces (ETs) as a way of storing occurrences of events, such as a state visit or action taken, and provide intuition to developing TD algorithms. The trace decides if the associated events are eligible for learning changes, bridging the gap between events and training information.

An ET for state \(s\) at time \(t\) is a random positive number, denoted \(E_t(s) \in \mathbb{R}^+\). During each step, the ETs of all non-visited states decay by \(\gamma \lambda\) (equation 2.3.1), where \(\gamma\) is the discount rate, and \(\lambda\) is a value between \([0,1]\), called a trace-decay parameter.

(2.3.1)

ETs are expandable to visited states at timestep \(t\), where the ET decays like normal but includes an addition of an indicator function, denoted 1 (equation 2.3.2). The indicator function accepts a state at timestep \(t\) as a parameter and returns 1 if it is visited or 0 otherwise.

\(\begin{equation} E_t(s) = \gamma \lambda E_{t-1}(s) + \boldsymbol{1}(S_t = s) \end{equation}\)

(2.3.2)

These types of eligibility trace have a formal name called an accumulating trace. Accumulating traces accumulate each time the state is visited, then gradually decrease (fade) when it is no longer actively selected, illustrated in figure 2.3.1.

Recently visited states are stored as a simple record, defined by \(\gamma \lambda\). Each stored trace (state) indicates the degree of eligibility for experiencing learning changes when a reinforcing event occurs, or more accurately, a TD error. Equation 2.3.3 highlights an example of a one-step TD error. The equation takes an initial reward \(R_{t+1}\) and adds it to a discounted return of the agent’s next state \(\gamma V(S_{t+1})\), subtracted by the current estimated return \(V(S_t)\).

\(\begin{equation} \delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \end{equation}\)

(2.3.3)

Backward-view TD(\(\lambda\)) stores an ET for every state \(s\), then updates the estimated value function \(V(s)\) for each state in proportion to the TD-error \(\delta_t\) and eligibility trace \(E_t(s)\), mathematically represented in equation 2.3.4. The TD-error and ET are multiplied by \(\alpha\), a learning rate between \([0,1]\) that determines the agent learning speed. Formally, this equation represents the backward-view TD update.

\(\begin{equation} V(s) \leftarrow V(s) + \alpha \delta_t E_t(s) \end{equation}\)

(2.3.4)

To better understand the backward view, let us examine different values of \(\lambda\). At the lowest value, 0, only the current state is updated, creating the same update behaviour as TD(0) (equation 2.3.5).

\(\begin{equation} V(s) \leftarrow V(s) + \alpha \delta_t \end{equation}\)

(2.3.5)

At the highest value, 1, we create the TD(1) algorithm, where credit given to earlier states deteriorates by \(\gamma\) per step, achieving similar behaviour to MC. Standard MC methods are limited to episodic tasks, but TD(1) significantly increases their range of applicability by extending the functionality to discounted continuing ones. Furthermore, recall from the previous article that MC methods only update their estimates at the end of an episode. TD(1) learns through an \(n\)-step approach, allowing immediate alteration of agent behaviour during the same episode, proving more valuable.

Concerning the performance of TD(\(\lambda\)) methods, accumulating traces have a single flaw. When \(\lambda > 0.9\) and \(\alpha > 0.5\), the algorithm becomes unstable. There are two variations of ETs that accommodate these limitations: Dutch and replacing traces. Each variation decays the traces of the non-visited states in the same way but differs when incrementing visited states.

The first variation we will discuss is replacing trace. Consider a state that has been visited and revisited before decaying to zero. With an accumulating variant, the revisit increments, achieving a value larger than 1. However, when using a replacing trace, it is instead reset to 1 (equation 2.3.6). In a rare cases of TD(1), this type of trace can closely reflect a first-visit MC method.

\(\begin{equation} E_t(S_t) = 1 \end{equation}\)

(2.3.6)

In the second variation, Dutch trace, we perform an intermediate between accumulating and replacing traces, depending on the step-size of \(\alpha\) (equation 2.3.7). As \(\alpha\) approaches zero, it becomes an accumulating trace, and if \(\alpha = 1\), it becomes a replacing trace.

\(\begin{equation} E_t(S_t) = (1 - \alpha) \gamma \lambda E_{t-1}(S_t) + 1 \end{equation}\)

(2.3.7)

Both variants increase performance and provide more robust algorithms than accumulating traces. Dutch trace, in particular, can achieve performance that closely approximates the \(\lambda\)-return algorithm.