Value-Based RL

Basic Concepts

Value Functions

Let be a given policy.

Definition (Value Function)

We define the state-value function, or value function for short, as follows:

where indicates that actions are sampled by .

This is the expected return obtained if we start in state and follow to choose actions in a continuing task.

Definition (State-Action Value Function)

Similarly, we define the state-action value function, also known as the -function as follows:

This quantity represents the expected return if we start by taking action in state , and then follow to choose actions thereafter.

Definition (Advantage Function)

The advantage function is

This tells us the benefit of taking action in state , then switching to policy , relative to the baseline return of always following . Note that can be both positive and negative, and due to a useful equality:

Bellman’s Equations

Definition (Optimal Policy)

Suppose is a policy such that for all and all policies . Then is an optimal policy.

There can be multiple optimal policies for the same MDP, but by definition, their value functions must be the same, and are denoted by and respectively. We call the optimal state-value function, and the optimal action-value function.

Tip

Any finite MDP must have at least one deterministic optimal policy.

Definition

A fundamental result about the optimal value function is Bellman’s optimality equations:

Conversely, the optimal value functions are the only solutions that satisfy the equations.

This recursion will play a central role in many RL algorithms.

Given a value function, the discrepancy between the left and right hand side of the Bellman optimality equations are called the Bellman error or Bellman residual.

Definition (Bellman Operator)

We can define the Bellman operator given an MDP and policy as a function that takes a value function and derives a new value function that satisfies:

This reduces the Bellman error. Applying the Bellman operator to a state is called the Bellman backup. Iterating this process converges to the optimal value function.

Given the optimal value function, we can derive an optimal policy using

Definition

The problem for solving for , , or is called policy optimization. In contrast, Solving for or for a given policy is called policy evaluation.

Solving for the Optimal Policy in a Known World Model

Computing the optimal value function is known as the prediction problem, and computing the optimal policy is known as the control problem when the MDP is known.

Value Iteration

Definition (Value Iteration)

Starting from an initial value function estimate , value iteration (VI) iteratively updates the estimate by

A fundamental property of the above is that the update is a contraction: it can be verified that

In other words, every iteration will reduce the maximum value function error by a constant factor. will converge to , after which an optimal policy can be extracted using Equation 8.

Caution

In value iteration, we compute and for all possible states , averaging over all possible next states at each iteration.

Policy Iteration

Policy iteration is an iterative algorithm for computing , that searches in the space of deterministic policies until converging to an optimal policy. Each iteration consists of two steps: policy evaluation and policy improvement. The state values and policy are initialized arbitrarily. The algorithm then alternates between iteratively computing the state values for the current policy (policy evaluation) and improving the policy (policy improvement).

Definition (Policy Iteration)

Policy Evaluation: We sweep through the states, updating their values

Policy Improvement: To update the policy, we greedily choose the action that maximizes the value for each state

Starting from an initial policy , policy iteration alternates between policy evaluation and improvement :

The algorithm stops at iteration , if the policy is greedy with respect to its own value function . Since there are at most deterministic policies, and every iteration strictly improves the policy, the algorithm must converge after finite iterations.

Value Iteration vs Policy Iteration

Value iteration and policy iteration are both methods for finding an optimal policy, but they organize the computation differently. In policy iteration, the expectation over actions appears during policy evaluation because we are asking: “what is the value of this state if I follow the current policy ?” The policy improvement step then greedily updates the policy using those evaluated values.

In value iteration, the update uses a max over actions because it directly applies the optimal Bellman backup. Rather than fully evaluating a fixed policy before improving it, value iteration repeatedly asks: “if I acted optimally from this state onward, what value estimate would I get?” This combines policy evaluation and policy improvement into a single update: the expectation is still over the next state , but the choice over actions is replaced by the maximizing action.

Intuitively, policy iteration alternates between understanding the current policy and improving it, while value iteration repeatedly pushes the value function toward optimality. Once value iteration has converged to , the optimal policy can be recovered by acting greedily with respect to that value function.

Value Function Learning Using Samples from the World Model

We now assume that the agent only has access to , and the goal is to use these samples to estimate optimal value function and -function, even without explicitly knowing the MDP dynamics. This is often called learning as opposed to planning since the latter requires access to an explicit world model.

Monte Carlo Estimation

Definition (Monte Carlo Estimation)

Recall that is the sum of expected (discounted) rewards from state if we follow policy . A simple way to estimate this is to rollout the policy, and then compute the average sum of discounted rewards.

where is the learning rate, and the term in the brackets is the error term. We can use a similar technique to estimate by simply starting the rollout with action .

We can use MC estimation of , together with policy iteration, to learn an optimal policy:

Definition (Monte Carlo Control)

  • At iteration , compute a new improved policy using where is approximated using MC estimation.
  • This update can be applied to all the states visited on the sampled trajectory

Caution

  • To ensure this converges to the optimal policy, we need to collect data for every pair (at least in the tabular case).
  • MC methods are limited to episodic tasks since it must unroll to the end of the episode before updating
  • They are tend to be very high variance, since it has to unroll many trajectories whose returns are a sum of many random rewards

Temporal Difference (TD) Learning

The idea of temporal difference or TD learning is to incrementally reduce the Bellman error for sampled states or state-actions based on transitions instead of long trajectories.

Definition (Temporal Difference Learning)

Suppose we are to learn the value function for a fixed policy . Given a state transition , where , we change the estimate so that it moves towards the target value :

where is the learning rate. The term is known as the TD error.

A more general form of the TD update for parametric value function representations is

TD methods update their estimates based in part of other estimates, which is known as bootstrapping.

  • TD methods do not require a model of the environment, unlike DP methods.
  • TD methods are naturally implemented in an online, incremental fashion, whereas MC methods need to wait until the end of an episode.
  • TD methods are much less susceptible to exploratory actions taken affecting previous values, as TD methods learn from each transition regardless of what subsequent actions are taken.

It has also been shown that for any fixed policy , TD will converge to , in the mean for a constant step-size parameter.

Combining TD and MC Learning

Given a trajectory, TD estimates the return from by on-step lookahead , where the return from the time is replaced by its value function estimate. In contrast, MC waits until the end of the episode, then uses . It is possible to interpolate between these by performing an -step return, and then using the value function to approximate the return for the rest of the trajectory.

Definition ( -step return)

The corresponding -step version of the TD update becomes

Definition ( -return)

Rather than picking a specific look ahead value , we can take a weighted average of all possible values, with a single parameter by using

We can now use inside the TD update instead of in Equation 18, which is known as .

  • If , the -return becomes equal to the regular MC return .
  • If , the -return becomes equal to the one-step return , so standard TD learning is also called learning.

This also gives us the following recursive relationship

which we initialize with .

SARSA: On Policy TD Learning

TD learning is for policy evaluation, as it estimates the value function for a fixed policy (i.e. it computes ). In order to find an optimal policy , we may use the algorithm as a building block inside generalized policy iteration. In this case, its more convenient to work with the action-value function , and a policy that is greedy with respect to .

The agent follows in every step to choose actions, and upon a transition , the TD update rule is

where is the action the agent will take in state . This converges to . After is updated (for policy evaluation), also changes accordingly so it is greedy with respect to (for policy improvement).

Convergence

In order for SARSA to converge to , every state-action pair must be visited infinitely often, at least in the tabular case, since the algorithm only updates for the that it visits. One way to ensure this condition is to use a greedy in the limit with infinite exploration policy. An example is the -greedy policy, with vanishing to 0 gradually.

Sources

  • Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 2.