Q-Learning: Off-Policy TD Policy Learning

SARSA is an on-policy algorithm, which it means it learns the -function for the policy it is currently using, which is typically not the optimal policy, because of the need to perform exploration. However, with a simple modification, we can convert this to an off-policy algorithm that learns , even if a suboptimal or exploratory policy is used to choose actions.

Tabular Q Learning

Suppose we modify SARSA by replacing the sampled action with a greedy action . This results in the following update when a transition happens

This is the update rule of Q-learning for the tabular case.

Since it is off-policy, the method can use triples coming from any data source, such as older versions of the policy. If every state-action pair is visited infinitely often, the algorithm provably converges to in the tabular case, with properly decayed learning rates.

For terminal states, we know that for all actions . Consequently, for the optimal value function, we have for all terminal states. When performing online learning, we don’t usually know which states are terminal. Therefore, we assume that whenever we take a step in the environment, we get an additional binary indicator that tells us if is terminal. In this case, we set the target value in -learning to yielding the modified update rule

Q Learning with Function Approximation

To make Q learning work with high-dimensional state spaces, we have to replace the tabular representation with a parametric approximation, denoted . We can update this function using one or more steps of SGD on the following loss function

Since nonlinear functions need to be trained on minibatches of data, we compute the average loss over multiple randomly sampled experience tuples to get

Algorithm 8 Q Learning

Input: Environment state ss, network parameters w0\mathbf{w}_0, replay buffer D=\mathcal{D} = \emptyset, discount factor γ\gamma, step learning rate η\eta, policy π0\pi_0

for iteration k=0,1,k=0,1,\dots do

for environment step s=0,1,,S1s = 0,1,\dots,S-1 do

Sample action aπk(as)a \sim \pi_k(a \mid s)

Interact with environment (s,r)=(s',r) = env.step(aa)

Update buffer DD{(s,a,s,r)}\mathcal{D} \gets \mathcal{D} \cup \{(s,a,s',r)\}

end for

wk,0wk\mathbf{w}_{k,0} \gets \mathbf{w}_k

for gradient step g=0,1,,G1g=0,1,\dots,G-1 do

Sample batch BD\mathcal{B} \subset \mathcal{D}

Compute Error L(B,wk,g)=1B(s,a,r,s)B[Qwk,g(s,a)(r+γ maxa Qwk(s,a))]2\mathcal{L}(\mathcal{B},\mathbf{w}_{k,g}) = \frac{1}{|\mathcal{B}|} \sum_{(s,a,r,s') \in \mathcal{B}} [ Q_{\mathbf{w}_{k,g}}(s,a) - (r + \gamma \ \mathop{\mathrm{max}}_{a'} \ Q_{\mathbf{w}_k}(s',a') ) ]^2

Update parameters wk,gwk,gηwk,gL(B,wk,g)\mathbf{w}_{k,g} \gets \mathbf{w}_{k,g} - \eta \nabla_{\mathbf{w}_{k,g}} \mathcal{L}(\mathcal{B}, \mathbf{w}_{k,g})

end for

end for

The Deadly Triad

The problem with the naive Q learning objective function in Equation 3 is that it can lead to instability, since the target we are regressing towards uses the same parameters as the function we are updating. So the network is chasing its own tail. Although this is fine for tabular models, it can fail for nonlinear models.

In general, an RL algorithm can become unstable when it has three components:

  • Function approximation (as in neural networks)
  • Bootstrapped value function estimation (i.e. TD-like methods instead of MC)
  • Off-policy learning (where actions are sampled from some distribution other than the policy being optimized)

DQN

The deep Q-network or DQN paper used neural nets to represent the function, but performed a smaller number of gradient updates per iteration. Furthermore, they proposed to modify the target value when fitting the Q function in order to avoid instabilities during training such as to use a target network and experience replay. 1

Experience Replay

Since Q learning is an off-policy method, we can update the Q function using any data source. This is particularly important when we use nonlinear function approximation, which often needs a lot of data for model fitting. A natural source of data is data collected earlier in the trajectory. This is called an experience replay buffer, which stores transition tuples into a buffer and uniformly samples from it during training.

Success

This can improve sample efficiency due to being able to use the same sample multiple times. It also improves stability in training by reducing the correlation of the data samples that the network is trained on, since the training tuples do not have to come from adjacent moments in time.

Caution

Experience replay requires the use of off-policy learning methods since the training data is sampled from older versions of the policy, not the current policy.

Prioritized Experience Replay

It is possible to replace the unform sampling from the buffer with one that favors more important transitions tuples that may be more informative about .

Definition (Prioritized Experience Replay)

Consider the TD error for the th tuple

Define the priority of as

where determine the degree of prioritization, with corresponding to uniform sampling. Then the probability of sampling is

Intuition

The TD error measures how surprising transition is under the current estimate of . If is large, then the network’s prediction is far from the bootstrapped target , so replaying this transition is likely to produce a larger and more useful update. If is small, then the transition is already well explained by the current value function, so sampling it again is less informative.

The priority is therefore a softened measure of learning potential. The term ensures every transition has nonzero priority, so old or currently low-error transitions are not completely ignored. The exponent controls how strongly sampling focuses on high-error transitions: when , all priorities become equal and replay is uniform; as increases, transitions with larger TD errors are sampled more often.

Target Network

One heuristic solution to the deadly triad, proposed in the DQN paper, is to use a frozen target network computed at an earlier iteration to define the target value for the DQN updates, rather than trying to chase a constantly moving target. Specifically, we maintain an extra copy of the -network, , with the same structure as . This new -network is used to compute bootstrapping targets

for training . we can periodically set , usually after a few episodes, where the stop gradient operator is used to prevent auto-differentiation propagating gradients back to .

Alternatively, we can use an exponentially moving average of EMA of the weights, i.e., at step we use ., where the forgetting factor ensures slowly catches up with .

Two Time-Scale Methods

Another approach is to update the target value in the TD update more quickly than the value function itself. This is known as two timescale optimization,.

Layer Norm

More recently, its been proved 2 that nonlinear TD learning can be made to converge, even in the off-policy setting if three simple conditions on the critic -network are satisfied:

  • The final layer weights are bounded (e.g. using normalization or AdamW, which has been shown to solve an constrained optimization problem, ensuring the parameters remain bounded by )
  • The penultimate layer is sufficiently wide to represent the true function
  • The input to the critic has bounded norm (e.g. using LayerNorm)

Maximization Bias

Standard -learning suffers from a problem known as the optimizer’s curse, or maximization bias. The issue comes from the fact that the max operator does not commute with expectation:

for a collection of random variables . In words, if each action value estimate is noisy, then taking the maximum over those noisy estimates tends to select the action whose estimate was pushed upward by random error.

This matters in -learning because the TD target uses

Even if each estimate is unbiased on its own, the maximization step tends to prefer actions with unusually large positive estimation error. As a result, the target can become systematically too optimistic, causing -learning to overestimate action values.

Example

Suppose we have two samples and . Then, , and so . However, and . Even though both actions have true expected value of 0, the expected maximum noisy estimate is positive.

Double Q-Learning

One solution to avoid the maximization bias is to use two separate Q-functions, and , one for selecting the greedy action and the other for estimating the corresponding -value. In particular, upon seeing a transition , we perform the following update for

Here, uses to choose the best action, but uses to evaluate it, and vise versa. This technique is called double Q-learning. 3

Double DQN

Combining double Q learning with deep Q networks results in double DQN. This modifies Equation 12 to its gradient form, and then the current network is used for action proposals and the target network for action evaluation. Thus, the training target becomes

DQN Extensions

Q Learning for Continuous Actions

Q learning is not directly applicable to continuous actions due to the need to compute the argmax over actions. One option used by NFQCA is to replace the optimization over continuous actions by a function that approximates the greedy action. This became the basis for the DDPG algorithm.

Dueling DQN

The Dueling DQN method learns a value function and an advantage function, and derives the Q function rather than learning it directly. This is helpful when there are many actions with similar Q-values, since the advantage focuses on the differences in value relative to a shared baseline. 4

We define a network with output heads, which computes for the actions, and for the last output. We can then derive

However, this naive decomposition is not identifiable. For any state-dependent scalar , we can rewrite

Thus the same -function can be represented by infinitely many different pairs . The network therefore has no unique way to decide what should be stored in the value stream and what should be stored in the advantage stream.

To remove this ambiguity, we impose a normalization constraint on the advantage values. One option is to subtract the maximum advantage, yielding

If , then

So after normalization, the value stream represents the value of the best action, and the normalized advantage of the best action is zero. This anchors the decomposition and makes the roles of and well-defined.

Tip

In practice, the max is usually replaced by an average,

which enforces that the advantages average to zero and is typically easier to optimize.

Noisy Nets and Exploration

Standard DQN relies on the epsilon-greedy strategy to perform exploration. However, this will explore equally in all states, whereas we would like the amount of exploration to be state dependent to reflect the amount of uncertainty in the outcomes of trying each action in that state due to lack of knowledge. An early approach for this, known as noisy nets, added random noise to the network weights for encouraging exploration which is temporally consistent within episodes.

Multi-step DQN

In TD$(n)$, the bias introduced via bootstrapping was reduced by unrolling the value computation for MC steps, and then plugging into the value at the end. We can apply this to the DQN by defining the target

This can be implemented for episodic environments by storing experiences tuples of the form

where if the trajectory ended at any point during the -step rollout.

Caution

Theoretically this method is only valid if all the intermediate actions are sampled from the current optimal policy derived from , as opposed to some behavior policy, such as epsilon greedy or some samples from the replay buffer from an old policy. In practice, we can just restrict sampling to recent samples from the replay buffer, making the resulting method approximately on-policy.

Rainbow

The Rainbow method combined several improvements to the vanilla DQN method, as listed below. At the time it was published (2018), this produces SOTA results n the Atari-200M benchmark. Each improvement gives diminishing returns. 5

Sources and Additional Readings

  • Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 2.

Footnotes

  1. Playing Atari with Deep Reinforcement Learning (DQN)

  2. Simplifying Deep Temporal Difference Learning

  3. Deep Reinforcement Learning with Double Q-learning

  4. Dueling DQN

  5. Rainbow: Combining Improvements in Deep Reinforcement Learning