Off-Policy Methods

In many cases, it is useful to train a policy using data collected from a distinct behavior policy that is not the same as the target policy that is being learned. Examples include:

  • Data collected from earlier trials or parallel workers (with different parameters ) and stored in a replay buffer
  • Demonstration data from human experts

This is known as off-policy RL, and can be much more sample efficient than on-policy methods since they can use data from multiple sources.

Caution

Off-policy methods are more complicated than on-policy methods. The basic difficulty is that the target policy that we want to learn may want to try an action in a state that has not been experienced before in the existing data, so there is no way to predict the outcome of this new pair.

To tackle this problem, we generally assume that the target policy is not too different from the behavior policy, so that the ratio is bounded. In the online setting, we can ensure this property by using conservative updates to the policy. Alternatively, we can use policy gradient methods with various regularization methods.

Policy Evaluation using Importance Sampling

Assume we have a dataset of the form , where each trajectory is a sequence , where the actions are sampled according to a behavior policy , and the reward and next states are sampled according to the reward and transition models.

Goal

We want to use this offline dataset to evaluate the performance of some target policy . This is called off-policy evaluation or OPE.

If the trajectories were sampled from , we could use the standard Monte Carlo estimate

However, since trajectories are sampled from , we use importance sampling (IS) to correct for the distribution mismatch. This gives

^is-policy-evaluator
It can be shown that , that is, is unbiased, provided that whenever . The importance ratio is used to compensate for the fact that the data is sampled from and not . It can be simplified as follows:

Tip

This simplification makes it easy to apply IS, as long as the target and behavior policies are known. If the behavior policy is unknown, we can estimate it from , and replace by its estimate .

For convenience, we define the per-step importance ratio at time as

We can reduce the variance of the estimator (Equation 2) by noting that the reward is independent of the trajectory beyond time . This leads to a per-decision importance sampling variant:

Off-Policy Actor Critic Methods

Learning the Critic using V-Trace

V-trace is a method to estimate the value function for a target policy using off-policy data. First, consider the -step target value for in the on-policy case:

where we define as the TD error at time . To extend this to the off-policy case, we use the per-step importance ratio trick. However, to bound the variance of the estimator, we truncate the IS weights. In particular, we define

At first glance these look redundant, but they play different roles:

  • corrects the local TD error at time . It answers: if action was sampled from , how much should the one-step update at time count when estimating the target policy?
  • controls how strongly later TD errors are allowed to propagate backward to earlier states through the trace. It answers: how much should information from future time steps keep flowing back through this off-policy trajectory?

So changes the actual correction term at time , while changes the amount of credit propagation across multiple steps. This is why it is useful to give them different clipping thresholds.

To see where the target comes from, start from the on-policy decomposition

If the data were generated by instead of , the unbiased correction for the TD error at time would multiply by the cumulative importance ratio

But this full product is exactly what causes the variance to explode. V-trace therefore makes a more conservative choice:

  • it keeps a single-step correction on the TD error
  • it uses the clipped factors only to determine how far that corrected error is propagated backward

This yields

or equivalently

Intuition

Each future TD error tries to update not only , but also earlier states for .

  • says how much we trust the content of the correction at time .
  • says how much we trust passing that correction through the trace to earlier states.

If we used the same unclipped importance ratio everywhere, then one unlikely off-policy action could make the entire multi-step target explode. V-trace avoids this by separating “how much should I correct this TD error?” from “how far backward should this corrected error propagate?”

where and are hyperparameters. We then define the V-trace target value for as

Note that we can compute these targets recursively using

The product of the weights (known as the trace) measures how much a temporal difference at time impacts the update of the value function at an earlier time . If the policies are very different, the variance of this product will be large. So the truncation parameter is used to reduce the variance.

The use of the target rather than means we are evaluating the value function for a policy that is somewhere between and .

  • For (i.e. no truncation), we converge to the value function
  • For , we converge to the value function .

Tip

It has been shown that and work well in practice. 1

If , then . This gives rise to the simplified form

We can use the above V-trace to learn an approximate value function by minimizing the usual loss

Learning the Actor

To update the actor using an off-policy estimate of the policy gradient, we start by defining the objective to be the expected value of the new policy, where states are drawn from the behavior policy’s state distribution, but the actions are drawn from the target policy

Differentiating this and ignoring the term as previously suggested 2, gives rise to an approximate off-policy policy gradient using a one-step IS correction ratio

In practice, we can approximate by , where is the V-trace estimate for state . If we use as a baseline to reduce the variance, we get the following estimate for the policy

We can also replace the 1-step IS-weighted TD error with an IS-weighted GAE value by modifying the generalized advantage estimation to replace with .

Algorithm

Algorithm 6 Actor Critic (Off-Policy)

Learning rates αθ\alpha_{\mathbf{\theta}}, αϕ\alpha_{\mathbf{\phi}}, discount γ\gamma, GAE λ\lambda

Initialize actor πθ\pi_{\mathbf{\theta}}, critic VϕV_{\mathbf{\phi}}, replay buffer D\mathcal{D}

Initialize VϕVϕV_{\mathbf{\phi}'} \gets V_{\mathbf{\phi}}

for episode 1 to MM do

sample initial state s0s_0

Initialize empty episode buffer E[]\mathcal{E} \gets []

for t=0t=0 to T1T-1 do

Sample action from behavior policy πb(st)\pi_b(\cdot \mid s_t)

Execute ata_t, observe st+1s_{t+1} and rtr_t

Store (st,at,rt,st+1,πb(atst))(s_t,a_t,r_t,s_{t+1}, \pi_b(a_t \mid s_t)) in E\mathcal{E}

stst+1s_t \gets s_{t+1}

end for

Store trajectory E\mathcal{E} in replay buffer D\mathcal{D}

if enough data in D\mathcal{D} then

Sample batch of trajectories {Ei}\{\mathcal{E}_i\} from D\mathcal{D}

for each trajectory Ei\mathcal{E}_i in batch do

for t=T1,,0t=T-1, \dots, 0 do // Calculate advantage estimates using GAE

δtrt+γVϕ(st+1)Vϕ(st)\delta_t \gets r_t + \gamma V_{\mathbf{\phi}'}(s_{t+1}) - V_{\mathbf{\phi}} (s_t)

A^tδt+γλA^t+1\hat{A}_t \gets \delta_t + \gamma \lambda \hat{A}_{t+1}

end for

Update critic

L(ϕ)1Tt=0T1(rt+γVϕ(st+1)Vϕ(st))2L(\mathbf{\phi}) \gets \frac{1}{T} \sum_{t=0}^{T-1} ( r_t + \gamma V_{\mathbf{\phi}'}(s_{t+1}) - V_{\mathbf{\phi}}(s_t) )^2

ϕϕαϕϕL(ϕ)\mathbf{\phi} \gets \mathbf{\phi} - \alpha_{\mathbf{\phi}} \nabla_{\mathbf{\phi}} L(\mathbf{\phi})

Update actor

J(θ)1Tt=0T1ρtlogπθ(atst)A^tJ(\mathbf{\theta}) \gets \frac{1}{T} \sum_{t=0}^{T-1} \rho_t \log \pi_{\mathbf{\theta}} (a_t \mid s_t) \hat{A}_t

where ρt=min(ρ,πθ(atst)πb(atst))\rho_t = \min \left( \overline{\rho}, \frac{\pi_{\mathbf{\theta}}(a_t \mid s_t)}{\pi_b (a_t \mid s_t)} \right)

θθ+αθJ(θ)\mathbf{\theta} \gets \mathbf{\theta} + \alpha_{\mathbf{\theta}} J(\mathbf{\theta})

end for

Update target network ϕ\mathbf{\phi}'

end if

end for

Off-Policy Improvement Methods

Policy improvement methods, such as PPO, can be extended to the off-policy case. The key insight is to realize that we can generalize the lower bound to any reference policy:

The reference policy can be any previous policy, or a convex combination of them. In particular, if is the current policy, we can consider the reference policy to be where and are mixture weights. We can approximate the expectation by sampling from the replay buffer, which contains samples from older policies. That is, can be implemented by and .

Note

In the off-policy case, we separate two roles that were the same in ordinary PPO/TRPO:

  • the reference / behavior policy tells us where the replayed data came from
  • the current policy tells us which policy we are currently trying to improve

So the importance ratio is written using , because it corrects for the fact that actions were sampled from older policies. But the advantage remains , because we still want to judge whether an action is good or bad relative to the current policy we are updating.

To compute the advantage function from off-policy data, we can adapt the V-trace method (Equation 10) to get

where , and

is the truncated importance sampling ratio.

To compute the penalty term from off policy data, we need to choose between the PPO or TRPO approach.

For PPO, we can derive an off-policy version (known as Generalized PPO)

Caution

This is more delicate than ordinary PPO in practice. In on-policy PPO, the rollout data and the advantage estimates are both tied to the same recent policy, so the surrogate objective is relatively trustworthy. In the off-policy case, both the importance ratios and the advantage estimates must compensate for stale data from older policies, so errors in either one can make the update noisy or biased. This is why off-policy policy-improvement methods typically need conservative clipping, good value estimates, and replay data that is not too far from the current policy.

Implementations

Off-Policy Actor Critic: IMPALA

An example of an off-policy AC method is IMPALA, which stands for Importance Weighted Actor-Learning Architecture. This uses shared parameters for the policy and value function (different output heads), and adds an entropy bonus to ensure the policy remains stochastic. We end up with the objective function

The only difference from standard A2C is that we need to store the probabilities of each action , in addition to in the dataset , which can be used to compute the importance ratio (Equation 7).

For a learning-oriented implementation, it is easiest to write a synchronous version: collect a rollout using a slightly stale copy of the current policy (the behavior policy), then update the current model using the V-trace targets computed from that rollout. This is not the full distributed IMPALA architecture, but it shows the core update clearly.

import copy
 
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
 
 
class ActorCriticNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
        )
        self.policy_head = nn.Linear(hidden_dim, action_dim)
        self.value_head = nn.Linear(hidden_dim, 1)
 
    def forward(self, state):
        if isinstance(state, np.ndarray):
            state = torch.from_numpy(state)
        if state.ndim == 1:
            state = state.unsqueeze(0)
        state = state.float()
 
        features = self.backbone(state)
        logits = self.policy_head(features)
        state_value = self.value_head(features).squeeze(-1)
        return logits, state_value
 
 
def collect_rollout(env, behavior_model, state, rollout_length: int):
    states = []
    actions = []
    rewards = []
    dones = []
    behavior_log_probs = []
 
    for _ in range(rollout_length):
        with torch.no_grad():
            logits, _ = behavior_model(state)
            dist = Categorical(logits=logits)
            action = dist.sample()
 
        next_state, reward, terminated, truncated, _ = env.step(action.item())
        done = terminated or truncated
 
        states.append(torch.as_tensor(state, dtype=torch.float32))
        actions.append(action.squeeze(0))
        rewards.append(torch.tensor(reward, dtype=torch.float32))
        dones.append(torch.tensor(float(done), dtype=torch.float32))
        behavior_log_probs.append(dist.log_prob(action).squeeze(0))
 
        state = next_state
        if done:
            state, _ = env.reset()
 
    batch = {
        "states": torch.stack(states),
        "actions": torch.stack(actions),
        "rewards": torch.stack(rewards),
        "dones": torch.stack(dones),
        "behavior_log_probs": torch.stack(behavior_log_probs),
    }
    return batch, state
 
 
def compute_vtrace_targets(
    rewards,
    dones,
    values,
    target_log_probs,
    behavior_log_probs,
    gamma: float = 0.99,
    rho_bar: float = 1.0,
    c_bar: float = 1.0,
):
    with torch.no_grad():
        log_rhos = target_log_probs - behavior_log_probs
        rhos = log_rhos.exp()
        clipped_rhos = torch.clamp(rhos, max=rho_bar)
        clipped_cs = torch.clamp(rhos, max=c_bar)
 
        vs = torch.zeros_like(values)
        vs[-1] = values[-1]
 
        for t in reversed(range(len(rewards))):
            delta = clipped_rhos[t] * (
                rewards[t] + gamma * (1 - dones[t]) * values[t + 1] - values[t]
            )
            vs[t] = values[t] + delta + (
                gamma * (1 - dones[t]) * clipped_cs[t] * (vs[t + 1] - values[t + 1])
            )
 
        pg_advantages = clipped_rhos * (
            rewards + gamma * (1 - dones) * vs[1:] - values[:-1]
        )
        return vs[:-1], pg_advantages
 
 
def train_impala(
    env,
    num_iterations: int,
    gamma: float = 0.99,
    lr: float = 3e-4,
    rollout_length: int = 32,
    value_coef: float = 0.5,
    entropy_coef: float = 0.01,
    rho_bar: float = 1.0,
    c_bar: float = 1.0,
):
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
 
    model = ActorCriticNetwork(state_dim, action_dim)
    optimizer = optim.Adam(model.parameters(), lr=lr)
 
    state, _ = env.reset()
 
    for iteration in range(num_iterations):
        # In real IMPALA, actors run asynchronously and may lag behind the learner.
        # Here we emulate that by collecting data with a frozen behavior-policy snapshot.
        behavior_model = copy.deepcopy(model)
        batch, state = collect_rollout(env, behavior_model, state, rollout_length)
 
        states = batch["states"]
        actions = batch["actions"]
        rewards = batch["rewards"]
        dones = batch["dones"]
        behavior_log_probs = batch["behavior_log_probs"]
 
        with torch.no_grad():
            _, bootstrap_value = model(state)
 
        logits, values = model(states)
        values = torch.cat([values, bootstrap_value], dim=0)
 
        dist = Categorical(logits=logits)
        target_log_probs = dist.log_prob(actions)
        entropy = dist.entropy().mean()
 
        vtrace_targets, pg_advantages = compute_vtrace_targets(
            rewards,
            dones,
            values,
            target_log_probs,
            behavior_log_probs,
            gamma=gamma,
            rho_bar=rho_bar,
            c_bar=c_bar,
        )
 
        policy_loss = -(target_log_probs * pg_advantages.detach()).mean()
        value_loss = (values[:-1] - vtrace_targets.detach()).pow(2).mean()
        loss = policy_loss + value_coef * value_loss - entropy_coef * entropy
 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Sources

  • Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 3.

Footnotes

  1. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

  2. Off-Policy Actor-Critic