In many cases, it is useful to train a policy using data collected from a distinct behavior policy that is not the same as the target policy that is being learned. Examples include:
Data collected from earlier trials or parallel workers (with different parameters ) and stored in a replay buffer
Demonstration data from human experts
This is known as off-policy RL, and can be much more sample efficient than on-policy methods since they can use data from multiple sources.
Caution
Off-policy methods are more complicated than on-policy methods. The basic difficulty is that the target policy that we want to learn may want to try an action in a state that has not been experienced before in the existing data, so there is no way to predict the outcome of this new pair.
To tackle this problem, we generally assume that the target policy is not too different from the behavior policy, so that the ratio is bounded. In the online setting, we can ensure this property by using conservative updates to the policy. Alternatively, we can use policy gradient methods with various regularization methods.
Policy Evaluation using Importance Sampling
Assume we have a dataset of the form , where each trajectory is a sequence , where the actions are sampled according to a behavior policy , and the reward and next states are sampled according to the reward and transition models.
Goal
We want to use this offline dataset to evaluate the performance of some target policy . This is called off-policy evaluation or OPE.
If the trajectories were sampled from , we could use the standard Monte Carlo estimate
However, since trajectories are sampled from , we use importance sampling (IS) to correct for the distribution mismatch. This gives
^is-policy-evaluator
It can be shown that , that is, is unbiased, provided that whenever . The importance ratio is used to compensate for the fact that the data is sampled from and not . It can be simplified as follows:
Tip
This simplification makes it easy to apply IS, as long as the target and behavior policies are known. If the behavior policy is unknown, we can estimate it from , and replace by its estimate .
For convenience, we define the per-step importance ratio at time as
We can reduce the variance of the estimator (Equation 2) by noting that the reward is independent of the trajectory beyond time . This leads to a per-decision importance sampling variant:
Off-Policy Actor Critic Methods
Learning the Critic using V-Trace
V-trace is a method to estimate the value function for a target policy using off-policy data. First, consider the -step target value for in the on-policy case:
where we define as the TD error at time . To extend this to the off-policy case, we use the per-step importance ratio trick. However, to bound the variance of the estimator, we truncate the IS weights. In particular, we define
At first glance these look redundant, but they play different roles:
corrects the local TD error at time . It answers: if action was sampled from , how much should the one-step update at time count when estimating the target policy?
controls how strongly later TD errors are allowed to propagate backward to earlier states through the trace. It answers: how much should information from future time steps keep flowing back through this off-policy trajectory?
So changes the actual correction term at time , while changes the amount of credit propagation across multiple steps. This is why it is useful to give them different clipping thresholds.
To see where the target comes from, start from the on-policy decomposition
If the data were generated by instead of , the unbiased correction for the TD error at time would multiply by the cumulative importance ratio
But this full product is exactly what causes the variance to explode. V-trace therefore makes a more conservative choice:
it keeps a single-step correction on the TD error
it uses the clipped factors only to determine how far that corrected error is propagated backward
This yields
or equivalently
Intuition
Each future TD error tries to update not only , but also earlier states for .
says how much we trust the content of the correction at time .
says how much we trust passing that correction through the trace to earlier states.
If we used the same unclipped importance ratio everywhere, then one unlikely off-policy action could make the entire multi-step target explode. V-trace avoids this by separating “how much should I correct this TD error?” from “how far backward should this corrected error propagate?”
where and are hyperparameters. We then define the V-trace target value for as
Note that we can compute these targets recursively using
The product of the weights (known as the trace) measures how much a temporal difference at time impacts the update of the value function at an earlier time . If the policies are very different, the variance of this product will be large. So the truncation parameter is used to reduce the variance.
The use of the target rather than means we are evaluating the value function for a policy that is somewhere between and .
For (i.e. no truncation), we converge to the value function
For , we converge to the value function .
Tip
It has been shown that and work well in practice. 1
If , then . This gives rise to the simplified form
We can use the above V-trace to learn an approximate value function by minimizing the usual loss
Learning the Actor
To update the actor using an off-policy estimate of the policy gradient, we start by defining the objective to be the expected value of the new policy, where states are drawn from the behavior policy’s state distribution, but the actions are drawn from the target policy
Differentiating this and ignoring the term as previously suggested 2, gives rise to an approximate off-policy policy gradient using a one-step IS correction ratio
In practice, we can approximate by , where is the V-trace estimate for state . If we use as a baseline to reduce the variance, we get the following estimate for the policy
We can also replace the 1-step IS-weighted TD error with an IS-weighted GAE value by modifying the generalized advantage estimation to replace with .
Algorithm
Algorithm 6 Actor Critic (Off-Policy)
Learning rates αθ, αϕ, discount γ, GAE λ
Initialize actor πθ, critic Vϕ, replay buffer D
Initialize Vϕ′←Vϕ
for episode 1 to M do
sample initial state s0
Initialize empty episode buffer E←[]
for t=0 to T−1 do
Sample action from behavior policy πb(⋅∣st)
Execute at, observe st+1 and rt
Store (st,at,rt,st+1,πb(at∣st)) in E
st←st+1
end for
Store trajectory E in replay buffer D
if enough data in D then
Sample batch of trajectories {Ei} from D
for each trajectory Ei in batch do
for t=T−1,…,0 do // Calculate advantage estimates using GAE
δt←rt+γVϕ′(st+1)−Vϕ(st)
A^t←δt+γλA^t+1
end for
Update critic
L(ϕ)←T1∑t=0T−1(rt+γVϕ′(st+1)−Vϕ(st))2
ϕ←ϕ−αϕ∇ϕL(ϕ)
Update actor
J(θ)←T1∑t=0T−1ρtlogπθ(at∣st)A^t
where ρt=min(ρ,πb(at∣st)πθ(at∣st))
θ←θ+αθJ(θ)
end for
Update target network ϕ′
end if
end for
Off-Policy Improvement Methods
Policy improvement methods, such as PPO, can be extended to the off-policy case. The key insight is to realize that we can generalize the lower bound to any reference policy:
The reference policy can be any previous policy, or a convex combination of them. In particular, if is the current policy, we can consider the reference policy to be where and are mixture weights. We can approximate the expectation by sampling from the replay buffer, which contains samples from older policies. That is, can be implemented by and .
Note
In the off-policy case, we separate two roles that were the same in ordinary PPO/TRPO:
the reference / behavior policy tells us where the replayed data came from
the current policy tells us which policy we are currently trying to improve
So the importance ratio is written using , because it corrects for the fact that actions were sampled from older policies. But the advantage remains , because we still want to judge whether an action is good or bad relative to the current policy we are updating.
To compute the advantage function from off-policy data, we can adapt the V-trace method (Equation 10) to get
where , and
is the truncated importance sampling ratio.
To compute the penalty term from off policy data, we need to choose between the PPO or TRPO approach.
For PPO, we can derive an off-policy version (known as Generalized PPO)
Caution
This is more delicate than ordinary PPO in practice. In on-policy PPO, the rollout data and the advantage estimates are both tied to the same recent policy, so the surrogate objective is relatively trustworthy. In the off-policy case, both the importance ratios and the advantage estimates must compensate for stale data from older policies, so errors in either one can make the update noisy or biased. This is why off-policy policy-improvement methods typically need conservative clipping, good value estimates, and replay data that is not too far from the current policy.
Implementations
Off-Policy Actor Critic: IMPALA
An example of an off-policy AC method is IMPALA, which stands for Importance Weighted Actor-Learning Architecture. This uses shared parameters for the policy and value function (different output heads), and adds an entropy bonus to ensure the policy remains stochastic. We end up with the objective function
The only difference from standard A2C is that we need to store the probabilities of each action , in addition to in the dataset , which can be used to compute the importance ratio (Equation 7).
For a learning-oriented implementation, it is easiest to write a synchronous version: collect a rollout using a slightly stale copy of the current policy (the behavior policy), then update the current model using the V-trace targets computed from that rollout. This is not the full distributed IMPALA architecture, but it shows the core update clearly.
import copyimport numpy as npimport torchimport torch.nn as nnimport torch.optim as optimfrom torch.distributions import Categoricalclass ActorCriticNetwork(nn.Module): def __init__(self, state_dim, action_dim, hidden_dim=128): super().__init__() self.backbone = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), ) self.policy_head = nn.Linear(hidden_dim, action_dim) self.value_head = nn.Linear(hidden_dim, 1) def forward(self, state): if isinstance(state, np.ndarray): state = torch.from_numpy(state) if state.ndim == 1: state = state.unsqueeze(0) state = state.float() features = self.backbone(state) logits = self.policy_head(features) state_value = self.value_head(features).squeeze(-1) return logits, state_valuedef collect_rollout(env, behavior_model, state, rollout_length: int): states = [] actions = [] rewards = [] dones = [] behavior_log_probs = [] for _ in range(rollout_length): with torch.no_grad(): logits, _ = behavior_model(state) dist = Categorical(logits=logits) action = dist.sample() next_state, reward, terminated, truncated, _ = env.step(action.item()) done = terminated or truncated states.append(torch.as_tensor(state, dtype=torch.float32)) actions.append(action.squeeze(0)) rewards.append(torch.tensor(reward, dtype=torch.float32)) dones.append(torch.tensor(float(done), dtype=torch.float32)) behavior_log_probs.append(dist.log_prob(action).squeeze(0)) state = next_state if done: state, _ = env.reset() batch = { "states": torch.stack(states), "actions": torch.stack(actions), "rewards": torch.stack(rewards), "dones": torch.stack(dones), "behavior_log_probs": torch.stack(behavior_log_probs), } return batch, statedef compute_vtrace_targets( rewards, dones, values, target_log_probs, behavior_log_probs, gamma: float = 0.99, rho_bar: float = 1.0, c_bar: float = 1.0,): with torch.no_grad(): log_rhos = target_log_probs - behavior_log_probs rhos = log_rhos.exp() clipped_rhos = torch.clamp(rhos, max=rho_bar) clipped_cs = torch.clamp(rhos, max=c_bar) vs = torch.zeros_like(values) vs[-1] = values[-1] for t in reversed(range(len(rewards))): delta = clipped_rhos[t] * ( rewards[t] + gamma * (1 - dones[t]) * values[t + 1] - values[t] ) vs[t] = values[t] + delta + ( gamma * (1 - dones[t]) * clipped_cs[t] * (vs[t + 1] - values[t + 1]) ) pg_advantages = clipped_rhos * ( rewards + gamma * (1 - dones) * vs[1:] - values[:-1] ) return vs[:-1], pg_advantagesdef train_impala( env, num_iterations: int, gamma: float = 0.99, lr: float = 3e-4, rollout_length: int = 32, value_coef: float = 0.5, entropy_coef: float = 0.01, rho_bar: float = 1.0, c_bar: float = 1.0,): state_dim = env.observation_space.shape[0] action_dim = env.action_space.n model = ActorCriticNetwork(state_dim, action_dim) optimizer = optim.Adam(model.parameters(), lr=lr) state, _ = env.reset() for iteration in range(num_iterations): # In real IMPALA, actors run asynchronously and may lag behind the learner. # Here we emulate that by collecting data with a frozen behavior-policy snapshot. behavior_model = copy.deepcopy(model) batch, state = collect_rollout(env, behavior_model, state, rollout_length) states = batch["states"] actions = batch["actions"] rewards = batch["rewards"] dones = batch["dones"] behavior_log_probs = batch["behavior_log_probs"] with torch.no_grad(): _, bootstrap_value = model(state) logits, values = model(states) values = torch.cat([values, bootstrap_value], dim=0) dist = Categorical(logits=logits) target_log_probs = dist.log_prob(actions) entropy = dist.entropy().mean() vtrace_targets, pg_advantages = compute_vtrace_targets( rewards, dones, values, target_log_probs, behavior_log_probs, gamma=gamma, rho_bar=rho_bar, c_bar=c_bar, ) policy_loss = -(target_log_probs * pg_advantages.detach()).mean() value_loss = (values[:-1] - vtrace_targets.detach()).pow(2).mean() loss = policy_loss + value_coef * value_loss - entropy_coef * entropy optimizer.zero_grad() loss.backward() optimizer.step()
Sources
Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 3.