Policy Improvement Methods

Policy improvement methods are policy gradient methods that try to make each update safer than a vanilla gradient step. The central idea is: improve the policy, but do not move so far that the data collected under the old policy stops being informative.

Vanilla policy gradients say: find actions that had high estimated advantage, then increase their probability. The problem is that a neural network update changes the policy at all states simultaneously. A step that looks good locally in parameter space can still hurt performance globally by changing the state distribution too much, or by making the importance weights too large.

Policy improvement methods address this by combining two ingredients:

  • A surrogate objective that estimates whether the new policy looks better than the old one on data sampled from the old policy.
  • A proximity constraint or penalty that prevents the new policy from drifting too far from the old one in a single update.

TRPO and PPO are the two canonical ways of implementing this idea. TRPO enforces the small-step idea explicitly with a trust-region constraint. PPO replaces that hard constrained optimization problem with a simpler objective that discourages large updates.

Policy Improvement Lower Bound

Let be the current policy at step , and let be any other policy (e.g. a candidate new one). Let be the normalized discounted state visitation distribution for . Let be the advantage function. Finally, let the total variation distance between two distributions be given by

Then, one can show that

where

In the above, is a surrogate objective, and the second term is a penalty term.

Intuition

Read the bound as follows:

  • The first term asks: if I evaluate the new policy using states visited by the old policy, and reweight actions by the likelihood ratio , do actions with positive advantage become more likely and actions with negative advantage become less likely? If the advantage was positive but the new policy makes good actions less likely, then this hurts.
  • If that were the only term, the optimizer could exploit errors in the advantage estimates by changing the policy too aggressively.
  • The second term says that if the new policy differs too much from the old one, the surrogate objective stops being reliable.
  • So the whole theorem formalizes a simple idea: small policy changes are predictable; large policy changes are not.

If we can optimize this lower bound (or a stochastic approximation based on samples from the current policy ), we can guarantee monotonic policy improvement (in expectation) at each step. We will replace this objective with a trust-region update that is easier to optimize:

This constraint bounds the worst-case performance decline at each update. The overall procedure becomes an approximate policy improvement method.

Another way to think about it is that we are doing policy search in two stages:

  1. Ask which action probabilities should go up or down based on the current batch of experience.
  2. Only move a limited distance, so the policy after the update is still close enough to the one that generated the batch.

TRPO and PPO differ mainly in how Step 2 is enforced.

Why Not Just Add a Distance Penalty to Actor-Critic?

At first glance, policy improvement methods can seem like they are doing something very simple: take an ordinary actor-critic or A2C style objective, then add an auxiliary term that penalizes the new policy for moving too far from the old one. This is a reasonable idea, and in practice it is often used as a baseline. However, the policy-improvement viewpoint gives a bit more structure than that naive modification.

The main issue is that on-policy policy-gradient updates are computed using data sampled from the old policy. If the policy changes too much in one update, then the action probabilities, state visitation distribution, and advantage estimates used to justify the update may no longer be representative of the new policy. So the problem is not just “large updates are bad” in an abstract regularization sense. The problem is that the local policy-gradient signal becomes unreliable when the new policy drifts too far from the one that generated the data.

This is why the policy-improvement perspective is useful:

  • It ties the update to a surrogate objective that is explicitly evaluated under the old policy’s state distribution.
  • It motivates controlling distance in policy space (for example via KL or TV between action distributions), rather than in raw parameter space.
  • It explains why the distance control should be coupled to the update itself, instead of being treated as an unrelated auxiliary loss.

In other words, a naive extra penalty says “prefer not to move too far,” whereas policy-improvement methods say “only trust this improvement estimate while the new policy remains close enough to the old one.” The latter is more informative, because it explains why closeness matters and what notion of closeness is relevant.

So this line of methods is worth studying not because it adds regularization in a completely new way, but because it gives a principled answer to three questions:

  • What objective are we actually improving with data from the old policy?
  • What notion of distance should define a “small” update?
  • Why should small updates make the optimization more reliable?

Trust Region Policy Optimization (TRPO)

TRPO is an on-policy actor critic policy gradient method that implements an approximation to the constrained problem above. 1 The core idea is to replace an ordinary gradient step with the largest safe step inside a local trust region.

First, it leverages the fact that if

then also satisfies the constraint with . So TRPO replaces the TV constraint with an average KL constraint, which is easier to work with.

Next, it takes a first-order approximation of the surrogate objective around the current parameters :

where . Finally, it considers a second-order expansion of the KL term to get the approximate constraint:

where is the Fisher information matrix (equivalently, the local curvature of the KL divergence around ). We then use the update

where is the natural gradient, and the step size is initialized to . In practice, we compute by approximately solving the linear system using conjugate gradient methods, which just requires matrix vector multiplies. We then use a backtracking line search procedure to ensure the trust region is satisfied.

Intuition

TRPO is trying to answer: among all parameter updates that keep the new policy close to the old one in KL divergence, which one improves the surrogate objective the most?

  • The ordinary gradient points in the direction of steepest ascent in parameter space.
  • But parameter space is a poor measure of how much the distribution changed; a small Euclidean move can produce a large behavioral change.
  • The Fisher matrix defines a geometry in which distance corresponds to change in the policy distribution.
  • The natural gradient therefore points in the direction of steepest improvement per unit of policy change, not per unit of raw parameter change.

So TRPO can be summarized as: take a policy-gradient step, but measure the step in policy space rather than parameter space.

In practice, TRPO is stable but somewhat cumbersome:

  • It requires solving a constrained optimization problem approximately at every update.
  • It usually uses one update per batch of fresh on-policy data.
  • The line search and conjugate-gradient solve make it more complex to implement than standard SGD/Adam training.

Proximal Policy Optimization (PPO)

PPO is an on-policy actor critic policy gradient method, and a simplification of TRPO. It keeps the same high-level idea, but instead of solving a constrained optimization problem, it uses an objective that automatically suppresses updates that move the policy too far. 2

We start by noting the following result

This holds provided the support of is contained in the support of at every state. We then use the following update

^ppo-update
where is the likelihood ratio, and .

The ratio is the key quantity. It measures how much more or less likely the new policy makes an action that was actually sampled under the old policy:

  • If , the action became more likely.
  • If , it became less likely.
  • Multiplying by the advantage says whether that change is desirable.

Without clipping, PPO would optimize the same basic surrogate as other importance-weighted policy-gradient methods. The problem is that very large ratios can dominate the objective and lead to destructive updates. Clipping prevents the optimizer from receiving extra reward for pushing the ratio far beyond a small interval around .

The pseudocode for PPO (with GAE) is basically identical to Generalized Advantage Estimation, except the policy loss has the form instead of , and we perform multiple policy updates per rollout for increased sample efficiency.

Intuition

PPO says: if an action had positive advantage, increase its probability, but stop rewarding that increase once it becomes too large. If an action had negative advantage, decrease its probability, but again stop pushing once the change becomes too large.

This is why the objective uses a min with a clipped term. The clipped objective is a pessimistic version of the ordinary surrogate:

  • If the update is modest, PPO behaves like a standard policy-gradient update.
  • If the update becomes too aggressive, PPO acts as if the extra improvement does not count.

So PPO does not forbid a large step the way TRPO does, but it removes much of the incentive to take one.

Algorithm 7 PPO with GAE

Initialize parameters ϕ\mathbf{\phi}, environment state ss

for iteration k=1,2,k=1,2,\dots do

ϕoldϕ\mathbf{\phi}_{\mathrm{old}} \gets \mathbf{\phi}

(s1,a1,r1,,sT,aT,rT,sT+1)=(s_1,a_1,r_1,\dots,s_T,a_T,r_T,s_{T+1}) = rollout(s,πϕolds, \pi_{\mathbf{\phi}_{\mathrm{old}}})

vt=Vϕ(st)v_t = V_{\mathbf{\phi}}(s_t) for 1 : T+11 \ : \ T+1

(A1:T,y1:T)=(A_{1:T}, y_{1:T}) = GAE(r1:T,v1:T+1,γ,λr_{1:T}, v_{1:T+1}, \gamma, \lambda)

for m=1m=1 to MM do

ρt=πϕ(atst)πϕold(atst)\rho_t = \frac{\pi_{\mathbf{\phi}}(a_t \mid s_t)}{\pi_{\mathbf{\phi}_{\mathrm{old}}}(a_t \mid s_t)} for t=1:Tt=1:T

ρ~t=\tilde{\rho}_t = clip(ρt,1ϵ,1+ϵ\rho_t, 1-\epsilon, 1+\epsilon) for t=1:Tt=1:T

L(ϕ)=1Tt=1T[λTD(Vϕ(st)yt)2λPGmin(ρtAt,ρ~tAt)λentH(πϕ(st))]\mathcal{L}(\mathbf{\phi}) = \frac{1}{T} \sum_{t=1}^T [\lambda_{\mathrm{TD}} (V_{\mathbf{\phi}}(s_t) - y_t)^2 - \lambda_{\mathrm{PG}} \min(\rho_t A_t, \tilde{\rho}_t A_t) - \lambda_{\mathrm{ent}} \mathbb{H}(\pi_{\mathbf{\phi}}(\bullet \mid s_t))]

ϕϕηϕL(ϕ)\mathbf{\phi} \gets \mathbf{\phi} - \eta \nabla_{\mathbf{\phi}} \mathcal{L}(\mathbf{\phi})

end for

end for

function GAE(r1:T,v1:T+1,γ,λr_{1:T}, v_{1:T+1}, \gamma, \lambda)

AT+1=0A_{T+1} = 0

for t=Tt=T down to 11 do

δt=rt+γvt+1vt\delta_t = r_t + \gamma v_{t+1} - v_t

At=δt+γλAt+1A_t = \delta_t + \gamma \lambda A_{t+1} // advantage

yt=At+vty_t = A_t + v_t // return

end for

return (A1:T,y1:T)(A_{1:T}, y_{1:T})

end function

Simplified Form of the Clipping Term

From OpenAI, they propose a simplified form of the clipping term, which is easier to understand. First, we rewrite the loss (inside the expectation of the update formula for ) in shorthand form as

where and . If the advantage is positive (), the action was better than expected, so we want to increase the probability , but not too much (so the new policy stays close to the previous one). The term restricts to be in . The outer selects the smaller of and this clipped value. If , then , so the operator will naturally select without needing the lower clip. So in this case, we use a loss of the form . If the advantage is negative (), a similar argument shows we can use a loss of the form . We combine these two as

where

The simplified form makes the asymmetry very clear:

  • If , the optimizer wants to increase the probability of the sampled action, but PPO caps how much benefit it gets once .
  • If , the optimizer wants to decrease the probability of the sampled action, but PPO caps how much benefit it gets once .

So clipping is not about forcing to stay in the interval at all times. It is about making the objective flat outside the trusted region in the direction that would otherwise keep pushing the update.

PPO for Diffusion Policies

The PPO algorithm relies on being able to compute the likelihood of a candidate action. This is difficult to do for diffusion policies. In Diffusion Policy Policy Optimization, they propose DPPO, which is PPO for diffusion models. The key idea is to treat each step of the diffusion process as a step of an inner MDP, nested inside the outer (main) MDP, and apply PPO to the combined system. Unfortunately, this results in very long horizon tasks. 3

TRPO vs PPO

Both methods start from the same policy-improvement principle, but they make different engineering tradeoffs.

MethodMain mechanismStrengthWeakness
TRPOHard KL trust regionStronger monotonic-improvement motivationMore expensive and harder to implement
PPOClipped surrogate objectiveSimple, scalable, works well with SGD/Adam and minibatchesWeaker theoretical guarantee than TRPO

In words:

  • TRPO is closer to the theory.
  • PPO is closer to the way deep learning systems are usually optimized in practice.
  • Both are still on-policy methods: once the policy has changed enough, old data is no longer trustworthy and a fresh rollout is needed.

If you only want the main takeaway, it is this:

  • Policy gradients tell us which direction should improve the policy.
  • Policy improvement methods add a notion of how far it is safe to move in that direction.
  • TRPO enforces that safety with a trust-region constraint.
  • PPO enforces it approximately by clipping the objective.

Implementations

PPO

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
 
class ActorCriticNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
        )
        self.policy_head = nn.Linear(hidden_dim, action_dim)
        self.value_head = nn.Linear(hidden_dim, 1)
 
    def forward(self, state):
        if isinstance(state, np.ndarray):
            state = torch.from_numpy(state)
        if state.ndim == 1:
            state = state.unsqueeze(0)
        state = state.float()
 
        features = self.backbone(state)
        logits = self.policy_head(features)
        state_value = self.value_head(features).squeeze(-1)
        return logits, state_value
 
def compute_gae(rewards, values, dones, gamma: float = 0.99, lam: float = 0.95):
    advantages = []
    gae = torch.tensor(0.0)
 
    for t in reversed(range(len(rewards))):
        delta = rewards[t] + gamma * values[t + 1] * (1 - dones[t]) - values[t]
        gae = delta + gamma * lam * (1 - dones[t]) * gae
        advantages.insert(0, gae)
 
    advantages = torch.stack(advantages)
    returns = advantages + values[:-1]
    return advantages, returns
 
def collect_rollout(env, model, state, rollout_length: int):
    states = []
    actions = []
    rewards = []
    dones = []
    old_log_probs = []
    values = []
 
    for _ in range(rollout_length):
        logits, value = model(state)
        dist = Categorical(logits=logits)
        action = dist.sample()
 
        next_state, reward, terminated, truncated, _ = env.step(action.item())
        done = terminated or truncated
 
        states.append(torch.as_tensor(state, dtype=torch.float32))
        actions.append(action.squeeze(0))
        rewards.append(torch.tensor(reward, dtype=torch.float32))
        dones.append(torch.tensor(float(done), dtype=torch.float32))
        old_log_probs.append(dist.log_prob(action).squeeze(0).detach())
        values.append(value.squeeze(0).detach())
 
        state = next_state
        if done:
            state, _ = env.reset()
 
    with torch.no_grad():
        _, bootstrap_value = model(state)
        values.append(bootstrap_value.squeeze(0))
 
    batch = {
        "states": torch.stack(states),
        "actions": torch.stack(actions),
        "rewards": torch.stack(rewards),
        "dones": torch.stack(dones),
        "old_log_probs": torch.stack(old_log_probs),
        "values": torch.stack(values),
    }
    return batch, state
 
def train_ppo(
    env,
    num_iterations: int,
    gamma: float = 0.99,
    lam: float = 0.95,
    lr: float = 3e-4,
    rollout_length: int = 128,
    clip_eps: float = 0.2,
    update_epochs: int = 4,
    minibatch_size: int = 64,
    value_coef: float = 0.5,
    entropy_coef: float = 0.01,
):
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
 
    model = ActorCriticNetwork(state_dim, action_dim)
    optimizer = optim.Adam(model.parameters(), lr=lr)
 
    state, _ = env.reset()
 
    for iteration in range(num_iterations):
        batch, state = collect_rollout(env, model, state, rollout_length)
 
        advantages, returns = compute_gae(
            batch["rewards"],
            batch["values"],
            batch["dones"],
            gamma=gamma,
            lam=lam,
        )
 
        # PPO usually normalizes advantages within each rollout batch.
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
 
        states = batch["states"]
        actions = batch["actions"]
        old_log_probs = batch["old_log_probs"]
        returns = returns.detach()
        advantages = advantages.detach()
 
        batch_size = states.shape[0]
        for _ in range(update_epochs):
            permutation = torch.randperm(batch_size)
 
            for start in range(0, batch_size, minibatch_size):
                idx = permutation[start : start + minibatch_size]
 
                mb_states = states[idx]
                mb_actions = actions[idx]
                mb_old_log_probs = old_log_probs[idx]
                mb_returns = returns[idx]
                mb_advantages = advantages[idx]
 
                logits, values = model(mb_states)
                dist = Categorical(logits=logits)
                new_log_probs = dist.log_prob(mb_actions)
                entropy = dist.entropy().mean()
 
                ratio = (new_log_probs - mb_old_log_probs).exp()
                unclipped = ratio * mb_advantages
                clipped = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * mb_advantages
                policy_loss = -torch.min(unclipped, clipped).mean()
 
                value_loss = (values - mb_returns).pow(2).mean()
                loss = policy_loss + value_coef * value_loss - entropy_coef * entropy
 
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

Sources

  • Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 3.

Footnotes

  1. Trust Region Policy Optimization

  2. Constrained Policy Optimization

  3. Diffusion Policy Policy Optimization