Policy Gradient Methods

Value based RL methods have several disadvantages:

  • They can be difficult to apply in continuous action spaces
  • They may diverge if function approximation is used
  • The training of , often based on TD-style updates, is not directly related to the expected return garnered by the learned policy
  • They learn deterministic policies

Policy gradient methods offer an alternative approach, which directly optimize the parameters of the policy so as to maximize its expected return. These methods often benefit from estimating a value or advantage function to reduce the variance in the policy search process.

The parametric policy is denoted as , which is usually some neural network. For discrete actions, the final layer is usually passed through a softmax function and then into a categorical distribution. For continuous actions, we typically use a Gaussian output layer (potentially clipped to a suitable range such as ), although it is also possible to use more expressive distributions, such as diffusion models (which when used as a policy is known as a diffusion policy).

Likelihood Ratio Estimate

Definition

We define the value of a policy as

where is the return along the trajectory, and is the distribution over trajectories induced by the policy (and world model):

The goal is to compute , that is, how the expected return changes when we slightly change the policy parameters.

Starting from Equation 1,

Multiplying and dividing by gives

This is known as the likelihood ratio estimator.

The key identity is the log derivative trick:

or equivalently,

Applying this with , we get

This is useful because the expectation in Equation 4 can be estimated with Monte Carlo samples, i.e. by rolling out the current policy in the environment.

To simplify , note from Equation 2 that only the policy terms depend on ; the initial state distribution and environment dynamics do not. Therefore,

and so

Hence

This is often referred to as the score function estimator or SFE.

An unbiased Monte Carlo estimator based on sampled trajectories is

Intuition

  • The objective is an average over trajectories, but the trajectories themselves depend on the policy parameters .
  • Directly differentiating through the random trajectory distribution is awkward.
  • The likelihood ratio trick rewrites the derivative of the distribution as a derivative of a log-probability, which is much easier to estimate from samples.
  • Each sampled trajectory contributes a term of the form , so high-return trajectories push the policy toward making those actions more likely, while low-return trajectories push it away from them.
  • Because becomes a sum over time, the update decomposes into per-action terms .

Another way to read Equation 6 is:

If a trajectory produced a large return, increase the log-probability of the actions that generated it. If a trajectory produced a poor return, decrease their log-probability.

This is the core idea behind policy gradient methods. In practice, this estimator has high variance, which is why later methods introduce value functions and baselines to make the updates more stable.

Note

For a higher-level comparison of how vanilla policy gradients connect to actor-critic methods, deterministic policy gradients, and DDPG, see From Policy Gradients to DDPG.

Caution

Here is the value of the policy, not a loss function. So the goal is to maximize , not minimize it. Consequently, policy gradient methods use stochastic gradient ascent updates of the form

which is equivalent to applying standard gradient descent to the loss .

Variance Reduction using Reward-to-go

The likelihood ratio estimator can have high variance, since we are sampling entire trajectories. Fortunately, we can reduce the variance using the temporal/causal structure of the problem. In particular, from Equation 6 we have

Expanding the product inside the expectation, we have

The crossed out terms are not literally zero for each trajectory. Rather, their expectation is zero, because rewards from times depend only on past states and actions, and therefore do not depend on the future action . More precisely, for ,

Intuitively, once a past reward has already happened, changing the probability of a future action cannot affect that past reward in expectation. Plugging in this simplified expression, we get

Where is the reward-to-go

Note that the reward-to-go of a state-action pair can be considered as a single sample approximation of the state-action value function . Averaging over such samples gives

Intuition

  • The estimator in Equation 6 uses the full trajectory return to weight every score term .
  • But rewards from times happened before action was chosen, so they cannot contain any information about whether was a good or bad action.
  • Those past rewards therefore act like extra noise when multiplied by the score term at time .
  • Reward-to-go removes exactly those noisy past-reward terms and keeps only the rewards that can still be affected by action .
  • This does not change the expected gradient, because the removed terms have zero expectation:
  • So this is an unbiased variance reduction step derived from causality: we are not inventing a correction term, we are dropping terms that contribute noise but no signal.
  • Intuitively, if action only influences rewards from time onward, then using the full return asks that action to also “explain” rewards from the past. Reward-to-go stops doing that, so the estimator fluctuates less from trajectory to trajectory.

The Policy Gradient Theorem

We now turn to the infinite horizon setting.

For this section, we switch to the more standard infinite-horizon indexing convention .

Definition

We define the discounted state visitation measure as follows:

where is the probability of going from to in steps, and is the marginal probability of being in state at time (after each episodic reset).

Note that is a measure of time spent in non-terminal states, but it is not a probability measure since it is not normalized. However, we can define a normalized version of the measure by noting that for .

Definition

Hence, the normalized discounted state visitation distribution is given by (note the change from to )

We can convert the normalized distribution back to the measure using

The key observation is that Equation 14 has the form

for the choice

We can rewrite this by first expanding the expectation over trajectories into a sum over time marginals:

Using this notation, we can rewrite Equation 14 in terms of expectations over states rather than over full trajectories:

This is known as the policy gradient theorem.

Note

A common alternative motivation for the policy gradient theorem starts by writing the objective in terms of the policy’s state visitation distribution. In that view, differentiating appears to require differentiating both the policy and the induced occupancy measure , which is awkward because changing the policy also changes which states are visited. In our derivation above, this difficulty is hidden inside the trajectory distribution , and the likelihood-ratio trick already gives a valid gradient estimator in trajectory form. The policy gradient theorem then shows that this same gradient can be rewritten as an expectation over visited state-action pairs, without needing an explicit derivative of the state distribution. This alternate form is what makes later actor-critic style methods natural: we can estimate or locally at sampled states instead of reasoning directly about entire trajectories.

Intuition

  • Equation 14 says: sample a whole trajectory, then add up a contribution from each visited state-action pair.
  • The policy gradient theorem says: instead of thinking in terms of whole trajectories, we can think in terms of how often the policy visits each state, discounted over time.
  • The quantity collects exactly that: the discounted amount of occupancy of state under policy .
  • So the trajectory expectation and the state-visitation expectation are just two views of the same weighted average.
  • The trajectory view groups terms by episode; the policy-gradient-theorem view groups the same terms by state-action pair.
  • We use this reformulation because it is cleaner conceptually and leads directly to actor-critic methods: if we can estimate and sample states from the on-policy visitation distribution, we do not need to reason about full trajectories explicitly.

Variance Reduction using a Baseline

In practice, estimating the policy gradient using Equation 6 can have a high variance. A baseline function can be used for variance reduction to get

Any function that depends only on the state, and not on the sampled action , is a valid baseline. The reason is that subtracting such a term does not change the expectation of the policy gradient estimator:

Therefore,

So, as with reward-to-go, this is another unbiased variance reduction step: we subtract a term with zero expectation in order to reduce noise while preserving the expected gradient.

Tip

A common choice for the baseline is . This is useful because captures the average quality of state , so measures whether action was better or worse than usual in that state. Since and are strongly correlated, subtracting often reduces variance substantially.

Note that is the advantage function. In the finite horizon case, we get

We can also apply a baseline to the reward-to-go formulation from Equation 14

We can derive analogous baselines for the infinite horizon case, defined in terms of .

Policy Gradient Algorithms

Sources

  • Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 3.