Actor Critic Methods

An actor-critic method is a policy gradient method that augments the policy with a learned value estimator. The actor is the policy we want to improve, and the critic is the value function used to tell the actor whether its behavior was good or bad.

The main motivation is variance reduction. In vanilla policy gradients, the actor is trained using sampled returns, which can be noisy. Actor-critic methods replace those raw Monte Carlo targets with learned estimates such as a TD residual, value baseline, or advantage estimate. This usually makes learning more stable and more sample efficient.

In short:

  • the actor chooses actions
  • the critic evaluates states or state-action pairs
  • the actor uses the critic’s signal to update the policy

The use of bootstrapping in TD updates allows for more efficient learning of the critic than pure Monte Carlo estimation. It also enables fully online, incremental updates that do not need to wait until the end of the trajectory.

Core Update Pattern

Most stochastic actor-critic methods still use the same policy-gradient structure:

where is some learned or partially learned estimate of how good action was. Common choices include:

  • a TD residual
  • an advantage estimate
  • GAE

So the critic is not just an unrelated auxiliary network. It is the component that provides the training signal used to weight the policy gradient.

Architectural Issues

Caution

It is common to use a single neural network for both the actor and critic, but using different output heads: a scalar output for the value function, and a vector output for the policy. However, its been argued that it can be better to use different networks for the actor and critic (at least when using MLPs/CNNs), since they need to extract different kinds of features. 1

Representative Methods

  • A2C: A synchronous actor-critic method that uses a critic to form lower-variance policy updates from short rollouts.
  • Generalized Advantage Estimation (GAE): Not a separate actor-critic family by itself so much as a better advantage estimator, often paired with actor-critic methods.
  • Policy Improvement Methods: Trust-region and proximal methods such as PPO often still use an actor-critic architecture, but add an explicit mechanism to keep policy updates small.
  • Deterministic Policy Gradient Methods: Replaces a stochastic actor with a deterministic one, so the actor is updated by differentiating the critic through the chosen action.
  • DDPG and TD3: Deep deterministic actor-critic methods for continuous control.

From Policy Gradients to Deterministic Actor-Critic

These methods can feel unrelated when studied one at a time, but they are all trying to answer the same question:

How should we change the policy parameters so that the expected return increases?

The main differences are not about having different ultimate goals. They are about how the method answers two practical questions:

  1. What signal tells us whether an action was good or bad?
  2. How do we propagate that signal back to the policy parameters?

Path 1: Vanilla Policy Gradient

In Policy Gradient Methods, we keep the policy stochastic and write

The update has the form

where might be the full return, reward-to-go, or an advantage estimate.

The logic is simple:

  • sample an action from the policy
  • observe how good that sampled action was
  • make that action more or less likely in the future

So stochastic policy gradients update the log-probability of the sampled action.

Path 2: Actor-Critic

Actor-critic keeps the same basic stochastic-policy update, but replaces raw Monte Carlo returns with a learned estimator such as or . This provides a lower-variance signal like

The key point is that the actor update is still of the form

So the critic is still supplying a scalar weighting term for a stochastic policy-gradient update.

The Big Fork: What Does the Policy Output?

At this point, the family splits depending on what kind of object the policy produces.

For a stochastic policy, the policy outputs a distribution

so the natural update is to adjust the probabilities of sampled actions.

For a deterministic policy, the policy outputs a single action

Now there is no log-probability term. Instead, the actor is updated by asking how the critic value changes if the chosen action moves slightly.

Path 3: Deterministic Policy Gradient

With a deterministic actor, the objective can be written as

The actor is then updated by differentiating through the action itself:

This is the main conceptual difference:

  • in stochastic actor-critic, the critic says whether a sampled action should become more or less likely
  • in deterministic policy gradients, the critic tells the actor how to move the action itself in action space

DQN Analogy

The connection to DQN is helpful here. In DQN with discrete actions, we can compute a value for each action and pick the best one. In continuous action spaces, we cannot enumerate all possible actions. Deterministic actor-critic methods replace that search with local optimization:

  • the actor proposes an action
  • the critic evaluates that action
  • the critic gradient tells the actor how to nudge the action toward a better one

Path 4: DDPG

DDPG takes the deterministic policy gradient idea and combines it with practical stabilization tricks from DQN:

  • a replay buffer
  • a target critic
  • a target actor

The critic is trained by TD learning, while the actor is trained to maximize the critic’s output:

So DDPG is not changing the goal of policy optimization. It is choosing a particular answer to the same two questions from above:

  1. What says an action is good: a learned action-value critic .
  2. How is that signal propagated to the actor: by differentiating the critic with respect to the action, and then the action with respect to the actor parameters.

Summary Table

MethodPolicy typeMain signalHow the actor updates
REINFORCEStochasticReturn / reward-to-goMake sampled good actions more likely
A2C / actor-criticStochasticTD error / advantage / GAESame policy-gradient form, but with a learned low-variance scalar estimator
Policy Improvement MethodsStochasticAdvantage plus trust-region / clipped surrogate controlImprove the policy while limiting how far one update can move it
Deterministic Policy Gradient MethodsDeterministicMove the actor output in the direction that increases critic value
DDPG / TD3DeterministicLearned critic with replay + targetsPractical deep RL versions of deterministic policy gradients

Core Intuition

All of these methods are trying to improve a policy with respect to expected return. The deeper fork is not simply “with or without a critic.” It is:

Are we learning by reweighting sampled actions under a probability distribution, or by directly differentiating a critic through the action chosen by the policy?

That is the conceptual step from vanilla policy gradients to deterministic actor-critic methods.

Sources

  • Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 3.

Footnotes

  1. Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning