Actor Critic Methods

An actor-critic method is a policy gradient method that augments the policy with a learned value estimator. The actor is the policy we want to improve, and the critic is the value function used to tell the actor whether its behavior was good or bad.

The main motivation is variance reduction. In vanilla policy gradients, the actor is trained using sampled returns, which can be noisy. Actor-critic methods replace those raw Monte Carlo targets with learned estimates such as a TD residual, value baseline, or advantage estimate. This usually makes learning more stable and more sample efficient.

In short:

the actor chooses actions
the critic evaluates states or state-action pairs
the actor uses the critic’s signal to update the policy

The use of bootstrapping in TD updates allows for more efficient learning of the critic than pure Monte Carlo estimation. It also enables fully online, incremental updates that do not need to wait until the end of the trajectory.

Core Update Pattern

Most stochastic actor-critic methods still use the same policy-gradient structure:

where is some learned or partially learned estimate of how good action was. Common choices include:

a TD residual
an advantage estimate
GAE

So the critic is not just an unrelated auxiliary network. It is the component that provides the training signal used to weight the policy gradient.

Architectural Issues

Caution

It is common to use a single neural network for both the actor and critic, but using different output heads: a scalar output for the value function, and a vector output for the policy. However, its been argued that it can be better to use different networks for the actor and critic (at least when using MLPs/CNNs), since they need to extract different kinds of features. ¹

Representative Methods

A2C: A synchronous actor-critic method that uses a critic to form lower-variance policy updates from short rollouts.
Generalized Advantage Estimation (GAE): Not a separate actor-critic family by itself so much as a better advantage estimator, often paired with actor-critic methods.
Policy Improvement Methods: Trust-region and proximal methods such as PPO often still use an actor-critic architecture, but add an explicit mechanism to keep policy updates small.
Deterministic Policy Gradient Methods: Replaces a stochastic actor with a deterministic one, so the actor is updated by differentiating the critic through the chosen action.
DDPG and TD3: Deep deterministic actor-critic methods for continuous control.

From Policy Gradients to Deterministic Actor-Critic

These methods can feel unrelated when studied one at a time, but they are all trying to answer the same question:

How should we change the policy parameters so that the expected return increases?

The main differences are not about having different ultimate goals. They are about how the method answers two practical questions:

What signal tells us whether an action was good or bad?
How do we propagate that signal back to the policy parameters?

Path 1: Vanilla Policy Gradient

In Policy Gradient Methods, we keep the policy stochastic and write

The update has the form

where might be the full return, reward-to-go, or an advantage estimate.

The logic is simple:

sample an action from the policy
observe how good that sampled action was
make that action more or less likely in the future

So stochastic policy gradients update the log-probability of the sampled action.

Path 2: Actor-Critic

Actor-critic keeps the same basic stochastic-policy update, but replaces raw Monte Carlo returns with a learned estimator such as or . This provides a lower-variance signal like

The key point is that the actor update is still of the form

So the critic is still supplying a scalar weighting term for a stochastic policy-gradient update.

The Big Fork: What Does the Policy Output?

At this point, the family splits depending on what kind of object the policy produces.

For a stochastic policy, the policy outputs a distribution

so the natural update is to adjust the probabilities of sampled actions.

For a deterministic policy, the policy outputs a single action

Now there is no log-probability term. Instead, the actor is updated by asking how the critic value changes if the chosen action moves slightly.

Path 3: Deterministic Policy Gradient

With a deterministic actor, the objective can be written as

The actor is then updated by differentiating through the action itself:

This is the main conceptual difference:

in stochastic actor-critic, the critic says whether a sampled action should become more or less likely
in deterministic policy gradients, the critic tells the actor how to move the action itself in action space

DQN Analogy

The connection to DQN is helpful here. In DQN with discrete actions, we can compute a value for each action and pick the best one. In continuous action spaces, we cannot enumerate all possible actions. Deterministic actor-critic methods replace that search with local optimization:

the actor proposes an action
the critic evaluates that action
the critic gradient tells the actor how to nudge the action toward a better one

Path 4: DDPG

DDPG takes the deterministic policy gradient idea and combines it with practical stabilization tricks from DQN:

a replay buffer
a target critic
a target actor

The critic is trained by TD learning, while the actor is trained to maximize the critic’s output:

So DDPG is not changing the goal of policy optimization. It is choosing a particular answer to the same two questions from above:

What says an action is good: a learned action-value critic .
How is that signal propagated to the actor: by differentiating the critic with respect to the action, and then the action with respect to the actor parameters.

Summary Table

Method	Policy type	Main signal	How the actor updates
REINFORCE	Stochastic	Return / reward-to-go	Make sampled good actions more likely
A2C / actor-critic	Stochastic	TD error / advantage / GAE	Same policy-gradient form, but with a learned low-variance scalar estimator
Policy Improvement Methods	Stochastic	Advantage plus trust-region / clipped surrogate control	Improve the policy while limiting how far one update can move it
Deterministic Policy Gradient Methods	Deterministic		Move the actor output in the direction that increases critic value
DDPG / TD3	Deterministic	Learned critic with replay + targets	Practical deep RL versions of deterministic policy gradients

Core Intuition

All of these methods are trying to improve a policy with respect to expected return. The deeper fork is not simply “with or without a critic.” It is:

Are we learning by reweighting sampled actions under a probability distribution, or by directly differentiating a critic through the action chosen by the policy?

That is the conceptual step from vanilla policy gradients to deterministic actor-critic methods.

Sources

Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 3.

Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning ↩

Jake Tuero

Explorer

Actor Critic Methods

Actor Critic Methods

Core Update Pattern

Architectural Issues

Representative Methods

From Policy Gradients to Deterministic Actor-Critic

Path 1: Vanilla Policy Gradient

Path 2: Actor-Critic

The Big Fork: What Does the Policy Output?

Path 3: Deterministic Policy Gradient

DQN Analogy

Path 4: DDPG

Summary Table

Core Intuition

Sources

Graph View

Table of Contents

Backlinks

Jake Tuero

Explorer

Actor Critic Methods

Actor Critic Methods

Core Update Pattern

Architectural Issues

Representative Methods

From Policy Gradients to Deterministic Actor-Critic

Path 1: Vanilla Policy Gradient

Path 2: Actor-Critic

The Big Fork: What Does the Policy Output?

Path 3: Deterministic Policy Gradient

DQN Analogy

Path 4: DDPG

Summary Table

Core Intuition

Sources

Footnotes

Graph View

Table of Contents

Backlinks