Actor Critic Methods
An actor-critic method is a policy gradient method that augments the policy with a learned value estimator. The actor is the policy we want to improve, and the critic is the value function used to tell the actor whether its behavior was good or bad.
The main motivation is variance reduction. In vanilla policy gradients, the actor is trained using sampled returns, which can be noisy. Actor-critic methods replace those raw Monte Carlo targets with learned estimates such as a TD residual, value baseline, or advantage estimate. This usually makes learning more stable and more sample efficient.
In short:
- the actor chooses actions
- the critic evaluates states or state-action pairs
- the actor uses the critic’s signal to update the policy
The use of bootstrapping in TD updates allows for more efficient learning of the critic than pure Monte Carlo estimation. It also enables fully online, incremental updates that do not need to wait until the end of the trajectory.
Core Update Pattern
Most stochastic actor-critic methods still use the same policy-gradient structure:
where
- a TD residual
- an advantage estimate
- GAE
So the critic is not just an unrelated auxiliary network. It is the component that provides the training signal used to weight the policy gradient.
Architectural Issues
Caution
It is common to use a single neural network for both the actor and critic, but using different output heads: a scalar output for the value function, and a vector output for the policy. However, its been argued that it can be better to use different networks for the actor and critic (at least when using MLPs/CNNs), since they need to extract different kinds of features. 1
Representative Methods
- A2C: A synchronous actor-critic method that uses a critic to form lower-variance policy updates from short rollouts.
- Generalized Advantage Estimation (GAE): Not a separate actor-critic family by itself so much as a better advantage estimator, often paired with actor-critic methods.
- Policy Improvement Methods: Trust-region and proximal methods such as PPO often still use an actor-critic architecture, but add an explicit mechanism to keep policy updates small.
- Deterministic Policy Gradient Methods: Replaces a stochastic actor with a deterministic one, so the actor is updated by differentiating the critic through the chosen action.
- DDPG and TD3: Deep deterministic actor-critic methods for continuous control.
From Policy Gradients to Deterministic Actor-Critic
These methods can feel unrelated when studied one at a time, but they are all trying to answer the same question:
How should we change the policy parameters so that the expected return increases?
The main differences are not about having different ultimate goals. They are about how the method answers two practical questions:
- What signal tells us whether an action was good or bad?
- How do we propagate that signal back to the policy parameters?
Path 1: Vanilla Policy Gradient
In Policy Gradient Methods, we keep the policy stochastic and write
The update has the form
where
The logic is simple:
- sample an action from the policy
- observe how good that sampled action was
- make that action more or less likely in the future
So stochastic policy gradients update the log-probability of the sampled action.
Path 2: Actor-Critic
Actor-critic keeps the same basic stochastic-policy update, but replaces raw Monte Carlo returns with a learned estimator such as
The key point is that the actor update is still of the form
So the critic is still supplying a scalar weighting term for a stochastic policy-gradient update.
The Big Fork: What Does the Policy Output?
At this point, the family splits depending on what kind of object the policy produces.
For a stochastic policy, the policy outputs a distribution
so the natural update is to adjust the probabilities of sampled actions.
For a deterministic policy, the policy outputs a single action
Now there is no log-probability term. Instead, the actor is updated by asking how the critic value changes if the chosen action moves slightly.
Path 3: Deterministic Policy Gradient
With a deterministic actor, the objective can be written as
The actor is then updated by differentiating through the action itself:
This is the main conceptual difference:
- in stochastic actor-critic, the critic says whether a sampled action should become more or less likely
- in deterministic policy gradients, the critic tells the actor how to move the action itself in action space
DQN Analogy
The connection to DQN is helpful here. In DQN with discrete actions, we can compute a value for each action and pick the best one. In continuous action spaces, we cannot enumerate all possible actions. Deterministic actor-critic methods replace that search with local optimization:
- the actor proposes an action
- the critic evaluates that action
- the critic gradient
tells the actor how to nudge the action toward a better one
Path 4: DDPG
DDPG takes the deterministic policy gradient idea and combines it with practical stabilization tricks from DQN:
- a replay buffer
- a target critic
- a target actor
The critic is trained by TD learning, while the actor is trained to maximize the critic’s output:
So DDPG is not changing the goal of policy optimization. It is choosing a particular answer to the same two questions from above:
- What says an action is good: a learned action-value critic
. - How is that signal propagated to the actor: by differentiating the critic with respect to the action, and then the action with respect to the actor parameters.
Summary Table
| Method | Policy type | Main signal | How the actor updates |
|---|---|---|---|
| REINFORCE | Stochastic | Return / reward-to-go | Make sampled good actions more likely |
| A2C / actor-critic | Stochastic | TD error / advantage / GAE | Same policy-gradient form, but with a learned low-variance scalar estimator |
| Policy Improvement Methods | Stochastic | Advantage plus trust-region / clipped surrogate control | Improve the policy while limiting how far one update can move it |
| Deterministic Policy Gradient Methods | Deterministic | Move the actor output in the direction that increases critic value | |
| DDPG / TD3 | Deterministic | Learned critic with replay + targets | Practical deep RL versions of deterministic policy gradients |
Core Intuition
All of these methods are trying to improve a policy with respect to expected return. The deeper fork is not simply “with or without a critic.” It is:
Are we learning by reweighting sampled actions under a probability distribution, or by directly differentiating a critic through the action chosen by the policy?
That is the conceptual step from vanilla policy gradients to deterministic actor-critic methods.
Sources
- Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 3.