Deterministic Policy Gradient Methods

Note

For the conceptual transition from vanilla policy gradients to deterministic policy gradients, see From Policy Gradients to DDPG.

Here, we consider an actor critic method that uses a deterministic policy, that predicts a unique action for each state, so rather than . This is trained to match the optimal action from . Thus, we can think of the resulting method as a version of DQN designed for continuous actions.

The benefit of using a deterministic policy, as opposed to a stochastic policy, is that we can modify the policy gradient method so that it can work off policy using data from a replay buffer, without needing importance sampling ratios. In addition, the feedback signal for learning is based on the vector-valued gradient of the critic with respect to the action, , which tells the actor how to change the action to increase the predicted value. This is often a richer signal for the actor than a single scalar weighting term such as a reward, TD error, or advantage estimate.

Note

A stochastic policy gradient update has the form , so the actor only gets a scalar judgment of the sampled action: roughly, whether that sampled action should become more likely or less likely in the future. By contrast, a deterministic policy gradient uses , which gives a direction in action space for how the actor should change its output.

This is analogous to the role of in DQN. In the discrete case, we can evaluate several candidate actions and choose the one with the largest Q-value. In a continuous action space, we cannot enumerate all possible actions, so instead we use the gradient of the critic with respect to the action to locally adjust the actor toward actions with larger predicted value. In that sense, the critic is not just saying “this sampled action was good/bad”; it is also saying “move the action a little in this direction.” This is particularly useful in continuous action spaces.

Caution

A deterministic policy does not explore by itself, since it always outputs the same action for a given state. In practice, algorithms such as DDPG therefore add external exploration noise during data collection.

Deterministic Policy Gradient Theorem

For a deterministic policy , the value objective can be written in terms of the critic as

The deterministic policy gradient theorem says that

The second line is just the chain rule applied to the composition

If we define , then for a fixed state we have

Here is the Jacobian matrix, and and are the dimensions of and , respectively. Intuitively, the actor first asks “if I perturb the parameters, how does my chosen action move?” and then the critic asks “if I perturb the action, how does the predicted return change?” Multiplying these two sensitivities gives the direction in parameter space that most increases the value.

As a limiting case, if we consider a family of stochastic policies that becomes increasingly concentrated around , then the standard stochastic policy gradient approaches the deterministic policy gradient above as the policy noise goes to zero.

Note that this gradient integrates over states, but not over sampled actions. This removes one source of sampling variance, since the actor does not need to sample an action and then weight it by a scalar advantage estimate. However, a deterministic policy does not explore by itself, so in practice we collect data using a stochastic behavior policy and reuse those transitions from a replay buffer. If denotes the discounted state distribution of the behavior policy, we can define the off-policy objective

The off-policy deterministic policy gradient has the same chain-rule form, but with expectation taken over states from the replay distribution:

The important point is that we still differentiate the critic with respect to the action chosen by the current actor, even though the states themselves may have come from older policies. In practice, we replace with a learned critic and fit it using TD learning. This gives the updates

So we learn both a state-action critic and an actor . This method avoids importance sampling in the actor update because of the deterministic policy gradient, and we avoid it in the critic update because of the use of Q-learning.

DDPG

DDPG is an off-policy actor critic policy gradient algorithm. ¹

The DDPG (deep deterministic policy gradient) algorithm combines deterministic policy gradients with the stabilizing tricks from DQN: a replay buffer and slowly updated target networks. The critic is trained to satisfy a Bellman target, and the actor is trained to choose actions that maximize the critic.

Writing the actor objective as a loss, we therefore minimize the negative critic value

where the loss is averaged over states drawn from a replay buffer. This is just gradient ascent on written in loss-minimization form.

The critic minimizes the 1-step TD loss, as in Q-learning,

where the sample is drawn from a replay buffer and the bootstrap target is treated as a constant during differentiation. DDPG therefore uses both a target critic and a target actor , so the TD target is

The target networks change slowly, which helps prevent the critic from chasing a rapidly moving bootstrap target.

Algorithm 2 DDPG

Initialize actor parameters $\mathbf{\theta}$ , critic parameters $\mathbf{w}$

Initialize target actor $\overline{\mathbf{\theta}} \gets \mathbf{\theta}$ , target critic $\overline{\mathbf{w}} \gets \mathbf{w}$

Initialize replay buffer $\mathcal{D} \gets \emptyset$

repeat

sample starting state $s_0$ of a new episode

for $t=0,1,2,\dots$ do

$a_t \gets \mu_{\mathbf{\theta}}(s_t) + \varepsilon_t$ // exploration noise

$(s_{t+1}, r_t, \mathrm{done}) \gets$ env.step( $s_t, a_t$ )

$\mathcal{D} \gets \mathcal{D} \cup \{(s_t, a_t, r_t, s_{t+1}, \mathrm{done})\}$

Sample minibatch $\mathcal{B} \subset \mathcal{D}$

for each $(s,a,r,s',d) \in \mathcal{B}$ do

$y \gets r + \gamma (1-d) Q_{\overline{\mathbf{w}}}(s', \mu_{\overline{\mathbf{\theta}}}(s'))$ // TD target

end for

$\mathcal{L}_{\mathbf{w}} \gets \frac{1}{|\mathcal{B}|} \sum_{(s,a,r,s',d) \in \mathcal{B}} \Big( Q_{\mathbf{w}}(s,a) - \mathrm{stopgrad}(y) \Big)^2$

$\mathbf{w} \gets \mathbf{w} - \eta_{\mathbf{w}} \nabla_{\mathbf{w}} \mathcal{L}_{\mathbf{w}}$ // critic

$\mathcal{L}_{\mathbf{\theta}} \gets -\frac{1}{|\mathcal{B}|} \sum_{s \in \mathcal{B}} Q_{\mathbf{w}}(s, \mu_{\mathbf{\theta}}(s))$

$\mathbf{\theta} \gets \mathbf{\theta} - \eta_{\mathbf{\theta}} \nabla_{\mathbf{\theta}} \mathcal{L}_{\mathbf{\theta}}$ // actor

$\overline{\mathbf{w}} \gets \rho \overline{\mathbf{w}} + (1-\rho) \mathbf{w}$ // target critic

$\overline{\mathbf{\theta}} \gets \rho \overline{\mathbf{\theta}} + (1-\rho) \mathbf{\theta}$ // target actor

if $\mathrm{done}$ then

break

end if

end for

until converged

Implementation

import random
from collections import deque
 
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
 
 
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=256, action_limit=1.0):
        super().__init__()
        self.action_limit = action_limit
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
            nn.Tanh(),
        )
 
    def forward(self, state):
        if isinstance(state, np.ndarray):
            state = torch.from_numpy(state)
        if state.ndim == 1:
            state = state.unsqueeze(0)
        state = state.float()
        return self.action_limit * self.network(state)
 
 
class Critic(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=256):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1),
        )
 
    def forward(self, state, action):
        if isinstance(state, np.ndarray):
            state = torch.from_numpy(state)
        if isinstance(action, np.ndarray):
            action = torch.from_numpy(action)
        if state.ndim == 1:
            state = state.unsqueeze(0)
        if action.ndim == 1:
            action = action.unsqueeze(0)
        state = state.float()
        action = action.float()
        return self.network(torch.cat([state, action], dim=-1)).squeeze(-1)
 
 
class ReplayBuffer:
    def __init__(self, capacity=100_000):
        self.buffer = deque(maxlen=capacity)
 
    def add(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
 
    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (
            torch.tensor(np.array(states), dtype=torch.float32),
            torch.tensor(np.array(actions), dtype=torch.float32),
            torch.tensor(rewards, dtype=torch.float32),
            torch.tensor(np.array(next_states), dtype=torch.float32),
            torch.tensor(dones, dtype=torch.float32),
        )
 
    def __len__(self):
        return len(self.buffer)
 
 
def soft_update(target, source, tau):
    for target_param, source_param in zip(target.parameters(), source.parameters()):
        target_param.data.mul_(1 - tau).add_(tau * source_param.data)
 
 
def train(
    env,
    num_episodes: int,
    gamma: float = 0.99,
    actor_lr: float = 1e-4,
    critic_lr: float = 1e-3,
    tau: float = 0.005,
    batch_size: int = 64,
    noise_std: float = 0.1,
):
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.shape[0]
    action_limit = float(env.action_space.high[0])
 
    actor = Actor(state_dim, action_dim, action_limit=action_limit)
    critic = Critic(state_dim, action_dim)
    target_actor = Actor(state_dim, action_dim, action_limit=action_limit)
    target_critic = Critic(state_dim, action_dim)
 
    target_actor.load_state_dict(actor.state_dict())
    target_critic.load_state_dict(critic.state_dict())
 
    actor_optimizer = optim.Adam(actor.parameters(), lr=actor_lr)
    critic_optimizer = optim.Adam(critic.parameters(), lr=critic_lr)
    replay_buffer = ReplayBuffer()
 
    for episode in range(num_episodes):
        state, _ = env.reset()
 
        while True:
            with torch.no_grad():
                action = actor(state).squeeze(0).numpy()
            noise = np.random.normal(0.0, noise_std, size=action_dim)
            action = np.clip(action + noise, -action_limit, action_limit)
 
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            replay_buffer.add(state, action, reward, next_state, done)
            state = next_state
 
            if len(replay_buffer) >= batch_size:
                states, actions, rewards, next_states, dones \
                     = replay_buffer.sample(batch_size)
 
                with torch.no_grad():
                    next_actions = target_actor(next_states)
                    td_target = rewards + (
                        gamma * (1 - dones) 
                              * target_critic(next_states, next_actions)
                    )
 
                critic_loss = F.mse_loss(critic(states, actions), td_target)
                critic_optimizer.zero_grad()
                critic_loss.backward()
                critic_optimizer.step()
 
                # Grads are set for critic again but we ignore so its ok
                actor_loss = -critic(states, actor(states)).mean()
                actor_optimizer.zero_grad()
                actor_loss.backward()
                actor_optimizer.step()
 
                soft_update(target_actor, actor, tau)
                soft_update(target_critic, critic, tau)
 
            if done:
                break

Twin Delayed DDPG (TD3)

TD3 is an off-policy actor critic policy gradient algorithm. ²

The TD3 algorithm extends DDPG in 3 main ways. First, it uses target policy smoothing, in which a noise is added to the action, to encourage generalization

Second, it uses clipped double Q learning, where the target values for TD learning are defined using

Third, it uses delayed policy updates, in which it only updates the policy after the value function has stabilized.

Sources

Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 3.

Jake Tuero

Explorer

Deterministic Policy Gradient Methods

Deterministic Policy Gradient Methods

Deterministic Policy Gradient Theorem

DDPG

Implementation

Twin Delayed DDPG (TD3)

Sources

Graph View

Table of Contents

Backlinks

Jake Tuero

Explorer

Deterministic Policy Gradient Methods

Deterministic Policy Gradient Methods

Deterministic Policy Gradient Theorem

DDPG

Implementation

Twin Delayed DDPG (TD3)

Sources

Footnotes

Graph View

Table of Contents

Backlinks