Deterministic Policy Gradient Methods
Note
For the conceptual transition from vanilla policy gradients to deterministic policy gradients, see From Policy Gradients to DDPG.
Here, we consider an actor critic method that uses a deterministic policy, that predicts a unique action for each state, so
The benefit of using a deterministic policy, as opposed to a stochastic policy, is that we can modify the policy gradient method so that it can work off policy using data from a replay buffer, without needing importance sampling ratios. In addition, the feedback signal for learning is based on the vector-valued gradient of the critic with respect to the action,
Note
A stochastic policy gradient update has the form
, so the actor only gets a scalar judgment of the sampled action: roughly, whether that sampled action should become more likely or less likely in the future. By contrast, a deterministic policy gradient uses , which gives a direction in action space for how the actor should change its output. This is analogous to the role of
in DQN. In the discrete case, we can evaluate several candidate actions and choose the one with the largest Q-value. In a continuous action space, we cannot enumerate all possible actions, so instead we use the gradient of the critic with respect to the action to locally adjust the actor toward actions with larger predicted value. In that sense, the critic is not just saying “this sampled action was good/bad”; it is also saying “move the action a little in this direction.” This is particularly useful in continuous action spaces.
Caution
A deterministic policy does not explore by itself, since it always outputs the same action for a given state. In practice, algorithms such as DDPG therefore add external exploration noise during data collection.
Deterministic Policy Gradient Theorem
For a deterministic policy
The deterministic policy gradient theorem says that
The second line is just the chain rule applied to the composition
If we define
Here
As a limiting case, if we consider a family of stochastic policies that becomes increasingly concentrated around
Note that this gradient integrates over states, but not over sampled actions. This removes one source of sampling variance, since the actor does not need to sample an action and then weight it by a scalar advantage estimate. However, a deterministic policy does not explore by itself, so in practice we collect data using a stochastic behavior policy
The off-policy deterministic policy gradient has the same chain-rule form, but with expectation taken over states from the replay distribution:
The important point is that we still differentiate the critic with respect to the action chosen by the current actor, even though the states themselves may have come from older policies. In practice, we replace
So we learn both a state-action critic
DDPG
DDPG is an off-policy actor critic policy gradient algorithm. 1
The DDPG (deep deterministic policy gradient) algorithm combines deterministic policy gradients with the stabilizing tricks from DQN: a replay buffer and slowly updated target networks. The critic is trained to satisfy a Bellman target, and the actor is trained to choose actions that maximize the critic.
Writing the actor objective as a loss, we therefore minimize the negative critic value
where the loss is averaged over states
The critic minimizes the 1-step TD loss, as in Q-learning,
where the sample
The target networks change slowly, which helps prevent the critic from chasing a rapidly moving bootstrap target.
Algorithm 2 DDPG
Initialize actor parameters , critic parameters
Initialize target actor , target critic
Initialize replay buffer
repeat
sample starting state of a new episode
for do
// exploration noise
env.step()
Sample minibatch
for each do
// TD target
end for
// critic
// actor
// target critic
// target actor
if then
break
end if
end for
until converged
Implementation
import random
from collections import deque
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
class Actor(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=256, action_limit=1.0):
super().__init__()
self.action_limit = action_limit
self.network = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim),
nn.Tanh(),
)
def forward(self, state):
if isinstance(state, np.ndarray):
state = torch.from_numpy(state)
if state.ndim == 1:
state = state.unsqueeze(0)
state = state.float()
return self.action_limit * self.network(state)
class Critic(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=256):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim + action_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1),
)
def forward(self, state, action):
if isinstance(state, np.ndarray):
state = torch.from_numpy(state)
if isinstance(action, np.ndarray):
action = torch.from_numpy(action)
if state.ndim == 1:
state = state.unsqueeze(0)
if action.ndim == 1:
action = action.unsqueeze(0)
state = state.float()
action = action.float()
return self.network(torch.cat([state, action], dim=-1)).squeeze(-1)
class ReplayBuffer:
def __init__(self, capacity=100_000):
self.buffer = deque(maxlen=capacity)
def add(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
batch = random.sample(self.buffer, batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
return (
torch.tensor(np.array(states), dtype=torch.float32),
torch.tensor(np.array(actions), dtype=torch.float32),
torch.tensor(rewards, dtype=torch.float32),
torch.tensor(np.array(next_states), dtype=torch.float32),
torch.tensor(dones, dtype=torch.float32),
)
def __len__(self):
return len(self.buffer)
def soft_update(target, source, tau):
for target_param, source_param in zip(target.parameters(), source.parameters()):
target_param.data.mul_(1 - tau).add_(tau * source_param.data)
def train(
env,
num_episodes: int,
gamma: float = 0.99,
actor_lr: float = 1e-4,
critic_lr: float = 1e-3,
tau: float = 0.005,
batch_size: int = 64,
noise_std: float = 0.1,
):
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
action_limit = float(env.action_space.high[0])
actor = Actor(state_dim, action_dim, action_limit=action_limit)
critic = Critic(state_dim, action_dim)
target_actor = Actor(state_dim, action_dim, action_limit=action_limit)
target_critic = Critic(state_dim, action_dim)
target_actor.load_state_dict(actor.state_dict())
target_critic.load_state_dict(critic.state_dict())
actor_optimizer = optim.Adam(actor.parameters(), lr=actor_lr)
critic_optimizer = optim.Adam(critic.parameters(), lr=critic_lr)
replay_buffer = ReplayBuffer()
for episode in range(num_episodes):
state, _ = env.reset()
while True:
with torch.no_grad():
action = actor(state).squeeze(0).numpy()
noise = np.random.normal(0.0, noise_std, size=action_dim)
action = np.clip(action + noise, -action_limit, action_limit)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
replay_buffer.add(state, action, reward, next_state, done)
state = next_state
if len(replay_buffer) >= batch_size:
states, actions, rewards, next_states, dones \
= replay_buffer.sample(batch_size)
with torch.no_grad():
next_actions = target_actor(next_states)
td_target = rewards + (
gamma * (1 - dones)
* target_critic(next_states, next_actions)
)
critic_loss = F.mse_loss(critic(states, actions), td_target)
critic_optimizer.zero_grad()
critic_loss.backward()
critic_optimizer.step()
# Grads are set for critic again but we ignore so its ok
actor_loss = -critic(states, actor(states)).mean()
actor_optimizer.zero_grad()
actor_loss.backward()
actor_optimizer.step()
soft_update(target_actor, actor, tau)
soft_update(target_critic, critic, tau)
if done:
breakTwin Delayed DDPG (TD3)
TD3 is an off-policy actor critic policy gradient algorithm. 2
The TD3 algorithm extends DDPG in 3 main ways. First, it uses target policy smoothing, in which a noise is added to the action, to encourage generalization
Second, it uses clipped double Q learning, where the target values for TD learning are defined using
Third, it uses delayed policy updates, in which it only updates the policy after the value function has stabilized.
Sources
- Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 3.