Soft Actor Critic

The soft actor-critic (SAC) algorithm is an off-policy actor-critic method based on the maximum entropy RL method. This is an instance of KL control, where the variational posterior policy is parameterized, but the prior policy is fixed to the uniform distribution. Thus, SAC only has an E step (implemented via SGD) but with no M step. SAC uses , where the soft-Q function is defined below.

SAC Objective

We can write the maximum entropy objective for the E step but with a slightly modified notation

Note that the entropy term makes the objective easier to optimize, and encourages exploration. To optimize this, we can perform a policy evaluation step, and then a policy improvement step.

Policy Evaluation Step

Given a fixed policy , we define the soft state-value and soft action-value functions by

and

Combining these gives the soft Bellman equation

SAC approximates with neural critics . For a replay sample , we form the target

where

Each critic is then fit by squared Bellman error regression:

Intuition

The critic is trying to predict the total future reward of taking action in state , but under the soft objective it also includes the future entropy bonus. So the target is the usual one-step TD target, except that the next-state bootstrap value is reduced by , which rewards policies that remain stochastic.

Policy Improvement Step

Once the critic has estimated the soft action values, we improve the policy by maximizing the expected soft value

In practice, SAC replaces by the learned critics and replaces the discounted state distribution by replay-buffer samples. Using the ensemble average

the actor objective for a minibatch becomes

We then perform gradient ascent on . This is exactly the quantity optimized by update_actor in the pseudocode.

Intuition

The actor prefers actions that the critic thinks have high long-term value, but it is also rewarded for keeping high entropy. So SAC does not simply become greedy with respect to ; it becomes greedy with respect to soft , balancing reward-seeking and exploration.

Algorithm

Algorithm 11 SAC

Initialize environment state ss, policy parameters θ\mathbf{\theta}, NN critic parameters wi\mathbf{w}_i, target parameters wi\overline{\mathbf{w}}_i, replay buffer D\mathcal{D}, discount factor γ\gamma, EMA rate ρ\rho, step size ηw\eta_{\mathbf{w}}, ηπ\eta_\pi

repeat

Take action apiθ(s)a \sim pi_{\mathbf{\theta}}(\cdot \mid s)

Take stem aa in ss, observe s,rs',r

Add (s,a,r,s)(s,a,r,s') to D\mathcal{D}

for GG updates do

Sample minibatch B={(sj,aj,rj,sj)}\mathcal{B} = \{(s_j,a_j,r_j,s_j')\} from D\mathcal{D}

w=\mathbf{w} = updatecritic(θ,w,B\mathbf{\theta}, \mathbf{w}, \mathcal{B})

end for

Sample minibatch B={(sj,aj,rj,sj)}\mathcal{B} = \{(s_j,a_j,r_j,s_j')\} from D\mathcal{D}

θ=\mathbf{\theta} = updateactor(θ,w,B\mathbf{\theta}, \mathbf{w}, \mathcal{B})

until converged

procedure updatecritic(θ,w,B\mathbf{\theta}, \mathbf{w}, \mathcal{B})

Let (sj,aj,rj,sj)j=1B=B(s_j, a_j, r_j, s_j')_{j=1}^B = \mathcal{B}

yj=y(rj,sj;w1:N,θ)y_j = y(r_j, s_j'; \overline{\mathcal{w}}_{1:N}, \mathbf{\theta}) for j=1:Bj=1:B

for i=1:Ni=1:N do

L(wi)=1B(s,a,r,s)jB(Qwi(sj,aj)stopgrad(yj))2\mathcal{L}(\mathbf{w}_i) = \frac{1}{|\mathcal{B}|} \sum_{(s,a,r,s')_j \in \mathcal{B}} (Q_{\mathbf{w}_i}(s_j,a_j) - \mathrm{stopgrad}(y_j))^2

wiwiηwL(wi)\mathbf{w}_i \gets \mathbf{w}_i - \eta_{\mathbf{w}} \nabla \mathcal{L}(\mathbf{w}_i) // Descent

wi:=ρwi+(1ρ)wi\overline{\mathbf{w}}_i := \rho \overline{\mathbf{w}}_i + (1-\rho)\mathbf{w}_i // Update target networks

end for

return w1:N\mathbf{w}_{1:N}, w1:N\overline{\mathbf{w}}_{1:N}

end procedure

procedure updateactor(θ,w,B\mathbf{\theta}, \mathbf{w}, \mathcal{B})

Q^(s,a)1Ni=1NQwi(s,a)\hat{Q}(s,a) \triangleq \frac{1}{N} \sum_{i=1}^N Q_{\mathbf{w}_i}(s,a) // Average critic

J(θ)=1BsB(Q^(s,a~θ(s))αlogπθ(a~θ(s)s)), a~θ(s)πθ(s)J(\mathbf{\theta}) = \frac{1}{|\mathcal{B}|} \sum_{s \in \mathcal{B}} \big( \hat{Q}(s, \tilde{a}_{\mathbf{\theta}}(s)) - \alpha \log \pi_{\mathbf{\theta}} (\tilde{a}_{\mathbf{\theta}}(s) \mid s) \big), \ \tilde{a}_{\mathbf{\theta}}(s) \sim \pi_{\mathbf{\theta}}(\cdot \mid s)

θθ+ηθJ(θ)\mathbf{\theta} \gets \mathbf{\theta} + \eta_{\mathbf{\theta}} \nabla J(\mathbf{\theta}) // Ascent

return θ\mathbf{\theta}

end procedure

Sources