The soft actor-critic (SAC) algorithm is an off-policy actor-critic method based on the maximum entropy RL method. This is an instance of KL control, where the variational posterior policy is parameterized, but the prior policy is fixed to the uniform distribution. Thus, SAC only has an E step (implemented via SGD) but with no M step. SAC uses , where the soft-Q function is defined below.
Note that the entropy term makes the objective easier to optimize, and encourages exploration. To optimize this, we can perform a policy evaluation step, and then a policy improvement step.
Policy Evaluation Step
Given a fixed policy , we define the soft state-value and soft action-value functions by
and
Combining these gives the soft Bellman equation
SAC approximates with neural critics . For a replay sample , we form the target
where
Each critic is then fit by squared Bellman error regression:
Intuition
The critic is trying to predict the total future reward of taking action in state , but under the soft objective it also includes the future entropy bonus. So the target is the usual one-step TD target, except that the next-state bootstrap value is reduced by , which rewards policies that remain stochastic.
Policy Improvement Step
Once the critic has estimated the soft action values, we improve the policy by maximizing the expected soft value
In practice, SAC replaces by the learned critics and replaces the discounted state distribution by replay-buffer samples. Using the ensemble average
the actor objective for a minibatch becomes
We then perform gradient ascent on . This is exactly the quantity optimized by update_actor in the pseudocode.
Intuition
The actor prefers actions that the critic thinks have high long-term value, but it is also rewarded for keeping high entropy. So SAC does not simply become greedy with respect to ; it becomes greedy with respect to soft, balancing reward-seeking and exploration.
Algorithm
Algorithm 11 SAC
Initialize environment state s, policy parameters θ, N critic parameters wi, target parameters wi, replay buffer D, discount factor γ, EMA rate ρ, step size ηw, ηπ