This note discusses an approach to policy optimization that reduces to probabilistic inference. This is called control as inference. The primary advantage to this approach is that it enables policy learning using off-policy data, while avoiding the need to use (potentially high variance) importance sampling corrections. This is because the inference approach takes expectations with respect to instead of , where is an auxiliary distribution, is the policy which is being optimized, and is the state visitation measure. It also enables us to use a large toolkit of methods for probabilistic modeling and inference to solve RL problems.
The core of these methods is based on the above probabilistic model. This shows an MDP augmented with a new variable , called optimality variables, and indicates whether the action at time is optimal or not. We assume these have the following probability distributions
where is a temperature parameter, and is some quality function such as , or , or . For brevity, we will write to denote the probability of the event that for all time steps.
Deterministic Case (Planning and Control as Inference)
Our goal is to find trajectories that are optimal. That is, we would like to find the mode (or posterior samples) from the following distribution
where is the policy.
We start by determining the deterministic case, where is either 1 or 0, depending on whether the transition is feasible or not. In this case, rather than learning a policy that maps states to actions, we just need to learn a plan (a specific sequence of actions ) for each starting state . This is equivalent to a shortest path problem, i.e., we want to maximize
The MAP sequence of actions, which we denote , is the optimal open loop plan. It is called open loop since the agent does not need to observe the state, since is uniquely determined by and , both of which are known. Computing this trajectory is known as the control as inference problem. Such open loop planning problems can be solved using model predictive control methods.
Stochastic Case (Policy Learning as Variational Inference)
In the stochastic case, we want to learn a policy which maps states to actions, and which generates a distribution over trajectories which are optimal. Thus, we define the objective as
where we define
The difficulty is not writing down , but rather reasoning about the posterior over trajectories conditioned on all optimality variables being 1, namely . This posterior favors trajectories with high reward, but computing it exactly requires marginalizing over all trajectories. To obtain a tractable objective, we introduce a variational distribution whose role is to approximate this posterior. We assume factors in the same way:
Even though has the same factorization as , it is not the same distribution unless . The key modeling choice is that keeps the true dynamics model fixed and only changes the action distribution through . Conditioning on optimality should make us prefer better actions, but it should not change our belief about how the environment transitions. This is one way to avoid the optimization bias that can arise if we sample from an unconstrained .
Intuition
To see this, suppose is the event that we win the lottery. We do not want conditioning on this outcome to influence our belief in the probability of chance events, which is governed by and not .
Now we note the following identity
Going back to the original objective, we can multiply and divide by inside the integral to express it as an expectation under :
Hence, by Jensen’s inequality,
This gives the lower bound. Separately, if we rearrange the KL identity above, we get the exact decomposition
Where is defined as
Since , we see that . Hence, is called the evidence lower bound or ELBO. We can define the policy learning task as maximizing the ELBO, subject to constraints that and are distributions that integrate to 1 across actions for all states.
Intuition
We can think of as scoring trajectories proposed by the variational policy according to two criteria. First, it prefers trajectories whose actions lead to high values of , so it rewards behavior that looks optimal under the control-as-inference model. Second, it penalizes for moving too far away from the policy through the KL term. Thus, maximizing means finding a trajectory distribution that concentrates on high-quality behavior while still remaining compatible with the current policy and the true dynamics.
To extend to the infinite time discounted case, we define as the unnormalized discounted distribution over states
Replacing the finite-horizon sum with an expectation under the discounted visitation measure gives the constrained objective
There are two main ways to solve this optimization problem, which are called EM control and KL control.
EM Control
One can optimize the above using Expectation Maximization that monotonically increases a lower bound on its objective.
In the E step, we maximize with respect to a non-parametric representation of the variational posterior , while holding the parametric prior fixed at the value from the previous th iteration, to get .
In the M step, we then maximize with respect to , holding the variational posterior fixed at , to get the updated policy .
KL Control (Maximum Entropy RL)
In KL control, we only optimize the variational posterior , holding the prior fixed. Thus, we only have an E step. In addition, we represent parametrically as , instead of the non-parametric approach used by EM.
If the prior is uniform and we use , then Equation 10 becomes
where is the negative entropy function and is a constant. This is called the maximum entropy RL objective.
Sources
Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 3.