Model-Based RL

Model-free approaches to RL typically need a lot of interactions with the environment to achieve good performance. For example, state of the methods for Atari need millions of frames, while humans can achieve the same performance in minutes.

One approach to increase the sample efficiency is model-based RL (MBRL). In the simplest approach to MBRL, we first learn the state transition dynamics model , also called a world model, and the reward function , using some offline trajectory data, and then use these models to compute a policy. It can be shown that the complexity of learning the dynamics model is less than the sample complexity of learning the policy 1.

Caution

The above two stage approach, or first learning a model and then planning with it, can suffer from problems such as the policy may query the model at a state for which no data has been collected, so prediction can be unreliable. To get better results, we have to interleave the model learning and the policy learning, so that one helps the other.

There are two main ways to perform MBRL

  • Decision-time planning or model predictive control, which use the model to choose the next action by searching over possible future trajectories. Then each trajectory is scored, pick the action corresponding to the best one, take a step in the environment, and repeat.
  • Background planning, which uses the current model and policy to rollout imaginary trajectories, and to use this data to improve the policy using model-free RL

Decision-Time (Online) Planning

Receding Horizon Control (RHC)

In receding horizon control, we plan from the current state to a maximum fixed depth of . We then take the first action based on this future planning, observe the new state , and then replan. This can be slow since it needs to perform a search at each step, but it can give good results. Some examples include:

  • Forward Search: Starting at the current state, all actions are considered, then all corresponding next states, continuing until all transitions up to depth . The reward associated with each edge is computed. At the leaves, the value function is computed (using a learned offline method), and the path with the highest score is used to apply the corresponding first action.
  • Branch and Bound: We try to avoid the exponential complexity of forward search by pruning paths that we determine are suboptimal. To do this, we need to know a lower bound on the value function , and an upper bound on the action value function . At each state node , we examine the actions in decreasing order of their upper bound. If we find an action where is less than the current best lower bound, we prune this branch of the tree. Otherwise we expand it and explore below. We continue this until we hit a leaf node (at maximum depth), in which case we return the lower bound .
  • Sparse Sampling: To speed up forward search and branch and bound, we can sample a subset of possible next states for each action.
  • Heuristic Search: We start with a value function which we use to initialize the value function . We then perform Monte Carlo rollouts starting from the root node . At each state node, we pick the greedy action with respect to the current , i.e. . We then update , and sample the next state . We repeat this process until the max depth, and then return the greedy action applied to the root node.

Monte Carlo Tree Search (MCTS)

Monte Carlo Tree Search is a receding horizon control procedure. It can be applied to any MDP, but some of its most famous applications are to games.

Sequential Monte Carlo (SMC) for Online Planning

Although MCTS is powerful, it is inherently serial, and can be complicated to apply to continuous action spaces. A different approach is to view online planning through the RL as Inference lens, and then use sequential Monte Carlo (SMC) to approximately sample good future trajectories.

The high-level goal is to maintain a population of candidate futures and gradually concentrate that population on trajectories that look promising under the learned model. For a planning horizon , we can define a target distribution over trajectories starting from the current state as

where , is the learned dynamics model, and

is a one-step model-based estimate of the advantage. The exponential tilt means that trajectories whose steps have larger estimated advantage receive larger probability mass.

Intuition

MCTS builds a search tree and repeatedly focuses computation on promising branches. SMC does something similar, but with a set of weighted trajectory samples instead of an explicit tree. Each particle is a candidate future. After every imagined step, particles that look better under the model get larger weights, weak particles get smaller weights, and resampling duplicates the promising particles while discarding poor ones. By the end, the first actions attached to the surviving particles tell us which action looks best right now.

The resulting empirical trajectory distribution is

which induces an empirical distribution over the next action,

where is the first action of particle .

To approximately sample from , SMC extends each particle one step at a time using a proposal distribution. A simple and natural choice is to use the model and current policy as the proposal,

Under this choice, the proposal already accounts for the policy and dynamics factors, so the incremental importance weight is just

where . More generally, if the proposal differs from the target dynamics or action distribution, the weight update also includes the usual target-to-proposal correction ratio.

In practice, we periodically resample when the effective sample size becomes too small, meaning that the normalized weights have become concentrated on only a few particles even though we still store particles. This prevents nearly all of the mass from collapsing onto a few particles. After resampling, the particle set is unweighted again, so we reset the weights to a constant and continue rolling the particles forward.

Algorithm 4 SMC with Receding Horizon Control (RHC)

procedure SMC-RHC(st,π,T^,R^,V,d,ηs_t, \pi, \hat{\mathcal{T}}, \hat{R}, V, d, \eta)

Initialize particles {s0(n)=st}n=1N\{s_0^{(n)} = s_t\}_{n=1}^N

Initialize weights {w0(n)=1}n=1N\{w_0^{(n)} = 1\}_{n=1}^N

for k=0,1,,d1k=0,1,\dots,d-1 do

Sample {ak(n)π(sk(n))}n=1N\{a_k^{(n)} \sim \pi(\cdot \mid s_k^{(n)})\}_{n=1}^N

Sample {sk+1(n)T^(sk(n),ak(n))}n=1N\{s_{k+1}^{(n)} \sim \hat{\mathcal{T}}(\cdot \mid s_k^{(n)}, a_k^{(n)})\}_{n=1}^N

Compute {A^k(n)=R^(sk(n),ak(n))+γV(sk+1(n))V(sk(n))}n=1N\{\hat{A}_k^{(n)} = \hat{R}(s_k^{(n)}, a_k^{(n)}) + \gamma V(s_{k+1}^{(n)}) - V(s_k^{(n)})\}_{n=1}^N

Update {wk+1(n)=wk(n)exp(A^k(n)/η)}n=1N\{w_{k+1}^{(n)} = w_k^{(n)} \exp(\hat{A}_k^{(n)} / \eta)\}_{n=1}^N

if effective sample size is too small then

Resample complete particle histories with probabilities proportional to {wk+1(n)}n=1N\{w_{k+1}^{(n)}\}_{n=1}^N

Reset {wk+1(n)=1}n=1N\{w_{k+1}^{(n)} = 1\}_{n=1}^N

end if

end for

Normalize weights w(n)=wd(n)/n=1Nwd(n)\overline{w}^{(n)} = w_d^{(n)} / \sum_{n'=1}^N w_d^{(n')}

Let a0(n)a_0^{(n)} be the first action in particle history nn

return q^(ast)=n=1Nw(n)δ(aa0(n))\hat{q}(a \mid s_t) = \sum_{n=1}^N \overline{w}^{(n)} \delta(a - a_0^{(n)})

end procedure

Model Predictive Control (MPC)

Model Predictive Control (MPC) is an open loop version of receding horizon. In particular, at each step, it solves for the sequence of subsequent actions that is most likely to achieve high expected reward

where is the dynamics model. It then returns as the best action, takes a step, then replans.

Crucially, the future actions are chosen without knowing what the future states are; this is what is meant by open loop. This can be much faster than interleaving the search for actions and future states. However, it can also lead to suboptimal decisions. Nevertheless, the fact that we replan at each step can reduce the harms of this approximation, making the method quite popular for some problems, especially ones where the dynamics are deterministic. Some examples include:

  • Random shooting methods: Sample many candidate action sequences, simulate them with the model, and choose the sequence with the highest predicted return. 2
  • Cross-Entropy Method (CEM) planning: Iteratively sample action sequences, keep the elite ones, and refit the sampling distribution toward better plans. 3
  • Model Predictive Path Integral (MPPI) control: Use reward-weighted averaging of sampled control sequences to update the current plan. 4
  • Iterative LQR / Differential Dynamic Programming (iLQR / DDP): Locally optimize a nominal action sequence by repeatedly linearizing the dynamics and quadratizing the objective.
  • PETS: A model-based RL method that learns an ensemble dynamics model and then uses MPC, often with CEM, to choose actions online. 5

Background (Offline) Planning

Online planning can be slow. Fortunately, we can amortize the planning process into a reactive policy. TO do this, we can use the model to generate synthetic trajectories in the background (while executing the current policy), and use this imaginary data to train the policy. This is called background planning.

One example of this is Dyna 6, which trains a policy and model in parallel. In this case, the policy is trained on both real and imaginary data. That is, we define

where is the data from the real environment, and is the rollout using from the imaginary model. At each step of Dyna, the agent collects new data from the environment and adds it to a real replay buffer. This is then used to do an off-policy update. It also updates its world model given the real data. Then it simulates imaginary data, starting from a previously visited state, and rolling out the current policy in the learned model. The imaginary data is then added to the imaginary replay buffer and used by an on-policy learning algorithm. This process continues until the agent runs out of time and must take the next step in the environment.

Algorithm 5 Tabular Dyna-Q

Initialize data buffer D\mathcal{D}, Q(s,a)=0Q(s,a)=0, and World model M^(s,a)=0\hat{M}(s,a)=0

repeat

// Collect real data from environment

aa \gets epsilon-greedy action

Take action aa in environment, observe (s,r)(s',r)

add (s,a,r,s)(s,a,r,s') to buffer D\mathcal{D}

// Upate policy

Q(s,a)Q(s,a)+η[r+γmaxaQ(s,a)Q(s,a)]Q(s,a) \gets Q(s,a) + \eta[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]

// Update model on real data

M^(s,a)=(s,r)\hat{M}(s,a) = (s',r)

sss \gets s'

for n=1:N do // Update policy on imaginary data

Select (s,a)(s,a) from D\mathcal{D}

(s,r)=M^(s,a)(s',r) = \hat{M}(s,a)

Q(s,a)Q(s,a)+η[r+γmaxaQ(s,a)Q(s,a)]Q(s,a) \gets Q(s,a) + \eta[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]

end for

until converged

Sources

  • Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 4.

Footnotes

  1. On Representation Complexity of Model-based and Model-free Reinforcement Learning

  2. Fast Direct Multiple Shooting Algorithms for Optimal Robot Control

  3. A Tutorial on the Cross Entropy Method

  4. Model Predictive Path Integral Control using Covariance Variable Importance Sampling

  5. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

  6. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming