Model-Based RL
Model-free approaches to RL typically need a lot of interactions with the environment to achieve good performance. For example, state of the methods for Atari need millions of frames, while humans can achieve the same performance in minutes.
One approach to increase the sample efficiency is model-based RL (MBRL). In the simplest approach to MBRL, we first learn the state transition dynamics model
Caution
The above two stage approach, or first learning a model and then planning with it, can suffer from problems such as the policy may query the model at a state for which no data has been collected, so prediction can be unreliable. To get better results, we have to interleave the model learning and the policy learning, so that one helps the other.
There are two main ways to perform MBRL
- Decision-time planning or model predictive control, which use the model to choose the next action by searching over possible future trajectories. Then each trajectory is scored, pick the action corresponding to the best one, take a step in the environment, and repeat.
- Background planning, which uses the current model and policy to rollout imaginary trajectories, and to use this data to improve the policy using model-free RL
Decision-Time (Online) Planning
Receding Horizon Control (RHC)
In receding horizon control, we plan from the current state
- Forward Search: Starting at the current state, all actions are considered, then all corresponding next states, continuing until all transitions up to depth
. The reward associated with each edge is computed. At the leaves, the value function is computed (using a learned offline method), and the path with the highest score is used to apply the corresponding first action. - Branch and Bound: We try to avoid the exponential complexity of forward search by pruning paths that we determine are suboptimal. To do this, we need to know a lower bound on the value function
, and an upper bound on the action value function . At each state node , we examine the actions in decreasing order of their upper bound. If we find an action where is less than the current best lower bound, we prune this branch of the tree. Otherwise we expand it and explore below. We continue this until we hit a leaf node (at maximum depth), in which case we return the lower bound . - Sparse Sampling: To speed up forward search and branch and bound, we can sample a subset of
possible next states for each action. - Heuristic Search: We start with a value function
which we use to initialize the value function . We then perform Monte Carlo rollouts starting from the root node . At each state node, we pick the greedy action with respect to the current , i.e. . We then update , and sample the next state . We repeat this process until the max depth, and then return the greedy action applied to the root node.
Monte Carlo Tree Search (MCTS)
Monte Carlo Tree Search is a receding horizon control procedure. It can be applied to any MDP, but some of its most famous applications are to games.
Sequential Monte Carlo (SMC) for Online Planning
Although MCTS is powerful, it is inherently serial, and can be complicated to apply to continuous action spaces. A different approach is to view online planning through the RL as Inference lens, and then use sequential Monte Carlo (SMC) to approximately sample good future trajectories.
The high-level goal is to maintain a population of candidate futures and gradually concentrate that population on trajectories that look promising under the learned model. For a planning horizon
where
is a one-step model-based estimate of the advantage. The exponential tilt means that trajectories whose steps have larger estimated advantage receive larger probability mass.
Intuition
MCTS builds a search tree and repeatedly focuses computation on promising branches. SMC does something similar, but with a set of weighted trajectory samples instead of an explicit tree. Each particle is a candidate future. After every imagined step, particles that look better under the model get larger weights, weak particles get smaller weights, and resampling duplicates the promising particles while discarding poor ones. By the end, the first actions attached to the surviving particles tell us which action looks best right now.
The resulting empirical trajectory distribution is
which induces an empirical distribution over the next action,
where
To approximately sample from
Under this choice, the proposal already accounts for the policy and dynamics factors, so the incremental importance weight is just
where
In practice, we periodically resample when the effective sample size becomes too small, meaning that the normalized weights have become concentrated on only a few particles even though we still store
Algorithm 4 SMC with Receding Horizon Control (RHC)
procedure SMC-RHC()
Initialize particles
Initialize weights
for do
Sample
Sample
Compute
Update
if effective sample size is too small then
Resample complete particle histories with probabilities proportional to
Reset
end if
end for
Normalize weights
Let be the first action in particle history
return
end procedure
Model Predictive Control (MPC)
Model Predictive Control (MPC) is an open loop version of receding horizon. In particular, at each step, it solves for the sequence of subsequent actions that is most likely to achieve high expected reward
where
Crucially, the future actions are chosen without knowing what the future states are; this is what is meant by open loop. This can be much faster than interleaving the search for actions and future states. However, it can also lead to suboptimal decisions. Nevertheless, the fact that we replan at each step can reduce the harms of this approximation, making the method quite popular for some problems, especially ones where the dynamics are deterministic. Some examples include:
- Random shooting methods: Sample many candidate action sequences, simulate them with the model, and choose the sequence with the highest predicted return. 2
- Cross-Entropy Method (CEM) planning: Iteratively sample action sequences, keep the elite ones, and refit the sampling distribution toward better plans. 3
- Model Predictive Path Integral (MPPI) control: Use reward-weighted averaging of sampled control sequences to update the current plan. 4
- Iterative LQR / Differential Dynamic Programming (iLQR / DDP): Locally optimize a nominal action sequence by repeatedly linearizing the dynamics and quadratizing the objective.
- PETS: A model-based RL method that learns an ensemble dynamics model and then uses MPC, often with CEM, to choose actions online. 5
Background (Offline) Planning
Online planning can be slow. Fortunately, we can amortize the planning process into a reactive policy. TO do this, we can use the model to generate synthetic trajectories in the background (while executing the current policy), and use this imaginary data to train the policy. This is called background planning.
One example of this is Dyna 6, which trains a policy and model in parallel. In this case, the policy is trained on both real and imaginary data. That is, we define
where
Algorithm 5 Tabular Dyna-Q
Initialize data buffer , , and World model
repeat
// Collect real data from environment
epsilon-greedy action
Take action in environment, observe
add to buffer
// Upate policy
// Update model on real data
for n=1:N do // Update policy on imaginary data
Select from
end for
until converged
Sources
- Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 4.
Footnotes
-
On Representation Complexity of Model-based and Model-free Reinforcement Learning ↩
-
Fast Direct Multiple Shooting Algorithms for Optimal Robot Control ↩
-
Model Predictive Path Integral Control using Covariance Variable Importance Sampling ↩
-
Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models ↩
-
Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming ↩