Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) is a search method that incrementally builds a partial search tree by sampling trajectories through the state space. Instead of relying on a hand-designed evaluation function at every node, it estimates the value of actions from simulated outcomes and uses those estimates to decide where to spend more search effort.

MCTS is especially useful when the state space is too large for exhaustive search, but it is cheap enough to generate legal actions and simulate or evaluate future states. It is most closely associated with game-tree search, but the same idea applies more broadly to sequential decision problems, planning, and black-box optimization.

MCTS can be applied in several settings:

  • Single-agent search, where the goal is to find a high-value trajectory or a solution state.
  • Two-player zero-sum games, where rewards for one player are losses for the other.
  • Multi-agent games, where each player may have a separate payoff.
  • Deterministic dynamics, where each action has a fixed successor state.
  • Stochastic dynamics, where a transition model or simulator samples the next state.
  • Perfect-information games, where the full state is visible to all players.
  • Imperfect-information games, using variants such as information-set MCTS.
  • Markov decision processes, where MCTS performs online planning from the current state.

Intuition

MCTS converts search into an allocation problem. Every simulation is a small experiment. The algorithm repeatedly asks: which part of the tree is promising enough to exploit, and which part is uncertain enough that another sample would be useful?

Basic Theory

Monte Carlo Tree Search is an iterative algorithm that searches the state space and builds statistical evidence about the decisions available in particular states. Each node stores statistics such as a visit count and an estimated value. After many iterations, the root action is chosen from these accumulated statistics rather than from a single rollout.

The common node statistics are:

  • : number of times state has been visited.
  • : number of times action has been selected from state .
  • : estimated value of selecting action in state .
  • : total backed-up reward accumulated for action from state .

Usually,

The four phases are:

  1. Selection: Starting from the root, recursively choose actions according to a tree policy until reaching a node that is not fully expanded or is terminal.
  2. Expansion: Add one or more child nodes corresponding to previously untried actions.
  3. Simulation: Estimate the value of the new node, often by running a rollout policy until a terminal state or by calling a value function.
  4. Backpropagation: Propagate the simulation result back through the selected path, updating visit counts and value estimates.

Algorithm 14 Monte Carlo Tree Search

Input: MDP MM, base QQ-function QQ, Iterations KK

Output: updated QQ-function QQ

k0k \gets 0

while k<Kk < K do

nn \gets Select(s0s_0)

cc \gets Expand(nn)

GG \gets Simulate(cc)

Backpropagate(nn, action of cc, QQ, GG)

kk+1k \gets k + 1

end while

function Select(ss)

while ss is fully expanded and not terminal do

Select action aa to apply in ss according to Tree Policy

Choose output ss' according to pS(ss)p_{S}(s' \mid s)

sss \gets s'

end while

return ss

end function

function Expand(ss)

Randomly sample aa (untired action)

Expand one outcome spS(s)s' \sim p_{S}(\bullet \mid s)

return ss'

end function

function Simulate(ss)

while ss not terminal do

Select action aa to apply in ss according to Default Policy

Choose output ss' according to pS(ss)p_{S}(s' \mid s)

sss \gets s'

end while

return Reward rr for state ss

end function

function Backpropagation(ss, aa, QQ, GG)

repeat

N(s,a)N(s,a)+1N(s,a) \gets N(s,a) + 1

Gr+γGG \gets r + \gamma G

Q(s,a)Q(s,a)+1N(s,a)[GQ(s,a)]Q(s,a) \gets Q(s,a) + \frac{1}{N(s,a)}{[G - Q(s,a)]}

ss \gets parent of ss

aa \gets parent action of ss

until s=s0s = s_0

end function

The search tree is asymmetric: promising regions receive many visits, while unpromising regions may remain shallow. This is a major difference from algorithms such as breadth-first search, which expand uniformly by depth, and best-first search, which relies on an explicit evaluation function over frontier nodes.

Choosing the Final Action

During search, actions are selected to balance exploration and exploitation. At the end of search, the action chosen at the root is usually selected differently. Common choices are:

  • Robust child: choose , the most visited action.
  • Max child: choose , the action with the largest empirical value.
  • Secure child: combine value and uncertainty, useful when avoiding low-confidence actions matters.

The robust child is common because the visit count reflects the tree policy’s accumulated evidence about which actions deserved search effort.

Tree Policy

The tree policy controls action selection inside the already-built portion of the tree. It is the part of MCTS responsible for the exploration-exploitation tradeoff.

Definition (UCT)

Upper Confidence bounds applied to Trees (UCT) selects an action by

where is the empirical mean value of action , is the parent visit count, is the action visit count, and controls exploration.

The first term exploits actions with high estimated value. The second term explores actions with low visit count. As grows, the exploration bonus for that action shrinks. As grows, actions that have not kept up with the parent visit count become attractive again.

UCT comes from the UCB1 algorithm for the multi-armed bandit problem. At a fixed state , each action can be viewed as an arm. Pulling an arm corresponds to selecting that action and receiving a sampled return from the rest of the search. UCB1 chooses arms by adding an optimism bonus to the empirical mean.

Intuition

UCT is optimistic under uncertainty. An action with a mediocre value estimate may still be selected if it has been sampled only a few times, because its true value might be better than the current estimate suggests.

The confidence term can be motivated by concentration inequalities such as Hoeffding’s inequality. If returns are bounded, then the empirical mean is likely to be close to the true mean after enough samples. The uncertainty shrinks on the order of , which gives the exploration term its square-root form. A central-limit-theorem view gives a similar intuition: empirical means become approximately normal around their true mean, with standard error decreasing like .

The usual UCT assumptions are idealized:

  • Rewards are bounded, often in or .
  • More samples from an action give an increasingly accurate estimate of its value.
  • The simulator or rollout process produces representative samples of future outcomes.
  • The value being backed up is stationary enough for averages to be meaningful.

In a tree, the bandit problems at different nodes are coupled. Sampling one action changes value estimates deeper in the tree, which changes future returns for ancestors. UCT nevertheless works well in practice because the same optimism principle recursively allocates search effort toward actions that are either good or insufficiently understood.

Backups

For single-agent maximization, the backed-up value is usually the sampled return from the trajectory. If a simulation from node produces return , the update can be written as

For two-player zero-sum games, values are usually stored from the perspective of the player to move or from a fixed root player’s perspective. The sign convention matters. If values are stored from the current player’s perspective, the backed-up value alternates sign when moving up the tree.

Caution

A common source of MCTS bugs is mixing value perspectives. The value estimate, tree policy, and backup rule must agree on whether is measured from the root player’s perspective or the current player’s perspective.

Advantages and Disadvantages

MCTS has several advantages:

  • It does not require a handcrafted heuristic evaluation function.
  • It can be stopped at any time and still return the best action found so far.
  • It focuses computation on promising parts of the tree.
  • It naturally handles delayed rewards because values are backed up from complete or partially evaluated trajectories.
  • It can work with a generative simulator without requiring an explicit transition table.
  • It can combine search with learned policies and value functions.

MCTS also has important limitations:

  • Vanilla rollouts can be noisy, making value estimates require many simulations.
  • Large branching factors make it difficult to visit each action enough times.
  • Long horizons delay reward information and increase variance.
  • Pure random rollouts may be uninformative in domains where valid behavior requires structure.
  • Theoretical guarantees usually depend on bounded rewards and enough visits, which may be unrealistic under a finite compute budget.
  • Transpositions, stochasticity, and imperfect information require additional bookkeeping or modified algorithms.

The branching factor is one of the main practical bottlenecks. If a node has thousands of legal actions, vanilla UCT initially spends many iterations trying each action once before meaningfully comparing them. This is why MCTS is often paired with policy priors, action pruning, or progressive widening.

Warning

MCTS is not automatically better than heuristic search. If a strong admissible heuristic is available and optimality matters, methods such as A* may be more appropriate. MCTS is most attractive when good heuristic evaluation is difficult but simulation or approximate value estimation is available.

Extensions

Many MCTS variants modify one of the four phases or change the tree policy to handle cases where vanilla UCT is poorly matched to the domain.

Reward Scaling

The UCT exploration constant is sensitive to the scale of the rewards. If rewards are assumed to lie in but the actual returns are much larger, the exploitation term can dominate the exploration bonus. If the rewards are very small, exploration can dominate for too long.

Common fixes include:

  • Normalize returns to a known range such as or .
  • Tune the exploration constant for the reward scale.
  • Use empirical value normalization when the reward range is unknown.
  • Replace random rollouts with a value function that is trained or calibrated to the desired range.

For single-player optimization problems, rewards may be unbounded or strongly depth-dependent. In that case, the naive empirical mean may be a poor comparison across nodes. It is often better to define a bounded utility, normalize objective values, or search over transformed returns.

Prior Knowledge

Prior knowledge can be injected into MCTS through the rollout policy, the tree policy, or the initialization of node statistics. A good prior policy can focus search on plausible actions before the value estimates are reliable.

Examples include:

  • Biasing rollouts toward domain-specific actions.
  • Initializing new edges with pseudo-counts and prior values.
  • Pruning obviously invalid or dominated actions.
  • Ordering expansion so likely actions are considered first.
  • Using a learned policy to provide action probabilities.

This is closely related to the motivation behind policy-guided methods such as Levin Tree Search and Policy Guided Heuristic Search, but MCTS uses the policy to allocate sampling effort rather than to define a deterministic best-first expansion order.

RAVE

Rapid Action Value Estimation (RAVE) shares value information across different parts of a trajectory. The idea is that in some domains, especially games like Go, an action that is good later in a rollout may also be informative about playing that action earlier from a related state.

RAVE maintains an additional estimate based on whether an action appeared anywhere later in the simulation, not only whether it was selected immediately from the node. Early in search, this can provide lower-variance estimates. Later in search, the algorithm gradually shifts back toward ordinary MCTS statistics, which are more specific to the node.

Intuition

RAVE says: if a move was useful somewhere in this simulated future, that is weak evidence that the move may be useful here too. This evidence is biased, but it can be helpful before ordinary visit counts are large.

Progressive Widening

Progressive widening is used when the action space is very large or continuous. Instead of expanding every legal action immediately, the number of available children grows slowly with the parent visit count. A typical rule allows at most

children at state , where and .

This lets the algorithm repeatedly evaluate a manageable subset of actions before adding more. Progressive widening is common in continuous control, stochastic planning, and language-generation settings with very large action spaces.

PUCT

Predictor + UCT (PUCT) modifies the exploration term using a prior policy . A common form is

The prior controls which actions receive early exploration. Actions with high prior probability get a larger exploration bonus, but the bonus still decays as grows.

Intuition

UCT treats all unvisited actions symmetrically. PUCT says that if a policy network already thinks some actions are more plausible, the search should spend its early simulations there instead of spreading effort uniformly.

PUCT is one of the key ingredients in systems such as AlphaGo Zero and AlphaZero, where a neural network supplies both a policy prior and a value estimate.

Imperfect Information

In imperfect-information games, the true state may not be known to the acting player. A naive approach samples a possible hidden state and runs ordinary MCTS, but this can produce strategies that depend on information the player should not have.

Information-set MCTS groups states that are indistinguishable to the player into information sets. The tree is then built over information sets rather than fully observed states. This better matches the decision problem faced by the player, though it introduces complications such as strategy fusion and nonlocality.

Applications

MCTS is most useful when a problem has a large sequential decision tree, delayed rewards, and a simulator or approximate evaluator.

Applications include:

  • Board games such as Go, Hex, Chess, and Shogi.
  • General game playing, where domain-specific heuristics may not be available.
  • Online planning in stochastic environments.
  • Program synthesis and theorem proving when partial solutions can be sampled and checked.
  • Combinatorial optimization with expensive or delayed objective feedback.
  • Model-based reinforcement learning.

AlphaGo, AlphaZero, and MuZero

AlphaGo combined MCTS with neural networks that supplied policy priors and value estimates. The policy network made search much more selective, while the value network reduced dependence on noisy random rollouts.

AlphaZero simplified the setup by learning from self-play without human expert games. Its MCTS used a neural network that returned both a policy prior and a value estimate . Search improved the policy by producing stronger visit-count targets, and the neural network was trained to imitate those improved policies while predicting game outcomes.

MuZero removed the need for an explicit environment model. Instead, it learned a latent dynamics model that could be searched by MCTS. This made the same planning idea applicable even when the true transition dynamics were unknown or too complex to model directly.

Language Models

MCTS can be applied to language models by treating partial text as a search node and next-token choices as actions. However, this is not always a good match for ordinary text generation. If the objective is simply high model likelihood, beam search is usually simpler and more hardware-efficient.

MCTS becomes more interesting when there is an external reward, verifier, or tool feedback that is not captured by next-token probability alone. Examples include:

  • Code generation with unit tests.
  • Mathematical reasoning with checkable answers.
  • Tool-use planning with environment feedback.
  • Theorem proving with a proof checker.
  • Agentic tasks where intermediate actions affect a persistent state.

In these settings, a language model can act as the prior policy over actions, while MCTS allocates additional computation toward branches that produce high verifier or reward scores.

Compared with Levin Tree Search, MCTS is more appropriate when the search needs to estimate and compare downstream action values. MCTS asks which action has the highest expected return after additional sampling. Levin tree search instead asks which complete or partial trajectory should be cheapest to discover under the language-model policy. This makes LTS a natural alternative when the task is a first-solution problem with a verifier, such as code generation with unit tests or theorem proving with a proof checker. In those cases, the LM policy already defines a path probability, and the main objective is to find any verified solution rather than to estimate a smooth value function over actions.

Rule of Thumb

Use Levin Tree Search when the problem is best viewed as policy-guided first-solution search. Use MCTS when the problem is best viewed as value-guided planning with rewards, rollouts, or learned value estimates. See Search-Based Decoding for Language Models for a comparison with beam search.

Sources