VII · Reinforcement Learning & RoboticsAdvanced

Monte Carlo Tree Search

A planning algorithm that builds a search tree by simulating random rollouts from each candidate action, using the aggregate results to estimate action values - the algorithm that powered AlphaGo's superhuman performance in the ancient game of Go.

Added May 18, 2026 · 3 min read

MCTS represents a fundamentally different paradigm from policy-based or value-based RL: it combines learning (the trained neural networks) with search at decision time, enabling deliberate planning rather than purely reactive action. This planning-at-inference approach is increasingly relevant as large language models begin to incorporate explicit reasoning steps and search, bringing MCTS concepts into the domain of language model inference. Understanding MCTS explains the architecture behind AlphaGo, AlphaZero, and the emerging area of LLM reasoning with search.

Monte Carlo Tree Search (MCTS) is a decision-time planning algorithm for sequential decision problems. Rather than computing a complete policy in advance, MCTS focuses its computation on the current decision: given the current state, which action leads to the best long-term outcome? It answers this by building a search tree rooted at the current state, sampling from possible futures, and using the simulated outcomes to guide action selection.

MCTS operates in four phases that repeat until a computational budget (time or iterations) is exhausted. Selection: starting from the root (current state), traverse the tree by selecting child nodes according to a selection policy that balances exploration and exploitation (UCT - Upper Confidence bound for Trees - is the standard). Expansion: when the traversal reaches a node not yet fully expanded (unexplored actions remain), add one or more new child nodes. Simulation (rollout): from the newly expanded node, simulate a complete game or episode using a fast rollout policy (often random) to reach a terminal state. Backpropagation: propagate the outcome (win/loss/reward) back up the path from the terminal state to the root, updating value estimates along the way.

After enough iterations, the tree contains statistically reliable estimates of each action's long-term value, and the agent takes the action with the highest estimated value from the root. Crucially, MCTS naturally focuses computation on the most promising subtrees - actions that appear unpromising receive few simulations, while the frontier of the search near the best candidate actions receives extensive simulation.

AlphaGo combined MCTS with deep neural networks in a breakthrough architecture. A policy network (trained from human games and refined by RL) guides the selection phase toward promising moves rather than random rollouts, dramatically improving search efficiency. A value network (trained by self-play RL) provides position evaluations that supplement or replace rollout simulations. This combination - MCTS with neural network-guided search - achieved the first superhuman performance in Go.

AlphaZero and MuZero extended this framework. AlphaZero learns everything (policy and value networks) from self-play with no human data. MuZero learns the dynamics model itself, enabling MCTS without knowing the game rules in advance. These systems achieved superhuman performance across Go, Chess, Shogi, and various video games.

Beyond games, MCTS principles appear in LLM reasoning: tree-of-thought prompting explores multiple reasoning branches similarly to how MCTS explores game trees, and some RL for reasoning systems use MCTS-like search over reasoning steps during training.

Analogy

A chess player mentally evaluating candidate moves by playing out each variation in their head for a few moves. If a variation looks promising, they explore it further; if it leads to a bad position quickly, they abandon it. After considering several variations to various depths, they play the move that seemed to lead to the best outcomes across their simulations. MCTS automates and scales this process: it explores the game tree proportional to promise, using random or network-guided simulations to estimate position values.

Real-world example

AlphaGo plays a professional Go match. On each move, MCTS runs for a fixed time budget (a few seconds). The policy network suggests promising moves; MCTS selects among them using UCT, expanding the tree in the most promising directions. After millions of simulated continuations - each guided by the policy network and evaluated by the value network - MCTS returns the move visited most often from the root (the most consistently high-value move). This move is played on the board. Repeat for every move in the game. The result: superhuman Go performance.

Why it matters

MCTS represents a fundamentally different paradigm from policy-based or value-based RL: it combines learning (the trained neural networks) with search at decision time, enabling deliberate planning rather than purely reactive action. This planning-at-inference approach is increasingly relevant as large language models begin to incorporate explicit reasoning steps and search, bringing MCTS concepts into the domain of language model inference. Understanding MCTS explains the architecture behind AlphaGo, AlphaZero, and the emerging area of LLM reasoning with search.

In the news

AI Tools Accelerate GPU Kernel Optimization
NVIDIA Dev Blog, arXiv CS.AI · 2d ago

Related concepts

Actor-Critic

A reinforcement learning architecture that combines a policy network (the actor, which decides which actions to take) with a value network (the critic, which evaluates how good the current state is) - reducing gradient variance and enabling more stable learning than pure policy gradient.

Exploration vs Exploitation

The central dilemma of reinforcement learning: whether to exploit currently known good strategies to collect reward, or explore unknown actions that might reveal even better strategies - a tradeoff with no universally correct answer.

World Models

Learned neural network representations of an environment's dynamics - enabling an RL agent to simulate future outcomes in its "mind" and plan ahead without additional real-world experience, dramatically improving sample efficiency.

← Back to concepts