Monte Carlo Tree Search for AI Planning
Monte Carlo Tree Search for AI Planning
When AI agents face complex decisions with many possible paths forward, how do they choose wisely? Monte Carlo Tree Search (MCTS) offers an elegant answer: explore possible futures through simulation, learn from those simulations, and gradually focus on the most promising paths. This algorithm powers game-playing AI, robotic planning systems, and increasingly, language model reasoning.
Concept Introduction
Simple Explanation
Imagine you’re lost in a maze with many branching paths. You could:
- Explore randomly - Try paths at random, hoping to get lucky
- Plan everything - Map out every possible route before moving (impossible in complex mazes)
- Smart exploration - Try promising paths more often, remember what you learned, gradually focus on the best routes
MCTS uses the third approach. It runs simulations, learns from them, and balances exploring new possibilities with exploiting what it has learned works well. Over time, it builds a “search tree” representing the most promising paths through the problem space.
Technical Detail
Monte Carlo Tree Search is a best-first search algorithm that builds a decision tree incrementally through randomized simulation. Each iteration consists of four phases:
1. Selection: Starting from the root, traverse the tree using a selection policy (typically UCT - Upper Confidence Bound for Trees) that balances exploration and exploitation until reaching a leaf node.
2. Expansion: Add one or more child nodes to the leaf, representing possible next actions.
3. Simulation (Rollout): From the new node, simulate the remainder of the trajectory using a default policy (often random) until reaching a terminal state.
4. Backpropagation: Update the value estimates of all nodes in the path from the expanded node back to the root based on the simulation outcome.
The key insight: you don’t need perfect information about all possible futures. By running many simulations and intelligently balancing exploration with exploitation, MCTS converges toward optimal actions even in enormous search spaces.
Historical & Theoretical Context
Origins
MCTS emerged in 2006 from two independent discoveries: Rémi Coulom’s work on Go playing programs and the UCT (Upper Confidence Bound applied to Trees) algorithm by Kocsis and Szepesvári. The method revolutionized computer Go, a game with a search space so vast (10^170 possible board positions) that traditional minimax approaches failed.
Before MCTS, computer Go programs were mediocre. After MCTS, they became competitive with human experts, culminating in AlphaGo’s 2016 victory over Lee Sedol—a watershed moment in AI.
Theoretical Foundation
MCTS draws on:
Multi-armed bandit theory: The UCT selection formula is derived from the UCB1 algorithm for multi-armed bandits, which provably converges to optimal arm selection.
Monte Carlo methods: Using randomized sampling to estimate properties of deterministic systems, developed in the 1940s for nuclear physics simulations.
Anytime algorithms: MCTS is an “anytime” algorithm—it can return a reasonable answer at any point but improves with more computation time.
The UCT formula balances exploitation (choosing actions with high observed value) and exploration (trying less-explored actions):
UCT = exploitation + exploration
= (wins/visits) + c * sqrt(ln(parent_visits) / node_visits)
Where c is an exploration constant (typically √2). This formula has theoretical guarantees: with infinite samples, it converges to the optimal action.
Algorithms & Math
The MCTS Algorithm (Pseudocode)
function MCTS(root_state, iterations):
root = Node(state=root_state)
for i in range(iterations):
node = root
state = copy(root_state)
# 1. SELECTION
while node is fully expanded and not terminal:
node = node.select_child_UCT()
state = state.apply(node.action)
# 2. EXPANSION
if not node.is_terminal():
node = node.expand() # Add a new child
state = state.apply(node.action)
# 3. SIMULATION (Rollout)
while not state.is_terminal():
action = random_action(state)
state = state.apply(action)
reward = state.get_reward()
# 4. BACKPROPAGATION
while node is not None:
node.visits += 1
node.value += reward
node = node.parent
return root.best_child() # Return action with most visits
UCT Selection Formula
def uct_value(node, c=math.sqrt(2)):
"""Calculate UCT value for node selection."""
if node.visits == 0:
return float('inf') # Unvisited nodes have priority
exploitation = node.value / node.visits
exploration = c * math.sqrt(math.log(node.parent.visits) / node.visits)
return exploitation + exploration
Key Parameters
Exploration constant (c): Controls exploration vs. exploitation tradeoff. Higher c means more exploration. Theoretical optimum is √2, but practical values range from 0.5 to 2.0 depending on the domain.
Simulation policy: How to select actions during rollout. Options include:
- Random: Fast but low quality
- Heuristic-guided: Uses domain knowledge to bias simulations
- Learned policy: Uses a neural network (as in AlphaGo)
Expansion strategy: When to add children. Common approaches:
- Add one child per iteration (conservative)
- Add all children at once (aggressive)
Design Patterns & Architectures
The Standard MCTS Loop
MCTS fits naturally into agent architectures as a planning module:
Perception → State Representation → MCTS Planning → Action Execution → Environment
Integration with Neural Networks (AlphaGo Style)
Modern MCTS implementations combine tree search with neural networks:
Node Selection:
Use neural network to predict action probabilities and state value
Incorporate these predictions into UCT selection
Simulation:
Replace random rollouts with neural network value estimates
No need to simulate to terminal state
This combination is extremely powerful: the neural network provides informed priors, while MCTS refines these through search.
MCTS for Language Model Reasoning
Recent work applies MCTS to LLM reasoning:
State: Partial reasoning chain or code being generated Actions: Next reasoning steps or code tokens Simulation: Complete the reasoning/code and evaluate correctness Reward: Whether the final answer is correct
This enables LLMs to search over reasoning paths rather than generating a single path greedily.
Practical Application
Implementing MCTS for a Simple Game
Let’s implement MCTS for Tic-Tac-Toe:
import math
import random
from copy import deepcopy
class TicTacToeState:
def __init__(self):
self.board = [[' ' for _ in range(3)] for _ in range(3)]
self.current_player = 'X'
def get_legal_actions(self):
"""Return list of (row, col) for empty squares."""
actions = []
for r in range(3):
for c in range(3):
if self.board[r][c] == ' ':
actions.append((r, c))
return actions
def apply_action(self, action):
"""Return new state with action applied."""
new_state = deepcopy(self)
r, c = action
new_state.board[r][c] = self.current_player
new_state.current_player = 'O' if self.current_player == 'X' else 'X'
return new_state
def is_terminal(self):
"""Check if game is over."""
return self.get_winner() is not None or not self.get_legal_actions()
def get_winner(self):
"""Return 'X', 'O', or None."""
# Check rows, columns, diagonals
for r in range(3):
if self.board[r][0] == self.board[r][1] == self.board[r][2] != ' ':
return self.board[r][0]
for c in range(3):
if self.board[0][c] == self.board[1][c] == self.board[2][c] != ' ':
return self.board[0][c]
if self.board[0][0] == self.board[1][1] == self.board[2][2] != ' ':
return self.board[0][0]
if self.board[0][2] == self.board[1][1] == self.board[2][0] != ' ':
return self.board[0][2]
return None
def get_reward(self, player):
"""Return reward from perspective of player."""
winner = self.get_winner()
if winner == player:
return 1
elif winner is None:
return 0
else:
return -1
class MCTSNode:
def __init__(self, state, parent=None, action=None):
self.state = state
self.parent = parent
self.action = action # Action that led to this state
self.children = []
self.visits = 0
self.value = 0
self.untried_actions = state.get_legal_actions()
def is_fully_expanded(self):
return len(self.untried_actions) == 0
def is_terminal(self):
return self.state.is_terminal()
def expand(self):
"""Add a new child node."""
action = self.untried_actions.pop()
new_state = self.state.apply_action(action)
child = MCTSNode(new_state, parent=self, action=action)
self.children.append(child)
return child
def select_child_uct(self, c=math.sqrt(2)):
"""Select child with highest UCT value."""
return max(self.children, key=lambda child: child.uct_value(c))
def uct_value(self, c):
if self.visits == 0:
return float('inf')
return (self.value / self.visits) + c * math.sqrt(math.log(self.parent.visits) / self.visits)
def best_child(self):
"""Return child with most visits (most reliable)."""
return max(self.children, key=lambda child: child.visits)
def mcts_search(root_state, iterations=1000, player='X'):
"""Run MCTS and return best action."""
root = MCTSNode(root_state)
for _ in range(iterations):
node = root
state = deepcopy(root_state)
# SELECTION
while node.is_fully_expanded() and not node.is_terminal():
node = node.select_child_uct()
state = node.state
# EXPANSION
if not node.is_terminal():
node = node.expand()
state = node.state
# SIMULATION (random rollout)
while not state.is_terminal():
action = random.choice(state.get_legal_actions())
state = state.apply_action(action)
# BACKPROPAGATION
reward = state.get_reward(player)
while node is not None:
node.visits += 1
node.value += reward
node = node.parent
return root.best_child().action
# Usage
state = TicTacToeState()
best_action = mcts_search(state, iterations=1000, player='X')
print(f"Best move: {best_action}")
Using MCTS with LangGraph for LLM Reasoning
from typing import List, TypedDict
import openai
class ReasoningState(TypedDict):
problem: str
reasoning_chain: List[str]
answer: str | None
is_correct: bool | None
class MCTSReasoningNode:
"""MCTS node for LLM reasoning paths."""
def __init__(self, state: ReasoningState, parent=None):
self.state = state
self.parent = parent
self.children = []
self.visits = 0
self.value = 0
def get_possible_next_steps(self) -> List[str]:
"""Use LLM to generate possible next reasoning steps."""
prompt = f"""Problem: {self.state['problem']}
Current reasoning:
{chr(10).join(self.state['reasoning_chain'])}
Generate 3 different possible next reasoning steps:"""
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
n=3
)
return [choice.message.content for choice in response.choices]
def simulate_to_completion(self) -> bool:
"""Complete reasoning and check if answer is correct."""
# Simplified: use LLM to complete reasoning
prompt = f"""Problem: {self.state['problem']}
Reasoning so far:
{chr(10).join(self.state['reasoning_chain'])}
Complete the reasoning and provide final answer:"""
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
answer = response.choices[0].message.content
# Check correctness (would need ground truth in real implementation)
is_correct = check_answer(answer, self.state['problem'])
return 1 if is_correct else 0
def mcts_reasoning(problem: str, iterations: int = 10):
"""Use MCTS to search for correct reasoning path."""
initial_state = ReasoningState(
problem=problem,
reasoning_chain=[],
answer=None,
is_correct=None
)
root = MCTSReasoningNode(initial_state)
# Standard MCTS loop (adapted for LLM reasoning)
for _ in range(iterations):
# ... MCTS phases ...
pass
# Return best reasoning path
return root.best_child().state['reasoning_chain']
Comparisons & Tradeoffs
MCTS vs Minimax
Minimax:
- Requires perfect game tree traversal
- Needs evaluation function for non-terminal states
- Depth-limited in practice
- Deterministic
MCTS:
- Anytime algorithm (returns best answer so far)
- No evaluation function needed
- Asymmetric tree growth (explores promising areas more)
- Stochastic (uses randomness)
Tradeoff: MCTS better for large search spaces where you can’t enumerate all possibilities. Minimax better for small, well-understood games where perfect play is computable.
MCTS vs Greedy Search
Greedy (e.g., LLM token-by-token generation):
- Fast (O(n) for sequence length n)
- No backtracking
- Can get stuck in local optima
MCTS:
- Slower (O(iterations × simulation_depth))
- Explores multiple paths
- Finds better global solutions
Tradeoff: Use greedy when speed matters and local decisions are usually good. Use MCTS when decisions have long-term consequences and you can afford computation time.
Pure MCTS vs Neural-Guided MCTS (AlphaGo Style)
Pure MCTS:
- No training data needed
- Simpler implementation
- Requires many simulations for quality
Neural-Guided:
- Needs training data and trained model
- Much more sample efficient
- Achieves superhuman performance
Tradeoff: Pure MCTS for quick prototypes and domains without training data. Neural-guided for production systems with available training resources.
Latest Developments & Research
MCTS for LLM Reasoning (2024-2025)
Recent papers apply MCTS to improve LLM reasoning:
“Tree of Thoughts” (Yao et al., 2024): Uses MCTS-like search over reasoning steps, with LLM evaluating intermediate states. Shows significant improvements on mathematical reasoning and planning tasks.
“Self-Refine with MCTS” (2025): Combines iterative self-refinement with tree search, allowing LLMs to explore multiple refinement paths and select the best.
Multi-Agent MCTS
Extending MCTS to multi-agent settings where agents have partial information:
Information Set MCTS: Handles imperfect information games (like poker) by grouping indistinguishable states.
Decentralized MCTS: Multiple agents run MCTS independently, coordinating through communication.
MCTS with Learned Rollout Policies
Instead of random rollouts, train neural networks to simulate high-quality continuations:
Policy networks: Trained to mimic expert play, dramatically improving simulation quality Value networks: Directly estimate state value, eliminating need for rollouts entirely
This approach (used in AlphaGo Zero) achieves superhuman performance with less domain knowledge.
Cross-Disciplinary Insight
MCTS connects to decision theory and optimal stopping problems: when do you stop searching and commit to an action? MCTS implicitly solves this through the anytime property—you can stop whenever time runs out.
From cognitive science, MCTS resembles human prospective thinking: we imagine possible futures, learn from those mental simulations, and gradually focus on promising options. Humans don’t exhaustively search all possibilities; we adaptively sample like MCTS.
Daily Challenge
Task: Implement MCTS for a simple planning problem
Create an MCTS solver for the “Shortest Path with Obstacles” problem:
class GridWorld:
def __init__(self, size=10, obstacles=None):
self.size = size
self.obstacles = obstacles or set()
self.position = (0, 0)
self.goal = (size-1, size-1)
def get_actions(self):
"""Return valid moves: up, down, left, right."""
# Your implementation
pass
def apply_action(self, action):
"""Return new GridWorld state after action."""
pass
def is_terminal(self):
"""Check if goal reached."""
pass
# Your task:
# 1. Implement MCTS for GridWorld
# 2. Compare MCTS with random walk and greedy approaches
# 3. Visualize the search tree after N iterations
Requirements:
- Implement all four MCTS phases
- Use UCT for selection
- Random rollout for simulation
- Track number of simulations needed to find goal
Bonus Challenge: Implement a learned rollout policy that uses value estimation to guide simulations toward the goal, comparing performance to random rollouts.
References & Further Reading
Foundational Papers
- “Bandit Based Monte-Carlo Planning” (Kocsis & Szepesvári, 2006) - UCT algorithm
- “Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search” (Coulom, 2006) - MCTS for Go
- “Mastering the Game of Go with Deep Neural Networks and Tree Search” (Silver et al., 2016) - AlphaGo
- “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm” (Silver et al., 2017) - AlphaZero
Recent MCTS for LLMs
- “Tree of Thoughts: Deliberate Problem Solving with Large Language Models” (Yao et al., 2024)
- “Reasoning with Language Model is Planning with World Model” (Hao et al., 2024)
Implementations
- Python MCTS Library:
pymcts- Clean implementation of standard MCTS - AlphaZero: DeepMind’s open-source implementation
- OpenSpiel: Framework for multi-agent RL including MCTS variants
Tutorials
- MCTS Tutorial - Comprehensive guide with interactive examples
- AlphaGo Paper Explained - Video walkthrough
- MCTS in Python - Step-by-step implementation
Monte Carlo Tree Search transforms the impossible task of exhaustively searching all possibilities into a tractable learning problem: sample intelligently, learn from simulations, and focus on what works. As AI agents tackle increasingly complex planning problems—from game playing to mathematical reasoning to robotic manipulation—MCTS provides a principled way to search vast spaces efficiently. When combined with modern neural networks, it achieves superhuman performance while remaining conceptually simple: explore, simulate, learn, repeat.