Monte Carlo Tree Search for AI Planning

Monte Carlo Tree Search for AI Planning

When AI agents face complex decisions with many possible paths forward, how do they choose wisely? Monte Carlo Tree Search (MCTS) offers an elegant answer: explore possible futures through simulation, learn from those simulations, and gradually focus on the most promising paths. This algorithm powers game-playing AI, robotic planning systems, and increasingly, language model reasoning.

Concept Introduction

Simple Explanation

Imagine you’re lost in a maze with many branching paths. You could:

  1. Explore randomly - Try paths at random, hoping to get lucky
  2. Plan everything - Map out every possible route before moving (impossible in complex mazes)
  3. Smart exploration - Try promising paths more often, remember what you learned, gradually focus on the best routes

MCTS uses the third approach. It runs simulations, learns from them, and balances exploring new possibilities with exploiting what it has learned works well. Over time, it builds a “search tree” representing the most promising paths through the problem space.

Technical Detail

Monte Carlo Tree Search is a best-first search algorithm that builds a decision tree incrementally through randomized simulation. Each iteration consists of four phases:

1. Selection: Starting from the root, traverse the tree using a selection policy (typically UCT - Upper Confidence Bound for Trees) that balances exploration and exploitation until reaching a leaf node.

2. Expansion: Add one or more child nodes to the leaf, representing possible next actions.

3. Simulation (Rollout): From the new node, simulate the remainder of the trajectory using a default policy (often random) until reaching a terminal state.

4. Backpropagation: Update the value estimates of all nodes in the path from the expanded node back to the root based on the simulation outcome.

The key insight: you don’t need perfect information about all possible futures. By running many simulations and intelligently balancing exploration with exploitation, MCTS converges toward optimal actions even in enormous search spaces.

Historical & Theoretical Context

Origins

MCTS emerged in 2006 from two independent discoveries: Rémi Coulom’s work on Go playing programs and the UCT (Upper Confidence Bound applied to Trees) algorithm by Kocsis and Szepesvári. The method revolutionized computer Go, a game with a search space so vast (10^170 possible board positions) that traditional minimax approaches failed.

Before MCTS, computer Go programs were mediocre. After MCTS, they became competitive with human experts, culminating in AlphaGo’s 2016 victory over Lee Sedol—a watershed moment in AI.

Theoretical Foundation

MCTS draws on:

Multi-armed bandit theory: The UCT selection formula is derived from the UCB1 algorithm for multi-armed bandits, which provably converges to optimal arm selection.

Monte Carlo methods: Using randomized sampling to estimate properties of deterministic systems, developed in the 1940s for nuclear physics simulations.

Anytime algorithms: MCTS is an “anytime” algorithm—it can return a reasonable answer at any point but improves with more computation time.

The UCT formula balances exploitation (choosing actions with high observed value) and exploration (trying less-explored actions):

UCT = exploitation + exploration
    = (wins/visits) + c * sqrt(ln(parent_visits) / node_visits)

Where c is an exploration constant (typically √2). This formula has theoretical guarantees: with infinite samples, it converges to the optimal action.

Algorithms & Math

The MCTS Algorithm (Pseudocode)

function MCTS(root_state, iterations):
    root = Node(state=root_state)
    
    for i in range(iterations):
        node = root
        state = copy(root_state)
        
        # 1. SELECTION
        while node is fully expanded and not terminal:
            node = node.select_child_UCT()
            state = state.apply(node.action)
        
        # 2. EXPANSION
        if not node.is_terminal():
            node = node.expand()  # Add a new child
            state = state.apply(node.action)
        
        # 3. SIMULATION (Rollout)
        while not state.is_terminal():
            action = random_action(state)
            state = state.apply(action)
        
        reward = state.get_reward()
        
        # 4. BACKPROPAGATION
        while node is not None:
            node.visits += 1
            node.value += reward
            node = node.parent
    
    return root.best_child()  # Return action with most visits

UCT Selection Formula

def uct_value(node, c=math.sqrt(2)):
    """Calculate UCT value for node selection."""
    if node.visits == 0:
        return float('inf')  # Unvisited nodes have priority
    
    exploitation = node.value / node.visits
    exploration = c * math.sqrt(math.log(node.parent.visits) / node.visits)
    
    return exploitation + exploration

Key Parameters

Exploration constant (c): Controls exploration vs. exploitation tradeoff. Higher c means more exploration. Theoretical optimum is √2, but practical values range from 0.5 to 2.0 depending on the domain.

Simulation policy: How to select actions during rollout. Options include:

Expansion strategy: When to add children. Common approaches:

Design Patterns & Architectures

The Standard MCTS Loop

MCTS fits naturally into agent architectures as a planning module:

Perception → State Representation → MCTS Planning → Action Execution → Environment

Integration with Neural Networks (AlphaGo Style)

Modern MCTS implementations combine tree search with neural networks:

Node Selection:
  Use neural network to predict action probabilities and state value
  Incorporate these predictions into UCT selection

Simulation:
  Replace random rollouts with neural network value estimates
  No need to simulate to terminal state

This combination is extremely powerful: the neural network provides informed priors, while MCTS refines these through search.

MCTS for Language Model Reasoning

Recent work applies MCTS to LLM reasoning:

State: Partial reasoning chain or code being generated Actions: Next reasoning steps or code tokens Simulation: Complete the reasoning/code and evaluate correctness Reward: Whether the final answer is correct

This enables LLMs to search over reasoning paths rather than generating a single path greedily.

Practical Application

Implementing MCTS for a Simple Game

Let’s implement MCTS for Tic-Tac-Toe:

import math
import random
from copy import deepcopy

class TicTacToeState:
    def __init__(self):
        self.board = [[' ' for _ in range(3)] for _ in range(3)]
        self.current_player = 'X'
    
    def get_legal_actions(self):
        """Return list of (row, col) for empty squares."""
        actions = []
        for r in range(3):
            for c in range(3):
                if self.board[r][c] == ' ':
                    actions.append((r, c))
        return actions
    
    def apply_action(self, action):
        """Return new state with action applied."""
        new_state = deepcopy(self)
        r, c = action
        new_state.board[r][c] = self.current_player
        new_state.current_player = 'O' if self.current_player == 'X' else 'X'
        return new_state
    
    def is_terminal(self):
        """Check if game is over."""
        return self.get_winner() is not None or not self.get_legal_actions()
    
    def get_winner(self):
        """Return 'X', 'O', or None."""
        # Check rows, columns, diagonals
        for r in range(3):
            if self.board[r][0] == self.board[r][1] == self.board[r][2] != ' ':
                return self.board[r][0]
        for c in range(3):
            if self.board[0][c] == self.board[1][c] == self.board[2][c] != ' ':
                return self.board[0][c]
        if self.board[0][0] == self.board[1][1] == self.board[2][2] != ' ':
            return self.board[0][0]
        if self.board[0][2] == self.board[1][1] == self.board[2][0] != ' ':
            return self.board[0][2]
        return None
    
    def get_reward(self, player):
        """Return reward from perspective of player."""
        winner = self.get_winner()
        if winner == player:
            return 1
        elif winner is None:
            return 0
        else:
            return -1

class MCTSNode:
    def __init__(self, state, parent=None, action=None):
        self.state = state
        self.parent = parent
        self.action = action  # Action that led to this state
        self.children = []
        self.visits = 0
        self.value = 0
        self.untried_actions = state.get_legal_actions()
    
    def is_fully_expanded(self):
        return len(self.untried_actions) == 0
    
    def is_terminal(self):
        return self.state.is_terminal()
    
    def expand(self):
        """Add a new child node."""
        action = self.untried_actions.pop()
        new_state = self.state.apply_action(action)
        child = MCTSNode(new_state, parent=self, action=action)
        self.children.append(child)
        return child
    
    def select_child_uct(self, c=math.sqrt(2)):
        """Select child with highest UCT value."""
        return max(self.children, key=lambda child: child.uct_value(c))
    
    def uct_value(self, c):
        if self.visits == 0:
            return float('inf')
        return (self.value / self.visits) + c * math.sqrt(math.log(self.parent.visits) / self.visits)
    
    def best_child(self):
        """Return child with most visits (most reliable)."""
        return max(self.children, key=lambda child: child.visits)

def mcts_search(root_state, iterations=1000, player='X'):
    """Run MCTS and return best action."""
    root = MCTSNode(root_state)
    
    for _ in range(iterations):
        node = root
        state = deepcopy(root_state)
        
        # SELECTION
        while node.is_fully_expanded() and not node.is_terminal():
            node = node.select_child_uct()
            state = node.state
        
        # EXPANSION
        if not node.is_terminal():
            node = node.expand()
            state = node.state
        
        # SIMULATION (random rollout)
        while not state.is_terminal():
            action = random.choice(state.get_legal_actions())
            state = state.apply_action(action)
        
        # BACKPROPAGATION
        reward = state.get_reward(player)
        while node is not None:
            node.visits += 1
            node.value += reward
            node = node.parent
    
    return root.best_child().action

# Usage
state = TicTacToeState()
best_action = mcts_search(state, iterations=1000, player='X')
print(f"Best move: {best_action}")

Using MCTS with LangGraph for LLM Reasoning

from typing import List, TypedDict
import openai

class ReasoningState(TypedDict):
    problem: str
    reasoning_chain: List[str]
    answer: str | None
    is_correct: bool | None

class MCTSReasoningNode:
    """MCTS node for LLM reasoning paths."""
    
    def __init__(self, state: ReasoningState, parent=None):
        self.state = state
        self.parent = parent
        self.children = []
        self.visits = 0
        self.value = 0
    
    def get_possible_next_steps(self) -> List[str]:
        """Use LLM to generate possible next reasoning steps."""
        prompt = f"""Problem: {self.state['problem']}
Current reasoning:
{chr(10).join(self.state['reasoning_chain'])}

Generate 3 different possible next reasoning steps:"""
        
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            n=3
        )
        
        return [choice.message.content for choice in response.choices]
    
    def simulate_to_completion(self) -> bool:
        """Complete reasoning and check if answer is correct."""
        # Simplified: use LLM to complete reasoning
        prompt = f"""Problem: {self.state['problem']}
Reasoning so far:
{chr(10).join(self.state['reasoning_chain'])}

Complete the reasoning and provide final answer:"""
        
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )
        
        answer = response.choices[0].message.content
        # Check correctness (would need ground truth in real implementation)
        is_correct = check_answer(answer, self.state['problem'])
        return 1 if is_correct else 0

def mcts_reasoning(problem: str, iterations: int = 10):
    """Use MCTS to search for correct reasoning path."""
    initial_state = ReasoningState(
        problem=problem,
        reasoning_chain=[],
        answer=None,
        is_correct=None
    )
    
    root = MCTSReasoningNode(initial_state)
    
    # Standard MCTS loop (adapted for LLM reasoning)
    for _ in range(iterations):
        # ... MCTS phases ...
        pass
    
    # Return best reasoning path
    return root.best_child().state['reasoning_chain']

Comparisons & Tradeoffs

MCTS vs Minimax

Minimax:

MCTS:

Tradeoff: MCTS better for large search spaces where you can’t enumerate all possibilities. Minimax better for small, well-understood games where perfect play is computable.

Greedy (e.g., LLM token-by-token generation):

MCTS:

Tradeoff: Use greedy when speed matters and local decisions are usually good. Use MCTS when decisions have long-term consequences and you can afford computation time.

Pure MCTS vs Neural-Guided MCTS (AlphaGo Style)

Pure MCTS:

Neural-Guided:

Tradeoff: Pure MCTS for quick prototypes and domains without training data. Neural-guided for production systems with available training resources.

Latest Developments & Research

MCTS for LLM Reasoning (2024-2025)

Recent papers apply MCTS to improve LLM reasoning:

“Tree of Thoughts” (Yao et al., 2024): Uses MCTS-like search over reasoning steps, with LLM evaluating intermediate states. Shows significant improvements on mathematical reasoning and planning tasks.

“Self-Refine with MCTS” (2025): Combines iterative self-refinement with tree search, allowing LLMs to explore multiple refinement paths and select the best.

Multi-Agent MCTS

Extending MCTS to multi-agent settings where agents have partial information:

Information Set MCTS: Handles imperfect information games (like poker) by grouping indistinguishable states.

Decentralized MCTS: Multiple agents run MCTS independently, coordinating through communication.

MCTS with Learned Rollout Policies

Instead of random rollouts, train neural networks to simulate high-quality continuations:

Policy networks: Trained to mimic expert play, dramatically improving simulation quality Value networks: Directly estimate state value, eliminating need for rollouts entirely

This approach (used in AlphaGo Zero) achieves superhuman performance with less domain knowledge.

Cross-Disciplinary Insight

MCTS connects to decision theory and optimal stopping problems: when do you stop searching and commit to an action? MCTS implicitly solves this through the anytime property—you can stop whenever time runs out.

From cognitive science, MCTS resembles human prospective thinking: we imagine possible futures, learn from those mental simulations, and gradually focus on promising options. Humans don’t exhaustively search all possibilities; we adaptively sample like MCTS.

Daily Challenge

Task: Implement MCTS for a simple planning problem

Create an MCTS solver for the “Shortest Path with Obstacles” problem:

class GridWorld:
    def __init__(self, size=10, obstacles=None):
        self.size = size
        self.obstacles = obstacles or set()
        self.position = (0, 0)
        self.goal = (size-1, size-1)
    
    def get_actions(self):
        """Return valid moves: up, down, left, right."""
        # Your implementation
        pass
    
    def apply_action(self, action):
        """Return new GridWorld state after action."""
        pass
    
    def is_terminal(self):
        """Check if goal reached."""
        pass

# Your task:
# 1. Implement MCTS for GridWorld
# 2. Compare MCTS with random walk and greedy approaches
# 3. Visualize the search tree after N iterations

Requirements:

Bonus Challenge: Implement a learned rollout policy that uses value estimation to guide simulations toward the goal, comparing performance to random rollouts.

References & Further Reading

Foundational Papers

Recent MCTS for LLMs

Implementations

Tutorials

Monte Carlo Tree Search transforms the impossible task of exhaustively searching all possibilities into a tractable learning problem: sample intelligently, learn from simulations, and focus on what works. As AI agents tackle increasingly complex planning problems—from game playing to mathematical reasoning to robotic manipulation—MCTS provides a principled way to search vast spaces efficiently. When combined with modern neural networks, it achieves superhuman performance while remaining conceptually simple: explore, simulate, learn, repeat.