Monte Carlo Tree Search for Agent Planning

Introduction: Search in the Unknown

Monte Carlo Tree Search (MCTS) is an algorithm that makes intelligent decisions even when you can’t evaluate all possibilities. Unlike traditional search algorithms that require full knowledge of the problem space, MCTS explores partially, learns from simulations, and progressively focuses on promising paths.

In simple terms: MCTS is like planning your next move in chess by mentally “playing out” random games from different possible moves, then choosing the move that led to wins most often.

For practitioners: MCTS combines tree search with statistical sampling to balance exploration (trying new things) and exploitation (doing what works). It’s particularly powerful when the search space is too large for exhaustive search and when you can simulate outcomes.

Historical & Theoretical Context

MCTS emerged from the intersection of game theory, reinforcement learning, and computer Go research in the early 2000s. Before MCTS, computer Go was considered intractable—the branching factor (~250 moves per position) made traditional minimax search infeasible.

Key developments:

2006: Rémi Coulom introduces MCTS, revolutionizing computer Go
2016: DeepMind’s AlphaGo combines MCTS with deep neural networks to defeat world champion Lee Sedol
2017: AlphaZero generalizes the approach to chess and shogi, learning from pure self-play

Theoretical foundation:
MCTS is grounded in the multi-armed bandit problem: how do you allocate limited trials among multiple options to maximize reward? The Upper Confidence Bound (UCB) formula provides the mathematical answer, balancing exploration and exploitation.

The MCTS Algorithm

MCTS builds a search tree incrementally through four phases repeated in a loop:

1. Selection

Start at the root and traverse the tree by selecting child nodes according to a selection policy (typically UCB1) until reaching a node that hasn’t been fully expanded.

2. Expansion

Add one or more child nodes representing possible actions from the current state.

3. Simulation (Rollout)

From the new node, simulate a complete episode using a rollout policy (often random or semi-random) until reaching a terminal state.

4. Backpropagation

Propagate the simulation result back up the tree, updating statistics (visit counts and values) for each node along the path.

After many iterations, the action from the root with the best statistics is selected.

Mathematical Foundation: UCB1 Selection

The UCB1 (Upper Confidence Bound) formula balances exploitation and exploration:

UCB1(node) = (wins / visits) + C * sqrt(ln(parent_visits) / visits)

Where:

wins / visits: Exploitation term - prefer nodes with high win rates
C * sqrt(ln(parent_visits) / visits): Exploration term - prefer less-visited nodes
C: Exploration constant (typically √2 for theoretical optimality)

Intuition:

Nodes with high win rates are attractive (exploitation)
Rarely visited nodes get a bonus (exploration)
As parent visits increase, exploration bonus grows (ensures all children eventually get explored)
As a node’s visits increase, its exploration bonus shrinks (focus shifts to exploitation)

Pseudocode

class MCTSNode:
    def __init__(self, state, parent=None):
        self.state = state
        self.parent = parent
        self.children = []
        self.visits = 0
        self.value = 0.0
        self.untried_actions = state.get_legal_actions()
    
    def ucb1(self, c=1.41):
        if self.visits == 0:
            return float('inf')
        exploitation = self.value / self.visits
        exploration = c * math.sqrt(math.log(self.parent.visits) / self.visits)
        return exploitation + exploration
    
    def select_child(self):
        return max(self.children, key=lambda child: child.ucb1())
    
    def expand(self):
        action = self.untried_actions.pop()
        next_state = self.state.take_action(action)
        child = MCTSNode(next_state, parent=self)
        self.children.append(child)
        return child
    
    def simulate(self):
        state = self.state
        while not state.is_terminal():
            action = random.choice(state.get_legal_actions())
            state = state.take_action(action)
        return state.get_reward()
    
    def backpropagate(self, reward):
        self.visits += 1
        self.value += reward
        if self.parent:
            self.parent.backpropagate(reward)

def mcts_search(root_state, num_iterations):
    root = MCTSNode(root_state)
    
    for _ in range(num_iterations):
        node = root
        
        # Selection
        while node.untried_actions == [] and node.children != []:
            node = node.select_child()
        
        # Expansion
        if node.untried_actions != []:
            node = node.expand()
        
        # Simulation
        reward = node.simulate()
        
        # Backpropagation
        node.backpropagate(reward)
    
    # Return action with most visits (most robust)
    return max(root.children, key=lambda c: c.visits).state.last_action

Design Patterns & Architectures

Pattern 1: MCTS as Planning Module

In AI agent architectures, MCTS often serves as the planning component in a planner-executor pattern:

Perception → State Representation → MCTS Planning → Action Execution → Environment
                                         ↑                                    ↓
                                         └────────── Feedback Loop ──────────┘

Pattern 2: MCTS + Learned Heuristics (AlphaGo Style)

Modern applications combine MCTS with neural networks:

class NeuralMCTSNode(MCTSNode):
    def __init__(self, state, parent, policy_network, value_network):
        super().__init__(state, parent)
        self.policy_net = policy_network
        self.value_net = value_network
    
    def expand(self):
        # Use policy network to prioritize action expansion
        action_probs = self.policy_net.predict(self.state)
        self.untried_actions = sorted(
            self.untried_actions,
            key=lambda a: action_probs[a],
            reverse=True
        )
        return super().expand()
    
    def simulate(self):
        # Replace random rollout with value network estimate
        return self.value_net.predict(self.state)

Architecture:

Policy Network: Guides selection and expansion (replaces random rollouts)
Value Network: Estimates position value (replaces full simulation)
MCTS: Performs lookahead search to refine network predictions

This combines the intuition of neural networks with the lookahead reasoning of search.

Practical Application: LLM Agent Planning

Here’s how to use MCTS for planning in an LLM-based agent:

import openai
from typing import List, Tuple

class LLMState:
    def __init__(self, context: str, goal: str, history: List[str]):
        self.context = context
        self.goal = goal
        self.history = history  # Actions taken so far
    
    def get_legal_actions(self) -> List[str]:
        """Use LLM to generate possible next actions"""
        prompt = f"""
        Context: {self.context}
        Goal: {self.goal}
        Actions taken: {self.history}
        
        Generate 3-5 possible next actions to achieve the goal.
        Format: one action per line.
        """
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.8
        )
        actions = response.choices[0].message.content.strip().split('\n')
        return actions
    
    def take_action(self, action: str) -> 'LLMState':
        """Simulate taking an action"""
        new_context = f"{self.context}\nAction: {action}"
        new_history = self.history + [action]
        return LLMState(new_context, self.goal, new_history)
    
    def is_terminal(self) -> bool:
        """Check if goal is achieved or max depth reached"""
        if len(self.history) >= 10:  # Max planning depth
            return True
        # Use LLM to judge goal completion
        prompt = f"""
        Context: {self.context}
        Goal: {self.goal}
        Actions: {self.history}
        
        Has the goal been achieved? Answer YES or NO.
        """
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        return "YES" in response.choices[0].message.content.upper()
    
    def get_reward(self) -> float:
        """Evaluate how good this outcome is"""
        prompt = f"""
        Context: {self.context}
        Goal: {self.goal}
        Actions: {self.history}
        
        Rate how well these actions achieved the goal.
        Score from 0.0 (failed) to 1.0 (perfect success).
        Return only the number.
        """
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        return float(response.choices[0].message.content.strip())

# Usage
initial_state = LLMState(
    context="User wants to book a flight from NYC to LA",
    goal="Successfully book a flight",
    history=[]
)

best_action = mcts_search(initial_state, num_iterations=50)
print(f"Recommended first action: {best_action}")

How this works:

MCTS explores different action sequences
LLM generates possible actions at each state
LLM evaluates terminal states (goal achievement)
MCTS statistics identify the most promising first action

Advantages:

Explores multiple solution paths before committing
Finds non-obvious multi-step solutions
More robust than greedy LLM single-step planning

Comparisons & Tradeoffs

MCTS vs. Minimax (Alpha-Beta Pruning)

Aspect	MCTS	Minimax
Best for	Large branching factors, imperfect info	Smaller spaces, perfect info
Evaluation	Statistical sampling	Requires heuristic evaluation function
Anytime	Yes (improves with more time)	No (must complete to depth limit)
Bias	Unbiased (explores randomly)	Biased by heuristic quality

MCTS vs. Reinforcement Learning

Aspect	MCTS	RL
Training	No training needed	Requires training phase
Model	Requires simulator/model	Can be model-free
Planning	Plans at decision time	Policy pre-learned
Adaptability	Adapts immediately to new rules	Needs retraining

When to use MCTS:

You have a simulator but limited training data
Decision-making needs to be adaptive/dynamic
Search space is large but structured

When to use RL instead:

You need fast inference (pre-trained policy)
You have ample training data
Model-free learning from environment interactions

Latest Developments & Research

MuZero (DeepMind, 2020-2025)

MuZero extends AlphaZero by learning the environment model itself:

No longer requires knowing game rules
Learns dynamics model, reward model, and policy simultaneously
Achieves superhuman performance in Atari from pixels

Significance: MCTS no longer requires a known simulator—it can plan in learned latent spaces.

Sampled MuZero & Stochastic MCTS

Recent work extends MCTS to stochastic environments:

Handles uncertainty in state transitions
Uses chance nodes in the search tree
Applications in real-world robotics and logistics

MCTS for LLM Reasoning (2024-2025)

Researchers are applying MCTS to improve LLM reasoning:

Tree-of-Thoughts: Use MCTS to explore reasoning paths
Self-Consistency with Search: Generate multiple reasoning chains, use MCTS to select best
Planning LLM Actions: Search over multi-step LLM tool use sequences

Benchmark improvements:

Math problem solving: 15-25% accuracy gains
Code generation: Better handling of complex multi-file changes
Agent task planning: More robust to edge cases

Open problems:

LLM simulation is expensive (each node = API call)
Balancing MCTS depth vs. breadth for cost efficiency
Learning when to search vs. when to act greedily

Cross-Disciplinary Insight: Cognitive Science

MCTS mirrors mental simulation in human cognition:

Humans imagine future scenarios before acting
We run “mental rollouts” to evaluate options
Dual process theory: System 1 (intuition/policy net) + System 2 (deliberate search/MCTS)

Research connection:
Neuroscience shows the brain’s hippocampus “replays” experiences during rest, similar to MCTS simulations consolidating value estimates. This suggests MCTS captures something fundamental about how intelligent systems plan.

Daily Challenge

Implement MCTS for a text adventure game:

Create a simple text-based game state (e.g., navigating rooms, picking up items, achieving a goal). Implement MCTS to find an optimal action sequence.

Requirements:

Define a GameState class with methods:
- get_legal_actions()
- take_action(action)
- is_terminal()
- get_reward()
Implement the four MCTS phases
Run 100 iterations and print the recommended action
Bonus: Compare MCTS’s choice to random search (picking random actions until goal achieved). How many fewer steps does MCTS take on average?

Extension: Try integrating an LLM to generate actions or evaluate states instead of hand-coding them.

References & Further Reading

Foundational Papers:

Coulom, R. (2006). “Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search”
Browne et al. (2012). “A Survey of Monte Carlo Tree Search Methods” - Comprehensive survey

AlphaGo/AlphaZero:

Silver et al. (2016). “Mastering the game of Go with deep neural networks”
Silver et al. (2017). “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm”

Recent Extensions:

Schrittwieser et al. (2020). “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model” (MuZero)
Yao et al. (2024). “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”

Implementations:

OpenSpiel - DeepMind’s game-playing framework with MCTS implementations
MCTS Python - Clean Python MCTS implementation

Tutorials:

AlphaGo Paper Walkthrough
MCTS for Beginners - Visual explanation

Books:

Sutton & Barto, “Reinforcement Learning: An Introduction” (Chapter on planning and learning)

2025-11-28

../