Monte Carlo Tree Search for Agent Planning

Monte Carlo Tree Search for Agent Planning

Introduction: Search in the Unknown

Monte Carlo Tree Search (MCTS) is an algorithm that makes intelligent decisions even when you can’t evaluate all possibilities. Unlike traditional search algorithms that require full knowledge of the problem space, MCTS explores partially, learns from simulations, and progressively focuses on promising paths.

In simple terms: MCTS is like planning your next move in chess by mentally “playing out” random games from different possible moves, then choosing the move that led to wins most often.

For practitioners: MCTS combines tree search with statistical sampling to balance exploration (trying new things) and exploitation (doing what works). It’s particularly powerful when the search space is too large for exhaustive search and when you can simulate outcomes.

Historical & Theoretical Context

MCTS emerged from the intersection of game theory, reinforcement learning, and computer Go research in the early 2000s. Before MCTS, computer Go was considered intractable—the branching factor (~250 moves per position) made traditional minimax search infeasible.

Key developments:

Theoretical foundation:
MCTS is grounded in the multi-armed bandit problem: how do you allocate limited trials among multiple options to maximize reward? The Upper Confidence Bound (UCB) formula provides the mathematical answer, balancing exploration and exploitation.

The MCTS Algorithm

MCTS builds a search tree incrementally through four phases repeated in a loop:

1. Selection

Start at the root and traverse the tree by selecting child nodes according to a selection policy (typically UCB1) until reaching a node that hasn’t been fully expanded.

2. Expansion

Add one or more child nodes representing possible actions from the current state.

3. Simulation (Rollout)

From the new node, simulate a complete episode using a rollout policy (often random or semi-random) until reaching a terminal state.

4. Backpropagation

Propagate the simulation result back up the tree, updating statistics (visit counts and values) for each node along the path.

After many iterations, the action from the root with the best statistics is selected.

Mathematical Foundation: UCB1 Selection

The UCB1 (Upper Confidence Bound) formula balances exploitation and exploration:

UCB1(node) = (wins / visits) + C * sqrt(ln(parent_visits) / visits)

Where:

Intuition:

Pseudocode

class MCTSNode:
    def __init__(self, state, parent=None):
        self.state = state
        self.parent = parent
        self.children = []
        self.visits = 0
        self.value = 0.0
        self.untried_actions = state.get_legal_actions()
    
    def ucb1(self, c=1.41):
        if self.visits == 0:
            return float('inf')
        exploitation = self.value / self.visits
        exploration = c * math.sqrt(math.log(self.parent.visits) / self.visits)
        return exploitation + exploration
    
    def select_child(self):
        return max(self.children, key=lambda child: child.ucb1())
    
    def expand(self):
        action = self.untried_actions.pop()
        next_state = self.state.take_action(action)
        child = MCTSNode(next_state, parent=self)
        self.children.append(child)
        return child
    
    def simulate(self):
        state = self.state
        while not state.is_terminal():
            action = random.choice(state.get_legal_actions())
            state = state.take_action(action)
        return state.get_reward()
    
    def backpropagate(self, reward):
        self.visits += 1
        self.value += reward
        if self.parent:
            self.parent.backpropagate(reward)

def mcts_search(root_state, num_iterations):
    root = MCTSNode(root_state)
    
    for _ in range(num_iterations):
        node = root
        
        # Selection
        while node.untried_actions == [] and node.children != []:
            node = node.select_child()
        
        # Expansion
        if node.untried_actions != []:
            node = node.expand()
        
        # Simulation
        reward = node.simulate()
        
        # Backpropagation
        node.backpropagate(reward)
    
    # Return action with most visits (most robust)
    return max(root.children, key=lambda c: c.visits).state.last_action

Design Patterns & Architectures

Pattern 1: MCTS as Planning Module

In AI agent architectures, MCTS often serves as the planning component in a planner-executor pattern:

Perception → State Representation → MCTS Planning → Action Execution → Environment
                                         ↑                                    ↓
                                         └────────── Feedback Loop ──────────┘

Pattern 2: MCTS + Learned Heuristics (AlphaGo Style)

Modern applications combine MCTS with neural networks:

class NeuralMCTSNode(MCTSNode):
    def __init__(self, state, parent, policy_network, value_network):
        super().__init__(state, parent)
        self.policy_net = policy_network
        self.value_net = value_network
    
    def expand(self):
        # Use policy network to prioritize action expansion
        action_probs = self.policy_net.predict(self.state)
        self.untried_actions = sorted(
            self.untried_actions,
            key=lambda a: action_probs[a],
            reverse=True
        )
        return super().expand()
    
    def simulate(self):
        # Replace random rollout with value network estimate
        return self.value_net.predict(self.state)

Architecture:

This combines the intuition of neural networks with the lookahead reasoning of search.

Practical Application: LLM Agent Planning

Here’s how to use MCTS for planning in an LLM-based agent:

import openai
from typing import List, Tuple

class LLMState:
    def __init__(self, context: str, goal: str, history: List[str]):
        self.context = context
        self.goal = goal
        self.history = history  # Actions taken so far
    
    def get_legal_actions(self) -> List[str]:
        """Use LLM to generate possible next actions"""
        prompt = f"""
        Context: {self.context}
        Goal: {self.goal}
        Actions taken: {self.history}
        
        Generate 3-5 possible next actions to achieve the goal.
        Format: one action per line.
        """
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.8
        )
        actions = response.choices[0].message.content.strip().split('\n')
        return actions
    
    def take_action(self, action: str) -> 'LLMState':
        """Simulate taking an action"""
        new_context = f"{self.context}\nAction: {action}"
        new_history = self.history + [action]
        return LLMState(new_context, self.goal, new_history)
    
    def is_terminal(self) -> bool:
        """Check if goal is achieved or max depth reached"""
        if len(self.history) >= 10:  # Max planning depth
            return True
        # Use LLM to judge goal completion
        prompt = f"""
        Context: {self.context}
        Goal: {self.goal}
        Actions: {self.history}
        
        Has the goal been achieved? Answer YES or NO.
        """
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        return "YES" in response.choices[0].message.content.upper()
    
    def get_reward(self) -> float:
        """Evaluate how good this outcome is"""
        prompt = f"""
        Context: {self.context}
        Goal: {self.goal}
        Actions: {self.history}
        
        Rate how well these actions achieved the goal.
        Score from 0.0 (failed) to 1.0 (perfect success).
        Return only the number.
        """
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        return float(response.choices[0].message.content.strip())

# Usage
initial_state = LLMState(
    context="User wants to book a flight from NYC to LA",
    goal="Successfully book a flight",
    history=[]
)

best_action = mcts_search(initial_state, num_iterations=50)
print(f"Recommended first action: {best_action}")

How this works:

  1. MCTS explores different action sequences
  2. LLM generates possible actions at each state
  3. LLM evaluates terminal states (goal achievement)
  4. MCTS statistics identify the most promising first action

Advantages:

Comparisons & Tradeoffs

MCTS vs. Minimax (Alpha-Beta Pruning)

AspectMCTSMinimax
Best forLarge branching factors, imperfect infoSmaller spaces, perfect info
EvaluationStatistical samplingRequires heuristic evaluation function
AnytimeYes (improves with more time)No (must complete to depth limit)
BiasUnbiased (explores randomly)Biased by heuristic quality

MCTS vs. Reinforcement Learning

AspectMCTSRL
TrainingNo training neededRequires training phase
ModelRequires simulator/modelCan be model-free
PlanningPlans at decision timePolicy pre-learned
AdaptabilityAdapts immediately to new rulesNeeds retraining

When to use MCTS:

When to use RL instead:

Latest Developments & Research

MuZero (DeepMind, 2020-2025)

MuZero extends AlphaZero by learning the environment model itself:

Significance: MCTS no longer requires a known simulator—it can plan in learned latent spaces.

Sampled MuZero & Stochastic MCTS

Recent work extends MCTS to stochastic environments:

MCTS for LLM Reasoning (2024-2025)

Researchers are applying MCTS to improve LLM reasoning:

Benchmark improvements:

Open problems:

Cross-Disciplinary Insight: Cognitive Science

MCTS mirrors mental simulation in human cognition:

Research connection:
Neuroscience shows the brain’s hippocampus “replays” experiences during rest, similar to MCTS simulations consolidating value estimates. This suggests MCTS captures something fundamental about how intelligent systems plan.

Daily Challenge

Implement MCTS for a text adventure game:

Create a simple text-based game state (e.g., navigating rooms, picking up items, achieving a goal). Implement MCTS to find an optimal action sequence.

Requirements:

  1. Define a GameState class with methods:

    • get_legal_actions()
    • take_action(action)
    • is_terminal()
    • get_reward()
  2. Implement the four MCTS phases

  3. Run 100 iterations and print the recommended action

  4. Bonus: Compare MCTS’s choice to random search (picking random actions until goal achieved). How many fewer steps does MCTS take on average?

Extension: Try integrating an LLM to generate actions or evaluate states instead of hand-coding them.

References & Further Reading

Foundational Papers:

AlphaGo/AlphaZero:

Recent Extensions:

Implementations:

Tutorials:

Books: