Self-Play and Competitive Learning for AI Agents

Concept Introduction

Simple Explanation: Self-play is when an AI agent improves by playing against copies of itself, learning from both wins and losses without needing human opponents or labeled data. Imagine a chess player who gets better by analyzing games they play against their past selves—each version slightly stronger than the last.

Technical Detail: Self-play is a reinforcement learning strategy where agents generate their own training data by competing or interacting with copies of themselves at various skill levels. The agent explores the state-action space, discovers strategies through trial and error, and iteratively improves its policy based on outcomes. This approach is particularly powerful for sequential decision-making tasks where optimal strategies aren’t known in advance and must be discovered through exploration.

Historical & Theoretical Context

Origins: The Game-Playing Revolution

Self-play’s modern prominence began with TD-Gammon (Gerald Tesauro, 1992), a neural network that learned to play backgammon at expert level entirely through self-play, using temporal difference learning. TD-Gammon started with random play and, through millions of self-play games, discovered strategies that surprised human experts.

The technique exploded into public consciousness with AlphaGo (DeepMind, 2016), which used a combination of supervised learning from human games and self-play reinforcement learning to defeat world champion Lee Sedol at Go—a game with more possible positions than atoms in the observable universe. AlphaGo’s successor, AlphaGo Zero (2017), removed even the supervised learning component, learning purely from self-play and tabula rasa initialization, yet became stronger than its predecessor.

AlphaZero (2017) generalized the approach to chess and shogi, mastering all three games (Go, chess, shogi) from scratch through self-play alone, reaching superhuman performance in less than 24 hours of training.

Theoretical Foundation: Minimax, Nash Equilibrium, and Evolutionary Dynamics

Self-play connects to several theoretical frameworks:

Minimax Theory: In zero-sum games, self-play approximates minimax search—each agent tries to maximize its reward while minimizing the opponent’s. Over time, strategies converge toward Nash equilibria.
Nash Equilibrium: In multi-agent systems, self-play can discover equilibrium strategies where no agent can improve by unilaterally changing its policy. This is particularly relevant for competitive scenarios.
Evolutionary Algorithms: Self-play resembles evolutionary dynamics where agents with successful strategies “survive” and propagate, while unsuccessful strategies die out. Population-based methods often maintain diverse agent populations that compete.
Curriculum Learning: Self-play naturally implements curriculum learning—as the agent improves, opponents become more challenging, providing appropriately difficult training signals throughout the learning process.

Algorithms & Math

Basic Self-Play Algorithm (Pseudocode)

Initialize agent policy π₀ randomly

For iteration t = 1 to T:
    Generate N games by having πₜ play against πₜ
    Collect trajectory data: (s, a, r, s')
    
    For each trajectory:
        Compute returns Gₜ (discounted cumulative rewards)
        Compute policy gradient: ∇θ J(θ) = E[∇θ log π(a|s) * Gₜ]
    
    Update policy: θₜ₊₁ = θₜ + α * ∇θ J(θ)
    
    Optionally: Add πₜ to population of past policies
    Periodically: Evaluate πₜ against benchmark opponents

Return final policy πₜ

Mathematical Formulation

The objective in self-play reinforcement learning is to find a policy π that maximizes expected return when playing against itself:

J(θ) = E_{τ ~ π_θ vs π_θ}[∑ γᵗ r(sₜ, aₜ)]

Where:

τ is a trajectory (sequence of states and actions)
γ is the discount factor
r(sₜ, aₜ) is the reward at time t
π_θ is the policy parameterized by θ

Policy Gradient Update:

∇θ J(θ) = E_τ[∑ₜ ∇θ log π_θ(aₜ|sₜ) * Aₜ]

Where Aₜ is the advantage function (how much better action aₜ was compared to average).

AlphaZero’s Self-Play Loop

AlphaZero combines Monte Carlo Tree Search (MCTS) with neural network policy and value functions:

Self-Play Game Generation:
- For each state s, run MCTS guided by policy π and value v
- MCTS simulations use neural network to evaluate positions
- Select moves proportional to visit counts in search tree
- Record (state, MCTS policy, game outcome) tuples
Neural Network Training:
- Policy head: Minimize cross-entropy between MCTS policy and network policy
- Value head: Minimize MSE between game outcome z and network value v(s)
```
L(θ) = -π_MCTS · log(π_θ) + (z - v_θ(s))² + c·||θ||²
```
Iteration:
- Use updated network to generate new self-play games
- Repeat until convergence or time limit

Design Patterns & Architectures

Pattern 1: Symmetric Self-Play (Zero-Sum Games)

Use Case: Board games, competitive strategy games, adversarial scenarios

Architecture:

Single policy network
Agents are identical copies
Outcome is win/loss/draw
Training signal: agent learns to beat itself

Pros: Simple, efficient, guarantees opponents of equal skill Cons: Can converge to local equilibria, may develop strategies that exploit self-specific weaknesses

Pattern 2: Population-Based Self-Play

Use Case: Complex strategy games, preventing overfitting to single opponent

Architecture:

Maintain population of past policy versions
Agent plays against randomly sampled past opponents
Preserves diversity of strategies

Implementation Detail:

class PolicyPopulation:
    def __init__(self, max_size=100):
        self.policies = []
        self.max_size = max_size
    
    def add_policy(self, policy):
        self.policies.append(copy.deepcopy(policy))
        if len(self.policies) > self.max_size:
            self.policies.pop(0)  # FIFO
    
    def sample_opponent(self):
        return random.choice(self.policies)

Pattern 3: League Training (OpenAI Five)

Use Case: Highly complex domains requiring strategy diversity

Architecture:

Main agents (continually trained)
League opponents (frozen snapshots of past main agents)
Exploiter agents (trained to beat current main agent)

Why It Works: Prevents cycling between strategies, maintains broad competency across strategy space

Pattern 4: Cooperative Self-Play

Use Case: Multi-agent coordination, team-based scenarios

Architecture:

Multiple agents with shared or individual policies
Reward shared across team
Agents learn coordination through self-play

Example: Training soccer-playing agents where multiple copies coordinate to score goals

Practical Application

Example: Training a Simple Game-Playing Agent with Self-Play

Let’s implement self-play for a simple game (Tic-Tac-Toe) using policy gradients:

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import random

# Simple neural network policy for Tic-Tac-Toe
class TicTacToePolicy(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(9, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 9)  # Output: action logits for 9 positions
    
    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)
    
    def get_action(self, state, valid_actions):
        """Sample action from policy, masking invalid moves"""
        logits = self.forward(state)
        # Mask invalid actions
        mask = torch.full_like(logits, float('-inf'))
        mask[valid_actions] = 0
        logits = logits + mask
        probs = torch.softmax(logits, dim=-1)
        action = torch.multinomial(probs, 1).item()
        return action, torch.log(probs[action])

class TicTacToeGame:
    def __init__(self):
        self.reset()
    
    def reset(self):
        self.board = [0] * 9  # 0: empty, 1: player 1, -1: player 2
        self.current_player = 1
        return self.get_state()
    
    def get_state(self):
        """Return board from current player's perspective"""
        return torch.FloatTensor([x * self.current_player for x in self.board])
    
    def valid_actions(self):
        return [i for i, x in enumerate(self.board) if x == 0]
    
    def step(self, action):
        if self.board[action] != 0:
            return self.get_state(), -1, True  # Invalid move loses
        
        self.board[action] = self.current_player
        
        # Check win
        win_patterns = [
            [0,1,2], [3,4,5], [6,7,8],  # Rows
            [0,3,6], [1,4,7], [2,5,8],  # Columns
            [0,4,8], [2,4,6]             # Diagonals
        ]
        for pattern in win_patterns:
            if all(self.board[i] == self.current_player for i in pattern):
                return self.get_state(), 1, True  # Win
        
        # Check draw
        if 0 not in self.board:
            return self.get_state(), 0, True  # Draw
        
        # Switch player
        self.current_player *= -1
        return self.get_state(), 0, False  # Continue

def self_play_episode(policy1, policy2):
    """Play one game and collect trajectories"""
    game = TicTacToeGame()
    trajectories = [[], []]  # One for each player
    
    state = game.get_state()
    done = False
    player_idx = 0
    
    while not done:
        policy = policy1 if player_idx == 0 else policy2
        valid = game.valid_actions()
        action, log_prob = policy.get_action(state, valid)
        
        trajectories[player_idx].append((state, action, log_prob))
        
        next_state, reward, done = game.step(action)
        state = next_state
        player_idx = 1 - player_idx
    
    # Assign rewards: winner gets +1, loser -1, draw 0
    if reward == 1:  # Last player won
        rewards = [-1, 1] if player_idx == 0 else [1, -1]
    else:
        rewards = [reward, reward]
    
    return trajectories, rewards

def train_self_play(episodes=5000):
    policy = TicTacToePolicy()
    optimizer = optim.Adam(policy.parameters(), lr=0.001)
    
    for episode in range(episodes):
        # Self-play: policy plays against itself
        trajectories, rewards = self_play_episode(policy, policy)
        
        # Update policy using REINFORCE
        for player_idx in range(2):
            loss = 0
            for state, action, log_prob in trajectories[player_idx]:
                loss -= log_prob * rewards[player_idx]  # Policy gradient
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        if episode % 500 == 0:
            print(f"Episode {episode}: Training...")
    
    return policy

# Train the agent
trained_policy = train_self_play()
print("Training complete!")

Using LLM Agents with Self-Play

For more complex domains, we can use LLMs as policies:

from langchain import OpenAI, PromptTemplate
from langgraph.graph import StateGraph

class LLMAgent:
    def __init__(self, model="gpt-4"):
        self.llm = OpenAI(model=model, temperature=0.7)
        self.template = PromptTemplate(
            input_variables=["game_state", "history"],
            template="""
            You are playing a strategic game. 
            Current game state: {game_state}
            Game history: {history}
            
            Choose your next move and explain your reasoning.
            Output format: MOVE: <move> | REASONING: <reasoning>
            """
        )
    
    def get_action(self, game_state, history):
        prompt = self.template.format(
            game_state=game_state,
            history=history
        )
        response = self.llm(prompt)
        # Parse response to extract move
        move = self.parse_move(response)
        return move
    
    def parse_move(self, response):
        # Extract move from LLM response
        # Implementation depends on game format
        pass

# Self-play with LLM agents
def llm_self_play(agent1, agent2, num_games=10):
    results = {"agent1_wins": 0, "agent2_wins": 0, "draws": 0}
    
    for game_num in range(num_games):
        game = TicTacToeGame()  # Or more complex game
        history = []
        done = False
        
        while not done:
            current_agent = agent1 if game.current_player == 1 else agent2
            state_str = str(game.board)
            
            move = current_agent.get_action(state_str, history)
            state, reward, done = game.step(move)
            
            history.append(f"Player {game.current_player}: {move}")
        
        # Update results based on outcome
        if reward == 1:
            results["agent1_wins" if game.current_player == -1 else "agent2_wins"] += 1
        else:
            results["draws"] += 1
    
    return results

Comparisons & Tradeoffs

Self-Play vs. Supervised Learning

Supervised Learning:

Pros: Faster initial learning, leverages human expertise
Cons: Limited by quality of training data, can’t exceed human performance, requires labeled data

Self-Play:

Pros: Discovers novel strategies, can exceed human performance, no labeled data needed
Cons: Slower initial progress, can converge to local optima, computationally expensive

Hybrid Approach: AlphaGo used supervised learning to bootstrap from human games, then improved via self-play. This combines fast initial learning with open-ended discovery.

Symmetric vs. Population-Based Self-Play

Symmetric (Single Policy):

Pros: Simple, memory-efficient, always has matched opponent
Cons: Can cycle between strategies, vulnerable to self-exploitation

Population-Based:

Pros: More robust, prevents overfitting to self, maintains strategy diversity
Cons: Requires more memory, more complex implementation

When to Use Each: Symmetric works well for simple games with clear optimal strategies. Population-based is essential for complex domains where strategy diversity matters (StarCraft, Dota).

Scalability Challenges

Self-play is computationally expensive:

AlphaGo Zero: 4.9 million self-play games, 3 days on TPUs
OpenAI Five (Dota 2): 10,000 years of gameplay per day
AlphaStar (StarCraft): 200 years of gameplay per day

Optimizations:

Prioritized experience replay (train on interesting games)
Distributed training (many workers generating games simultaneously)
Model-based RL (learn environment model, plan in imagination)

Latest Developments & Research

Emergent Complexity and Multi-Agent Dynamics (2023-2025)

Recent research shows that self-play in multi-agent environments leads to emergent complexity and tool use:

Cicero (Meta, 2023): Combined strategic reasoning with natural language negotiation in Diplomacy, using self-play to learn both strategy and communication
Multi-Agent LLM Systems (2024-2025): Researchers are exploring self-play for training LLM agents in collaborative and competitive scenarios, discovering that agents develop specialized roles and communication protocols

Self-Play for Reasoning and Tool Use

New work applies self-play beyond games:

Self-Taught Reasoner (2024): LLMs improve reasoning by generating problems, solving them, and learning from mistakes—essentially self-play for logical reasoning
Constitutional AI Extensions (2024): Using adversarial self-play where one agent generates harmful prompts while another learns to refuse them safely

Open Problems

Credit Assignment in Long Horizons: How to attribute success/failure to specific decisions in games lasting thousands of steps
Preventing Mode Collapse: Self-play can converge to non-general strategies; how to maintain exploration
Sample Efficiency: Self-play requires massive compute; can we learn faster
Transfer Learning: Agents trained via self-play often don’t transfer well to slightly different environments
Human Alignment: Self-play discovers superhuman strategies, but are they aligned with human values and preferences

Recent Benchmarks

NetHack Learning Environment (2020, actively used 2025): Complex roguelike game testing long-horizon RL
MeltingPot (DeepMind, 2023): Multi-agent benchmark emphasizing social dynamics
Wordle and Connections (LLM Self-Play, 2024): Testing language models via self-competitive word games

Cross-Disciplinary Insight: Biology and Cultural Evolution

Self-play mirrors biological evolution: organisms that survive pass on genes, while unsuccessful ones don’t. The environment (other organisms) creates selective pressure, just as self-play creates training pressure.

Evolutionary Game Theory provides mathematical models for this. For example, Evolutionary Stable Strategies (ESS) describe strategies that, if adopted by a population, can’t be invaded by alternative strategies—exactly what self-play often converges to.

In cultural evolution, ideas compete for adoption. Bad ideas (memes) die out; good ones spread. Self-play in multi-agent systems resembles cultural evolution where strategies compete, successful ones propagate, and the population becomes increasingly sophisticated.

Neuroscience Connection: The brain might use self-play internally—mental simulation where we imagine outcomes of actions before taking them, learning from imagined successes and failures. Offline RL with learned models resembles this “imagination-based” learning.

Daily Challenge

Challenge: Implement Miniature League Training

Build a simple population-based self-play system for Tic-Tac-Toe (or Connect-Four if you want more complexity):

Population Management: Maintain a population of the 10 most recent policy versions
Training Loop:
- Train the main agent against randomly sampled opponents from the population
- Every 100 episodes, add current policy to population
- Track win rate against each historical opponent
Evaluation: After training, evaluate final policy against random, first-version, and mid-training versions
Analysis Question: Does population-based training lead to more robust policies than pure symmetric self-play Would you see different emergent strategies

Bonus: Visualize the evolution of strategy—can you identify qualitative shifts in play style over training

Expected Time: 20-30 minutes

References & Further Reading

Foundational Papers:

Tesauro, G. (1995). “Temporal Difference Learning and TD-Gammon.” Communications of the ACM
Silver, D. et al. (2016). “Mastering the game of Go with deep neural networks and tree search.” Nature
Silver, D. et al. (2017). “Mastering the game of Go without human knowledge.” Nature
Silver, D. et al. (2017). “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.” arXiv

Modern Applications:

Berner, C. et al. (2019). “Dota 2 with Large Scale Deep Reinforcement Learning.” arXiv (OpenAI Five)
Vinyals, O. et al. (2019). “Grandmaster level in StarCraft II using multi-agent reinforcement learning.” Nature (AlphaStar)
Meta Fundamental AI Research (2022). “Human-level play in the game of Diplomacy by combining language models with strategic reasoning.” Science (Cicero)

Recent Developments:

“Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models” (2024)
“Constitutional AI: Training Language Models with Adversarial Self-Play” (2024)

Code & Frameworks:

OpenAI Gym: gym.openai.com
PettingZoo (multi-agent environments): pettingzoo.farama.org
RLlib (scalable RL): docs.ray.io/en/latest/rllib

Blog Posts:

DeepMind Blog: AlphaGo, AlphaZero, MuZero series
OpenAI Blog: Dota 2, competitive self-play
Andrej Karpathy: “Deep Reinforcement Learning: Pong from Pixels”

2025-11-26

../