Hierarchical Reinforcement Learning for Multi-Agent Systems

Hierarchical Reinforcement Learning for Multi-Agent Systems

Introduction: Breaking Down Complex Agent Behaviors

Imagine programming a team of robots to build a house. Each robot needs to coordinate with others, learn from experience, and break down the massive goal of “build a house” into manageable subtasks like “lay foundation,” “frame walls,” and “install wiring.” How do you structure learning and decision-making for such complex, multi-agent tasks?

Hierarchical Reinforcement Learning (HRL) provides an elegant solution by organizing agent behavior into multiple levels of abstraction. Instead of learning one monolithic policy mapping every possible state to actions, HRL structures policies into hierarchies where high-level policies choose goals and low-level policies determine how to achieve them.

For software engineers building AI agents, HRL offers a principled way to decompose complex problems, enable transfer learning across tasks, and coordinate multiple agents pursuing shared goals.

Historical & Theoretical Context

Origins of Hierarchical RL

The foundations emerged in the 1990s as researchers grappled with RL’s “curse of dimensionality”—as state spaces grow, flat RL approaches require exponentially more samples to learn effective policies.

Key developments:

Why Hierarchy Matters in Multi-Agent Systems

In multi-agent contexts, hierarchy provides:

  1. Scalability: Agents can coordinate on high-level strategies while independently solving low-level execution details
  2. Modularity: Low-level skills learned for one task transfer to others (e.g., “navigate to location” works for both “gather resources” and “deliver payload”)
  3. Interpretability: Hierarchical structures are easier to understand and debug than monolithic policies
  4. Sample efficiency: Learning reusable sub-policies requires fewer samples than learning from scratch for each new task

Algorithms & Math

The Options Framework

An option is a tuple: o = ⟨I, π, β⟩

The agent chooses among options using a high-level policy. Once an option is selected, its internal policy π governs actions until termination β.

MAXQ Value Function Decomposition

MAXQ decomposes the value function hierarchically. For a task M with subtasks {M₁, M₂, …, Mₙ}:

Q(s, M) = V(M, s) + C(s, M)

Where:

The decomposition continues recursively for subtasks. This enables:

Feudal Reinforcement Learning (FRL)

In Feudal RL, a manager agent sets goals g for worker agents. The manager maximizes long-term reward, while workers maximize immediate reward for achieving g.

Manager policy: π_manager(s) → g
Worker policy: π_worker(s, g) → a

The system uses two levels of temporal abstraction:

Recent Algorithmic Advances: Feudal Networks for HRL (FuN)

Feudal Networks (Vezhnevets et al., 2017) implement feudal RL using deep neural networks:

The manager learns which goals lead to environmental rewards, while workers learn to efficiently reach any goal.

Design Patterns & Architectures

Pattern 1: Goal-Conditioned Policies

Structure agents with goal-conditioned value functions: Q(s, a, g) estimates value of taking action a in state s given goal g.

Benefits:

Implementation: Add goal embedding as input to your policy network alongside state observation.

Pattern 2: Temporal Abstraction via Options

Define reusable skills as options. For example, in a warehouse robot system:

High-level options:

Low-level primitive actions:

The agent learns when to invoke each option and how to execute each option’s internal policy.

Pattern 3: Multi-Agent Hierarchical Coordination

For multi-agent systems, use hierarchical structures where:

This pattern appears in team sports AI, swarm robotics, and multi-robot industrial systems.

Practical Application: Hierarchical Multi-Agent Warehouse

Let’s build a simplified hierarchical RL system for warehouse robots using Python and stable-baselines3.

Code Example

import gym
from gym import spaces
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv

class WarehouseEnv(gym.Env):
    """Simplified warehouse environment for HRL demonstration."""
    
    def __init__(self, grid_size=10, num_items=3):
        super().__init__()
        self.grid_size = grid_size
        self.num_items = num_items
        
        # State: robot position (x, y), current goal item index
        self.observation_space = spaces.Box(
            low=0, high=grid_size, shape=(3,), dtype=np.float32
        )
        
        # Actions: move up/down/left/right, pick item
        self.action_space = spaces.Discrete(5)
        
        self.reset()
    
    def reset(self):
        self.robot_pos = np.array([0, 0])
        self.item_positions = np.random.randint(0, self.grid_size, size=(self.num_items, 2))
        self.current_goal_idx = 0
        self.items_collected = 0
        return self._get_obs()
    
    def _get_obs(self):
        return np.array([
            self.robot_pos[0],
            self.robot_pos[1],
            self.current_goal_idx
        ], dtype=np.float32)
    
    def step(self, action):
        # Execute action
        if action < 4:  # Movement
            moves = [[-1, 0], [1, 0], [0, -1], [0, 1]]
            self.robot_pos = np.clip(
                self.robot_pos + moves[action], 0, self.grid_size - 1
            )
        
        # Reward structure: hierarchical intrinsic + extrinsic
        reward = 0
        done = False
        
        # Intrinsic reward: progress toward current goal
        goal_pos = self.item_positions[self.current_goal_idx]
        distance = np.linalg.norm(self.robot_pos - goal_pos)
        
        # Extrinsic reward: collecting items
        if action == 4 and distance < 1.0:  # Pick action near item
            reward = 10
            self.items_collected += 1
            self.current_goal_idx = (self.current_goal_idx + 1) % self.num_items
            if self.items_collected >= self.num_items:
                done = True
                reward = 50  # Bonus for completing all
        else:
            # Intrinsic reward: small penalty for distance to goal
            reward = -0.01 * distance
        
        return self._get_obs(), reward, done, {}
    
    def render(self, mode='human'):
        pass


class HierarchicalAgent:
    """Two-level hierarchical agent: Manager sets goals, Worker executes."""
    
    def __init__(self, env):
        self.env = env
        
        # Manager: learns to sequence goals (which item to fetch next)
        # For simplicity, we'll use a simple goal-selection policy
        
        # Worker: learns goal-conditioned policy to reach items
        self.worker = PPO("MlpPolicy", env, verbose=1)
    
    def train_worker(self, total_timesteps=50000):
        """Train worker policy to achieve goals."""
        print("Training worker policy...")
        self.worker.learn(total_timesteps=total_timesteps)
    
    def evaluate(self, num_episodes=10):
        """Evaluate trained hierarchical agent."""
        print("\nEvaluating agent...")
        for episode in range(num_episodes):
            obs = self.env.reset()
            total_reward = 0
            done = False
            steps = 0
            
            while not done and steps < 200:
                action, _ = self.worker.predict(obs, deterministic=True)
                obs, reward, done, _ = self.env.step(action)
                total_reward += reward
                steps += 1
            
            print(f"Episode {episode + 1}: Total Reward = {total_reward:.2f}, Steps = {steps}")


# Usage
if __name__ == "__main__":
    # Create environment
    env = DummyVecEnv([lambda: WarehouseEnv(grid_size=10, num_items=3)])
    
    # Create hierarchical agent
    agent = HierarchicalAgent(env)
    
    # Train worker policy
    agent.train_worker(total_timesteps=50000)
    
    # Evaluate
    agent.evaluate(num_episodes=5)

Explanation

  1. Environment: Simple warehouse grid where a robot must collect items sequentially
  2. Hierarchy:
    • Manager (implicit): Goal sequence is item collection order
    • Worker: Learns to navigate to current goal item and pick it
  3. Reward structure: Combines intrinsic (progress toward current goal) and extrinsic (collecting items) rewards
  4. Scalability: The worker policy learning to reach goals transfers across different goal sequences

Extending to Multi-Agent

For multiple robots, create separate worker policies for each agent, with a centralized manager assigning goals to minimize conflicts:

class MultiAgentWarehouse:
    def __init__(self, num_agents=3):
        self.agents = [HierarchicalAgent(env) for _ in range(num_agents)]
        self.manager = CentralManager(num_agents)  # Assigns goals to agents
    
    def step(self):
        # Manager assigns goals
        goals = self.manager.assign_goals(state)
        
        # Each agent executes toward its goal
        for agent, goal in zip(self.agents, goals):
            agent.execute(goal)

Comparisons & Tradeoffs

HRL vs Flat RL

Advantages of HRL:

Disadvantages:

HRL vs End-to-End Deep RL

Recent end-to-end deep RL (e.g., PPO, SAC) can learn complex behaviors without explicit hierarchy. However:

When to use each:

Latest Developments & Research

Director (OpenAI, 2024)

OpenAI’s Director framework uses LLMs as high-level managers in hierarchical RL. The LLM generates natural language goals, and low-level policies learn to achieve these goals.

Key insight: LLMs encode common-sense world knowledge, making them effective managers for setting meaningful goals. Low-level RL policies learn grounded execution.

Skill Discovery via Intrinsic Motivation

Recent work (DIAYN, VIC, APT) automatically discovers reusable skills without hand-designed hierarchies. These methods maximize mutual information between skills and states, learning diverse behaviors that can be composed hierarchically.

Application: Pre-train diverse skills on unlabeled data, then use them as options for downstream tasks.

Multi-Agent HRL in StarCraft and Dota

DeepMind’s AlphaStar (StarCraft II) and OpenAI Five (Dota 2) both use hierarchical structures:

This hierarchy enables training on tasks of unprecedented complexity.

Cross-Disciplinary Insight: From Organizations to Agents

HRL mirrors how human organizations structure work. Companies use hierarchies:

Similarly, HRL agents use hierarchies for scalable decision-making. This connection suggests insights from organizational theory (delegation, incentive alignment, communication protocols) may inform multi-agent HRL design.

Conversely, HRL algorithms offer lessons for organizational design—how to structure objectives, when centralization vs decentralization works, and how to incentivize alignment across hierarchical levels.

Daily Challenge: Implement a 3-Level Hierarchy

Challenge: Extend the warehouse example to a 3-level hierarchy:

  1. Executive layer: Decides which warehouse zone to prioritize (Zone A, B, or C)
  2. Manager layer: Assigns items within the chosen zone to worker robots
  3. Worker layer: Navigates and collects assigned items

Tasks:

  1. Modify the environment to have 3 zones with items in each
  2. Implement an executive policy (can be rule-based initially) that selects zones
  3. Implement a manager that assigns items to workers within the selected zone
  4. Train the worker policy to collect items
  5. Evaluate: Does the 3-level hierarchy reduce training time compared to flat RL?

Bonus: Implement option discovery—can your system automatically learn reusable navigation skills without manual specification?

References & Further Reading

Foundational Papers

  1. Sutton, R. S., Precup, D., & Singh, S. (1999). “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.” Artificial Intelligence.

  2. Dietterich, T. G. (2000). “Hierarchical reinforcement learning with the MAXQ value function decomposition.” Journal of Artificial Intelligence Research.

  3. Dayan, P., & Hinton, G. E. (1993). “Feudal reinforcement learning.” NIPS.

Modern Approaches

  1. Vezhnevets, A. S., et al. (2017). “FeUdal Networks for Hierarchical Reinforcement Learning.” ICML. arXiv:1703.01161

  2. Nachum, O., et al. (2018). “Data-Efficient Hierarchical Reinforcement Learning.” NeurIPS. arXiv:1805.08296

Skill Discovery

  1. Eysenbach, B., et al. (2018). “Diversity is All You Need: Learning Skills without a Reward Function.” ICLR. arXiv:1802.06070

Multi-Agent HRL

  1. Vinyals, O., et al. (2019). “Grandmaster level in StarCraft II using multi-agent reinforcement learning.” Nature. Link

GitHub Repositories

Books


Closing Thought: Hierarchical RL represents a shift from learning single behaviors to learning structured systems of behaviors. As AI agents tackle increasingly complex real-world problems—from multi-robot warehouses to autonomous research assistants—hierarchical approaches will be essential for scalable, interpretable, and efficient agent programming.