Hierarchical Reinforcement Learning for Multi-Agent Systems
Hierarchical Reinforcement Learning for Multi-Agent Systems
Introduction: Breaking Down Complex Agent Behaviors
Imagine programming a team of robots to build a house. Each robot needs to coordinate with others, learn from experience, and break down the massive goal of “build a house” into manageable subtasks like “lay foundation,” “frame walls,” and “install wiring.” How do you structure learning and decision-making for such complex, multi-agent tasks?
Hierarchical Reinforcement Learning (HRL) provides an elegant solution by organizing agent behavior into multiple levels of abstraction. Instead of learning one monolithic policy mapping every possible state to actions, HRL structures policies into hierarchies where high-level policies choose goals and low-level policies determine how to achieve them.
For software engineers building AI agents, HRL offers a principled way to decompose complex problems, enable transfer learning across tasks, and coordinate multiple agents pursuing shared goals.
Historical & Theoretical Context
Origins of Hierarchical RL
The foundations emerged in the 1990s as researchers grappled with RL’s “curse of dimensionality”—as state spaces grow, flat RL approaches require exponentially more samples to learn effective policies.
Key developments:
Options Framework (Sutton et al., 1999): Introduced “options”—temporally extended actions that execute multi-step behaviors. An option has an initiation set (states where it can start), a policy (how to act), and a termination condition.
MAXQ (Dietterich, 2000): Proposed decomposing Markov Decision Processes (MDPs) into task hierarchies. Each subtask is solved as an independent MDP, with value functions combining hierarchically.
Feudal RL (Dayan & Hinton, 1993): Introduced a manager-worker structure where managers set goals and workers execute them, inspired by feudal social structures.
Why Hierarchy Matters in Multi-Agent Systems
In multi-agent contexts, hierarchy provides:
- Scalability: Agents can coordinate on high-level strategies while independently solving low-level execution details
- Modularity: Low-level skills learned for one task transfer to others (e.g., “navigate to location” works for both “gather resources” and “deliver payload”)
- Interpretability: Hierarchical structures are easier to understand and debug than monolithic policies
- Sample efficiency: Learning reusable sub-policies requires fewer samples than learning from scratch for each new task
Algorithms & Math
The Options Framework
An option is a tuple: o = ⟨I, π, β⟩
- I ⊆ S: Initiation set (states where option can start)
- π: S × A → [0,1]: Option policy (how to act while executing option)
- β: S → [0,1]: Termination condition (probability of stopping in each state)
The agent chooses among options using a high-level policy. Once an option is selected, its internal policy π governs actions until termination β.
MAXQ Value Function Decomposition
MAXQ decomposes the value function hierarchically. For a task M with subtasks {M₁, M₂, …, Mₙ}:
Q(s, M) = V(M, s) + C(s, M)
Where:
- V(M, s): Value of completing task M starting from state s
- C(s, M): Completion value—expected reward after M completes
The decomposition continues recursively for subtasks. This enables:
- Modular learning: Subtasks are learned independently
- Transfer: Learned subtasks work across multiple parent tasks
Feudal Reinforcement Learning (FRL)
In Feudal RL, a manager agent sets goals g for worker agents. The manager maximizes long-term reward, while workers maximize immediate reward for achieving g.
Manager policy: π_manager(s) → g
Worker policy: π_worker(s, g) → a
The system uses two levels of temporal abstraction:
- Manager: Selects goals at coarse time intervals (e.g., every 10 steps)
- Workers: Execute actions at fine time intervals to achieve current goal
Recent Algorithmic Advances: Feudal Networks for HRL (FuN)
Feudal Networks (Vezhnevets et al., 2017) implement feudal RL using deep neural networks:
- Manager network: Outputs goal embeddings g_t at coarse timescale
- Worker network: Conditions on goal g_t to produce actions a_t
- Intrinsic rewards: Workers receive intrinsic rewards based on progress toward g_t, independent of environment rewards
The manager learns which goals lead to environmental rewards, while workers learn to efficiently reach any goal.
Design Patterns & Architectures
Pattern 1: Goal-Conditioned Policies
Structure agents with goal-conditioned value functions: Q(s, a, g) estimates value of taking action a in state s given goal g.
Benefits:
- Single policy handles multiple goals
- Enables Hindsight Experience Replay (HER): relabel failed trajectories with achieved goals to accelerate learning
Implementation: Add goal embedding as input to your policy network alongside state observation.
Pattern 2: Temporal Abstraction via Options
Define reusable skills as options. For example, in a warehouse robot system:
High-level options:
- “Fetch item X”
- “Deliver to location Y”
- “Recharge battery”
Low-level primitive actions:
- Move forward, turn, grasp, etc.
The agent learns when to invoke each option and how to execute each option’s internal policy.
Pattern 3: Multi-Agent Hierarchical Coordination
For multi-agent systems, use hierarchical structures where:
- Central coordinator: Assigns high-level goals to agents
- Individual agents: Autonomously determine how to achieve assigned goals
- Inter-agent communication: Agents negotiate conflicts and share relevant state information
This pattern appears in team sports AI, swarm robotics, and multi-robot industrial systems.
Practical Application: Hierarchical Multi-Agent Warehouse
Let’s build a simplified hierarchical RL system for warehouse robots using Python and stable-baselines3.
Code Example
import gym
from gym import spaces
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
class WarehouseEnv(gym.Env):
"""Simplified warehouse environment for HRL demonstration."""
def __init__(self, grid_size=10, num_items=3):
super().__init__()
self.grid_size = grid_size
self.num_items = num_items
# State: robot position (x, y), current goal item index
self.observation_space = spaces.Box(
low=0, high=grid_size, shape=(3,), dtype=np.float32
)
# Actions: move up/down/left/right, pick item
self.action_space = spaces.Discrete(5)
self.reset()
def reset(self):
self.robot_pos = np.array([0, 0])
self.item_positions = np.random.randint(0, self.grid_size, size=(self.num_items, 2))
self.current_goal_idx = 0
self.items_collected = 0
return self._get_obs()
def _get_obs(self):
return np.array([
self.robot_pos[0],
self.robot_pos[1],
self.current_goal_idx
], dtype=np.float32)
def step(self, action):
# Execute action
if action < 4: # Movement
moves = [[-1, 0], [1, 0], [0, -1], [0, 1]]
self.robot_pos = np.clip(
self.robot_pos + moves[action], 0, self.grid_size - 1
)
# Reward structure: hierarchical intrinsic + extrinsic
reward = 0
done = False
# Intrinsic reward: progress toward current goal
goal_pos = self.item_positions[self.current_goal_idx]
distance = np.linalg.norm(self.robot_pos - goal_pos)
# Extrinsic reward: collecting items
if action == 4 and distance < 1.0: # Pick action near item
reward = 10
self.items_collected += 1
self.current_goal_idx = (self.current_goal_idx + 1) % self.num_items
if self.items_collected >= self.num_items:
done = True
reward = 50 # Bonus for completing all
else:
# Intrinsic reward: small penalty for distance to goal
reward = -0.01 * distance
return self._get_obs(), reward, done, {}
def render(self, mode='human'):
pass
class HierarchicalAgent:
"""Two-level hierarchical agent: Manager sets goals, Worker executes."""
def __init__(self, env):
self.env = env
# Manager: learns to sequence goals (which item to fetch next)
# For simplicity, we'll use a simple goal-selection policy
# Worker: learns goal-conditioned policy to reach items
self.worker = PPO("MlpPolicy", env, verbose=1)
def train_worker(self, total_timesteps=50000):
"""Train worker policy to achieve goals."""
print("Training worker policy...")
self.worker.learn(total_timesteps=total_timesteps)
def evaluate(self, num_episodes=10):
"""Evaluate trained hierarchical agent."""
print("\nEvaluating agent...")
for episode in range(num_episodes):
obs = self.env.reset()
total_reward = 0
done = False
steps = 0
while not done and steps < 200:
action, _ = self.worker.predict(obs, deterministic=True)
obs, reward, done, _ = self.env.step(action)
total_reward += reward
steps += 1
print(f"Episode {episode + 1}: Total Reward = {total_reward:.2f}, Steps = {steps}")
# Usage
if __name__ == "__main__":
# Create environment
env = DummyVecEnv([lambda: WarehouseEnv(grid_size=10, num_items=3)])
# Create hierarchical agent
agent = HierarchicalAgent(env)
# Train worker policy
agent.train_worker(total_timesteps=50000)
# Evaluate
agent.evaluate(num_episodes=5)
Explanation
- Environment: Simple warehouse grid where a robot must collect items sequentially
- Hierarchy:
- Manager (implicit): Goal sequence is item collection order
- Worker: Learns to navigate to current goal item and pick it
- Reward structure: Combines intrinsic (progress toward current goal) and extrinsic (collecting items) rewards
- Scalability: The worker policy learning to reach goals transfers across different goal sequences
Extending to Multi-Agent
For multiple robots, create separate worker policies for each agent, with a centralized manager assigning goals to minimize conflicts:
class MultiAgentWarehouse:
def __init__(self, num_agents=3):
self.agents = [HierarchicalAgent(env) for _ in range(num_agents)]
self.manager = CentralManager(num_agents) # Assigns goals to agents
def step(self):
# Manager assigns goals
goals = self.manager.assign_goals(state)
# Each agent executes toward its goal
for agent, goal in zip(self.agents, goals):
agent.execute(goal)
Comparisons & Tradeoffs
HRL vs Flat RL
Advantages of HRL:
- Sample efficiency: Reusable sub-policies reduce learning time
- Transfer: Skills generalize across tasks
- Interpretability: Hierarchy makes decision-making clearer
Disadvantages:
- Complexity: More moving parts to tune and debug
- Suboptimality: Hierarchical constraints may prevent globally optimal solutions
- Design effort: Requires thoughtful decomposition of tasks
HRL vs End-to-End Deep RL
Recent end-to-end deep RL (e.g., PPO, SAC) can learn complex behaviors without explicit hierarchy. However:
- Sample efficiency: HRL typically learns faster with fewer samples
- Generalization: HRL transfers better to new tasks
- Scale: End-to-end RL struggles in very large state/action spaces where HRL excels
When to use each:
- HRL: Complex, structured tasks with natural subtasks (robotics, strategy games, multi-agent systems)
- Flat RL: Simpler tasks, abundant simulation samples available, tasks without clear hierarchical structure
Latest Developments & Research
Director (OpenAI, 2024)
OpenAI’s Director framework uses LLMs as high-level managers in hierarchical RL. The LLM generates natural language goals, and low-level policies learn to achieve these goals.
Key insight: LLMs encode common-sense world knowledge, making them effective managers for setting meaningful goals. Low-level RL policies learn grounded execution.
Skill Discovery via Intrinsic Motivation
Recent work (DIAYN, VIC, APT) automatically discovers reusable skills without hand-designed hierarchies. These methods maximize mutual information between skills and states, learning diverse behaviors that can be composed hierarchically.
Application: Pre-train diverse skills on unlabeled data, then use them as options for downstream tasks.
Multi-Agent HRL in StarCraft and Dota
DeepMind’s AlphaStar (StarCraft II) and OpenAI Five (Dota 2) both use hierarchical structures:
- Macro-level: Strategic decisions (economy management, army composition)
- Micro-level: Unit control (positioning, attack timing)
This hierarchy enables training on tasks of unprecedented complexity.
Cross-Disciplinary Insight: From Organizations to Agents
HRL mirrors how human organizations structure work. Companies use hierarchies:
- Executives: Set strategic goals
- Middle managers: Decompose into departmental objectives
- Workers: Execute specific tasks
Similarly, HRL agents use hierarchies for scalable decision-making. This connection suggests insights from organizational theory (delegation, incentive alignment, communication protocols) may inform multi-agent HRL design.
Conversely, HRL algorithms offer lessons for organizational design—how to structure objectives, when centralization vs decentralization works, and how to incentivize alignment across hierarchical levels.
Daily Challenge: Implement a 3-Level Hierarchy
Challenge: Extend the warehouse example to a 3-level hierarchy:
- Executive layer: Decides which warehouse zone to prioritize (Zone A, B, or C)
- Manager layer: Assigns items within the chosen zone to worker robots
- Worker layer: Navigates and collects assigned items
Tasks:
- Modify the environment to have 3 zones with items in each
- Implement an executive policy (can be rule-based initially) that selects zones
- Implement a manager that assigns items to workers within the selected zone
- Train the worker policy to collect items
- Evaluate: Does the 3-level hierarchy reduce training time compared to flat RL?
Bonus: Implement option discovery—can your system automatically learn reusable navigation skills without manual specification?
References & Further Reading
Foundational Papers
Sutton, R. S., Precup, D., & Singh, S. (1999). “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.” Artificial Intelligence.
Dietterich, T. G. (2000). “Hierarchical reinforcement learning with the MAXQ value function decomposition.” Journal of Artificial Intelligence Research.
Dayan, P., & Hinton, G. E. (1993). “Feudal reinforcement learning.” NIPS.
Modern Approaches
Vezhnevets, A. S., et al. (2017). “FeUdal Networks for Hierarchical Reinforcement Learning.” ICML. arXiv:1703.01161
Nachum, O., et al. (2018). “Data-Efficient Hierarchical Reinforcement Learning.” NeurIPS. arXiv:1805.08296
Skill Discovery
- Eysenbach, B., et al. (2018). “Diversity is All You Need: Learning Skills without a Reward Function.” ICLR. arXiv:1802.06070
Multi-Agent HRL
- Vinyals, O., et al. (2019). “Grandmaster level in StarCraft II using multi-agent reinforcement learning.” Nature. Link
GitHub Repositories
- HRL Baselines: https://github.com/facebookresearch/hrl
- Feudal Networks Implementation: https://github.com/dmakian/feudal_networks
Books
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free online
Closing Thought: Hierarchical RL represents a shift from learning single behaviors to learning structured systems of behaviors. As AI agents tackle increasingly complex real-world problems—from multi-robot warehouses to autonomous research assistants—hierarchical approaches will be essential for scalable, interpretable, and efficient agent programming.