Reinforcement Learning from Human Feedback: How AI Agents Learn What We Really Want

Reinforcement Learning from Human Feedback (RLHF)

Concept Introduction

Simple Terms: RLHF is a training method that teaches AI agents to do what humans actually want, not just what we can write in code. Instead of programming every rule, we show the AI examples of good and bad behavior, and it learns to optimize for human preferences.

Technical Detail: RLHF is a machine learning technique that combines reinforcement learning with human preference data to train AI models. The process involves three stages: (1) supervised fine-tuning on demonstration data, (2) training a reward model from human comparisons, and (3) using reinforcement learning (typically PPO - Proximal Policy Optimization) to optimize the model’s policy to maximize the learned reward function.

Historical & Theoretical Context

RLHF originated from the alignment problem in AI research: how do we ensure AI systems do what we intend, especially when our intentions are complex and hard to specify formally?

Early Work (2017-2019): OpenAI and DeepMind pioneered RLHF for training agents in Atari games and robotic tasks. Rather than specifying reward functions manually (which often led to reward hacking), they learned rewards from human preferences between pairs of behaviors.

Breakthrough (2020-2022): OpenAI applied RLHF to language models, resulting in InstructGPT and ChatGPT. This transformed LLMs from next-token predictors into helpful assistants that follow instructions and align with human values.

Theoretical Foundation: RLHF is grounded in inverse reinforcement learning (IRL)—inferring reward functions from behavior—and preference learning from the field of economics and decision theory. The key insight is that humans are better at comparing outcomes than specifying absolute reward values.

The RLHF Algorithm

Stage 1: Supervised Fine-Tuning (SFT)

Start with a pre-trained model and fine-tune it on high-quality demonstrations.

Input: Pre-trained model M, demonstration dataset D = {(prompt, response)}
Output: Fine-tuned model M_SFT

For each (prompt, response) in D:
    Loss = CrossEntropy(M(prompt), response)
    Update M to minimize Loss

This gives the model a strong baseline for the types of outputs we want.

Stage 2: Reward Model Training

Collect human comparison data and train a model to predict human preferences.

Input: Model M_SFT, comparison dataset C = {(prompt, response_a, response_b, preference)}
Output: Reward model R

For each (prompt, r_a, r_b, pref) in C:
    score_a = R(prompt, r_a)
    score_b = R(prompt, r_b)
    Loss = -log(sigmoid(score_winner - score_loser))
    Update R to minimize Loss

The reward model learns to assign scores that correlate with human preferences.

Stage 3: Policy Optimization with PPO

Use the reward model to fine-tune the model via reinforcement learning.

Input: Model M_SFT, Reward model R
Output: Optimized model M_RLHF

For each prompt p:
    response = M_SFT(p)
    reward = R(p, response)
    
    KL_penalty = KL_divergence(M_SFT(p), M_original(p))
    total_reward = reward - β * KL_penalty
    
    Update M_SFT to maximize total_reward using PPO

The KL penalty prevents the model from drifting too far from the original, avoiding mode collapse.

Design Patterns & Architecture

The Three-Model Pattern

RLHF typically involves three models:

Policy Model (the agent being trained)
Reward Model (learned from human feedback)
Reference Model (the original model, used for KL penalty)

This architecture separates concerns: reward learning from policy optimization.

Human-in-the-Loop Training

RLHF exemplifies the human-in-the-loop pattern where humans provide ongoing feedback rather than one-time labels. This is essential for aligning with complex, context-dependent human values.

RLHF is rarely one-shot. Teams iterate:

Deploy model → Collect user interactions → Label preferences → Retrain reward model → Update policy → Repeat

This creates a continuous improvement loop common in modern AI agent systems.

Practical Application: Simple RLHF with TRL

Here’s a minimal RLHF implementation using the TRL (Transformer Reinforcement Learning) library:

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler
import torch

# 1. Load pre-trained model
model_name = "gpt2"
model = AutoModelForCausalLMWithValueHead.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 2. Define reward function (in practice, this would be a trained reward model)
def simple_reward_fn(responses):
    """Reward longer, coherent responses (toy example)"""
    rewards = []
    for response in responses:
        # Simple heuristic: reward based on length and absence of repetition
        length_score = min(len(response.split()) / 20, 1.0)
        unique_words = len(set(response.lower().split()))
        total_words = len(response.split())
        diversity_score = unique_words / max(total_words, 1)
        reward = (length_score + diversity_score) / 2
        rewards.append(torch.tensor(reward))
    return rewards

# 3. Configure PPO
config = PPOConfig(
    model_name=model_name,
    learning_rate=1e-5,
    batch_size=4,
    mini_batch_size=2,
)

# 4. Initialize trainer
ppo_trainer = PPOTrainer(
    config=config,
    model=model,
    tokenizer=tokenizer,
)

# 5. Training loop
prompts = [
    "Explain quantum computing to a beginner:",
    "What is the meaning of life?",
    "How do neural networks work?",
]

for epoch in range(3):
    for prompt in prompts:
        # Generate response
        inputs = tokenizer(prompt, return_tensors="pt")
        outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True)
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Get reward
        rewards = simple_reward_fn([response])
        
        # PPO update (simplified - actual implementation more complex)
        stats = ppo_trainer.step([inputs['input_ids'][0]], [outputs[0]], rewards)
        
        print(f"Epoch {epoch}, Prompt: {prompt[:30]}..., Reward: {rewards[0].item():.3f}")

print("RLHF training complete!")

Note: This is a simplified example. Production RLHF involves:

Training a proper reward model on thousands of human comparisons
More sophisticated PPO implementation with advantage estimation
Careful hyperparameter tuning (KL coefficient, learning rate schedules)
Distributed training infrastructure

Comparisons & Tradeoffs

RLHF vs. Supervised Fine-Tuning

Supervised Fine-Tuning: Direct, simple, requires high-quality demonstrations. Struggles with subjective tasks where there’s no single correct answer.

RLHF: Handles subjective preferences, learns from comparisons (easier for humans than absolute ratings), but more complex to implement and requires more compute.

RLHF vs. Constitutional AI

Constitutional AI (Anthropic’s approach) uses AI-generated feedback against a set of principles rather than human feedback for each decision. More scalable but potentially less aligned with nuanced human values.

RLHF vs. Imitation Learning

Imitation learning copies expert behavior directly. RLHF goes beyond imitation by optimizing for outcomes humans prefer, potentially surpassing the demonstrators.

Limitations

Reward Model Quality: RLHF is only as good as the reward model. If human labelers disagree or have biases, those propagate.
Computational Cost: Training reward models and running PPO is expensive (10-100x more compute than SFT alone).
Reward Hacking: Models can exploit flaws in the reward model (e.g., generating verbose but uninformative responses if verbosity is rewarded).
Scalability: Requires ongoing human feedback, which is expensive and slow.

Latest Developments & Research

RLAIF: Reinforcement Learning from AI Feedback (2023-2024)

Recent work shows that AI models can generate preference labels instead of humans, dramatically reducing costs. Anthropic’s Constitutional AI and OpenAI’s work on process supervision demonstrate this approach.

Key Paper: “RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback” (2023)

Direct Preference Optimization (DPO, 2023)

Researchers at Stanford developed DPO, which skips the reward model and PPO entirely, directly optimizing the policy from preference data. This simplifies the pipeline and reduces compute.

Key Innovation: DPO reparameterizes the RL objective to enable direct supervised learning from preferences, making RLHF more accessible.

# DPO Loss (simplified)
def dpo_loss(policy, reference, preferred, dispreferred, beta=0.1):
    log_ratio_preferred = policy.log_prob(preferred) - reference.log_prob(preferred)
    log_ratio_dispreferred = policy.log_prob(dispreferred) - reference.log_prob(dispreferred)
    loss = -torch.log(torch.sigmoid(beta * (log_ratio_preferred - log_ratio_dispreferred)))
    return loss.mean()

Process Supervision (2024)

Rather than rating final outcomes, recent work provides feedback on intermediate reasoning steps. This helps models develop better problem-solving processes, not just lucky guesses.

Application: OpenAI’s work on training models to solve math problems shows that process supervision outperforms outcome supervision for complex reasoning.

Open Problems

Sample Efficiency: RLHF requires enormous amounts of data. Can we get similar alignment with 100x less feedback?
Multimodal RLHF: Extending RLHF to images, video, and robotics (not just text).
Robustness: Models that remain aligned even when users try to jailbreak them.
Value Pluralism: How do we align AI with diverse human values that sometimes conflict?

Cross-Disciplinary Insight: RLHF and Behavioral Economics

RLHF draws heavily from behavioral economics and decision theory. The key insight—humans are better at preferences than absolute valuations—mirrors findings in economics about revealed preferences.

Daniel Kahneman’s Work: Prospect theory shows humans make inconsistent absolute judgments but consistent relative judgments. RLHF exploits this by asking “which is better?” rather than “rate this 1-10.”

Social Choice Theory: RLHF faces similar challenges to voting systems: how do you aggregate diverse preferences into a single coherent policy? Arrow’s impossibility theorem suggests perfect aggregation is impossible, which explains why RLHF models sometimes make strange choices—they’re navigating inherently contradictory human preferences.

Implications: Engineers building RLHF systems should study decision theory to understand the fundamental limits of preference aggregation and design more robust feedback collection strategies.

Daily Challenge: Implement Preference Collection

Challenge: Build a simple web app that collects human preferences between AI-generated responses.

Requirements:

Display a prompt and two AI-generated responses (use any LLM API)
Let users click which response they prefer
Store preferences in a JSON file
After collecting 20+ preferences, analyze: Are there patterns in what humans prefer?

Extension: Train a tiny reward model (e.g., a 2-layer neural network) to predict human preferences from response features (length, readability score, sentiment).

Time: 30-45 minutes

Learning Goal: Experience the human labeling bottleneck and understand why reward model quality is crucial.

References & Further Reading

Foundational Papers

“Learning from Human Preferences” (Christiano et al., 2017) - Original RLHF for robotics [arXiv:1706.03741]
“Training language models to follow instructions with human feedback” (Ouyang et al., 2022) - InstructGPT paper [arXiv:2203.02155]
“Direct Preference Optimization” (Rafailov et al., 2023) - DPO paper [arXiv:2305.18290]

Practical Resources

TRL Library: https://github.com/huggingface/trl - Transformers + Reinforcement Learning
OpenAI Spinning Up in Deep RL: https://spinningup.openai.com/en/latest/algorithms/ppo.html - PPO tutorial
Anthropic’s RLHF Explanation: https://www.anthropic.com/index/core-views-on-alignment-research

Recent Research

“Constitutional AI” (Anthropic, 2022) - AI-generated feedback approach
“Let’s Verify Step by Step” (OpenAI, 2023) - Process supervision for reasoning
“RLAIF: Scaling RLHF with AI Feedback” (Lee et al., 2023) - Reducing human labeling

Blogs & Tutorials

Hugging Face RLHF Tutorial: Comprehensive guide with code examples
Chip Huyen’s RLHF Blog: Practical considerations for production RLHF
Sebastian Raschka’s AI Papers: Summaries of latest RLHF research

2025-11-30

../