Retrieval-Augmented Generation for AI Agents

Introduction: Simple to Technical

The Simple Version:

Imagine you’re taking an open-book exam instead of a closed-book one. Rather than trying to memorize everything, you can look up relevant information when you need it. Retrieval-Augmented Generation (RAG) gives AI agents this same capability—instead of relying purely on training data, they can fetch relevant information from external sources before generating responses.

The Technical Version:

RAG is an architectural pattern that combines neural retrieval systems with generative language models. When an agent receives a query, it first retrieves relevant documents or passages from an external knowledge base, then conditions its generation on both the query and the retrieved context. This augmentation approach addresses fundamental limitations of pure language models: knowledge cutoffs, hallucination, and inability to access private or updated information.

The architecture typically consists of three components: a retriever (often a dense vector encoder like BERT or ada-002), a knowledge base (vector database storing document embeddings), and a generator (typically an LLM like GPT-4 or Claude). The retriever finds relevant documents, which are then injected into the generator’s prompt along with the original query.

Historical & Theoretical Context

RAG emerged from research on open-domain question answering, where systems needed to find answers across large corpora. Early systems like DrQA (2017) used TF-IDF retrieval with neural reading comprehension models. The modern RAG paradigm crystallized with Facebook AI’s “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” paper (Lewis et al., 2020).

The key theoretical insight is that generation and knowledge storage can be separated. Traditional language models compress knowledge into parameters during training—an expensive, static process. RAG externalizes knowledge into retrievable documents, making it dynamic, updateable, and auditable.

This connects to the classical AI distinction between parametric and non-parametric memory. Parametric memory (model weights) is fixed after training, while non-parametric memory (external knowledge bases) can grow and change. RAG effectively gives agents hybrid memory systems.

Core Algorithm & Architecture

The RAG Algorithm

1. INPUT: User query q

2. ENCODE:
   query_embedding = encode(q)  # Convert query to dense vector

3. RETRIEVE:
   documents = vector_db.search(query_embedding, top_k=5)
   # Find k most similar documents using cosine similarity

4. AUGMENT:
   context = concatenate(documents)
   augmented_prompt = f"Given this context:\n{context}\n\nAnswer: {q}"

5. GENERATE:
   response = llm.generate(augmented_prompt)

6. OUTPUT: response

Key Architectural Variations

Naive RAG: Simple retrieve-then-generate. Retrieval happens once before generation.

Iterative RAG: Multiple retrieve-generate cycles. The agent can retrieve additional information based on partial generations.

Adaptive RAG: Agent decides when to retrieve. Not all queries require external knowledge.

Agentic RAG: RAG integrated into a full agent loop with tools, memory, and planning. Retrieval becomes one of many tools the agent can invoke.

Mathematical Foundation

The generative probability in RAG is conditioned on both the query and retrieved documents:

P(answer | query) ≈ P(answer | query, retrieved_docs)

More formally, RAG marginalizes over possible retrieved documents:

P(y | x) = Σ_z P(y | x, z) P(z | x)

Where:

x = input query
y = generated output
z = retrieved document(s)
P(z | x) = retriever probability
P(y | x, z) = generator probability

The retriever learns to maximize:

P(z | x) ∝ exp(sim(encode(x), encode(z)))

Where sim is typically cosine similarity between dense vectors.

Design Patterns & Practical Implementation

Pattern 1: Basic RAG Pipeline

from typing import List
import openai
import chromadb

class BasicRAGAgent:
    def __init__(self, collection_name: str):
        # Initialize vector database
        self.chroma_client = chromadb.Client()
        self.collection = self.chroma_client.get_or_create_collection(
            name=collection_name
        )
        
    def add_documents(self, documents: List[str], ids: List[str]):
        """Index documents into vector database"""
        self.collection.add(
            documents=documents,
            ids=ids
        )
    
    def retrieve(self, query: str, top_k: int = 3) -> List[str]:
        """Retrieve most relevant documents"""
        results = self.collection.query(
            query_texts=[query],
            n_results=top_k
        )
        return results['documents'][0]
    
    def generate(self, query: str, context: List[str]) -> str:
        """Generate answer using retrieved context"""
        context_str = "\n\n".join(context)
        
        prompt = f"""Based on the following context, answer the question.
        
Context:
{context_str}

Question: {query}

Answer:"""
        
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )
        
        return response.choices[0].message.content
    
    def query(self, question: str) -> str:
        """Main RAG pipeline"""
        # Retrieve relevant documents
        docs = self.retrieve(question)
        
        # Generate answer with context
        answer = self.generate(question, docs)
        
        return answer

# Usage
agent = BasicRAGAgent("knowledge_base")

# Index knowledge
agent.add_documents(
    documents=[
        "Python was created by Guido van Rossum in 1991.",
        "Python is dynamically typed and garbage-collected.",
        "Python emphasizes code readability with significant whitespace."
    ],
    ids=["doc1", "doc2", "doc3"]
)

# Query
result = agent.query("Who created Python?")
print(result)

Pattern 2: Iterative RAG with LangChain

from langchain.agents import Tool, AgentExecutor, create_react_agent
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

class IterativeRAGAgent:
    def __init__(self, docs: List[str]):
        # Create vector store
        self.vectorstore = Chroma.from_texts(
            docs,
            OpenAIEmbeddings()
        )
        
        # Create retrieval tool
        retrieval_tool = Tool(
            name="Knowledge Base",
            func=self._retrieve,
            description="Useful for looking up specific facts and information"
        )
        
        # Initialize agent
        llm = ChatOpenAI(temperature=0)
        tools = [retrieval_tool]
        
        prompt = PromptTemplate.from_template("""
        Answer the following question using available tools.
        You can retrieve information multiple times if needed.
        
        Question: {input}
        {agent_scratchpad}
        """)
        
        agent = create_react_agent(llm, tools, prompt)
        self.executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
    
    def _retrieve(self, query: str) -> str:
        """Retrieve documents for a query"""
        docs = self.vectorstore.similarity_search(query, k=2)
        return "\n".join([d.page_content for d in docs])
    
    def query(self, question: str) -> str:
        """Query with iterative retrieval"""
        result = self.executor.invoke({"input": question})
        return result['output']

Comparisons & Tradeoffs

RAG vs Fine-Tuning

RAG Advantages:

Dynamic knowledge (update documents without retraining)
Lower cost (no GPU training required)
Transparent (can inspect what was retrieved)
Better for factual accuracy

Fine-Tuning Advantages:

No retrieval latency
Better for style, tone, format learning
Works without external knowledge base
More reliable for specific task behaviors

Hybrid Approach: Fine-tune for task behavior, use RAG for knowledge.

RAG vs Long Context Windows

Modern LLMs like Claude have 200K+ token contexts. Why use RAG?

RAG wins when:

Knowledge base exceeds even large context windows
Knowledge updates frequently
Cost matters (retrieving 5 docs cheaper than 100K context)
You need retrieval transparency for auditing

Long context wins when:

You need to reason over entire documents
Query requires understanding relationships across document
Retrieval might miss important nuances

Latest Research & Developments

Self-RAG (Asai et al., 2023)

Self-RAG trains models to decide when to retrieve and to critique their own outputs. The model generates special tokens indicating:

Whether to retrieve ([Retrieve] or [No Retrieve])
Whether retrieved docs are relevant ([Relevant] or [Irrelevant])
Whether the generation is supported ([Supported] or [Not Supported])

This adaptive approach reduces unnecessary retrieval and improves answer quality.

FLARE: Forward-Looking Active Retrieval (Jiang et al., 2023)

FLARE retrieves information based on what the model is about to generate, not just the initial query. It generates sentence-by-sentence, checking confidence, and retrieves when uncertainty is high. This prevents the issue where initial retrieval misses information needed later in generation.

RAG-Fusion (Multiple Query Variants, 2024)

RAG-Fusion generates multiple query variants from the original question, retrieves for each variant, then combines and reranks results. This overcomes limitations of single query formulations.

# Pseudo-code
original_query = "What are transformers in ML?"
variants = [
    "Explain transformer architecture",
    "How do transformers work in machine learning?",
    "What is the transformer model in deep learning?"
]

all_docs = []
for variant in variants:
    docs = retrieve(variant)
    all_docs.extend(docs)

# Reciprocal Rank Fusion for combining results
reranked_docs = reciprocal_rank_fusion(all_docs)

Agentic RAG Systems (2025)

Current research integrates RAG into full agent frameworks where retrieval is one tool among many. Systems like LangGraph enable agents to dynamically decide when to retrieve, what to retrieve, and how to combine retrieved information with other tool calls.

The trend is toward agents that can:

Query multiple knowledge bases
Combine retrieval with computation (calculator, code execution)
Chain multiple retrieval steps with reasoning
Validate retrieved information against multiple sources

Cross-Disciplinary Insights

Information Retrieval (IR)

RAG inherits decades of IR research. Concepts like TF-IDF, BM25, and inverted indices inform modern retrieval systems. The shift to dense retrieval (neural embeddings) parallels the shift from symbolic to neural AI.

Cognitive Science

RAG mirrors human memory systems. Humans don’t store all knowledge in immediate access (working memory) but retrieve from long-term memory as needed. RAG’s separation of parametric (model) and non-parametric (retrieved) knowledge resembles this architecture.

Database Systems

Vector databases for RAG borrow from traditional database indexing, approximate nearest neighbor search, and caching strategies. Understanding database query optimization helps optimize RAG retrieval.

Daily Challenge

Build a Multi-Source RAG Agent

Create an agent that can answer questions by retrieving from multiple knowledge sources:

Implement a RAG agent with two vector stores:
- One for technical documentation
- One for recent news articles
The agent should:
- Decide which knowledge base(s) to query
- Retrieve from multiple sources when needed
- Cite which source each fact came from
Test with questions like:
- “What’s the syntax for Python decorators?” (documentation)
- “What’s the latest news on AI regulation?” (news)
- “How do recent AI developments affect Python library design?” (both)
Bonus: Add a reranking step that orders retrieved documents by relevance before feeding to the LLM.

Time estimate: 20-30 minutes

Hint: Use source metadata in your vector store to track which collection each document came from, and include this in the context you provide to the LLM.

References & Further Reading

Foundational Papers:

Lewis et al. (2020): “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” - The original RAG paper
Karpukhin et al. (2020): “Dense Passage Retrieval for Open-Domain Question Answering” - DPR, the retrieval component of RAG

Recent Advances:

Asai et al. (2023): “Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection”
Jiang et al. (2023): “Active Retrieval Augmented Generation”
Ram et al. (2023): “In-Context Retrieval-Augmented Language Models”

Practical Resources:

LangChain RAG documentation: https://python.langchain.com/docs/use_cases/question_answering/
Chroma vector database: https://www.trychroma.com/
LlamaIndex (RAG framework): https://www.llamaindex.ai/

Research Repos:

Facebook Research RAG: https://github.com/facebookresearch/RAG
Self-RAG implementation: https://github.com/AkariAsai/self-rag

2025-11-12

../