Vector Databases and Semantic Search for AI Agents

In the evolution of AI agents, memory isn’t just about storing information - it’s about retrieving the right information at the right time. Vector databases and semantic search form the foundation of agent memory systems, enabling agents to find relevant context even when the exact keywords don’t match. This article explores how vector representations transform agent capabilities.

Concept Introduction

Simple Explanation

Imagine organizing a library where instead of alphabetical order or categories, books are arranged by meaning. Books about similar topics sit near each other, even if they use completely different words. When someone asks a question, you instantly know which shelf holds the most relevant books because you understand what they mean, not just what words they used.

Vector databases work exactly like this for AI agents. They convert text, images, or other data into mathematical representations (vectors) that capture meaning. Similar meanings produce similar vectors, allowing agents to find relevant information through semantic similarity rather than keyword matching.

Technical Detail

A vector database stores embeddings - high-dimensional numerical representations (typically 768 to 1536 dimensions) where semantic similarity corresponds to geometric proximity. These embeddings are produced by neural networks trained to map similar meanings to nearby points in vector space.

When an agent needs to retrieve information:

Encode the query: Convert the user’s question into a vector using the same embedding model
Similarity search: Find vectors in the database closest to the query vector (typically using cosine similarity or Euclidean distance)
Retrieve and rank: Return the top-k most similar items, often with metadata filtering
Contextualize: Pass retrieved information to the LLM as context for generating responses

This enables semantic search: finding “machine learning algorithms” when the user asks about “AI training methods,” even though the exact words differ.

Historical & Theoretical Context

Origins

Vector representations of words emerged in the 2000s with approaches like Latent Semantic Analysis (LSA). The breakthrough came with Word2Vec (Mikolov et al., 2013), which showed that simple neural networks could learn embeddings where vector arithmetic reflected semantic relationships: king - man + woman ≈ queen.

Sentence and document embeddings followed with models like Doc2Vec, Universal Sentence Encoder, and eventually transformer-based embeddings from BERT (2018) and its successors. These captured not just word meaning but compositional semantics - how words combine to form meaning.

Vector databases as specialized infrastructure emerged around 2020 as organizations needed to search millions or billions of embeddings efficiently. Traditional databases couldn’t handle high-dimensional similarity search at scale.

Theoretical Foundation

Vector space models (VSM) from information retrieval theory provide the mathematical foundation. The key insight: represent documents and queries as vectors in a shared space where similarity in meaning corresponds to proximity.

This connects to cognitive science theories of semantic memory, where human memory is thought to organize concepts in associative networks based on similarity and relatedness rather than symbolic categories.

Design Patterns & Architectures

The Retrieval-Augmented Generation (RAG) Pattern

The dominant architecture for agent memory:

User Query → Embedding Model → Vector Search → Top-k Results → LLM Context → Generated Response

Agents using RAG don’t store all knowledge in model parameters. Instead, they retrieve relevant information from vector databases at query time, providing:

Up-to-date information: Database can be updated without retraining
Source attribution: Responses can cite specific retrieved documents
Scalability: Agents can access far more information than fits in context windows

Hybrid Search Pattern

Combines semantic search with traditional keyword search:

Query → [Semantic Search + Keyword Search] → Rank Fusion → Top Results

This addresses limitations of pure semantic search (which can miss specific entities or technical terms) and pure keyword search (which misses paraphrases and synonyms).

Hierarchical Memory Pattern

Organizes agent memory in multiple vector databases with different granularities:

Short-term memory: Recent conversation turns (small, fast)
Episode memory: Complete conversation sessions (medium)
Long-term memory: All historical knowledge (large, comprehensive)

Agents query these in sequence, starting with recent context and expanding to broader knowledge as needed.

Practical Application

Building a Simple Vector Memory System

import openai
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

class VectorMemory:
    def __init__(self, embedding_model="text-embedding-3-small"):
        self.embedding_model = embedding_model
        self.memories = []  # List of (text, vector, metadata)
        
    def embed(self, text: str) -> np.ndarray:
        """Convert text to embedding vector."""
        response = openai.embeddings.create(
            model=self.embedding_model,
            input=text
        )
        return np.array(response.data[0].embedding)
    
    def store(self, text: str, metadata: dict = None):
        """Store a new memory."""
        vector = self.embed(text)
        self.memories.append({
            'text': text,
            'vector': vector,
            'metadata': metadata or {}
        })
    
    def search(self, query: str, top_k: int = 3) -> list:
        """Find most similar memories to query."""
        query_vector = self.embed(query)
        
        # Calculate similarities
        similarities = []
        for memory in self.memories:
            similarity = cosine_similarity(
                query_vector.reshape(1, -1),
                memory['vector'].reshape(1, -1)
            )[0][0]
            similarities.append((similarity, memory))
        
        # Return top-k most similar
        similarities.sort(reverse=True, key=lambda x: x[0])
        return [(score, mem['text'], mem['metadata']) 
                for score, mem in similarities[:top_k]]


# Usage example
memory = VectorMemory()

# Store information
memory.store(
    "The user prefers dark mode and sans-serif fonts",
    metadata={'category': 'preferences', 'date': '2025-11-14'}
)
memory.store(
    "Machine learning models require large training datasets",
    metadata={'category': 'knowledge', 'topic': 'ML'}
)
memory.store(
    "The user's last project was a web app built with React",
    metadata={'category': 'history', 'date': '2025-11-13'}
)

# Search for relevant information
query = "What does the user like for UI design?"
results = memory.search(query, top_k=2)

for score, text, metadata in results:
    print(f"Score: {score:.3f} | {text}")
    print(f"Metadata: {metadata}\n")

Output:

Score: 0.782 | The user prefers dark mode and sans-serif fonts
Metadata: {'category': 'preferences', 'date': '2025-11-14'}

Score: 0.431 | The user's last project was a web app built with React
Metadata: {'category': 'history', 'date': '2025-11-13'}

Using a Production Vector Database (Pinecone Example)

import pinecone
from pinecone import Pinecone, ServerlessSpec
import openai

# Initialize Pinecone
pc = Pinecone(api_key="your-api-key")

# Create or connect to index
index_name = "agent-memory"
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,  # text-embedding-3-small dimension
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

index = pc.Index(index_name)

def store_memory(text: str, metadata: dict):
    """Store memory in Pinecone."""
    # Get embedding
    response = openai.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    vector = response.data[0].embedding
    
    # Store in Pinecone
    index.upsert(vectors=[{
        'id': f"mem_{hash(text)}",
        'values': vector,
        'metadata': {'text': text, **metadata}
    }])

def search_memory(query: str, top_k: int = 3, filter_dict: dict = None):
    """Search memories with optional metadata filtering."""
    # Get query embedding
    response = openai.embeddings.create(
        model="text-embedding-3-small",
        input=query
    )
    query_vector = response.data[0].embedding
    
    # Search Pinecone
    results = index.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True,
        filter=filter_dict
    )
    
    return [(match.score, match.metadata['text'], match.metadata) 
            for match in results.matches]

# Usage
store_memory(
    "User asked about quantum computing on 2025-11-14",
    metadata={'category': 'conversation', 'date': '2025-11-14', 'topic': 'quantum'}
)

# Search with filter
results = search_memory(
    "What did we discuss about physics?",
    filter_dict={'category': 'conversation'}
)

Comparisons & Tradeoffs

Vector Search vs Keyword Search

Vector Search Strengths:

Finds semantically similar content with different wording
Handles synonyms and paraphrases naturally
Works across languages (with multilingual embeddings)
No need for manual keyword specification

Vector Search Weaknesses:

Can miss specific entities or rare terms
Less precise for exact-match queries
Computationally expensive for large databases
Requires quality embedding models

Solution: Hybrid search combining both approaches gives best results.

Embedding Model Selection

Smaller Models (384-768 dimensions):

Faster inference and search
Lower storage costs
Sufficient for many applications
Example: all-MiniLM-L6-v2

Larger Models (1024-1536 dimensions):

Better semantic understanding
More nuanced similarity detection
Required for complex domains
Example: text-embedding-3-large

Tradeoff: Start with smaller models; upgrade only if semantic quality is insufficient.

Indexing Strategies

Flat (Brute Force) Search:

Exact k-NN results
Simple to implement
Only practical for <10K vectors

Approximate Nearest Neighbor (ANN) Indexes:

HNSW (Hierarchical Navigable Small World): Fast queries, slower updates
IVF (Inverted File Index): Fast updates, requires more memory
Product Quantization: Compressed storage, slight accuracy loss

Tradeoff: ANN methods trade slight accuracy (typically 95-99% recall) for massive speed improvements, essential for millions of vectors.

Latest Developments & Research

Matryoshka Embeddings

Recent research (Kusupati et al., 2022) introduced “Matryoshka Representation Learning” where embeddings are designed to be truncatable - you can use just the first 256 dimensions instead of all 1536, with graceful quality degradation.

This enables adaptive search: use short vectors for initial filtering, full vectors only for top candidates. Production systems using this approach achieve 3-5x speedup with minimal accuracy loss.

Multi-Vector Representations

Instead of one vector per document, recent systems (ColBERT, Kusupati et al., 2024) store vectors for each token or sentence, enabling more fine-grained matching. This improves precision for long documents where only small sections are relevant to queries.

Late Interaction Models

Rather than embedding queries and documents separately, late interaction models create token-level embeddings and compute similarity through learned interaction functions. This achieves retrieval quality approaching reranking models at the speed of traditional vector search.

Learned Sparse Retrieval

Methods like SPLADE combine dense vector semantics with sparse keyword signals by learning which terms to activate for each document. This gets benefits of both approaches in a unified representation, showing strong results on standard IR benchmarks.

Cross-Disciplinary Insight

Vector databases connect to neuroscience models of human memory. The hippocampus is thought to create compressed representations of experiences, enabling recall through pattern completion - remarkably similar to embedding-based retrieval.

From information theory, vector search can be viewed as lossy compression: embeddings compress text into fixed-size representations that preserve semantic information while discarding surface details. The quality of this compression determines retrieval effectiveness.

Daily Challenge

Task: Build a conversational agent with persistent memory

Create an agent that remembers previous conversations and retrieves relevant context:

class MemoryAgent:
    def __init__(self):
        self.vector_memory = VectorMemory()
        self.conversation_history = []
    
    def chat(self, user_message: str) -> str:
        """
        1. Store the user message in memory
        2. Search memory for relevant context
        3. Build prompt with retrieved context + conversation history
        4. Generate response using LLM
        5. Store response in memory
        6. Return response
        """
        # Your implementation here
        pass
    
    def remember(self, key_point: str, metadata: dict = None):
        """Explicitly store important information."""
        pass
    
    def recall(self, query: str) -> list:
        """Retrieve relevant memories."""
        pass

Requirements:

Automatically extract and store key facts from conversation
Retrieve relevant context for each new user message
Limit retrieved context to fit in LLM context window
Test with a multi-turn conversation about a complex topic

Bonus Challenge: Add memory consolidation - periodically summarize old memories into more compressed forms to prevent database bloat while retaining key information.

References & Further Reading

Papers

“Efficient Estimation of Word Representations in Vector Space” (Mikolov et al., 2013) - Word2Vec foundation
“Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks” (Reimers & Gurevych, 2019) - Modern sentence embeddings
“ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT” (Khattab & Zaharia, 2020)
“Matryoshka Representation Learning” (Kusupati et al., 2022) - Truncatable embeddings

Vector Databases

Pinecone - Fully managed, serverless
Weaviate - Open source, GraphQL API
Qdrant - Open source, Rust-based, fast
Milvus - Open source, highly scalable
ChromaDB - Simple, Python-native, great for prototyping

Frameworks with Vector Memory

LangChain - Extensive vector store integrations
LlamaIndex - Optimized for RAG applications
Haystack - End-to-end NLP with vector retrieval

Tutorials

Vector databases and semantic search transform AI agents from stateless responders to systems with persistent, semantically-organized memory. As embedding models improve and vector databases become more sophisticated, agents will remember more, search faster, and retrieve exactly what they need - bringing us closer to truly intelligent assistants.

2025-11-14

../