Vector Search and Embedding Spaces for AI Agents
Vector Search and Embedding Spaces for AI Agents
Concept Introduction
Simple Explanation
Imagine trying to find similar concepts in a massive library. Instead of searching by exact word matches, you want to find documents that are conceptually related—even if they use different words. Vector search makes this possible by converting text, images, or other data into lists of numbers (vectors) that capture meaning. Similar concepts end up close together in this numeric space, allowing AI agents to find relevant information based on meaning, not just keywords.
Technical Detail
Vector search operates on embedding spaces—high-dimensional mathematical spaces where semantic relationships are encoded as geometric relationships. Text, images, code, or other data are transformed into dense vectors (typically 768-4096 dimensions) via neural network encoders. These vectors preserve semantic similarity: semantically similar inputs produce vectors with small cosine distances or Euclidean distances.
For AI agents, this enables semantic memory retrieval, contextual decision-making, and knowledge-grounded responses. Instead of exact database queries, agents perform approximate nearest neighbor (ANN) search to find relevant context from large knowledge bases in milliseconds.
Historical & Theoretical Context
Origins
Vector representations of words emerged from distributional semantics in linguistics (1950s-1990s), formalized as “You shall know a word by the company it keeps” (Firth, 1957). Early computational approaches like Latent Semantic Analysis (LSA, 1988) used matrix factorization to create word vectors.
The modern embedding era began with Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), which learned dense word embeddings capturing semantic relationships: king - man + woman ≈ queen. Sentence and document embeddings followed with Doc2Vec (2014) and Sentence-BERT (2019).
Transformer models (Vaswani et al., 2017) revolutionized embeddings. BERT (2018) created contextual embeddings where the same word has different vectors depending on context. Specialized embedding models like OpenAI’s text-embedding-ada-002 (2022) and open-source alternatives optimized for retrieval tasks.
Relation to AI Agents
AI agents need memory and knowledge beyond their training data. Vector search enables:
Retrieval-Augmented Generation (RAG): Agents retrieve relevant documents before generating responses, grounding answers in current information.
Episodic Memory: Agents store past interactions as vectors, retrieving similar situations when making decisions.
Tool Selection: Agents embed tool descriptions and queries, finding the right tool for each task.
Multi-Modal Grounding: Agents can search across text, images, and code simultaneously using shared embedding spaces.
Algorithms & Math
Embedding Generation
Given input text $x$, an embedding model $f$ produces a vector:
$$\mathbf{v} = f(x) \in \mathbb{R}^d$$
where $d$ is the embedding dimension (e.g., 768, 1536, 4096).
For transformers like BERT, this typically uses the [CLS] token representation or mean pooling:
$$\mathbf{v} = \text{MeanPool}(\text{Transformer}(x))$$
Similarity Measurement
Cosine Similarity: Measures angle between vectors, normalized to [-1, 1]:
$$\text{sim}(\mathbf{v}_1, \mathbf{v}_2) = \frac{\mathbf{v}_1 \cdot \mathbf{v}_2}{|\mathbf{v}_1| |\mathbf{v}_2|}$$
Euclidean Distance: Direct geometric distance:
$$d(\mathbf{v}_1, \mathbf{v}_2) = |\mathbf{v}1 - \mathbf{v}2|2 = \sqrt{\sum{i=1}^d (v{1i} - v{2i})^2}$$
Dot Product: Unnormalized similarity (faster but scale-sensitive):
$$\text{score}(\mathbf{v}_1, \mathbf{v}_2) = \mathbf{v}_1 \cdot \mathbf{v}_2$$
Approximate Nearest Neighbor (ANN) Search
Exact nearest neighbor search scales linearly with database size—impractical for millions of vectors. ANN algorithms trade slight accuracy for massive speedup:
HNSW (Hierarchical Navigable Small World): Builds a multi-layer graph where each node connects to nearby vectors. Search starts at top layer and progressively refines at lower layers. Typical complexity: O(log N).
IVF (Inverted File Index): Clusters vectors into partitions using k-means. Search queries only relevant partitions. Complexity: O(√N) with good clustering.
Product Quantization: Compresses vectors by dividing into subvectors and quantizing separately. Reduces memory and accelerates distance calculations.
Design Patterns & Architectures
RAG (Retrieval-Augmented Generation) Pattern
User Query → Embedding → Vector Search → Retrieved Context → LLM → Response
The agent embeds the query, finds relevant documents, and passes them to the LLM as context, grounding responses in retrieved knowledge.
Hybrid Search Pattern
Combine vector search (semantic) with keyword search (exact match):
semantic_results = vector_search(query_embedding, top_k=50)
keyword_results = bm25_search(query_text, top_k=50)
final_results = rerank(combine(semantic_results, keyword_results))
This balances semantic understanding with precise term matching.
Multi-Vector Memory Pattern
Agents store different information types in separate vector stores:
- Episodic memory: Past conversations and actions
- Semantic memory: Facts and knowledge
- Procedural memory: Tool descriptions and usage examples
Each retrieval queries appropriate memory stores, mimicking human memory systems.
Hierarchical Retrieval Pattern
For massive knowledge bases, use two-stage retrieval:
- Coarse retrieval: Find relevant documents/chunks (fast, approximate)
- Fine-grained retrieval: Rerank with more expensive model or cross-encoder
Practical Application
Python Example with LangChain and FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
# Load and chunk documents
loader = TextLoader("knowledge_base.txt")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)
# Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vector_store = FAISS.from_documents(chunks, embeddings)
# Agent retrieval function
def retrieve_context(query: str, k: int = 3):
"""Retrieve top-k most relevant chunks for query"""
results = vector_store.similarity_search(query, k=k)
return "\n\n".join([doc.page_content for doc in results])
# Usage in agent loop
user_query = "How do I implement error handling in async functions?"
context = retrieve_context(user_query)
prompt = f"Context:\n{context}\n\nQuestion: {user_query}\n\nAnswer:"
# Pass prompt to LLM...
Using in CrewAI Agent
from crewai import Agent, Task, Crew
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
# Setup vector store
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_store = Chroma(persist_directory="./agent_memory",
embedding_function=embeddings)
# Create agent with vector memory
research_agent = Agent(
role="Research Assistant",
goal="Answer questions using retrieved knowledge",
backstory="Expert at finding relevant information",
tools=[], # Add vector search tool
verbose=True
)
# Define vector search tool
def vector_search_tool(query: str) -> str:
docs = vector_store.similarity_search(query, k=5)
return "\n".join([d.page_content for d in docs])
# Create task using retrieval
task = Task(
description="Research: What are best practices for agent memory?",
agent=research_agent,
expected_output="Comprehensive answer with sources"
)
Comparisons & Tradeoffs
Vector Search vs. Keyword Search
Vector Search:
- Pros: Semantic understanding, handles synonyms, multilingual
- Cons: Computationally expensive, requires embeddings, approximate results
Keyword Search (BM25):
- Pros: Exact matches, fast, interpretable
- Cons: No semantic understanding, vocabulary mismatch issues
Best Practice: Use hybrid search combining both approaches.
Embedding Model Choices
OpenAI text-embedding-ada-002:
- Pros: High quality, 1536 dimensions, good generalization
- Cons: API cost, proprietary, latency
Open-Source (all-MiniLM-L6-v2):
- Pros: Free, fast, local deployment, 384 dimensions
- Cons: Lower quality than frontier models
Domain-Specific Models:
- Pros: Optimized for specific tasks (code, legal, medical)
- Cons: Less generalizable
Vector Database Scalability
FAISS (Facebook AI Similarity Search):
- Pros: Extremely fast, in-memory, good for < 10M vectors
- Cons: No built-in persistence, no distributed mode
Pinecone:
- Pros: Managed, scalable, metadata filtering
- Cons: Cost, vendor lock-in
Weaviate/Qdrant:
- Pros: Open-source, scalable, feature-rich
- Cons: More complex setup
Latest Developments & Research
Matryoshka Embeddings (2022-2024)
Recent research enables “nested” embeddings where the first N dimensions capture coarse meaning, allowing dynamic dimensionality. This reduces storage and search costs while maintaining quality.
Paper: “Matryoshka Representation Learning” (Kusupati et al., 2022)
Late Interaction Models (ColBERT, 2023-2024)
Instead of single-vector representations, these models create vectors for each token, then compute token-level interactions during retrieval. This dramatically improves accuracy at moderate cost increase.
ColBERTv2 achieves state-of-the-art retrieval with 10-100x compression versus traditional dense retrieval.
Multimodal Embeddings (CLIP, 2024 Extensions)
Models like OpenAI’s CLIP and Google’s PaLI create shared embedding spaces for text and images. Recent extensions add audio, video, and 3D representations, enabling agents to search across modalities.
Application: Agents retrieving relevant images, diagrams, or videos to support text responses.
Benchmarks
BEIR (2021, updated 2024): Benchmark for diverse retrieval tasks. Current SOTA: ~56% NDCG@10 averaged across tasks.
MTEB (Massive Text Embedding Benchmark, 2023): Evaluates embeddings across 56 datasets and 8 tasks. Tracks leaderboard of best models.
Cross-Disciplinary Insight
Neuroscience Connection
Human memory doesn’t store exact records—it stores compressed, semantic representations that are retrieved associatively. Vector search mirrors this: instead of exact lookup, we retrieve by similarity to cues.
Hippocampus and place cells: Neurons fire for specific locations, creating spatial maps. Vector embeddings create semantic “maps” where concepts have locations in high-dimensional space.
Memory consolidation: Humans replay and reorganize memories during sleep. Some agent systems implement similar “replay” mechanisms, reembedding and reorganizing stored experiences to improve future retrieval.
Information Retrieval Theory
Vector search formalizes the “conceptual similarity” intuition from library science and information retrieval. While traditional IR relied on term frequency and document structure, embeddings learn these patterns from data, discovering latent semantic structure automatically.
Daily Challenge / Thought Exercise
Coding Challenge (30 minutes):
Build a simple agent that maintains a vector memory of conversations and retrieves relevant past exchanges:
- Create a list to store conversation turns with embeddings
- When user asks a question, retrieve the 2 most similar past turns
- Include retrieved context in the prompt to the LLM
- Test with a multi-turn conversation where later questions reference earlier topics
Thought Exercise:
Consider an agent that needs to search across documents in English, Spanish, and Chinese. Should you:
- Use separate embedding models for each language?
- Use a single multilingual model?
- Translate everything to English first?
What are the tradeoffs in accuracy, cost, and latency?
References & Further Reading
Papers
- “Attention Is All You Need” (Vaswani et al., 2017): Transformer foundations
- “BERT: Pre-training of Deep Bidirectional Transformers” (Devlin et al., 2018)
- “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks” (Reimers & Gurevych, 2019)
- “Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval” (Xiong et al., 2020)
- “ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction” (Santhanam et al., 2022)
Blog Posts
- Pinecone Learning Center: Vector databases explained
- Weaviate blog: RAG architectures and best practices
- LangChain documentation: Retrieval patterns
GitHub Repositories
- FAISS: https://github.com/facebookresearch/faiss
- Sentence Transformers: https://github.com/UKPLab/sentence-transformers
- LangChain: https://github.com/langchain-ai/langchain
- Weaviate: https://github.com/weaviate/weaviate