Retrieval-Augmented Generation (RAG)
Ground LLM responses in external knowledge by retrieving relevant documents and injecting them into the prompt context.
Overview
Retrieval-Augmented Generation (RAG) is one of the most important patterns in production AI systems. Instead of relying solely on the LLM’s frozen training data, RAG retrieves relevant documents from an external knowledge base and injects them into the prompt as context.
This gives the LLM access to up-to-date, domain-specific, or proprietary information without fine-tuning.
When to Use
- Your LLM needs access to current or private data not in its training set
- You want to reduce hallucinations by grounding responses in sources
- You have a knowledge base (docs, FAQs, wikis) to make searchable
- You need attributable answers (cite sources)
Architecture
flowchart LR
Q[User Query] --> E[Embedding Model]
E --> VS[Vector Store Search]
KB[(Knowledge Base)] --> VS
VS --> RC[Relevant Chunks]
RC --> P[Prompt Builder]
Q --> P
P --> LLM[LLM]
LLM --> A[Answer + Sources]
How It Works
- Indexing Phase (offline): Split documents into chunks, generate embeddings, store in a vector database
- Query Phase (online): Embed the user query, search for similar chunks, build a prompt with retrieved context, send to LLM
Key Components
| Component | Purpose | Examples |
|---|---|---|
| Chunker | Split documents into retrievable pieces | RecursiveTextSplitter, sentence-based |
| Embedding Model | Convert text to dense vectors | OpenAI text-embedding-3-small, Cohere Embed |
| Vector Store | Fast similarity search | Pinecone, Chroma, Weaviate, pgvector |
| Reranker | Re-score results for relevance | Cohere Rerank, cross-encoders |
| Prompt Builder | Assemble context + query | Template with retrieved chunks |
Implementation
Gotchas & Best Practices
Too small → loses context. Too large → dilutes relevance and wastes tokens. Start with 256-512 tokens with 50-100 token overlap. Benchmark different sizes on your data.
Queries are short and question-shaped; documents are long and declarative. Consider using query expansion or HyDE (Hypothetical Document Embeddings) to bridge this gap.
If retrieval fails, the LLM can’t compensate. Always measure retrieval recall@k separately from end-to-end quality. A reranker can dramatically improve precision.
Don’t rely on embeddings alone. Combine with metadata filters (date, category, access level) to improve relevance and enforce permissions.
Always include document sources in the prompt and instruct the LLM to cite them. This makes answers verifiable and builds trust.
Variations
- Naive RAG — Basic retrieve-then-generate
- Advanced RAG — Adds reranking, query rewriting, hybrid search
- Modular RAG — Composable pipeline with routing, caching, and fallback strategies
- Graph RAG — Uses knowledge graphs instead of (or alongside) vector search