retrieval intermediate

Retrieval-Augmented Generation (RAG)

Ground LLM responses in external knowledge by retrieving relevant documents and injecting them into the prompt context.

ragembeddingsvector-searchknowledge-basegrounding

Overview

Retrieval-Augmented Generation (RAG) is one of the most important patterns in production AI systems. Instead of relying solely on the LLM’s frozen training data, RAG retrieves relevant documents from an external knowledge base and injects them into the prompt as context.

This gives the LLM access to up-to-date, domain-specific, or proprietary information without fine-tuning.

When to Use

  • Your LLM needs access to current or private data not in its training set
  • You want to reduce hallucinations by grounding responses in sources
  • You have a knowledge base (docs, FAQs, wikis) to make searchable
  • You need attributable answers (cite sources)

Architecture

flowchart LR
    Q[User Query] --> E[Embedding Model]
    E --> VS[Vector Store Search]
    KB[(Knowledge Base)] --> VS
    VS --> RC[Relevant Chunks]
    RC --> P[Prompt Builder]
    Q --> P
    P --> LLM[LLM]
    LLM --> A[Answer + Sources]

How It Works

  1. Indexing Phase (offline): Split documents into chunks, generate embeddings, store in a vector database
  2. Query Phase (online): Embed the user query, search for similar chunks, build a prompt with retrieved context, send to LLM

Key Components

ComponentPurposeExamples
ChunkerSplit documents into retrievable piecesRecursiveTextSplitter, sentence-based
Embedding ModelConvert text to dense vectorsOpenAI text-embedding-3-small, Cohere Embed
Vector StoreFast similarity searchPinecone, Chroma, Weaviate, pgvector
RerankerRe-score results for relevanceCohere Rerank, cross-encoders
Prompt BuilderAssemble context + queryTemplate with retrieved chunks

Implementation

▶ Interactive Example (python)

Gotchas & Best Practices

🚨 Chunk Size Matters

Too small → loses context. Too large → dilutes relevance and wastes tokens. Start with 256-512 tokens with 50-100 token overlap. Benchmark different sizes on your data.

🚨 Embedding Query ≠ Document

Queries are short and question-shaped; documents are long and declarative. Consider using query expansion or HyDE (Hypothetical Document Embeddings) to bridge this gap.

⚠️ Retrieval Quality is Everything

If retrieval fails, the LLM can’t compensate. Always measure retrieval recall@k separately from end-to-end quality. A reranker can dramatically improve precision.

💡 Add Metadata Filtering

Don’t rely on embeddings alone. Combine with metadata filters (date, category, access level) to improve relevance and enforce permissions.

💡 Include Source Attribution

Always include document sources in the prompt and instruct the LLM to cite them. This makes answers verifiable and builds trust.

Variations

  • Naive RAG — Basic retrieve-then-generate
  • Advanced RAG — Adds reranking, query rewriting, hybrid search
  • Modular RAG — Composable pipeline with routing, caching, and fallback strategies
  • Graph RAG — Uses knowledge graphs instead of (or alongside) vector search

Further Reading