AI Advanced · RAG Engineering

RAG Engineering

From naive retrieval to production-grade RAG systems โ€” chunking, embeddings, vector search, re-ranking, and advanced patterns for retrieval-augmented generation.

RAG is not about putting more data into the prompt. It's about putting the RIGHT data into the prompt. Production RAG requires careful engineering of every stage โ€” from how you chunk documents to how you rank and present retrieved context. This guide covers all of it.

01
Chapter 01 ยท Foundations
RAG Mental Model โ€” How Retrieval Actually Works

The fundamental problem with LLMs is not that they're stupid โ€” it's that they don't know your data. They were trained on the public internet up to some cutoff date. They haven't read your internal docs, your product specs, your customer tickets. RAG is how you give them that knowledge at runtime, without retraining.

Retrieval-Augmented Generation (RAG) is an architecture pattern where you enhance LLM responses by retrieving relevant information from external data sources and injecting it into the prompt at inference time. The model generates its response grounded in this retrieved context โ€” not just its pre-trained knowledge.

The RAG pipeline โ€” from question to grounded answer
User Query "What's our refund policy?" Retriever Vector search BM25 / Hybrid Re-ranking Knowledge Base Docs, PDFs, DB, APIs Context Builder Assemble prompt LLM System + Context + Query โ†’ Generation Grounded Answer "30-day refund..." The LLM never sees your full knowledge base โ€” it sees a few retrieved chunks relevant to this specific query This is what makes RAG scalable: any amount of data, fixed context window cost per query
The Key Insight

RAG separates knowledge storage from knowledge access. Your knowledge base can be petabytes. Your context window is 128K tokens. Retrieval is the bridge โ€” it selects the few thousand tokens most relevant to this particular query, every time.

RAG = Search + Prompt Engineering. Retrieval decides WHAT the model sees. The prompt decides HOW the model uses it.

This separation is critical to understand:

Even perfect retrieval fails with poor prompts

You find the right documents, but the prompt doesn't instruct the model to cite sources, stay grounded, or say "I don't know." The model hallucinates or ignores the context entirely.

Even perfect prompts fail with bad retrieval

Your prompt is flawless, but the retriever returns irrelevant chunks. The model faithfully answers from garbage context โ€” producing a well-formatted wrong answer.

The Implication

A production RAG system must optimize both layers together. Retrieval quality without prompt quality is wasted effort. Prompt quality without retrieval quality is a beautiful facade over bad data. Every chapter in this guide addresses one or both.

LLMs have powerful language understanding and generation capabilities, but they ship with three fundamental limitations that RAG directly addresses:

๐Ÿ“…
Knowledge Cutoff

Training data stops at some date. GPT-4o's cutoff is ~October 2023. Anything after that โ€” new products, policy changes, recent events โ€” is invisible to the base model.

  • RAG injects current documents at inference
๐Ÿ”’
No Access to Private Data

Your internal docs, customer records, proprietary research โ€” none of it is in the model. It was never trained on your company's Confluence, Notion, or databases.

  • RAG retrieves from your data sources
๐ŸŽญ
Hallucination Risk

Without grounding, the model can confidently generate plausible-sounding but wrong information. It's completing text patterns, not checking facts.

  • RAG grounds answers in source documents
What the LLM knows vs what it doesn't
LLM Training Data Public internet Books, Wikipedia Code repos Your Private Data Internal docs Customer records Post-cutoff info RAG Without RAG: LLM can't access your data โœ— RAG bridges the gap: retrieves relevant private data and injects it into the LLM's context at query time

The most common question: "Should I fine-tune or use RAG?" This is not an either/or โ€” they solve different problems and can be combined. Here's how to decide:

DimensionRAGFine-TuningBoth Combined
What it changesWhat the model sees (context)How the model behaves (weights)Both knowledge and behaviour
Knowledge updatesInstant โ€” update docs, instant effectRequires re-training (~hours/days)Mixed โ€” fast for RAG portion
Cost structureHigher per-query (retrieval + longer prompts)Higher upfront, lower per-queryHighest upfront, balanced query cost
Factual accuracyHigh โ€” grounded in source docsLower โ€” can still hallucinateHighest with both
Style/format controlLimited โ€” prompt engineering onlyStrong โ€” embedded in weightsBest of both
ComplexityRetrieval pipeline, chunking, indexingData curation, training, evaluationBoth complexities combined
โœ… Choose RAG when...

โ€ข Knowledge changes frequently (docs, policies, products)

โ€ข You need source attribution ("this came from doc X")

โ€ข Dataset is large (>100K documents)

โ€ข You can't afford training compute

โ€ข Factual accuracy is critical (legal, medical, support)

โœ… Choose Fine-Tuning when...

โ€ข You need consistent style/tone/format

โ€ข Domain has specialized vocabulary the base model struggles with

โ€ข Knowledge is stable and won't change often

โ€ข Latency is critical (no retrieval overhead)

โ€ข You have high-quality training data (>1K examples)

The Hybrid Sweet Spot

In practice, many production systems use RAG + fine-tuned model. Fine-tune for domain style and vocabulary, RAG for factual grounding. Example: a legal assistant fine-tuned on case law writing style, with RAG retrieving relevant precedents. The fine-tuning doesn't add factual knowledge โ€” it teaches the model how to write like a lawyer.

A production RAG system has 6 core stages. Each stage has multiple design choices โ€” and getting all of them right is what separates "RAG that works in demos" from "RAG that works in production."

The 6 stages of a RAG pipeline โ€” each is an engineering problem
OFFLINE PIPELINE (INDEXING) โ‘  Ingestion Load docs Parse formats Extract text โ‘ก Chunking Split into pieces Size + overlap Semantic units โ‘ข Embedding Text โ†’ vectors 768-3072 dims Semantic meaning โ‘ฃ Vector Storage Indexed, searchable, persistent โ€” Pinecone, Qdrant, pgvector ONLINE PIPELINE (QUERY TIME) Query User question โ‘ค Retrieval Vector search Hybrid (BM25) Re-rank top-k โ‘ฅ Generation Context + Query โ†’ LLM โ†’ Answer Ch 02 Ch 02 Ch 03 Ch 04 Ch 05-06 Ch 07 Each stage can fail independently โ†’ Chapter 08 covers failure modes and evaluation Chapters 09-10 cover advanced patterns (GraphRAG, CRAG) and production operations
๐Ÿ“ฅ
Ingestion & Chunking (Ch 2)

Load documents from disparate sources, parse various formats (PDF, HTML, Markdown, Notion, Confluence), split into retrievable chunks with appropriate size and overlap.

  • Loaders for 50+ formats
  • Semantic vs fixed-size chunking
  • Metadata extraction
๐Ÿงฎ
Embedding & Storage (Ch 3-4)

Convert text chunks to dense vector representations. Store in specialized vector databases with efficient similarity search indices.

  • OpenAI, Cohere, BGE models
  • HNSW, IVF indexing
  • Scaling to millions of vectors
๐Ÿ”
Retrieval & Generation (Ch 5-7)

At query time: find relevant chunks via vector similarity + keyword search, re-rank for precision, construct an optimized prompt, generate grounded response.

  • Hybrid search strategies
  • Cross-encoder re-ranking
  • Context window management

Retrieval finds relevant chunks. But the LLM doesn't understand "relevance" โ€” it only processes tokens. This means how you assemble those chunks into the prompt matters enormously.

๐Ÿ“
Ordering matters

Important chunks buried in the middle get ignored (the "lost-in-the-middle" problem). Put the most relevant content first or last.

๐Ÿ“
Formatting matters

Chunks dumped as raw text confuse the model. Adding source labels, separators, and structure helps the LLM parse context correctly.

๐Ÿ“Œ
Instruction placement matters

Where you place "answer only from context" relative to the chunks changes how strictly the model follows it.

Bad context construction destroys good retrieval

You retrieved the perfect 5 chunks. But you stuffed them unformatted between a vague system prompt and the user query. The model skips to chunk #3 (the least relevant), hallucinates details from #1, and ignores #2 entirely. In practice, context construction often determines final answer quality more than retrieval itself. Chapter 7 covers this in depth.

User queries are often vague, incomplete, or poorly structured. A single word like "refund" is not a good search query. Production systems improve retrieval by transforming queries before search.

โŒ Raw user query

refund

Too vague โ€” retrieves everything mentioning "refund" with no intent clarity.

โœ… Rewritten queries

What is the refund policy?

How many days for refund eligibility?

Specific, intent-clear โ€” retrieves the right documents.

Techniques include: rewriting for clarity, expanding keywords, generating multiple search queries from one question, and using the LLM itself to reformulate. Chapters 5 and 7 cover these strategies in detail.

You can build a working RAG demo in 20 lines of code. You can also watch it fail spectacularly in production. The gap between "it works on my laptop" and "it works for 1000 users on real data" is filled with engineering challenges that naive implementations ignore.

Naive ApproachWhat Goes WrongProduction Solution
Fixed 500-token chunks Splits mid-sentence, loses context, breaks tables Semantic chunking, document-aware splitting
Top-3 vector results Misses relevant docs, returns duplicates Hybrid search + re-ranking + deduplication
Stuff all chunks in prompt Lost-in-the-middle, context overflow, cost explosion Compression, map-reduce, hierarchical retrieval
No metadata filtering Retrieves outdated docs, wrong department's info Metadata filters, access control, freshness ranking
"Answer from these docs" Hallucinations when docs don't contain answer "I don't know" instruction, citation enforcement
Embed once, never update Stale index, deleted docs still retrieved Incremental indexing, TTL, sync pipelines
The "80% Accuracy Trap"

Naive RAG often achieves 80% accuracy in testing โ€” good enough to demo, not good enough to deploy. The 20% failure cases are where users lose trust: wrong answers stated confidently, outdated information, obviously missed documents. Production RAG is about eliminating those 20% โ€” and that requires engineering every stage of the pipeline.

RAG is not free. Each query involves multiple steps โ€” embedding lookup, vector search, optional re-ranking, larger prompt construction โ€” each adding latency and cost.

โฑ๏ธ
Latency

RAG adds +100ms to +500ms over a direct LLM call: embedding (20โ€“50ms), vector search (5โ€“20ms), re-ranking (50โ€“200ms), plus longer prompts = slower generation.

๐Ÿ’ฐ
Token Cost

Retrieved context adds 2Kโ€“10K tokens per query. At GPT-4o rates, that's $0.005โ€“$0.025 per query just for context โ€” multiplied by thousands of daily queries.

๐Ÿ“ˆ
Scaling Difficulty

Naive implementations that retrieve too many chunks, skip caching, and use the most expensive model for every query become slow and expensive at scale.

The Fix

Production systems control costs by: limiting retrieved chunks (5 not 20), compressing context (extract relevant sentences), caching results (semantic cache for repeated queries), and routing simple queries to cheaper models. Chapter 10 covers production optimization in depth.

You can't improve what you can't measure. RAG quality is measured in two independent dimensions: retrieval quality (did we find the right documents?) and generation quality (did we produce a correct, grounded answer?).

๐Ÿ“Š Retrieval Metrics

Recall@K: Of all relevant docs, what % did we retrieve in top-K?

Precision@K: Of retrieved docs, what % are actually relevant?

MRR: Mean Reciprocal Rank โ€” how high is the first relevant result?

NDCG: Normalized Discounted Cumulative Gain โ€” rank quality score

๐Ÿ“ Generation Metrics

Faithfulness: Is the answer grounded in retrieved context? (no hallucination)

Answer Relevance: Does the answer address the user's question?

Context Relevance: Was the retrieved context actually useful?

RAGAS: Framework combining faithfulness + relevance metrics

The RAG quality equation โ€” both retrieval and generation must succeed
Retrieval Quality Find the right docs Recall, Precision, MRR ร— Generation Quality Use docs correctly Faithfulness, Relevance = RAG Quality End-to-end correctness User satisfaction 90% retrieval ร— 90% generation = 81% end-to-end โ€” both stages compound, both must be optimized
Evaluation Is Non-Negotiable

Production RAG systems require automated evaluation pipelines. Build a golden test set of (query, relevant_docs, expected_answer) tuples. Run retrieval metrics after every indexing change. Run generation metrics after every prompt change. Chapter 8 covers this in depth.

High retrieval accuracy does NOT guarantee good answers

If irrelevant chunks dominate the context, the model will still generate incorrect responses. A retriever with 90% recall but poor precision floods the context with noise โ€” and the LLM faithfully summarizes that noise. Retrieval quality and generation quality must be optimized together.

RAG is powerful, but it's not a universal solution. Some problems are better solved with other approaches โ€” and forcing RAG where it doesn't fit leads to complex systems that underperform simpler alternatives.

SituationUse RAG?Better Alternative
Domain-specific style/vocabulary No Fine-tuning teaches the model how to speak, not what to say
General knowledge questions No The base LLM already knows this โ€” RAG adds latency and cost for no benefit
Creative writing tasks No RAG constrains creativity; use the model's generative capabilities directly
Ultra-low latency (<100ms) Often No Retrieval adds 50โ€“200ms minimum; consider pre-computed responses or caching
Highly structured data (SQL databases) Sometimes Text-to-SQL may be more accurate than embedding rows as text chunks
Document-grounded factual Q&A Yes โœ“ RAG is the right tool for this job
Knowledge that changes frequently Yes โœ“ RAG shines when knowledge is dynamic
Source attribution required Yes โœ“ RAG naturally supports citation since sources are explicit
MisconceptionRealityPractical Implication
"RAG = vector search"RAG = retrieval + augmentation + generationDon't neglect re-ranking, context construction, and prompt engineering
"More chunks = better"More noise in context = worse answersQuality over quantity โ€” use re-ranking to select the best 3-5 chunks
"Embeddings capture everything"Embeddings miss keywords, numbers, exact matchesHybrid search (vector + BM25) outperforms pure vector in most cases
"One chunking strategy fits all"Optimal chunking depends on doc type and query typeTest different strategies on your actual data with your actual queries
"Set and forget"RAG systems drift as data and queries changeContinuous evaluation, monitoring, and re-indexing are required
"RAG eliminates hallucination"RAG reduces but doesn't eliminate hallucinationUse citation prompting, faithfulness scoring, and "I don't know" instructions

∑ Chapter 01 — Key Takeaways

  • RAG = Retrieval-Augmented Generation โ€” inject relevant external documents into LLM context at inference time
  • RAG solves three LLM problems: knowledge cutoff, private data access, and hallucination grounding
  • RAG vs Fine-tuning: RAG changes what the model sees (context), fine-tuning changes how it behaves (weights)
  • The 6-stage pipeline: Ingestion โ†’ Chunking โ†’ Embedding โ†’ Storage โ†’ Retrieval โ†’ Generation
  • Naive RAG (~80% accuracy) is not production RAG โ€” the 20% failure cases destroy user trust
  • Measure both retrieval quality (did we find the right docs?) and generation quality (did we answer correctly?)
  • RAG is NOT always the answer โ€” consider fine-tuning for style, base LLM for general knowledge, SQL for structured data
02
Chapter 02 ยท Data Preparation
Data Ingestion & Chunking โ€” Breaking Documents Into Retrievable Units

Chunking is where most RAG systems silently fail. A bad chunking strategy doesn't throw errors โ€” it just returns irrelevant results that the LLM confidently uses to generate wrong answers. Get chunking wrong, and nothing downstream can fix it.

Before you can retrieve, you need to ingest. The ingestion pipeline transforms raw documents โ€” PDFs, web pages, Notion exports, database dumps โ€” into indexed, searchable chunks. Each step has failure modes that propagate downstream.

The ingestion pipeline โ€” each stage has decisions that affect retrieval quality
Sources PDFs, HTML Markdown, Docs APIs, DBs Notion, Confluence โ‘  Load Document loaders Parse format โ‘ก Clean Remove boilerplate Normalize text โ‘ข Chunk Split into pieces CRITICAL STAGE โ‘ฃ Enrich Add metadata Extract entities โ‘ค Embed Text โ†’ Vectors (Chapter 3) 50+ formats preprocessing size + strategy source, date, tags Bad decisions compound: a PDF loader that misses tables โ†’ chunks with gibberish โ†’ embeddings for nonsense โ†’ irrelevant retrieval Test each stage independently before building the full pipeline

Your knowledge lives in dozens of formats. Each format has quirks that affect text extraction quality. Use the right loader for each source โ€” and always validate output before proceeding.

FormatRecommended LoaderWatch Out ForQuality
PDF pypdf, pdfplumber, unstructured Scanned PDFs need OCR; tables often parse badly Variable
HTML / Web BeautifulSoup, trafilatura, unstructured Nav/footer pollution; JavaScript-rendered content Good
Markdown Native text, MarkdownLoader Code blocks need special handling; images are lost Excellent
Word / DOCX python-docx, unstructured Embedded images, track changes, comments Good
PowerPoint python-pptx, unstructured Layout is lost; speaker notes often missed Fair
Notion Notion API, notion-to-md Nested blocks, databases need flattening Good
Confluence Confluence REST API, Atlassian SDK Macros, embeds, permissions filtering Fair
SQL Database SQLAlchemy, custom extractors Schema matters more than raw data; denormalize first Good
The PDF Problem

PDFs are the worst format for RAG. They're designed for printing, not parsing. A table in a PDF might render as "Column1 Column2 Row1Val1 Row1Val2 Row2Val1..." โ€” semantically meaningless. Solutions: Use pdfplumber for tables, run OCR on scanned docs (Tesseract, Azure Doc Intelligence), or convert to Markdown before chunking.

๐Ÿ”ง
Production pattern: Unified document loading with LangChain
from langchain_community.document_loaders import ( PyPDFLoader, UnstructuredHTMLLoader, TextLoader, NotionDirectoryLoader, ConfluenceLoader ) def load_document(path: str) -> list[Document]: """Load any supported format โ€” auto-detect by extension.""" loaders = { ".pdf": PyPDFLoader, ".html": UnstructuredHTMLLoader, ".txt": TextLoader, ".md": TextLoader, } ext = Path(path).suffix.lower() loader_cls = loaders.get(ext) if not loader_cls: raise ValueError(f"Unsupported format: {ext}") return loader_cls(path).load()

Chunking determines what your retriever can find. Too large, and irrelevant content dilutes the signal. Too small, and context is lost. The right strategy depends on your document type and query patterns.

๐Ÿ“
Fixed-Size Chunking

Split by character/token count with overlap. Simple, predictable, format-agnostic.

  • Chunk size: 500โ€“1000 tokens
  • Overlap: 50โ€“200 tokens (10โ€“20%)
  • Pro: Easy to implement
  • Con: Splits mid-sentence/idea
๐Ÿ“„
Recursive Character Splitting

Try splitting by paragraph, then sentence, then character. Preserves document structure better.

  • Separators: ["\n\n", "\n", ". ", " "]
  • Respects natural boundaries
  • Pro: Better semantic units
  • Con: Chunk sizes vary
๐Ÿง 
Semantic Chunking

Use embedding similarity to find natural breakpoints. Split where meaning shifts.

  • Compute sentence embeddings
  • Split at similarity drops
  • Pro: Meaning-preserving
  • Con: 10โ€“100ร— slower, more complex
How different chunking strategies handle the same document
ORIGINAL DOCUMENT Introduction paragraph about the topic. | Main concept explanation with details. | Related subtopic with examples. | Conclusion and summary. โ† 4 semantic sections โ†’ โ‘  FIXED-SIZE (200 chars) Introduction paragraph abo ut the topic. | Main conc ept explanation with deta ils. | Related subtopic... โœ— Splits mid-word โ‘ก RECURSIVE (by paragraph then sentence) Introduction paragraph about the topic. Main concept explanation with details. Related subtopic... | Conclusion... โœ“ Respects sentences โ‘ข SEMANTIC (by meaning shifts) โœ“ Meaning groups
StrategyBest ForChunk SizeSpeedQuality
Fixed-size Uniform docs, quick prototyping 512โ€“1024 tokens Fast Fair
Recursive character General purpose, most use cases 500โ€“1000 tokens Fast Good
Semantic High-value docs, precision critical Variable Slow Excellent
Document-aware (markdown headers) Structured docs with clear sections Section-based Medium Excellent
Sentence-window Dense technical content 3โ€“5 sentences Fast Good

Chunk size is a tradeoff between precision and context. There's no universal answer โ€” it depends on query type, document structure, and embedding model capabilities.

โฌ‡๏ธ Smaller Chunks (100โ€“300 tokens)

โœ“ Pro: Higher precision for specific queries

โœ“ Pro: Less noise in retrieved context

โœ“ Pro: Better for exact match questions

โœ— Con: May lose surrounding context

โœ— Con: More chunks = more vectors = higher cost

Best for: FAQ, definitions, code snippets

โฌ†๏ธ Larger Chunks (500โ€“1500 tokens)

โœ“ Pro: Preserves full context

โœ“ Pro: Better for complex reasoning

โœ“ Pro: Fewer vectors to store/search

โœ— Con: May include irrelevant content

โœ— Con: Lower precision for specific queries

Best for: Analysis, summaries, narratives

The Empirical Answer

Don't guess โ€” test. Create 50โ€“100 (query, expected_doc) pairs from your actual data. Run retrieval with chunk sizes 256, 512, 1024, 2048. Measure Recall@5. The winner varies by dataset โ€” we've seen 256 win for support tickets and 1024 win for research papers. Your optimal size is the one that maximizes recall on your queries.

Chunk size vs retrieval precision โ€” typical tradeoff curve
0% 50% 100% Retrieval Quality 128 256 512 1024 2048 4096 Chunk Size (tokens) Precision Recall Sweet spot varies

Production RAG systems often use more sophisticated chunking patterns that decouple what you search from what you retrieve. These add complexity but can significantly improve quality.

๐Ÿ‘จโ€๐Ÿ‘ง
Parent-Child Chunking

Store small chunks for precise matching, but retrieve their larger parent for context.

# Search: small chunk (256 tokens) "The refund policy is 30 days" # Retrieve: parent chunk (1024 tokens) "RETURNS AND REFUNDS\n\nThe refund policy is 30 days from purchase date. To initiate a refund, contact support with your order number. Refunds are processed within 5โ€“7 business days..."
  • Best of both worlds: precision + context
  • Requires document ID linking
๐ŸชŸ
Sentence-Window

Embed individual sentences, but retrieve surrounding sentences as context.

# Index: single sentence "The API rate limit is 100 req/min." # Retrieve: sentence + window (ยฑ2) "Our API uses rate limiting to ensure fair usage. The API rate limit is 100 req/min. Exceeding this triggers a 429 error. Use exponential backoff..."
  • Very precise matching
  • More setup complexity
๐Ÿ“‘
Document-aware chunking for Markdown
from langchain.text_splitter import MarkdownHeaderTextSplitter headers_to_split = [ ("#", "h1"), ("##", "h2"), ("###", "h3"), ] splitter = MarkdownHeaderTextSplitter( headers_to_split_on=headers_to_split, strip_headers=False # Keep headers in chunk ) chunks = splitter.split_text(markdown_doc) # Each chunk includes its header hierarchy as metadata # {"h1": "User Guide", "h2": "Authentication", "h3": "API Keys"}

Headers become metadata โ€” filter by section, include hierarchy in context, maintain document structure.

Metadata enables filtering before semantic search โ€” dramatically improving precision. Every chunk should carry metadata that answers: where did this come from, when, and who can see it?

Metadata FieldExample ValuesUse Case
sourcepricing-faq.pdf, api-docs/auth.mdFilter by document, show citations
date_created2026-04-01Freshness ranking, exclude outdated
date_modified2026-04-20Sync detection, re-indexing triggers
departmentengineering, sales, legalAccess control, relevance filtering
doc_typefaq, policy, tutorial, referenceMatch intent to doc type
chunk_index0, 1, 2, ...Reconstruct document order
parent_iddoc_123Link chunks to parent documents
section_titleAuthentication > API KeysHierarchical filtering, better context
Pre-filter, Don't Post-filter

Vector databases support metadata filtering during search, not after. Use it: filter={"department": "engineering", "date_modified": {"$gte": "2026-01-01"}}. Filtering before search is vastly more efficient than retrieving 100 results and filtering to 5.

∑ Chapter 02 — Key Takeaways

  • The ingestion pipeline: Load โ†’ Clean โ†’ Chunk โ†’ Enrich โ†’ Embed โ€” each stage has failure modes
  • PDFs are problematic โ€” use specialized loaders (pdfplumber) and validate extraction quality
  • Chunking strategies: fixed-size (simple), recursive (general purpose), semantic (highest quality)
  • Chunk size is a tradeoff: smaller = higher precision, larger = better context โ€” test on your data
  • Advanced patterns: parent-child (search small, retrieve large) and sentence-window (precise + context)
  • Metadata enables pre-filtering โ€” extract source, date, department, doc_type at ingestion time
  • Document-aware chunking (Markdown headers) preserves structure and enables hierarchical filtering
03
Chapter 03 ยท Representation
Embeddings & Representation โ€” Turning Text Into Vectors

Embeddings are how machines "understand" text. A good embedding model compresses semantic meaning into a vector such that similar meanings are close together in vector space. The choice of embedding model fundamentally determines what your RAG system can and cannot retrieve.

An embedding is a fixed-length vector (array of numbers) that represents the semantic meaning of text. Embedding models are trained so that semantically similar text produces similar vectors โ€” enabling "meaning-based" search rather than keyword matching.

How embeddings enable semantic search โ€” similar meanings cluster together
TEXT INPUT "How do I reset my password?" "I forgot my login credentials" "What's your refund policy?" "What are the pricing tiers?" โ†’ embed Embedding Model text-embedding-3 768โ€“3072 dims โ†’ VECTOR SPACE "reset password" "forgot credentials" Login cluster "refund policy" "pricing tiers" large distance small distance Cosine similarity between vectors = semantic similarity between texts
Why This Works

Embedding models learn from billions of text pairs. During training, they learn that "password reset" and "forgot credentials" appear in similar contexts โ€” so they map to nearby vectors. At search time, we embed the query and find the closest document vectors. No keyword matching required.

There are dozens of embedding models available. They differ in quality (MTEB benchmarks), dimensionality (affects storage/speed), cost, and whether they can run locally. Here are the ones that matter in 2026:

ModelDimensionsMTEB AvgCostWhereBest For
OpenAI text-embedding-3-large 3072 (or 256โ€“1536) 64.6 $0.13/1M tokens API Production default, scalable
OpenAI text-embedding-3-small 1536 (or 256โ€“512) 62.3 $0.02/1M tokens API Budget production
Cohere embed-v3 1024 64.5 $0.10/1M tokens API Multilingual, input types
Voyage-2 1024 65.4 $0.10/1M tokens API Legal, code, finance
BGE-large-en-v1.5 1024 64.2 Free Local Self-hosted, privacy
E5-mistral-7b-instruct 4096 66.6 Free Local (GPU) Highest quality OSS
GTE-Qwen2-7B-instruct 3584 67.2 Free Local (GPU) SOTA open-source
all-MiniLM-L6-v2 384 56.3 Free Local (CPU) Prototyping only
๐Ÿข
Production (API)

Use OpenAI text-embedding-3-large or Cohere embed-v3. Battle-tested, no GPU infra needed, good latency. Cost is usually negligible vs LLM costs.

๐Ÿ”’
Privacy-Required

Use BGE-large or E5-mistral. Run locally, no data leaves your infrastructure. BGE runs on CPU; E5/GTE need GPU but are higher quality.

๐ŸŒ
Multilingual

Use Cohere embed-v3 (100+ languages) or multilingual-e5-large. Don't assume English models work for other languages โ€” they don't.

Higher dimensions capture more nuance but cost more to store and search. OpenAI's embedding-3 models support dimension reduction via Matryoshka Representation Learning โ€” you can truncate vectors without retraining.

Low Dimensions (256โ€“512)

โœ“ Storage: 1M vectors ร— 512 dims ร— 4 bytes = 2 GB

โœ“ Speed: Faster similarity calculations

โœ“ Cost: Lower vector DB costs

โœ— Quality: May lose subtle distinctions

Best for: Large scale (>10M docs), cost-sensitive

High Dimensions (1536โ€“3072)

โœ“ Quality: Captures nuanced meaning

โœ“ Precision: Better for similar documents

โœ— Storage: 1M ร— 3072 ร— 4 bytes = 12 GB

โœ— Speed: Slower search (but still fast)

Best for: High precision needs, <1M docs

๐Ÿ”ง
Dimension reduction with OpenAI embedding-3
from openai import OpenAI client = OpenAI() # Full dimensions (3072) response = client.embeddings.create( model="text-embedding-3-large", input="Your text here", ) full_vector = response.data[0].embedding # 3072 dims # Reduced dimensions (256) โ€” via API parameter response = client.embeddings.create( model="text-embedding-3-large", input="Your text here", dimensions=256 # Matryoshka truncation ) small_vector = response.data[0].embedding # 256 dims # Same cost, 12ร— less storage, ~2โ€“5% quality drop

Queries and documents are fundamentally different: queries are short questions, documents are long answers. Some embedding models handle this asymmetry explicitly โ€” and they perform significantly better for retrieval tasks.

โŒ
Symmetric embedding (naive)

Same embedding model, same prompt for queries and documents. Works okay, but not optimal.

# Same embedding for both: embed("What is the refund policy?") embed("Our refund policy allows...") # May not align well
โœ…
Asymmetric embedding (better)

Different instruction prefixes for queries vs documents. Models trained for this (E5, BGE, Cohere) perform ~5โ€“10% better.

# E5 format: embed("query: What is the refund policy?") embed("passage: Our refund policy allows...") # Cohere input_type parameter: embed(text, input_type="search_query") embed(text, input_type="search_document")
Always Check Documentation

Each model has its own conventions. BGE uses "Represent this sentence: " prefix. E5 uses "query:" and "passage:". Cohere uses API parameters. Using the wrong format can reduce retrieval quality by 10โ€“15% โ€” read the model card.

PracticeWhyImplementation
Batch your API calls 50โ€“100ร— faster than one-by-one Send up to 2048 texts per API call (OpenAI limit)
Cache embeddings Don't re-embed unchanged docs Hash document content, store in cache, skip if exists
Normalize vectors Required for cosine similarity Most APIs return normalized, but verify
Same model everywhere Don't mix models Query and doc embeddings must use same model
Prepend chunk metadata Context helps embedding quality "Title: X | Section: Y | Content: Z"
Handle max length Models truncate silently Check model's max tokens (usually 512โ€“8192)
โšก
Production embedding pipeline
import hashlib from openai import OpenAI client = OpenAI() BATCH_SIZE = 100 cache = {} # In production: Redis or similar def get_content_hash(text: str) -> str: return hashlib.md5(text.encode()).hexdigest() def embed_batch(texts: list[str]) -> list[list[float]]: """Embed with caching and batching.""" results = [None] * len(texts) to_embed = [] # (index, text) pairs # Check cache for i, text in enumerate(texts): h = get_content_hash(text) if h in cache: results[i] = cache[h] else: to_embed.append((i, text, h)) # Batch embed uncached for batch_start in range(0, len(to_embed), BATCH_SIZE): batch = to_embed[batch_start:batch_start + BATCH_SIZE] response = client.embeddings.create( model="text-embedding-3-large", input=[t[1] for t in batch] ) for j, emb in enumerate(response.data): idx, _, h = batch[j] results[idx] = emb.embedding cache[h] = emb.embedding # Cache for next time return results

General-purpose embeddings work well for most text. But for specialized domains (legal, medical, code), fine-tuning on your data can improve retrieval quality significantly โ€” 10โ€“30% gains are common.

โœ… Fine-tune embeddings when...

โ€ข Domain has specialized vocabulary (medical, legal, code)

โ€ข General embeddings fail on your evaluation set

โ€ข You have (query, relevant_doc) pairs for training

โ€ข Retrieval quality is critical for production

โ€ข You can afford retraining when model updates

โŒ Don't fine-tune when...

โ€ข General-purpose embeddings work well enough

โ€ข You don't have labeled training data

โ€ข You need to iterate quickly (fine-tuning is slow)

โ€ข Domain is general knowledge / common English

โ€ข Using an API-only model (can't fine-tune OpenAI embeds)

How to Fine-tune

Use sentence-transformers with contrastive loss. You need (anchor, positive, negative) triplets: the anchor is a query, positive is a relevant doc, negative is an irrelevant doc. Train for 1โ€“3 epochs with a learning rate of 2e-5. Evaluate on a held-out test set โ€” if Recall@5 improves by >5%, deploy the fine-tuned model.

Fine-tuning MethodTraining Data NeededQuality GainEffort
Contrastive fine-tuning 1Kโ€“10K (query, doc) pairs +10โ€“30% recall Medium
Matryoshka fine-tuning 1Kโ€“10K pairs Maintain quality at low dims Medium
Adapter layers (LoRA) 500โ€“2K pairs +5โ€“15% recall Low
Hard negative mining Requires iterative labeling +15โ€“25% over random negatives High

MTEB (Massive Text Embedding Benchmark) is the industry standard for comparing embedding models. It evaluates models across 56+ datasets in retrieval, classification, clustering, and more. But MTEB alone isn't enough โ€” you must test on your own data.

๐Ÿ“Š
MTEB Leaderboard

Check huggingface.co/spaces/mteb/leaderboard for current rankings. Focus on the Retrieval category for RAG use cases.

๐ŸŽฏ
Your Own Eval Set

Create 50โ€“100 (query, relevant_docs) pairs from your actual data. Measure Recall@5 and MRR. The model that wins on MTEB may not win on your domain.

โš–๏ธ
Compare Fairly

Same chunking, same index, same queries. Only change the embedding model. Run 3 times, average results. Statistical significance matters.

MTEB Hacking

Some models are "trained on the test set" โ€” they've seen MTEB datasets during development. Their MTEB scores look great, but they don't generalize. Always validate on your own held-out data before committing to a model in production.

∑ Chapter 03 — Key Takeaways

  • Embeddings turn text into vectors where similar meanings are close together โ€” enabling semantic search
  • Top models: OpenAI text-embedding-3-large (API), BGE-large (local), Cohere embed-v3 (multilingual)
  • Dimensionality tradeoff: higher = better quality, lower = cheaper storage โ€” use Matryoshka truncation to choose
  • Use asymmetric embeddings โ€” different prefixes for queries vs documents (E5: "query:", "passage:")
  • Best practices: batch API calls, cache embeddings, prepend metadata to chunks
  • Fine-tune for specialized domains โ€” 10โ€“30% gains possible with 1K+ training pairs
  • MTEB is useful but not definitive โ€” test on your own data before choosing a production model
04
Chapter 04 ยท Infrastructure
Vector Storage & Indexing โ€” Storing and Searching at Scale

You've got embeddings. Now where do you put them? Vector databases are purpose-built for storing millions of vectors and finding the nearest neighbors in milliseconds. The choice of database and index type determines your latency, accuracy, cost, and operational complexity.

You could store vectors in PostgreSQL as arrays. But when you have 10 million vectors, computing cosine similarity against all of them takes minutes. Vector databases use specialized indices that trade perfect accuracy for 100โ€“1000ร— faster search.

Brute-force search vs. ANN (Approximate Nearest Neighbor) search
BRUTE FORCE (exact) Query Compare query to ALL 10M vectors O(n) = 10M comparisons = slow ANN INDEX (approximate) Query Jump to nearby cluster, search ~1000 candidates O(log n) = ~1000 comparisons = fast
The ANN Tradeoff

ANN is approximate โ€” it might miss the actual closest neighbor and return the 2nd or 3rd closest instead. In practice, with good tuning, you get 95โ€“99% recall (95% of queries return the true top-k) at 100ร— the speed. For RAG, this tradeoff is almost always worth it.

The vector database market has exploded. Here's an honest comparison of the major options โ€” there's no single "best" choice, only tradeoffs for your use case.

DatabaseTypeBest ForScaleCostComplexity
Pinecone Managed Production, serverless, no ops Billions $70+/mo Very low
Qdrant Both Self-hosted + cloud, flexible Billions Freeโ€“$$$ Medium
Weaviate Both Built-in ML, hybrid search Billions Freeโ€“$$$ Medium
Milvus Self-hosted Massive scale, on-prem Billions+ Free (infra) High
pgvector Extension Already using Postgres ~5M vectors Free Low
Chroma Embedded Prototyping, local dev ~1M vectors Free Very low
FAISS Library Research, custom pipelines Billions Free High
โ˜๏ธ
Start Here: Managed

Use Pinecone or Qdrant Cloud to start. Zero ops, scales automatically, free tiers available. Focus on your RAG logic, not infrastructure.

๐Ÿ˜
Already on Postgres?

Use pgvector. No new infrastructure, same ops model, works up to ~5M vectors. Beyond that, consider dedicated vector DB.

๐Ÿ 
Privacy/On-Prem Required

Self-host Qdrant or Milvus. Both are production-ready. Qdrant is simpler; Milvus scales larger but needs more ops work.

ANN search works by building an index โ€” a data structure that allows skipping most vectors during search. The two dominant index types are HNSW and IVF. Most modern vector databases default to HNSW.

๐Ÿ•ธ๏ธ HNSW (Hierarchical NSW)

How it works: Builds a multi-layer graph where each node connects to nearby neighbors. Search starts at top layer, descends to find approximate nearest.

โœ“ Pros: Fast search (1โ€“10ms), high recall, no training needed, incrementally updatable

โœ— Cons: High memory usage (stores graph edges), slow build time for very large datasets

Best for: Most RAG use cases, <100M vectors

๐Ÿ“Š IVF (Inverted File Index)

How it works: Clusters vectors into buckets (centroids) via k-means. Search only probes nearby buckets.

โœ“ Pros: Lower memory, compresses well with PQ, good for very large scale

โœ— Cons: Requires training phase, slower search than HNSW, updating is expensive

Best for: 100M+ vectors, memory-constrained, batch-build scenarios

HNSW index structure โ€” multi-layer graph for fast navigation
Layer 2 Layer 1 Layer 0 (all vectors) Search path: start at top layer, descend to find nearest neighbors
ParameterWhat It ControlsHigher ValueLower Value
M (HNSW) Edges per node Better recall, more memory Less memory, lower recall
ef_construction Build-time search width Better index quality, slower build Faster build, lower quality
ef_search Query-time search width Better recall, slower queries Faster queries, lower recall
nlist (IVF) Number of clusters More granular, slower build Faster build, coarser search
nprobe (IVF) Clusters to search Better recall, slower queries Faster queries, lower recall

Here's how to set up the most common vector databases. All examples use Python and store 1536-dimensional embeddings with metadata.

๐ŸŒฒ
Pinecone (managed, serverless)
from pinecone import Pinecone pc = Pinecone(api_key="YOUR_KEY") # Create index pc.create_index( name="my-rag-index", dimension=1536, metric="cosine", spec=ServerlessSpec( cloud="aws", region="us-east-1" ) ) # Upsert vectors with metadata index = pc.Index("my-rag-index") index.upsert(vectors=[ {"id": "doc1", "values": embedding_vector, "metadata": { "source": "faq.pdf", "date": "2026-01" }} ])
๐Ÿ”ท
Qdrant (self-hosted or cloud)
from qdrant_client import QdrantClient from qdrant_client.models import * client = QdrantClient("localhost", port=6333) # Create collection client.create_collection( collection_name="my-rag", vectors_config=VectorParams( size=1536, distance=Distance.COSINE ) ) # Upsert with payload (metadata) client.upsert( collection_name="my-rag", points=[ PointStruct( id=1, vector=embedding_vector, payload={"source": "faq.pdf"} ) ] )
๐Ÿ˜
pgvector (PostgreSQL extension)
-- Enable extension CREATE EXTENSION vector; -- Create table with vector column CREATE TABLE documents ( id SERIAL PRIMARY KEY, content TEXT, embedding vector(1536), -- pgvector type source TEXT, created_at TIMESTAMP ); -- Create HNSW index for fast search CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64); -- Query: find 5 nearest neighbors SELECT id, content, source, 1 - (embedding <=> '[0.1, 0.2, ...]') AS similarity FROM documents ORDER BY embedding <=> '[0.1, 0.2, ...]' LIMIT 5;

Vector databases scale differently than traditional databases. Memory is usually the bottleneck, not disk. Here's what to expect:

ScaleVectorsMemory (1536d)Typical CostLatency (p99)
Small <100K ~600 MB Free tier / $20/mo <10ms
Medium 100Kโ€“1M 600 MBโ€“6 GB $50โ€“200/mo <20ms
Large 1Mโ€“10M 6โ€“60 GB $200โ€“500/mo 20โ€“50ms
Very Large 10Mโ€“100M 60โ€“600 GB $500โ€“2000/mo 50โ€“100ms
Massive >100M >600 GB $2000+/mo + ops 100ms+ (sharded)
Cost-Saving Strategies

1. Reduce dimensions: 256d instead of 1536d = 6ร— less memory. 2. Quantization: Store int8 instead of float32 = 4ร— less memory (some quality loss). 3. Tiered storage: Keep hot vectors in memory, cold in disk-backed storage. 4. Aggressive deduplication: Remove near-duplicate chunks before indexing.

∑ Chapter 04 — Key Takeaways

  • Vector databases use ANN indices (not brute force) to search millions of vectors in milliseconds
  • Top choices: Pinecone (managed), Qdrant (flexible), pgvector (if already on Postgres)
  • HNSW is the default index type โ€” fast search, high recall, good for <100M vectors
  • Key parameters: M (edges), ef_construction (build quality), ef_search (query recall)
  • Memory scales with vectors: 1M ร— 1536d ร— 4 bytes โ‰ˆ 6 GB โ€” dimension reduction helps
  • Start with managed services โ€” self-host only when you need privacy or have ops capacity
05
Chapter 05 ยท Core Mechanics
Retrieval Strategies โ€” Finding What Matters

Retrieval is where RAG succeeds or fails. You can have perfect embeddings and a fast vector database, but if your retrieval strategy doesn't find the right documents, the LLM will confidently answer with irrelevant context. This chapter covers how to actually get retrieval right.

There are fundamentally two ways to find relevant documents: dense retrieval (semantic similarity via embeddings) and sparse retrieval (keyword matching via inverted indices). Both have strengths; the best systems use both.

๐Ÿง  Dense Retrieval (Vector Search)

How it works: Embed query, find nearest document embeddings by cosine similarity

โœ“ Strengths: Semantic understanding โ€” "car" matches "automobile", handles paraphrase

โœ— Weaknesses: Misses exact keywords, struggles with numbers, acronyms, rare terms

Example: "Python web framework" retrieves docs about Flask even if "Flask" is the only word in the doc

๐Ÿ”ค Sparse Retrieval (BM25/TF-IDF)

How it works: Count word occurrences, score by term frequency and rarity (BM25)

โœ“ Strengths: Exact keyword match, numbers, codes, acronyms, rare terms

โœ— Weaknesses: No semantic understanding โ€” "car" doesn't match "automobile"

Example: "ERR-4592" finds exact error code, vector search might miss it

Dense vs Sparse retrieval โ€” each finds different documents
Query: "Python web app" Dense (Vector) โœ“ "Flask framework tutorial" โœ“ "Building websites with Django" โœ“ "Web development in Python" โœ— "Error code PY-4592 fix" Semantic match โœ“ Sparse (BM25) โœ“ "Python app deployment" โœ“ "Web scraping with Python" โœ— "Flask framework tutorial" โœ“ "Python web server basics" Keyword match โœ“ Hybrid (Both) โœ“ Flask framework tutorial โœ“ Python app deployment โœ“ Building websites Django โœ“ Web dev in Python Best of both โœ“โœ“
The Empirical Result

Benchmarks consistently show: Hybrid search (dense + sparse) outperforms either alone by 5โ€“15% on recall. Dense captures paraphrase and semantic similarity; sparse captures exact terms the embeddings might miss. Use both.

Hybrid search runs both dense and sparse retrieval, then combines the results. The key question: how do you merge two ranked lists? The most common approach is Reciprocal Rank Fusion (RRF).

Reciprocal Rank Fusion (RRF) RRF_score(d) = ฮฃ 1 / (k + rank_i(d)) k = 60 (constant), rank_i = document's rank in retriever i. Higher score = better.
๐Ÿ”ง
Hybrid search with RRF โ€” production pattern
from rank_bm25 import BM25Okapi import numpy as np def hybrid_search(query: str, docs: list, embeddings: np.array, embed_fn, top_k: int = 10, alpha: float = 0.5): """ Combine dense (vector) and sparse (BM25) retrieval using RRF. alpha: weight for dense vs sparse (0.5 = equal weight) """ # Dense retrieval query_emb = embed_fn(query) dense_scores = np.dot(embeddings, query_emb) # cosine similarity dense_ranks = np.argsort(-dense_scores) # descending # Sparse retrieval (BM25) tokenized_docs = [doc.split() for doc in docs] bm25 = BM25Okapi(tokenized_docs) sparse_scores = bm25.get_scores(query.split()) sparse_ranks = np.argsort(-sparse_scores) # RRF fusion k = 60 # standard RRF constant rrf_scores = {} for rank, doc_idx in enumerate(dense_ranks): rrf_scores[doc_idx] = rrf_scores.get(doc_idx, 0) + alpha / (k + rank) for rank, doc_idx in enumerate(sparse_ranks): rrf_scores[doc_idx] = rrf_scores.get(doc_idx, 0) + (1 - alpha) / (k + rank) # Sort by RRF score ranked = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True) return [docs[idx] for idx, _ in ranked[:top_k]]
Hybrid MethodHow It WorksBest When
RRF (Reciprocal Rank Fusion) Sum of 1/(k+rank) across retrievers Default choice, no tuning needed
Weighted Linear ฮฑ ร— dense_score + (1-ฮฑ) ร— sparse_score When you can tune ฮฑ on your data
Cascade Sparse first for recall, dense re-rank When sparse is faster, need latency
Learned Fusion Train a model to combine scores When you have labeled data + resources

User queries are often poor search queries: too short, ambiguous, or phrased as questions when documents are statements. Query enhancement transforms the user's question into something more likely to retrieve relevant documents.

๐Ÿ“
Query Expansion

Add related terms to the query. "Python web" โ†’ "Python web Flask Django framework app"

  • Use LLM to generate synonyms
  • Or use WordNet/embeddings
  • Increases recall, may hurt precision
โœจ
HyDE (Hypothetical Doc)

Generate a hypothetical answer, embed that instead of the question.

  • Questions embed differently than docs
  • Generated answer is closer to real docs
  • +10โ€“15% recall on some benchmarks
๐Ÿ”€
Multi-Query

Generate 3โ€“5 variations of the query, retrieve with each, merge results.

  • "refund policy" + "return items" + "money back guarantee"
  • Captures more angles
  • 3ร— retrieval cost
๐Ÿช„
HyDE implementation โ€” generate hypothetical document, then embed
def hyde_retrieval(query: str, llm, embed_fn, index, top_k=5): """ HyDE: Hypothetical Document Embeddings 1. Generate a hypothetical answer to the query 2. Embed that answer (not the question) 3. Search for similar documents """ # Step 1: Generate hypothetical document prompt = f"""Write a short passage that would answer this question: Question: {query} Passage:""" hypothetical_doc = llm.generate(prompt, max_tokens=200) # Step 2: Embed the hypothetical answer (not the question!) hyde_embedding = embed_fn(hypothetical_doc) # Step 3: Search with the hypothetical doc's embedding results = index.search(hyde_embedding, top_k=top_k) return results # Why this works: Questions are phrased differently than answers. # "What is the refund policy?" embeds far from "Our refund policy is..." # HyDE generates "Our refund policy allows 30-day returns..." which # embeds close to actual policy documents.
HyDE Caveat

HyDE adds an LLM call to every query โ€” 100โ€“500ms latency + cost. It works best when: (1) question-answer asymmetry is high, (2) latency is not critical, (3) you've measured real improvement on your data. Don't blindly apply it โ€” A/B test.

Sometimes you know constraints before search: only documents from this year, only from the engineering team, only product docs. Metadata filtering narrows the search space, improving both precision and speed.

Filter TypeExampleUse Case
Equality department = "engineering" User belongs to a specific team
Range date >= "2025-01-01" Only recent documents
In-list source IN ["faq", "docs"] Only certain document types
Boolean AND public = true AND lang = "en" Multiple constraints
Geo location NEAR (lat, lon, 10km) Location-aware retrieval
๐ŸŒฒ
Pinecone filtering
results = index.query( vector=query_embedding, top_k=10, filter={ "department": {"$eq": "engineering"}, "date": {"$gte": "2025-01-01"} }, include_metadata=True )
๐Ÿ”ท
Qdrant filtering
results = client.search( collection_name="docs", query_vector=query_embedding, limit=10, query_filter=Filter(must=[ FieldCondition( key="department", match=MatchValue(value="eng") ) ]) )
Pre-filter vs Post-filter

Pre-filter (narrow before ANN search) is much faster and should be default. Post-filter (search all, filter results) only when: (1) filter would leave <100 candidates, (2) you need exact top-k after filtering. Most vector DBs support efficient pre-filtering.

Real applications have multiple document types: FAQs, documentation, support tickets, code. Each may benefit from different chunking, embeddings, or ranking. Multi-index retrieval queries multiple specialized indices and merges results.

Multi-index architecture โ€” route query to specialized indices
User Query Router classify intent FAQ Index Short, direct answers Docs Index Technical reference Code Index Code snippets, API Merger RRF / re-rank Merged Results
โœ…
When to use multi-index
  • Documents have very different structures (FAQ vs manuals)
  • Different chunking strategies needed
  • Different embedding models work better for each type
  • Access control varies by doc type
โŒ
When single index is fine
  • Documents are homogeneous
  • Same chunking works everywhere
  • Adding complexity without measured benefit
  • Small corpus (<100K docs)
OptimizationExpected GainEffortWhen to Use
Add BM25 hybrid search +5โ€“15% recall Low Almost always โ€” default recommendation
Cross-encoder re-ranking +5โ€“20% precision Medium When precision matters more than latency
Multi-query retrieval +5โ€“10% recall Lowโ€“Medium Short/ambiguous queries
HyDE +5โ€“15% recall Medium High question-doc asymmetry
Metadata pre-filtering Variable (precision) Low When you have useful metadata
Better chunking +10โ€“30% recall Medium Before other optimizations
Better embedding model +5โ€“15% recall Medium If current model underperforms on eval
Optimization Order

Before adding fancy techniques, get the basics right: 1. Good chunking (Ch 2), 2. Good embeddings (Ch 3), 3. Hybrid search (this chapter), 4. Re-ranking (Ch 6). Only then consider HyDE, multi-query, or multi-index. Measure each change on your eval set โ€” many "optimizations" don't help specific datasets.

∑ Chapter 05 — Key Takeaways

  • Dense retrieval (vectors) captures semantic similarity; sparse retrieval (BM25) captures exact keywords
  • Hybrid search combining both outperforms either alone โ€” use RRF to merge ranked lists
  • Query enhancement: HyDE (hypothetical docs), multi-query (variations), query expansion (synonyms)
  • Metadata filtering narrows search space before ANN โ€” faster and more precise
  • Multi-index architectures help when document types are very different
  • Optimization order: chunking โ†’ embeddings โ†’ hybrid search โ†’ re-ranking โ†’ advanced techniques
  • Always measure on your eval set โ€” not all optimizations help all datasets
06
Chapter 06 ยท Quality
Ranking & Re-Ranking โ€” From Retrieved to Relevant

Retrieval gives you candidates. Re-ranking gives you the right candidates in the right order. A cross-encoder re-ranker looking at 20 retrieved chunks and selecting the best 5 can improve answer quality more than any other single optimization in the RAG pipeline.

Embedding-based retrieval is fast but shallow โ€” it computes similarity independently for each document without comparing query and document together. Re-ranking takes the top-k candidates and applies a more powerful (but slower) model that reads query + document jointly.

Two-stage retrieval โ€” fast recall first, then precise re-ranking
Query Stage 1: Retriever Bi-encoder / BM25 Search 1M docs โ†’ top 50 Fast: ~10ms Stage 2: Re-ranker Cross-encoder Score 50 โ†’ pick best 5 Precise: ~100ms Top 5 Results Highly relevant, well-ordered Ready for LLM context Bi-encoder: fast but approximate. Cross-encoder: slow but accurate. Use both. Re-ranking typically improves precision@5 by 15โ€“30%
โšก Bi-encoder (Stage 1)

How: Encode query and document independently, then compare

Speed: ~1ms per 1M docs (pre-indexed)

Quality: Good recall, moderate precision

Use: Initial retrieval from full corpus

๐ŸŽฏ Cross-encoder (Stage 2)

How: Feed [query + document] together through transformer

Speed: ~5ms per document pair

Quality: Much higher precision

Use: Re-rank top 20โ€“50 candidates

ModelTypeQualitySpeedCostBest For
Cohere Rerank v3 API Excellent ~100ms / 50 docs $1/1K searches Production default
Voyage Reranker API Excellent ~100ms / 50 docs $0.05/1K Cost-effective API option
BGE-reranker-v2-m3 Local Very good ~200ms / 50 docs (GPU) Free Self-hosted, multilingual
cross-encoder/ms-marco Local Good ~300ms / 50 docs (GPU) Free Prototyping, English only
ColBERT v2 Local Very good ~50ms / 50 docs Free Late interaction, fast
Jina Reranker v2 Both Very good ~100ms / 50 docs Free / API Multilingual, long docs
๐Ÿ”ง
Re-ranking with Cohere โ€” production pattern
import cohere co = cohere.Client("YOUR_API_KEY") def rerank_results(query: str, documents: list[str], top_k: int = 5): """Re-rank retrieved documents using Cohere Rerank.""" response = co.rerank( model="rerank-english-v3.0", query=query, documents=documents, top_n=top_k, return_documents=True ) # Returns documents sorted by relevance score return [ { "text": r.document.text, "score": r.relevance_score, # 0.0 to 1.0 "index": r.index # original position } for r in response.results ] # Usage: retrieve 50 with vector search, re-rank to top 5 candidates = vector_search(query, top_k=50) best = rerank_results(query, candidates, top_k=5)
๐Ÿ”ง
Re-ranking with local cross-encoder (sentence-transformers)
from sentence_transformers import CrossEncoder # Load model once at startup reranker = CrossEncoder("BAAI/bge-reranker-v2-m3") def rerank_local(query: str, docs: list[str], top_k: int = 5): """Re-rank using local cross-encoder model.""" pairs = [[query, doc] for doc in docs] scores = reranker.predict(pairs) # Sort by score descending ranked = sorted( zip(docs, scores), key=lambda x: x[1], reverse=True ) return [doc for doc, _ in ranked[:top_k]]

Research shows LLMs pay more attention to information at the beginning and end of the context window, but tend to miss information in the middle. This "lost in the middle" effect means document ordering matters.

LLM attention by position โ€” the "lost in the middle" effect
LLM Attention Position in context window Start Middle End โฌ‡ Low attention zone High โœ“ High โœ“
โœ…
Best-first ordering

Put the most relevant document first. Simple, effective for most cases.

๐Ÿ”„
Interleaved ordering

Alternate: #1 first, #2 last, #3 second, #4 second-to-last. Spreads relevance across attention peaks.

๐Ÿ“
Fewer, better chunks

Re-rank to top 3โ€“5 instead of stuffing 10+. Less middle = less lost. Quality over quantity.

Practical Impact

In benchmarks, placing the answer in the middle vs at the start reduces accuracy by 10โ€“20%. The fix: re-rank aggressively (top 3โ€“5 only), put best results first, and keep total context short. More context โ‰  better answers.

If your top 5 results are 5 chunks from the same document saying the same thing, you waste context and miss other relevant information. Diversity ranking ensures retrieved results cover different aspects of the query.

StrategyHow It WorksWhen to Use
MMR (Maximal Marginal Relevance) Balance relevance to query vs diversity from already-selected docs Default choice โ€” simple, effective
Source deduplication Max 2 chunks per source document When same doc is over-represented
Similarity threshold Remove results with cosine sim >0.95 to each other When near-duplicates are common
Category diversity Ensure mix of doc types (FAQ + docs + code) Multi-index systems
Maximal Marginal Relevance (MMR) MMR = ฮป ยท Sim(q, d) โˆ’ (1โˆ’ฮป) ยท max[Sim(d, d_selected)] ฮป = 0.5โ€“0.7 typical. Higher ฮป = more relevance, lower ฮป = more diversity.
๐Ÿ”ง
MMR implementation
import numpy as np def mmr_rerank(query_emb, doc_embs, docs, top_k=5, lambda_=0.6): """Select diverse, relevant documents using MMR.""" selected = [] remaining = list(range(len(docs))) # Precompute similarities q_sim = np.dot(doc_embs, query_emb) # relevance scores d_sim = np.dot(doc_embs, doc_embs.T) # pairwise doc similarities for _ in range(top_k): if not remaining: break if not selected: # First pick: most relevant best = max(remaining, key=lambda i: q_sim[i]) else: # MMR: balance relevance vs diversity mmr_scores = {} for i in remaining: max_sim_to_selected = max(d_sim[i][j] for j in selected) mmr_scores[i] = lambda_ * q_sim[i] - (1 - lambda_) * max_sim_to_selected best = max(remaining, key=lambda i: mmr_scores[i]) selected.append(best) remaining.remove(best) return [docs[i] for i in selected]

∑ Chapter 06 — Key Takeaways

  • Use a two-stage architecture: fast bi-encoder retrieval (top 50), then cross-encoder re-ranking (top 5)
  • Top re-rankers: Cohere Rerank v3 (API), BGE-reranker-v2 (local), ColBERT v2 (fast local)
  • Re-ranking typically improves precision@5 by 15โ€“30% โ€” the highest-ROI optimization after hybrid search
  • Lost in the middle: LLMs miss info in the middle of context โ€” put best results first, use fewer chunks
  • MMR diversity ranking avoids redundant results โ€” balance relevance (ฮป) vs diversity (1โˆ’ฮป)
  • Deduplicate by source โ€” max 2 chunks per document prevents one doc dominating context
07
Chapter 07 ยท Prompting
Context Construction & Prompting โ€” What Goes Into the LLM

You've retrieved the right documents and ranked them well. Now comes the final mile: how you assemble the prompt determines whether the LLM uses that context correctly. Bad context construction turns great retrieval into mediocre answers.

Every RAG prompt has four parts. Getting each part right โ€” and getting the ordering right โ€” is what separates good RAG from great RAG.

The four parts of a RAG prompt โ€” order matters
โ‘  SYSTEM INSTRUCTION "You are a helpful assistant. Answer based ONLY on the provided context. If unsure, say 'I don't know'." โ‘ก RETRIEVED CONTEXT [Source: pricing-faq.pdf | Section: Refund Policy] "Our refund policy allows returns within 30 days of purchase. To initiate a refund, contact support..." [Source: terms.pdf | Section: Cancellation] "Subscriptions can be cancelled at any time..." โ‘ข USER QUERY "What's the refund policy?" โ‘ฃ OUTPUT FORMAT "Cite sources using [Source: filename]. Be concise."
๐Ÿ“
Production RAG prompt template
SYSTEM_PROMPT = """You are a helpful assistant for {company_name}. INSTRUCTIONS: - Answer ONLY using the provided context below - If the context doesn't contain the answer, say "I don't have information about that in my knowledge base" - Cite sources using [Source: filename] format - Be concise and direct - Do NOT make up information not in the context CONTEXT: {retrieved_context} """ USER_PROMPT = """Question: {user_query} Answer (cite sources):""" def build_rag_prompt(query, chunks, company="Acme Corp"): # Format retrieved chunks with source metadata context_parts = [] for i, chunk in enumerate(chunks): source = chunk.metadata.get("source", "unknown") section = chunk.metadata.get("section", "") header = f"[Source {i+1}: {source}" if section: header += f" | {section}" header += "]" context_parts.append(f"{header}\n{chunk.text}") context = "\n\n".join(context_parts) system = SYSTEM_PROMPT.format( company_name=company, retrieved_context=context ) user = USER_PROMPT.format(user_query=query) return system, user

Citations serve two purposes: they let users verify answers and they reduce hallucination by forcing the model to ground statements in specific sources. Without citations, you can't tell if the model invented something.

1๏ธโƒฃ
Inline Citations

"The refund window is 30 days [Source 1]. For subscriptions, cancellation is immediate [Source 2]."

  • Easiest to implement
  • Clear source per claim
  • Works with any LLM
๐Ÿ“Ž
Footnote Citations

"The refund window is 30 daysยน. Cancel anytimeยฒ."
ยน pricing-faq.pdf   ยฒ terms.pdf

  • Cleaner reading flow
  • Needs post-processing
  • Better for long answers
๐Ÿ’ฌ
Quote-based Citations

"As stated in the FAQ: 'returns within 30 days of purchase' (pricing-faq.pdf)"

  • Verifiable quotes
  • Highest trust
  • Longer responses
Citation Hallucination

LLMs can hallucinate citations โ€” they'll cite "[Source 3]" even if only 2 sources exist, or attribute information to the wrong source. Always validate citations programmatically: check that cited source numbers exist, and optionally verify the claim appears in the cited chunk using string matching or semantic similarity.

Even with 128K-token context windows, more context is not always better. Cost increases linearly, latency increases, and the "lost in the middle" effect gets worse. Smart context management is about using the window efficiently.

StrategyHow It WorksToken SavingsQuality Impact
Fewer, better chunks Re-rank to top 3โ€“5 instead of 10+ 50โ€“70% fewer tokens Often improves quality
Chunk compression Use LLM to summarize each chunk before insertion 60โ€“80% fewer tokens May lose details
Relevant sentence extraction Extract only sentences relevant to query from each chunk 50โ€“70% fewer tokens Preserves key info
Token budget allocation Set max tokens per chunk (e.g., 500), truncate overflow Predictable May cut important context
Map-reduce for large corpus Summarize each chunk separately, then combine summaries Can handle unlimited docs Multiple LLM calls, higher latency
โŒ Don't: Stuff everything

Retrieving 20 chunks ร— 500 tokens = 10K tokens of context. Most of it is noise. Cost: $0.03 per query with GPT-4o. At 10K queries/day = $300/day just for context.

Quality: LLM drowns in irrelevant text, misses the answer, or picks wrong chunk.

โœ… Do: Curate aggressively

Re-rank to top 5 chunks ร— 500 tokens = 2.5K tokens. Cost: $0.0075 per query. At 10K queries/day = $75/day. 4ร— cheaper.

Quality: Less noise, LLM focuses on best content, answers more accurately.

The hardest part of RAG isn't answering questions โ€” it's knowing when not to answer. When retrieved context doesn't contain the answer, the LLM should say "I don't know" rather than hallucinate.

โŒ
Bad: No abstention instruction
"Answer the user's question using the context." # LLM will ALWAYS answer โ€” even when # context doesn't contain the answer. # It fills the gap with hallucination.
โœ…
Good: Explicit abstention
"Answer ONLY using the provided context. If the context does not contain enough information to answer, respond with: 'I don't have information about that in my knowledge base. Please contact support@company.com for help.'" # LLM knows it's OK to not answer. # Provides fallback action.
Confidence Scoring

For production systems, combine prompt-based abstention with a retrieval confidence check: if the best re-ranked score is below a threshold (e.g., 0.3), don't even send to the LLM โ€” return a canned "I can't help with that" response. Saves tokens and avoids hallucination entirely.

In conversation, follow-up questions reference earlier context: "What about for enterprise plans?" requires knowing the previous question was about pricing. Query rewriting transforms follow-ups into standalone queries for retrieval.

๐Ÿ’ฌUser asks"What's the refund policy?"
๐Ÿ”Retrievefind refund docs
๐Ÿ’ฌFollow-up"What about enterprise?"
โœ๏ธRewrite"enterprise refund policy"
๐Ÿ”Retrievefind enterprise docs
โœ๏ธ
Query rewriting for multi-turn RAG
REWRITE_PROMPT = """Given the conversation history, rewrite the last user message as a standalone search query. Include all necessary context from the conversation. Chat history: {history} Last message: {current_message} Standalone query:""" def rewrite_query(history: list, current: str, llm) -> str: """Rewrite follow-up question as standalone query.""" formatted_history = "\n".join( f"{m['role']}: {m['content']}" for m in history[-4:] ) prompt = REWRITE_PROMPT.format( history=formatted_history, current_message=current ) return llm.generate(prompt, max_tokens=100) # Example: # History: "What's the refund policy?" โ†’ "30 days..." # Current: "What about enterprise?" # Rewritten: "What is the refund policy for enterprise plans?"
Rewrite Cost

Query rewriting adds an LLM call per turn (~50โ€“100ms, ~100 tokens). For simple applications, check if the user message is self-contained first (using heuristics like "does it contain a noun?") โ€” only rewrite when it's clearly a follow-up. Don't rewrite "How do I reset my password?" โ€” it's already standalone.

∑ Chapter 07 — Key Takeaways

  • RAG prompts have 4 parts: system instruction โ†’ retrieved context โ†’ user query โ†’ output format
  • Citation prompting forces grounding โ€” inline, footnote, or quote-based โ€” always validate citations programmatically
  • Less context is often better: re-rank to top 3โ€“5 chunks, avoid stuffing 10+ into the prompt
  • Teach the model to say "I don't know" with explicit abstention instructions + retrieval confidence thresholds
  • Lost in the middle: put best results first, keep context short, consider interleaved ordering
  • Multi-turn RAG needs query rewriting โ€” transform follow-ups into standalone retrieval queries
  • Context window management: fewer better chunks = 4ร— cheaper, better quality than stuffing everything
08
Chapter 08 ยท Quality Assurance
Failure Modes & Evaluation โ€” Why RAG Breaks and How to Measure

Every RAG system that "works in demos" eventually breaks in production. The difference between a toy and a product is knowing how it fails, measuring how often, and fixing it systematically. This chapter is about building that feedback loop.

RAG failures fall into two categories: retrieval failures (wrong documents found) and generation failures (wrong answer produced from correct documents). Each requires different fixes.

RAG failure taxonomy โ€” retrieval vs generation failures
RAG FAILURE MODES RETRIEVAL FAILURES โ‘  Missing content Doc not in knowledge base โ‘ก Wrong chunks Irrelevant results retrieved โ‘ข Stale content Outdated docs still indexed โ‘ฃ Chunking artifacts Answer split across chunks Fix: better chunking, hybrid search, metadata Fix: content audits, sync pipelines GENERATION FAILURES โ‘ค Hallucination Invents info not in context โ‘ฅ Wrong synthesis Misinterprets correct docs โ‘ฆ Refuses to answer Says "I don't know" when it should Fix: better prompts, citation enforcement Fix: faithfulness scoring, confidence calibration
Failure ModeSymptomRoot CauseFix
โ‘  Missing content "I don't know" on answerable questions Document not ingested, format parsing failed Content coverage audit, loader validation
โ‘ก Wrong chunks Confident answer from wrong topic Poor embeddings, no metadata filter, bad chunking Hybrid search, re-ranking, metadata filters
โ‘ข Stale content Outdated information returned No re-indexing pipeline, deleted docs still indexed TTL, incremental sync, freshness ranking
โ‘ฃ Chunking artifacts Partial/incoherent answers Answer split across chunk boundary Parent-child chunks, larger overlap, semantic chunking
โ‘ค Hallucination Facts not in any retrieved document LLM uses parametric knowledge instead of context Citation enforcement, faithfulness scoring
โ‘ฅ Wrong synthesis Misinterprets or contradicts source Context too noisy, conflicting chunks, poor prompt Fewer chunks, better re-ranking, explicit instructions
โ‘ฆ Over-refusal "I don't know" when answer exists in context Abstention threshold too high, overly cautious prompt Calibrate confidence threshold, tune prompt
๐ŸŽฏ
Recall@K

Of all relevant documents, what fraction did we find in the top-K?

Recall@K Recall@K = |relevant โˆฉ retrieved@K| / |relevant| K=5 typical. Target: >0.85 for production.
๐Ÿ“Š
MRR (Mean Reciprocal Rank)

On average, how high is the first relevant result?

MRR MRR = (1/N) ร— ฮฃ (1 / rank_i) rank_i = position of first relevant doc for query i. MRR=1.0 means always rank 1.
๐Ÿ“
NDCG (Normalized DCG)

How good is the ordering of results? Penalizes relevant docs at low positions.

  • NDCG=1.0 = perfect ranking
  • Considers graded relevance (not just binary)
  • Standard for search quality evaluation
โš–๏ธ
Precision@K

Of the K retrieved docs, how many are actually relevant?

  • Precision@5 of 0.6 = 3 of 5 are relevant
  • Tradeoff with recall โ€” improve one, other may drop
  • Critical for context quality (less noise)
MetricWhat It MeasuresHow to ComputeTarget
Faithfulness Is the answer grounded in retrieved context? LLM-as-judge: "Are all claims in the answer supported by context?" >0.90
Answer Relevance Does the answer address the user's question? LLM-as-judge: "Does this answer the question? Score 1โ€“5" >0.85
Context Relevance Was the retrieved context useful? LLM-as-judge: "Is this context relevant to the question?" >0.80
Correctness Is the answer factually correct? Compare against golden answer (exact or semantic match) >0.85
Harmfulness Does the answer contain harmful/biased content? Safety classifier or LLM-as-judge <0.01

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that automates RAG evaluation. It computes faithfulness, answer relevance, and context metrics using LLM-as-judge.

๐Ÿ”ง
RAGAS evaluation pipeline
from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall, ) from datasets import Dataset # Prepare evaluation dataset eval_data = { "question": ["What's the refund policy?", ...], "answer": ["The refund policy is 30 days...", ...], "contexts": [["Chunk 1 text", "Chunk 2 text"], ...], "ground_truth": ["Refunds within 30 days...", ...], } dataset = Dataset.from_dict(eval_data) # Run evaluation results = evaluate( dataset, metrics=[ faithfulness, # Is answer grounded in context? answer_relevancy, # Does answer address the question? context_precision, # Are retrieved docs relevant? context_recall, # Did we find all relevant docs? ] ) print(results) # {'faithfulness': 0.92, 'answer_relevancy': 0.88, # 'context_precision': 0.85, 'context_recall': 0.90}
Building a Golden Test Set

Create 50โ€“100 (question, expected_answer, relevant_docs) tuples from your actual data. Have domain experts label them. Run RAGAS after every change to chunking, retrieval, or prompts. Automate this in CI โ€” no RAG change ships without passing eval. This is the single most important practice for production RAG.

๐Ÿ“User Queryincoming question
๐Ÿ”Retrievelog chunks + scores
๐Ÿค–Generatelog answer
๐Ÿ“ŠEvaluateauto-score quality
๐ŸšจAlertflag low scores
What to LogWhyAlert Threshold
Retrieval scores (top-k) Detect queries with no good matches Best score < 0.3
Re-ranker scores Detect relevance drops after re-ranking Best score < 0.5
LLM response latency Detect slow queries p99 > 5s
"I don't know" rate Detect coverage gaps >15% of queries
User feedback (thumbs) Ground truth from users Negative rate >20%
Faithfulness (sampled) Detect hallucination drift Score < 0.85 on sample
The Eval Paradox

You need evaluation to improve, but building eval sets takes time. Start small: 20 golden queries on day 1, add 5 per week from production logs. Within a month you'll have a meaningful eval set. Don't wait for perfection โ€” imperfect evaluation beats no evaluation by a mile.

∑ Chapter 08 — Key Takeaways

  • RAG fails in 7 ways: 4 retrieval failures (missing, wrong, stale, chunking) + 3 generation failures (hallucination, wrong synthesis, over-refusal)
  • Retrieval metrics: Recall@K (did we find it?), MRR (how high?), NDCG (rank quality), Precision@K (how clean?)
  • Generation metrics: Faithfulness (grounded?), Answer Relevance (addresses question?), Correctness (factually right?)
  • Use RAGAS framework to automate evaluation โ€” run after every pipeline change
  • Build a golden test set: 50โ€“100 labeled (query, answer, docs) tuples โ€” the most important practice for production RAG
  • Log everything in production: retrieval scores, latency, "I don't know" rates, user feedback โ€” alert on degradation
09
Chapter 09 ยท Architecture
Advanced RAG Patterns โ€” Beyond Naive Retrieval

Chapters 1โ€“8 covered how to build a solid RAG pipeline. This chapter goes beyond โ€” patterns that push RAG quality to the next level when the basics aren't enough. These techniques are more complex but solve specific failure modes that standard RAG can't handle.

Standard RAG uses whatever the retriever returns, even if the results are irrelevant. CRAG adds a self-correction step: evaluate retrieval quality, and if it's poor, try alternative strategies before generating.

CRAG flow โ€” evaluate retrieval quality, correct if needed
โ‘  Retrieve top-K chunks โ‘ก Evaluate Are results relevant? Score: 0โ€“1 โœ“ Score > 0.7 Use as-is โ‰ˆ Score 0.3โ€“0.7 Refine + web search โœ— Score < 0.3 Web search fallback โ‘ข Generate with best context Answer

Self-RAG teaches the LLM to decide: (1) whether retrieval is needed, (2) whether retrieved docs are relevant, and (3) whether the generated answer is grounded. The model outputs special reflection tokens during generation.

โ“Query arrivesanalyze intent
๐Ÿค”Need retrieval?[Retrieve] or [No Retrieve]
๐Ÿ“„If yes: retrieveget docs
โœ…Is it relevant?[ISREL] token
๐Ÿ“Generate + check[ISSUP] grounded?
When to Use Self-RAG

Self-RAG requires fine-tuning a model with reflection tokens. It's most valuable when: queries are mixed (some need retrieval, some don't), and when you need per-statement grounding verification. For most applications, CRAG (no fine-tuning needed) provides 80% of the benefit.

Standard RAG retrieves individual chunks. GraphRAG first builds a knowledge graph from documents (entities and relationships), then traverses the graph during retrieval. This enables multi-hop reasoning across documents.

Standard RAG

Query: "Who manages the team that built Project X?"

Retrieves chunks about Project X, but team and manager info is in different documents. Fails.

Each chunk is independent โ€” no connections between them.

GraphRAG

Query: "Who manages the team that built Project X?"

Graph: Project X โ†’ built by โ†’ Team Alpha โ†’ managed by โ†’ Sarah. Succeeds via graph traversal.

Entities and relationships connect information across documents.

โœ…
GraphRAG shines when
  • Multi-hop questions common
  • Entity relationships matter
  • Global summarization needed
  • Data is highly connected
โŒ
GraphRAG overkill when
  • Queries are simple lookups
  • Documents are independent
  • Graph construction cost too high
  • Data changes too fast
๐Ÿ”ง
Implementation
  • Microsoft GraphRAG library
  • LLM extracts entities + relations
  • Store in Neo4j / NetworkX
  • Community detection for summaries

In standard RAG, the pipeline is fixed: retrieve โ†’ generate. In Agentic RAG, the LLM acts as an agent that decides what to retrieve, when, and how many times. It can reformulate queries, request more context, or search different sources.

๐Ÿ”
Iterative Retrieval

Retrieve โ†’ analyze gaps โ†’ retrieve more โ†’ combine. The agent keeps searching until it has enough context.

  • FLARE: Forward-Looking Active REtrieval
  • Agent generates, detects uncertainty, retrieves more
  • 2โ€“5 retrieval rounds typical
๐Ÿงฉ
Query Decomposition

Complex question โ†’ break into sub-questions โ†’ retrieve for each โ†’ combine answers.

  • "Compare pricing of X vs Y" โ†’ two separate retrievals
  • LLM decomposes, retrieves, synthesizes
  • Better for multi-part questions
Agentic Complexity

Agentic RAG is powerful but adds latency (2โ€“10ร— more LLM calls), cost, and unpredictability. The agent might loop, over-retrieve, or go off-track. Use it only when standard RAG demonstrably fails on your queries. Start with CRAG before going full agentic.

PatternComplexityLatencyBest ForRequires
Standard RAG Low 200โ€“500ms 80% of use cases Chapters 1โ€“7
CRAG Medium 500โ€“1000ms Unreliable retrieval Relevance scorer
Self-RAG High 500โ€“1500ms Mixed query types Fine-tuned model
GraphRAG High 1โ€“3s Multi-hop, connected data Graph DB, extraction pipeline
Agentic RAG Very High 2โ€“10s Complex multi-step queries Agent framework, tool definitions
The Pragmatic Path

Start with standard RAG (Ch 1โ€“7) + good evaluation (Ch 8). Measure where it fails. If retrieval quality is the bottleneck, add CRAG. If multi-hop queries fail, consider GraphRAG. If complex reasoning fails, consider agentic. Each pattern adds complexity โ€” only add it when you've measured the need.

∑ Chapter 09 — Key Takeaways

  • CRAG evaluates retrieval quality and falls back to web search or refined queries when results are poor
  • Self-RAG teaches the LLM to decide when to retrieve and whether results are grounded (requires fine-tuning)
  • GraphRAG builds a knowledge graph for multi-hop reasoning across connected documents
  • Agentic RAG lets the LLM control retrieval โ€” iterative search, query decomposition, multi-source
  • Standard RAG covers 80% of use cases โ€” only add advanced patterns when evaluation shows specific failures
  • Each pattern adds complexity, latency, and cost โ€” measure the tradeoff before committing
10
Chapter 10 ยท Production Systems
Production Systems โ€” Deployment, Monitoring, and Optimization

You've built a RAG system that works. Now ship it. Production RAG isn't about making the retrieval 1% better โ€” it's about keeping it working reliably at scale, managing costs, and responding to drift. This chapter covers the engineering that keeps RAG systems alive.

Production RAG architecture โ€” all the pieces
OFFLINE (DATA PIPELINE) Sources Ingest Chunk Embed Vector DB (Pinecone / Qdrant / pgvector) Cron Sync ONLINE (QUERY PIPELINE) Query Retrieve Re-rank LLM Answer Cache (Redis / semantic) OBSERVABILITY Logging (queries, chunks, scores) Tracing (LangSmith / Phoenix) Metrics (latency, cost, quality) Alerts (degradation) EVALUATION Golden test set (CI/CD) RAGAS automated scoring A/B testing framework User feedback loop

Many RAG queries are repetitive or semantically similar. Caching avoids re-running expensive retrieval and LLM calls for questions you've already answered.

๐Ÿ”‘
Exact Cache

Hash the query string, cache the full response.

  • Hit rate: 5โ€“15% typical
  • Simple: Redis key-value
  • TTL: 1โ€“24 hours
๐Ÿง 
Semantic Cache

Embed the query, find similar cached queries by vector distance.

  • Hit rate: 15โ€“40% typical
  • "refund policy" โ‰ˆ "return policy"
  • Threshold: cosine > 0.95
๐Ÿ“ฆ
Retrieval Cache

Cache retrieval results only, still run LLM. Saves retrieval latency + vector DB cost.

  • Useful when prompts change often
  • TTL: 1โ€“6 hours
  • Invalidate on index update
Cost ComponentTypical %OptimizationSavings
LLM generation 60โ€“70% Fewer chunks in context, smaller model for simple queries, caching 30โ€“60%
Embedding API 10โ€“15% Cache embeddings, batch calls, lower dimensions 50โ€“80%
Vector DB 10โ€“15% Reduce dimensions, quantization, tiered storage 30โ€“60%
Re-ranker 5โ€“10% Cache re-rank results, reduce candidate count 20โ€“40%
The Model Routing Pattern

Not every query needs GPT-4o. Route simple factual queries to a smaller/cheaper model (GPT-4o-mini, Claude Haiku) and complex reasoning queries to the best model. A simple classifier can save 40โ€“60% on LLM costs by routing 70% of queries to the cheap model.

StageTypical LatencyOptimizationTarget
Embedding query 20โ€“50ms Local model for embedding, batch <50ms
Vector search 5โ€“20ms HNSW tuning, pre-filter, warm cache <20ms
Re-ranking 50โ€“200ms Fewer candidates (20 not 50), ColBERT <100ms
LLM generation 500โ€“3000ms Streaming, shorter context, faster model <2000ms
Total (no cache) 800โ€“3500ms Parallelize retrieval + embedding <2500ms
Total (cache hit) 10โ€“50ms Semantic cache <50ms
Stream Everything

Use streaming responses โ€” start showing the LLM's answer token-by-token while it's still generating. Perceived latency drops from 2s to <500ms (time to first token). Every production RAG system should stream.

โฐ
Scheduled Re-index

Cron job: re-index all docs every N hours. Simple but inefficient for large corpora.

  • Best for: <10K docs
  • Frequency: hourly to daily
  • Pro: Simple to implement
๐Ÿ”„
Incremental Sync

Track document hashes. Only re-embed changed/new/deleted docs. 10โ€“100ร— faster than full re-index.

  • Best for: 10Kโ€“1M docs
  • Frequency: real-time to hourly
  • Pro: Efficient, low cost
๐Ÿ“ก
Event-driven

Webhook on doc change triggers re-indexing. Near-real-time freshness.

  • Best for: critical freshness needs
  • Frequency: real-time
  • Pro: Immediate, targeted
The Deletion Problem

When a document is deleted from the source, its chunks remain in the vector DB until explicitly removed. Users get answers from documents that no longer exist. Always track document IDs in your vector DB and delete chunks when source docs are removed.

CategoryChecklist ItemStatus
DataAll source documents ingested and validatedโ˜
DataIncremental sync pipeline runningโ˜
DataStale document cleanup (TTL or deletion sync)โ˜
QualityGolden test set (50+ queries) with passing scoresโ˜
QualityEval runs in CI โ€” blocks deploy on regressionโ˜
QualityFaithfulness >0.90, Recall@5 >0.85 on eval setโ˜
Performancep99 latency <3s (or streaming TTFT <500ms)โ˜
PerformanceSemantic cache deployed, hit rate monitoredโ˜
CostPer-query cost calculated and budgetedโ˜
CostModel routing for simple vs complex queriesโ˜
ObservabilityAll queries/chunks/scores loggedโ˜
ObservabilityAlerts on quality degradation, error spikesโ˜
SecurityAccess control on retrieval (user can only see their docs)โ˜
SecurityPII handling in logs and cacheโ˜
Fallback"I don't know" with helpful fallback (human handoff, search link)โ˜

∑ Chapter 10 — Key Takeaways

  • Production RAG = offline pipeline (data) + online pipeline (query) + observability + evaluation
  • Caching (exact + semantic) can save 30โ€“60% of costs and reduce latency to <50ms for repeated queries
  • LLM generation is 60โ€“70% of cost โ€” optimize with fewer chunks, model routing, and caching
  • Stream responses to cut perceived latency from 2s to <500ms time-to-first-token
  • Keep the index fresh: incremental sync + deletion tracking โ€” stale data destroys trust
  • Use the production checklist: data quality, evaluation gates, performance, cost, observability, security
  • RAG systems drift โ€” continuous evaluation and monitoring are not optional, they're the product