RAG Engineering
From naive retrieval to production-grade RAG systems โ chunking, embeddings, vector search, re-ranking, and advanced patterns for retrieval-augmented generation.
RAG is not about putting more data into the prompt. It's about putting the RIGHT data into the prompt. Production RAG requires careful engineering of every stage โ from how you chunk documents to how you rank and present retrieved context. This guide covers all of it.
The fundamental problem with LLMs is not that they're stupid โ it's that they don't know your data. They were trained on the public internet up to some cutoff date. They haven't read your internal docs, your product specs, your customer tickets. RAG is how you give them that knowledge at runtime, without retraining.
Retrieval-Augmented Generation (RAG) is an architecture pattern where you enhance LLM responses by retrieving relevant information from external data sources and injecting it into the prompt at inference time. The model generates its response grounded in this retrieved context โ not just its pre-trained knowledge.
RAG separates knowledge storage from knowledge access. Your knowledge base can be petabytes. Your context window is 128K tokens. Retrieval is the bridge โ it selects the few thousand tokens most relevant to this particular query, every time.
RAG = Search + Prompt Engineering. Retrieval decides WHAT the model sees. The prompt decides HOW the model uses it.
This separation is critical to understand:
You find the right documents, but the prompt doesn't instruct the model to cite sources, stay grounded, or say "I don't know." The model hallucinates or ignores the context entirely.
Your prompt is flawless, but the retriever returns irrelevant chunks. The model faithfully answers from garbage context โ producing a well-formatted wrong answer.
A production RAG system must optimize both layers together. Retrieval quality without prompt quality is wasted effort. Prompt quality without retrieval quality is a beautiful facade over bad data. Every chapter in this guide addresses one or both.
LLMs have powerful language understanding and generation capabilities, but they ship with three fundamental limitations that RAG directly addresses:
Training data stops at some date. GPT-4o's cutoff is ~October 2023. Anything after that โ new products, policy changes, recent events โ is invisible to the base model.
- RAG injects current documents at inference
Your internal docs, customer records, proprietary research โ none of it is in the model. It was never trained on your company's Confluence, Notion, or databases.
- RAG retrieves from your data sources
Without grounding, the model can confidently generate plausible-sounding but wrong information. It's completing text patterns, not checking facts.
- RAG grounds answers in source documents
The most common question: "Should I fine-tune or use RAG?" This is not an either/or โ they solve different problems and can be combined. Here's how to decide:
| Dimension | RAG | Fine-Tuning | Both Combined |
|---|---|---|---|
| What it changes | What the model sees (context) | How the model behaves (weights) | Both knowledge and behaviour |
| Knowledge updates | Instant โ update docs, instant effect | Requires re-training (~hours/days) | Mixed โ fast for RAG portion |
| Cost structure | Higher per-query (retrieval + longer prompts) | Higher upfront, lower per-query | Highest upfront, balanced query cost |
| Factual accuracy | High โ grounded in source docs | Lower โ can still hallucinate | Highest with both |
| Style/format control | Limited โ prompt engineering only | Strong โ embedded in weights | Best of both |
| Complexity | Retrieval pipeline, chunking, indexing | Data curation, training, evaluation | Both complexities combined |
โข Knowledge changes frequently (docs, policies, products)
โข You need source attribution ("this came from doc X")
โข Dataset is large (>100K documents)
โข You can't afford training compute
โข Factual accuracy is critical (legal, medical, support)
โข You need consistent style/tone/format
โข Domain has specialized vocabulary the base model struggles with
โข Knowledge is stable and won't change often
โข Latency is critical (no retrieval overhead)
โข You have high-quality training data (>1K examples)
In practice, many production systems use RAG + fine-tuned model. Fine-tune for domain style and vocabulary, RAG for factual grounding. Example: a legal assistant fine-tuned on case law writing style, with RAG retrieving relevant precedents. The fine-tuning doesn't add factual knowledge โ it teaches the model how to write like a lawyer.
A production RAG system has 6 core stages. Each stage has multiple design choices โ and getting all of them right is what separates "RAG that works in demos" from "RAG that works in production."
Load documents from disparate sources, parse various formats (PDF, HTML, Markdown, Notion, Confluence), split into retrievable chunks with appropriate size and overlap.
- Loaders for 50+ formats
- Semantic vs fixed-size chunking
- Metadata extraction
Convert text chunks to dense vector representations. Store in specialized vector databases with efficient similarity search indices.
- OpenAI, Cohere, BGE models
- HNSW, IVF indexing
- Scaling to millions of vectors
At query time: find relevant chunks via vector similarity + keyword search, re-rank for precision, construct an optimized prompt, generate grounded response.
- Hybrid search strategies
- Cross-encoder re-ranking
- Context window management
Retrieval finds relevant chunks. But the LLM doesn't understand "relevance" โ it only processes tokens. This means how you assemble those chunks into the prompt matters enormously.
Important chunks buried in the middle get ignored (the "lost-in-the-middle" problem). Put the most relevant content first or last.
Chunks dumped as raw text confuse the model. Adding source labels, separators, and structure helps the LLM parse context correctly.
Where you place "answer only from context" relative to the chunks changes how strictly the model follows it.
You retrieved the perfect 5 chunks. But you stuffed them unformatted between a vague system prompt and the user query. The model skips to chunk #3 (the least relevant), hallucinates details from #1, and ignores #2 entirely. In practice, context construction often determines final answer quality more than retrieval itself. Chapter 7 covers this in depth.
User queries are often vague, incomplete, or poorly structured. A single word like "refund" is not a good search query. Production systems improve retrieval by transforming queries before search.
refund
Too vague โ retrieves everything mentioning "refund" with no intent clarity.
What is the refund policy?
How many days for refund eligibility?
Specific, intent-clear โ retrieves the right documents.
Techniques include: rewriting for clarity, expanding keywords, generating multiple search queries from one question, and using the LLM itself to reformulate. Chapters 5 and 7 cover these strategies in detail.
You can build a working RAG demo in 20 lines of code. You can also watch it fail spectacularly in production. The gap between "it works on my laptop" and "it works for 1000 users on real data" is filled with engineering challenges that naive implementations ignore.
| Naive Approach | What Goes Wrong | Production Solution |
|---|---|---|
| Fixed 500-token chunks | Splits mid-sentence, loses context, breaks tables | Semantic chunking, document-aware splitting |
| Top-3 vector results | Misses relevant docs, returns duplicates | Hybrid search + re-ranking + deduplication |
| Stuff all chunks in prompt | Lost-in-the-middle, context overflow, cost explosion | Compression, map-reduce, hierarchical retrieval |
| No metadata filtering | Retrieves outdated docs, wrong department's info | Metadata filters, access control, freshness ranking |
| "Answer from these docs" | Hallucinations when docs don't contain answer | "I don't know" instruction, citation enforcement |
| Embed once, never update | Stale index, deleted docs still retrieved | Incremental indexing, TTL, sync pipelines |
Naive RAG often achieves 80% accuracy in testing โ good enough to demo, not good enough to deploy. The 20% failure cases are where users lose trust: wrong answers stated confidently, outdated information, obviously missed documents. Production RAG is about eliminating those 20% โ and that requires engineering every stage of the pipeline.
RAG is not free. Each query involves multiple steps โ embedding lookup, vector search, optional re-ranking, larger prompt construction โ each adding latency and cost.
RAG adds +100ms to +500ms over a direct LLM call: embedding (20โ50ms), vector search (5โ20ms), re-ranking (50โ200ms), plus longer prompts = slower generation.
Retrieved context adds 2Kโ10K tokens per query. At GPT-4o rates, that's $0.005โ$0.025 per query just for context โ multiplied by thousands of daily queries.
Naive implementations that retrieve too many chunks, skip caching, and use the most expensive model for every query become slow and expensive at scale.
Production systems control costs by: limiting retrieved chunks (5 not 20), compressing context (extract relevant sentences), caching results (semantic cache for repeated queries), and routing simple queries to cheaper models. Chapter 10 covers production optimization in depth.
You can't improve what you can't measure. RAG quality is measured in two independent dimensions: retrieval quality (did we find the right documents?) and generation quality (did we produce a correct, grounded answer?).
Recall@K: Of all relevant docs, what % did we retrieve in top-K?
Precision@K: Of retrieved docs, what % are actually relevant?
MRR: Mean Reciprocal Rank โ how high is the first relevant result?
NDCG: Normalized Discounted Cumulative Gain โ rank quality score
Faithfulness: Is the answer grounded in retrieved context? (no hallucination)
Answer Relevance: Does the answer address the user's question?
Context Relevance: Was the retrieved context actually useful?
RAGAS: Framework combining faithfulness + relevance metrics
Production RAG systems require automated evaluation pipelines. Build a golden test set of (query, relevant_docs, expected_answer) tuples. Run retrieval metrics after every indexing change. Run generation metrics after every prompt change. Chapter 8 covers this in depth.
If irrelevant chunks dominate the context, the model will still generate incorrect responses. A retriever with 90% recall but poor precision floods the context with noise โ and the LLM faithfully summarizes that noise. Retrieval quality and generation quality must be optimized together.
RAG is powerful, but it's not a universal solution. Some problems are better solved with other approaches โ and forcing RAG where it doesn't fit leads to complex systems that underperform simpler alternatives.
| Situation | Use RAG? | Better Alternative |
|---|---|---|
| Domain-specific style/vocabulary | No | Fine-tuning teaches the model how to speak, not what to say |
| General knowledge questions | No | The base LLM already knows this โ RAG adds latency and cost for no benefit |
| Creative writing tasks | No | RAG constrains creativity; use the model's generative capabilities directly |
| Ultra-low latency (<100ms) | Often No | Retrieval adds 50โ200ms minimum; consider pre-computed responses or caching |
| Highly structured data (SQL databases) | Sometimes | Text-to-SQL may be more accurate than embedding rows as text chunks |
| Document-grounded factual Q&A | Yes โ | RAG is the right tool for this job |
| Knowledge that changes frequently | Yes โ | RAG shines when knowledge is dynamic |
| Source attribution required | Yes โ | RAG naturally supports citation since sources are explicit |
| Misconception | Reality | Practical Implication |
|---|---|---|
| "RAG = vector search" | RAG = retrieval + augmentation + generation | Don't neglect re-ranking, context construction, and prompt engineering |
| "More chunks = better" | More noise in context = worse answers | Quality over quantity โ use re-ranking to select the best 3-5 chunks |
| "Embeddings capture everything" | Embeddings miss keywords, numbers, exact matches | Hybrid search (vector + BM25) outperforms pure vector in most cases |
| "One chunking strategy fits all" | Optimal chunking depends on doc type and query type | Test different strategies on your actual data with your actual queries |
| "Set and forget" | RAG systems drift as data and queries change | Continuous evaluation, monitoring, and re-indexing are required |
| "RAG eliminates hallucination" | RAG reduces but doesn't eliminate hallucination | Use citation prompting, faithfulness scoring, and "I don't know" instructions |
∑ Chapter 01 — Key Takeaways
- RAG = Retrieval-Augmented Generation โ inject relevant external documents into LLM context at inference time
- RAG solves three LLM problems: knowledge cutoff, private data access, and hallucination grounding
- RAG vs Fine-tuning: RAG changes what the model sees (context), fine-tuning changes how it behaves (weights)
- The 6-stage pipeline: Ingestion โ Chunking โ Embedding โ Storage โ Retrieval โ Generation
- Naive RAG (~80% accuracy) is not production RAG โ the 20% failure cases destroy user trust
- Measure both retrieval quality (did we find the right docs?) and generation quality (did we answer correctly?)
- RAG is NOT always the answer โ consider fine-tuning for style, base LLM for general knowledge, SQL for structured data
Chunking is where most RAG systems silently fail. A bad chunking strategy doesn't throw errors โ it just returns irrelevant results that the LLM confidently uses to generate wrong answers. Get chunking wrong, and nothing downstream can fix it.
Before you can retrieve, you need to ingest. The ingestion pipeline transforms raw documents โ PDFs, web pages, Notion exports, database dumps โ into indexed, searchable chunks. Each step has failure modes that propagate downstream.
Your knowledge lives in dozens of formats. Each format has quirks that affect text extraction quality. Use the right loader for each source โ and always validate output before proceeding.
| Format | Recommended Loader | Watch Out For | Quality |
|---|---|---|---|
| pypdf, pdfplumber, unstructured | Scanned PDFs need OCR; tables often parse badly | Variable | |
| HTML / Web | BeautifulSoup, trafilatura, unstructured | Nav/footer pollution; JavaScript-rendered content | Good |
| Markdown | Native text, MarkdownLoader | Code blocks need special handling; images are lost | Excellent |
| Word / DOCX | python-docx, unstructured | Embedded images, track changes, comments | Good |
| PowerPoint | python-pptx, unstructured | Layout is lost; speaker notes often missed | Fair |
| Notion | Notion API, notion-to-md | Nested blocks, databases need flattening | Good |
| Confluence | Confluence REST API, Atlassian SDK | Macros, embeds, permissions filtering | Fair |
| SQL Database | SQLAlchemy, custom extractors | Schema matters more than raw data; denormalize first | Good |
PDFs are the worst format for RAG. They're designed for printing, not parsing. A table in a PDF might render as "Column1 Column2 Row1Val1 Row1Val2 Row2Val1..." โ semantically meaningless. Solutions: Use pdfplumber for tables, run OCR on scanned docs (Tesseract, Azure Doc Intelligence), or convert to Markdown before chunking.
Chunking determines what your retriever can find. Too large, and irrelevant content dilutes the signal. Too small, and context is lost. The right strategy depends on your document type and query patterns.
Split by character/token count with overlap. Simple, predictable, format-agnostic.
- Chunk size: 500โ1000 tokens
- Overlap: 50โ200 tokens (10โ20%)
- Pro: Easy to implement
- Con: Splits mid-sentence/idea
Try splitting by paragraph, then sentence, then character. Preserves document structure better.
- Separators: ["\n\n", "\n", ". ", " "]
- Respects natural boundaries
- Pro: Better semantic units
- Con: Chunk sizes vary
Use embedding similarity to find natural breakpoints. Split where meaning shifts.
- Compute sentence embeddings
- Split at similarity drops
- Pro: Meaning-preserving
- Con: 10โ100ร slower, more complex
| Strategy | Best For | Chunk Size | Speed | Quality |
|---|---|---|---|---|
| Fixed-size | Uniform docs, quick prototyping | 512โ1024 tokens | Fast | Fair |
| Recursive character | General purpose, most use cases | 500โ1000 tokens | Fast | Good |
| Semantic | High-value docs, precision critical | Variable | Slow | Excellent |
| Document-aware (markdown headers) | Structured docs with clear sections | Section-based | Medium | Excellent |
| Sentence-window | Dense technical content | 3โ5 sentences | Fast | Good |
Chunk size is a tradeoff between precision and context. There's no universal answer โ it depends on query type, document structure, and embedding model capabilities.
โ Pro: Higher precision for specific queries
โ Pro: Less noise in retrieved context
โ Pro: Better for exact match questions
โ Con: May lose surrounding context
โ Con: More chunks = more vectors = higher cost
Best for: FAQ, definitions, code snippets
โ Pro: Preserves full context
โ Pro: Better for complex reasoning
โ Pro: Fewer vectors to store/search
โ Con: May include irrelevant content
โ Con: Lower precision for specific queries
Best for: Analysis, summaries, narratives
Don't guess โ test. Create 50โ100 (query, expected_doc) pairs from your actual data. Run retrieval with chunk sizes 256, 512, 1024, 2048. Measure Recall@5. The winner varies by dataset โ we've seen 256 win for support tickets and 1024 win for research papers. Your optimal size is the one that maximizes recall on your queries.
Production RAG systems often use more sophisticated chunking patterns that decouple what you search from what you retrieve. These add complexity but can significantly improve quality.
Store small chunks for precise matching, but retrieve their larger parent for context.
- Best of both worlds: precision + context
- Requires document ID linking
Embed individual sentences, but retrieve surrounding sentences as context.
- Very precise matching
- More setup complexity
Headers become metadata โ filter by section, include hierarchy in context, maintain document structure.
Metadata enables filtering before semantic search โ dramatically improving precision. Every chunk should carry metadata that answers: where did this come from, when, and who can see it?
| Metadata Field | Example Values | Use Case |
|---|---|---|
| source | pricing-faq.pdf, api-docs/auth.md | Filter by document, show citations |
| date_created | 2026-04-01 | Freshness ranking, exclude outdated |
| date_modified | 2026-04-20 | Sync detection, re-indexing triggers |
| department | engineering, sales, legal | Access control, relevance filtering |
| doc_type | faq, policy, tutorial, reference | Match intent to doc type |
| chunk_index | 0, 1, 2, ... | Reconstruct document order |
| parent_id | doc_123 | Link chunks to parent documents |
| section_title | Authentication > API Keys | Hierarchical filtering, better context |
Vector databases support metadata filtering during search, not after. Use it: filter={"department": "engineering", "date_modified": {"$gte": "2026-01-01"}}. Filtering before search is vastly more efficient than retrieving 100 results and filtering to 5.
∑ Chapter 02 — Key Takeaways
- The ingestion pipeline: Load โ Clean โ Chunk โ Enrich โ Embed โ each stage has failure modes
- PDFs are problematic โ use specialized loaders (pdfplumber) and validate extraction quality
- Chunking strategies: fixed-size (simple), recursive (general purpose), semantic (highest quality)
- Chunk size is a tradeoff: smaller = higher precision, larger = better context โ test on your data
- Advanced patterns: parent-child (search small, retrieve large) and sentence-window (precise + context)
- Metadata enables pre-filtering โ extract source, date, department, doc_type at ingestion time
- Document-aware chunking (Markdown headers) preserves structure and enables hierarchical filtering
Embeddings are how machines "understand" text. A good embedding model compresses semantic meaning into a vector such that similar meanings are close together in vector space. The choice of embedding model fundamentally determines what your RAG system can and cannot retrieve.
An embedding is a fixed-length vector (array of numbers) that represents the semantic meaning of text. Embedding models are trained so that semantically similar text produces similar vectors โ enabling "meaning-based" search rather than keyword matching.
Embedding models learn from billions of text pairs. During training, they learn that "password reset" and "forgot credentials" appear in similar contexts โ so they map to nearby vectors. At search time, we embed the query and find the closest document vectors. No keyword matching required.
There are dozens of embedding models available. They differ in quality (MTEB benchmarks), dimensionality (affects storage/speed), cost, and whether they can run locally. Here are the ones that matter in 2026:
| Model | Dimensions | MTEB Avg | Cost | Where | Best For |
|---|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 (or 256โ1536) | 64.6 | $0.13/1M tokens | API | Production default, scalable |
| OpenAI text-embedding-3-small | 1536 (or 256โ512) | 62.3 | $0.02/1M tokens | API | Budget production |
| Cohere embed-v3 | 1024 | 64.5 | $0.10/1M tokens | API | Multilingual, input types |
| Voyage-2 | 1024 | 65.4 | $0.10/1M tokens | API | Legal, code, finance |
| BGE-large-en-v1.5 | 1024 | 64.2 | Free | Local | Self-hosted, privacy |
| E5-mistral-7b-instruct | 4096 | 66.6 | Free | Local (GPU) | Highest quality OSS |
| GTE-Qwen2-7B-instruct | 3584 | 67.2 | Free | Local (GPU) | SOTA open-source |
| all-MiniLM-L6-v2 | 384 | 56.3 | Free | Local (CPU) | Prototyping only |
Use OpenAI text-embedding-3-large or Cohere embed-v3. Battle-tested, no GPU infra needed, good latency. Cost is usually negligible vs LLM costs.
Use BGE-large or E5-mistral. Run locally, no data leaves your infrastructure. BGE runs on CPU; E5/GTE need GPU but are higher quality.
Use Cohere embed-v3 (100+ languages) or multilingual-e5-large. Don't assume English models work for other languages โ they don't.
Higher dimensions capture more nuance but cost more to store and search. OpenAI's embedding-3 models support dimension reduction via Matryoshka Representation Learning โ you can truncate vectors without retraining.
โ Storage: 1M vectors ร 512 dims ร 4 bytes = 2 GB
โ Speed: Faster similarity calculations
โ Cost: Lower vector DB costs
โ Quality: May lose subtle distinctions
Best for: Large scale (>10M docs), cost-sensitive
โ Quality: Captures nuanced meaning
โ Precision: Better for similar documents
โ Storage: 1M ร 3072 ร 4 bytes = 12 GB
โ Speed: Slower search (but still fast)
Best for: High precision needs, <1M docs
Queries and documents are fundamentally different: queries are short questions, documents are long answers. Some embedding models handle this asymmetry explicitly โ and they perform significantly better for retrieval tasks.
Same embedding model, same prompt for queries and documents. Works okay, but not optimal.
Different instruction prefixes for queries vs documents. Models trained for this (E5, BGE, Cohere) perform ~5โ10% better.
Each model has its own conventions. BGE uses "Represent this sentence: " prefix. E5 uses "query:" and "passage:". Cohere uses API parameters. Using the wrong format can reduce retrieval quality by 10โ15% โ read the model card.
| Practice | Why | Implementation |
|---|---|---|
| Batch your API calls | 50โ100ร faster than one-by-one | Send up to 2048 texts per API call (OpenAI limit) |
| Cache embeddings | Don't re-embed unchanged docs | Hash document content, store in cache, skip if exists |
| Normalize vectors | Required for cosine similarity | Most APIs return normalized, but verify |
| Same model everywhere | Don't mix models | Query and doc embeddings must use same model |
| Prepend chunk metadata | Context helps embedding quality | "Title: X | Section: Y | Content: Z" |
| Handle max length | Models truncate silently | Check model's max tokens (usually 512โ8192) |
General-purpose embeddings work well for most text. But for specialized domains (legal, medical, code), fine-tuning on your data can improve retrieval quality significantly โ 10โ30% gains are common.
โข Domain has specialized vocabulary (medical, legal, code)
โข General embeddings fail on your evaluation set
โข You have (query, relevant_doc) pairs for training
โข Retrieval quality is critical for production
โข You can afford retraining when model updates
โข General-purpose embeddings work well enough
โข You don't have labeled training data
โข You need to iterate quickly (fine-tuning is slow)
โข Domain is general knowledge / common English
โข Using an API-only model (can't fine-tune OpenAI embeds)
Use sentence-transformers with contrastive loss. You need (anchor, positive, negative) triplets: the anchor is a query, positive is a relevant doc, negative is an irrelevant doc. Train for 1โ3 epochs with a learning rate of 2e-5. Evaluate on a held-out test set โ if Recall@5 improves by >5%, deploy the fine-tuned model.
| Fine-tuning Method | Training Data Needed | Quality Gain | Effort |
|---|---|---|---|
| Contrastive fine-tuning | 1Kโ10K (query, doc) pairs | +10โ30% recall | Medium |
| Matryoshka fine-tuning | 1Kโ10K pairs | Maintain quality at low dims | Medium |
| Adapter layers (LoRA) | 500โ2K pairs | +5โ15% recall | Low |
| Hard negative mining | Requires iterative labeling | +15โ25% over random negatives | High |
MTEB (Massive Text Embedding Benchmark) is the industry standard for comparing embedding models. It evaluates models across 56+ datasets in retrieval, classification, clustering, and more. But MTEB alone isn't enough โ you must test on your own data.
Check huggingface.co/spaces/mteb/leaderboard for current rankings. Focus on the Retrieval category for RAG use cases.
Create 50โ100 (query, relevant_docs) pairs from your actual data. Measure Recall@5 and MRR. The model that wins on MTEB may not win on your domain.
Same chunking, same index, same queries. Only change the embedding model. Run 3 times, average results. Statistical significance matters.
Some models are "trained on the test set" โ they've seen MTEB datasets during development. Their MTEB scores look great, but they don't generalize. Always validate on your own held-out data before committing to a model in production.
∑ Chapter 03 — Key Takeaways
- Embeddings turn text into vectors where similar meanings are close together โ enabling semantic search
- Top models: OpenAI text-embedding-3-large (API), BGE-large (local), Cohere embed-v3 (multilingual)
- Dimensionality tradeoff: higher = better quality, lower = cheaper storage โ use Matryoshka truncation to choose
- Use asymmetric embeddings โ different prefixes for queries vs documents (E5: "query:", "passage:")
- Best practices: batch API calls, cache embeddings, prepend metadata to chunks
- Fine-tune for specialized domains โ 10โ30% gains possible with 1K+ training pairs
- MTEB is useful but not definitive โ test on your own data before choosing a production model
You've got embeddings. Now where do you put them? Vector databases are purpose-built for storing millions of vectors and finding the nearest neighbors in milliseconds. The choice of database and index type determines your latency, accuracy, cost, and operational complexity.
You could store vectors in PostgreSQL as arrays. But when you have 10 million vectors, computing cosine similarity against all of them takes minutes. Vector databases use specialized indices that trade perfect accuracy for 100โ1000ร faster search.
ANN is approximate โ it might miss the actual closest neighbor and return the 2nd or 3rd closest instead. In practice, with good tuning, you get 95โ99% recall (95% of queries return the true top-k) at 100ร the speed. For RAG, this tradeoff is almost always worth it.
The vector database market has exploded. Here's an honest comparison of the major options โ there's no single "best" choice, only tradeoffs for your use case.
| Database | Type | Best For | Scale | Cost | Complexity |
|---|---|---|---|---|---|
| Pinecone | Managed | Production, serverless, no ops | Billions | $70+/mo | Very low |
| Qdrant | Both | Self-hosted + cloud, flexible | Billions | Freeโ$$$ | Medium |
| Weaviate | Both | Built-in ML, hybrid search | Billions | Freeโ$$$ | Medium |
| Milvus | Self-hosted | Massive scale, on-prem | Billions+ | Free (infra) | High |
| pgvector | Extension | Already using Postgres | ~5M vectors | Free | Low |
| Chroma | Embedded | Prototyping, local dev | ~1M vectors | Free | Very low |
| FAISS | Library | Research, custom pipelines | Billions | Free | High |
Use Pinecone or Qdrant Cloud to start. Zero ops, scales automatically, free tiers available. Focus on your RAG logic, not infrastructure.
Use pgvector. No new infrastructure, same ops model, works up to ~5M vectors. Beyond that, consider dedicated vector DB.
Self-host Qdrant or Milvus. Both are production-ready. Qdrant is simpler; Milvus scales larger but needs more ops work.
ANN search works by building an index โ a data structure that allows skipping most vectors during search. The two dominant index types are HNSW and IVF. Most modern vector databases default to HNSW.
How it works: Builds a multi-layer graph where each node connects to nearby neighbors. Search starts at top layer, descends to find approximate nearest.
โ Pros: Fast search (1โ10ms), high recall, no training needed, incrementally updatable
โ Cons: High memory usage (stores graph edges), slow build time for very large datasets
Best for: Most RAG use cases, <100M vectors
How it works: Clusters vectors into buckets (centroids) via k-means. Search only probes nearby buckets.
โ Pros: Lower memory, compresses well with PQ, good for very large scale
โ Cons: Requires training phase, slower search than HNSW, updating is expensive
Best for: 100M+ vectors, memory-constrained, batch-build scenarios
| Parameter | What It Controls | Higher Value | Lower Value |
|---|---|---|---|
| M (HNSW) | Edges per node | Better recall, more memory | Less memory, lower recall |
| ef_construction | Build-time search width | Better index quality, slower build | Faster build, lower quality |
| ef_search | Query-time search width | Better recall, slower queries | Faster queries, lower recall |
| nlist (IVF) | Number of clusters | More granular, slower build | Faster build, coarser search |
| nprobe (IVF) | Clusters to search | Better recall, slower queries | Faster queries, lower recall |
Here's how to set up the most common vector databases. All examples use Python and store 1536-dimensional embeddings with metadata.
Vector databases scale differently than traditional databases. Memory is usually the bottleneck, not disk. Here's what to expect:
| Scale | Vectors | Memory (1536d) | Typical Cost | Latency (p99) |
|---|---|---|---|---|
| Small | <100K | ~600 MB | Free tier / $20/mo | <10ms |
| Medium | 100Kโ1M | 600 MBโ6 GB | $50โ200/mo | <20ms |
| Large | 1Mโ10M | 6โ60 GB | $200โ500/mo | 20โ50ms |
| Very Large | 10Mโ100M | 60โ600 GB | $500โ2000/mo | 50โ100ms |
| Massive | >100M | >600 GB | $2000+/mo + ops | 100ms+ (sharded) |
1. Reduce dimensions: 256d instead of 1536d = 6ร less memory. 2. Quantization: Store int8 instead of float32 = 4ร less memory (some quality loss). 3. Tiered storage: Keep hot vectors in memory, cold in disk-backed storage. 4. Aggressive deduplication: Remove near-duplicate chunks before indexing.
∑ Chapter 04 — Key Takeaways
- Vector databases use ANN indices (not brute force) to search millions of vectors in milliseconds
- Top choices: Pinecone (managed), Qdrant (flexible), pgvector (if already on Postgres)
- HNSW is the default index type โ fast search, high recall, good for <100M vectors
- Key parameters: M (edges), ef_construction (build quality), ef_search (query recall)
- Memory scales with vectors: 1M ร 1536d ร 4 bytes โ 6 GB โ dimension reduction helps
- Start with managed services โ self-host only when you need privacy or have ops capacity
Retrieval is where RAG succeeds or fails. You can have perfect embeddings and a fast vector database, but if your retrieval strategy doesn't find the right documents, the LLM will confidently answer with irrelevant context. This chapter covers how to actually get retrieval right.
There are fundamentally two ways to find relevant documents: dense retrieval (semantic similarity via embeddings) and sparse retrieval (keyword matching via inverted indices). Both have strengths; the best systems use both.
How it works: Embed query, find nearest document embeddings by cosine similarity
โ Strengths: Semantic understanding โ "car" matches "automobile", handles paraphrase
โ Weaknesses: Misses exact keywords, struggles with numbers, acronyms, rare terms
Example: "Python web framework" retrieves docs about Flask even if "Flask" is the only word in the doc
How it works: Count word occurrences, score by term frequency and rarity (BM25)
โ Strengths: Exact keyword match, numbers, codes, acronyms, rare terms
โ Weaknesses: No semantic understanding โ "car" doesn't match "automobile"
Example: "ERR-4592" finds exact error code, vector search might miss it
Benchmarks consistently show: Hybrid search (dense + sparse) outperforms either alone by 5โ15% on recall. Dense captures paraphrase and semantic similarity; sparse captures exact terms the embeddings might miss. Use both.
Hybrid search runs both dense and sparse retrieval, then combines the results. The key question: how do you merge two ranked lists? The most common approach is Reciprocal Rank Fusion (RRF).
| Hybrid Method | How It Works | Best When |
|---|---|---|
| RRF (Reciprocal Rank Fusion) | Sum of 1/(k+rank) across retrievers | Default choice, no tuning needed |
| Weighted Linear | ฮฑ ร dense_score + (1-ฮฑ) ร sparse_score | When you can tune ฮฑ on your data |
| Cascade | Sparse first for recall, dense re-rank | When sparse is faster, need latency |
| Learned Fusion | Train a model to combine scores | When you have labeled data + resources |
User queries are often poor search queries: too short, ambiguous, or phrased as questions when documents are statements. Query enhancement transforms the user's question into something more likely to retrieve relevant documents.
Add related terms to the query. "Python web" โ "Python web Flask Django framework app"
- Use LLM to generate synonyms
- Or use WordNet/embeddings
- Increases recall, may hurt precision
Generate a hypothetical answer, embed that instead of the question.
- Questions embed differently than docs
- Generated answer is closer to real docs
- +10โ15% recall on some benchmarks
Generate 3โ5 variations of the query, retrieve with each, merge results.
- "refund policy" + "return items" + "money back guarantee"
- Captures more angles
- 3ร retrieval cost
HyDE adds an LLM call to every query โ 100โ500ms latency + cost. It works best when: (1) question-answer asymmetry is high, (2) latency is not critical, (3) you've measured real improvement on your data. Don't blindly apply it โ A/B test.
Sometimes you know constraints before search: only documents from this year, only from the engineering team, only product docs. Metadata filtering narrows the search space, improving both precision and speed.
| Filter Type | Example | Use Case |
|---|---|---|
| Equality | department = "engineering" | User belongs to a specific team |
| Range | date >= "2025-01-01" | Only recent documents |
| In-list | source IN ["faq", "docs"] | Only certain document types |
| Boolean AND | public = true AND lang = "en" | Multiple constraints |
| Geo | location NEAR (lat, lon, 10km) | Location-aware retrieval |
Pre-filter (narrow before ANN search) is much faster and should be default. Post-filter (search all, filter results) only when: (1) filter would leave <100 candidates, (2) you need exact top-k after filtering. Most vector DBs support efficient pre-filtering.
Real applications have multiple document types: FAQs, documentation, support tickets, code. Each may benefit from different chunking, embeddings, or ranking. Multi-index retrieval queries multiple specialized indices and merges results.
- Documents have very different structures (FAQ vs manuals)
- Different chunking strategies needed
- Different embedding models work better for each type
- Access control varies by doc type
- Documents are homogeneous
- Same chunking works everywhere
- Adding complexity without measured benefit
- Small corpus (<100K docs)
| Optimization | Expected Gain | Effort | When to Use |
|---|---|---|---|
| Add BM25 hybrid search | +5โ15% recall | Low | Almost always โ default recommendation |
| Cross-encoder re-ranking | +5โ20% precision | Medium | When precision matters more than latency |
| Multi-query retrieval | +5โ10% recall | LowโMedium | Short/ambiguous queries |
| HyDE | +5โ15% recall | Medium | High question-doc asymmetry |
| Metadata pre-filtering | Variable (precision) | Low | When you have useful metadata |
| Better chunking | +10โ30% recall | Medium | Before other optimizations |
| Better embedding model | +5โ15% recall | Medium | If current model underperforms on eval |
Before adding fancy techniques, get the basics right: 1. Good chunking (Ch 2), 2. Good embeddings (Ch 3), 3. Hybrid search (this chapter), 4. Re-ranking (Ch 6). Only then consider HyDE, multi-query, or multi-index. Measure each change on your eval set โ many "optimizations" don't help specific datasets.
∑ Chapter 05 — Key Takeaways
- Dense retrieval (vectors) captures semantic similarity; sparse retrieval (BM25) captures exact keywords
- Hybrid search combining both outperforms either alone โ use RRF to merge ranked lists
- Query enhancement: HyDE (hypothetical docs), multi-query (variations), query expansion (synonyms)
- Metadata filtering narrows search space before ANN โ faster and more precise
- Multi-index architectures help when document types are very different
- Optimization order: chunking โ embeddings โ hybrid search โ re-ranking โ advanced techniques
- Always measure on your eval set โ not all optimizations help all datasets
Retrieval gives you candidates. Re-ranking gives you the right candidates in the right order. A cross-encoder re-ranker looking at 20 retrieved chunks and selecting the best 5 can improve answer quality more than any other single optimization in the RAG pipeline.
Embedding-based retrieval is fast but shallow โ it computes similarity independently for each document without comparing query and document together. Re-ranking takes the top-k candidates and applies a more powerful (but slower) model that reads query + document jointly.
How: Encode query and document independently, then compare
Speed: ~1ms per 1M docs (pre-indexed)
Quality: Good recall, moderate precision
Use: Initial retrieval from full corpus
How: Feed [query + document] together through transformer
Speed: ~5ms per document pair
Quality: Much higher precision
Use: Re-rank top 20โ50 candidates
| Model | Type | Quality | Speed | Cost | Best For |
|---|---|---|---|---|---|
| Cohere Rerank v3 | API | Excellent | ~100ms / 50 docs | $1/1K searches | Production default |
| Voyage Reranker | API | Excellent | ~100ms / 50 docs | $0.05/1K | Cost-effective API option |
| BGE-reranker-v2-m3 | Local | Very good | ~200ms / 50 docs (GPU) | Free | Self-hosted, multilingual |
| cross-encoder/ms-marco | Local | Good | ~300ms / 50 docs (GPU) | Free | Prototyping, English only |
| ColBERT v2 | Local | Very good | ~50ms / 50 docs | Free | Late interaction, fast |
| Jina Reranker v2 | Both | Very good | ~100ms / 50 docs | Free / API | Multilingual, long docs |
Research shows LLMs pay more attention to information at the beginning and end of the context window, but tend to miss information in the middle. This "lost in the middle" effect means document ordering matters.
Put the most relevant document first. Simple, effective for most cases.
Alternate: #1 first, #2 last, #3 second, #4 second-to-last. Spreads relevance across attention peaks.
Re-rank to top 3โ5 instead of stuffing 10+. Less middle = less lost. Quality over quantity.
In benchmarks, placing the answer in the middle vs at the start reduces accuracy by 10โ20%. The fix: re-rank aggressively (top 3โ5 only), put best results first, and keep total context short. More context โ better answers.
If your top 5 results are 5 chunks from the same document saying the same thing, you waste context and miss other relevant information. Diversity ranking ensures retrieved results cover different aspects of the query.
| Strategy | How It Works | When to Use |
|---|---|---|
| MMR (Maximal Marginal Relevance) | Balance relevance to query vs diversity from already-selected docs | Default choice โ simple, effective |
| Source deduplication | Max 2 chunks per source document | When same doc is over-represented |
| Similarity threshold | Remove results with cosine sim >0.95 to each other | When near-duplicates are common |
| Category diversity | Ensure mix of doc types (FAQ + docs + code) | Multi-index systems |
∑ Chapter 06 — Key Takeaways
- Use a two-stage architecture: fast bi-encoder retrieval (top 50), then cross-encoder re-ranking (top 5)
- Top re-rankers: Cohere Rerank v3 (API), BGE-reranker-v2 (local), ColBERT v2 (fast local)
- Re-ranking typically improves precision@5 by 15โ30% โ the highest-ROI optimization after hybrid search
- Lost in the middle: LLMs miss info in the middle of context โ put best results first, use fewer chunks
- MMR diversity ranking avoids redundant results โ balance relevance (ฮป) vs diversity (1โฮป)
- Deduplicate by source โ max 2 chunks per document prevents one doc dominating context
You've retrieved the right documents and ranked them well. Now comes the final mile: how you assemble the prompt determines whether the LLM uses that context correctly. Bad context construction turns great retrieval into mediocre answers.
Every RAG prompt has four parts. Getting each part right โ and getting the ordering right โ is what separates good RAG from great RAG.
Citations serve two purposes: they let users verify answers and they reduce hallucination by forcing the model to ground statements in specific sources. Without citations, you can't tell if the model invented something.
"The refund window is 30 days [Source 1]. For subscriptions, cancellation is immediate [Source 2]."
- Easiest to implement
- Clear source per claim
- Works with any LLM
"The refund window is 30 daysยน. Cancel anytimeยฒ."
ยน pricing-faq.pdf ยฒ terms.pdf
- Cleaner reading flow
- Needs post-processing
- Better for long answers
"As stated in the FAQ: 'returns within 30 days of purchase' (pricing-faq.pdf)"
- Verifiable quotes
- Highest trust
- Longer responses
LLMs can hallucinate citations โ they'll cite "[Source 3]" even if only 2 sources exist, or attribute information to the wrong source. Always validate citations programmatically: check that cited source numbers exist, and optionally verify the claim appears in the cited chunk using string matching or semantic similarity.
Even with 128K-token context windows, more context is not always better. Cost increases linearly, latency increases, and the "lost in the middle" effect gets worse. Smart context management is about using the window efficiently.
| Strategy | How It Works | Token Savings | Quality Impact |
|---|---|---|---|
| Fewer, better chunks | Re-rank to top 3โ5 instead of 10+ | 50โ70% fewer tokens | Often improves quality |
| Chunk compression | Use LLM to summarize each chunk before insertion | 60โ80% fewer tokens | May lose details |
| Relevant sentence extraction | Extract only sentences relevant to query from each chunk | 50โ70% fewer tokens | Preserves key info |
| Token budget allocation | Set max tokens per chunk (e.g., 500), truncate overflow | Predictable | May cut important context |
| Map-reduce for large corpus | Summarize each chunk separately, then combine summaries | Can handle unlimited docs | Multiple LLM calls, higher latency |
Retrieving 20 chunks ร 500 tokens = 10K tokens of context. Most of it is noise. Cost: $0.03 per query with GPT-4o. At 10K queries/day = $300/day just for context.
Quality: LLM drowns in irrelevant text, misses the answer, or picks wrong chunk.
Re-rank to top 5 chunks ร 500 tokens = 2.5K tokens. Cost: $0.0075 per query. At 10K queries/day = $75/day. 4ร cheaper.
Quality: Less noise, LLM focuses on best content, answers more accurately.
The hardest part of RAG isn't answering questions โ it's knowing when not to answer. When retrieved context doesn't contain the answer, the LLM should say "I don't know" rather than hallucinate.
For production systems, combine prompt-based abstention with a retrieval confidence check: if the best re-ranked score is below a threshold (e.g., 0.3), don't even send to the LLM โ return a canned "I can't help with that" response. Saves tokens and avoids hallucination entirely.
In conversation, follow-up questions reference earlier context: "What about for enterprise plans?" requires knowing the previous question was about pricing. Query rewriting transforms follow-ups into standalone queries for retrieval.
Query rewriting adds an LLM call per turn (~50โ100ms, ~100 tokens). For simple applications, check if the user message is self-contained first (using heuristics like "does it contain a noun?") โ only rewrite when it's clearly a follow-up. Don't rewrite "How do I reset my password?" โ it's already standalone.
∑ Chapter 07 — Key Takeaways
- RAG prompts have 4 parts: system instruction โ retrieved context โ user query โ output format
- Citation prompting forces grounding โ inline, footnote, or quote-based โ always validate citations programmatically
- Less context is often better: re-rank to top 3โ5 chunks, avoid stuffing 10+ into the prompt
- Teach the model to say "I don't know" with explicit abstention instructions + retrieval confidence thresholds
- Lost in the middle: put best results first, keep context short, consider interleaved ordering
- Multi-turn RAG needs query rewriting โ transform follow-ups into standalone retrieval queries
- Context window management: fewer better chunks = 4ร cheaper, better quality than stuffing everything
Every RAG system that "works in demos" eventually breaks in production. The difference between a toy and a product is knowing how it fails, measuring how often, and fixing it systematically. This chapter is about building that feedback loop.
RAG failures fall into two categories: retrieval failures (wrong documents found) and generation failures (wrong answer produced from correct documents). Each requires different fixes.
| Failure Mode | Symptom | Root Cause | Fix |
|---|---|---|---|
| โ Missing content | "I don't know" on answerable questions | Document not ingested, format parsing failed | Content coverage audit, loader validation |
| โก Wrong chunks | Confident answer from wrong topic | Poor embeddings, no metadata filter, bad chunking | Hybrid search, re-ranking, metadata filters |
| โข Stale content | Outdated information returned | No re-indexing pipeline, deleted docs still indexed | TTL, incremental sync, freshness ranking |
| โฃ Chunking artifacts | Partial/incoherent answers | Answer split across chunk boundary | Parent-child chunks, larger overlap, semantic chunking |
| โค Hallucination | Facts not in any retrieved document | LLM uses parametric knowledge instead of context | Citation enforcement, faithfulness scoring |
| โฅ Wrong synthesis | Misinterprets or contradicts source | Context too noisy, conflicting chunks, poor prompt | Fewer chunks, better re-ranking, explicit instructions |
| โฆ Over-refusal | "I don't know" when answer exists in context | Abstention threshold too high, overly cautious prompt | Calibrate confidence threshold, tune prompt |
Of all relevant documents, what fraction did we find in the top-K?
On average, how high is the first relevant result?
How good is the ordering of results? Penalizes relevant docs at low positions.
- NDCG=1.0 = perfect ranking
- Considers graded relevance (not just binary)
- Standard for search quality evaluation
Of the K retrieved docs, how many are actually relevant?
- Precision@5 of 0.6 = 3 of 5 are relevant
- Tradeoff with recall โ improve one, other may drop
- Critical for context quality (less noise)
| Metric | What It Measures | How to Compute | Target |
|---|---|---|---|
| Faithfulness | Is the answer grounded in retrieved context? | LLM-as-judge: "Are all claims in the answer supported by context?" | >0.90 |
| Answer Relevance | Does the answer address the user's question? | LLM-as-judge: "Does this answer the question? Score 1โ5" | >0.85 |
| Context Relevance | Was the retrieved context useful? | LLM-as-judge: "Is this context relevant to the question?" | >0.80 |
| Correctness | Is the answer factually correct? | Compare against golden answer (exact or semantic match) | >0.85 |
| Harmfulness | Does the answer contain harmful/biased content? | Safety classifier or LLM-as-judge | <0.01 |
RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that automates RAG evaluation. It computes faithfulness, answer relevance, and context metrics using LLM-as-judge.
Create 50โ100 (question, expected_answer, relevant_docs) tuples from your actual data. Have domain experts label them. Run RAGAS after every change to chunking, retrieval, or prompts. Automate this in CI โ no RAG change ships without passing eval. This is the single most important practice for production RAG.
| What to Log | Why | Alert Threshold |
|---|---|---|
| Retrieval scores (top-k) | Detect queries with no good matches | Best score < 0.3 |
| Re-ranker scores | Detect relevance drops after re-ranking | Best score < 0.5 |
| LLM response latency | Detect slow queries | p99 > 5s |
| "I don't know" rate | Detect coverage gaps | >15% of queries |
| User feedback (thumbs) | Ground truth from users | Negative rate >20% |
| Faithfulness (sampled) | Detect hallucination drift | Score < 0.85 on sample |
You need evaluation to improve, but building eval sets takes time. Start small: 20 golden queries on day 1, add 5 per week from production logs. Within a month you'll have a meaningful eval set. Don't wait for perfection โ imperfect evaluation beats no evaluation by a mile.
∑ Chapter 08 — Key Takeaways
- RAG fails in 7 ways: 4 retrieval failures (missing, wrong, stale, chunking) + 3 generation failures (hallucination, wrong synthesis, over-refusal)
- Retrieval metrics: Recall@K (did we find it?), MRR (how high?), NDCG (rank quality), Precision@K (how clean?)
- Generation metrics: Faithfulness (grounded?), Answer Relevance (addresses question?), Correctness (factually right?)
- Use RAGAS framework to automate evaluation โ run after every pipeline change
- Build a golden test set: 50โ100 labeled (query, answer, docs) tuples โ the most important practice for production RAG
- Log everything in production: retrieval scores, latency, "I don't know" rates, user feedback โ alert on degradation
Chapters 1โ8 covered how to build a solid RAG pipeline. This chapter goes beyond โ patterns that push RAG quality to the next level when the basics aren't enough. These techniques are more complex but solve specific failure modes that standard RAG can't handle.
Standard RAG uses whatever the retriever returns, even if the results are irrelevant. CRAG adds a self-correction step: evaluate retrieval quality, and if it's poor, try alternative strategies before generating.
Self-RAG teaches the LLM to decide: (1) whether retrieval is needed, (2) whether retrieved docs are relevant, and (3) whether the generated answer is grounded. The model outputs special reflection tokens during generation.
Self-RAG requires fine-tuning a model with reflection tokens. It's most valuable when: queries are mixed (some need retrieval, some don't), and when you need per-statement grounding verification. For most applications, CRAG (no fine-tuning needed) provides 80% of the benefit.
Standard RAG retrieves individual chunks. GraphRAG first builds a knowledge graph from documents (entities and relationships), then traverses the graph during retrieval. This enables multi-hop reasoning across documents.
Query: "Who manages the team that built Project X?"
Retrieves chunks about Project X, but team and manager info is in different documents. Fails.
Each chunk is independent โ no connections between them.
Query: "Who manages the team that built Project X?"
Graph: Project X โ built by โ Team Alpha โ managed by โ Sarah. Succeeds via graph traversal.
Entities and relationships connect information across documents.
- Multi-hop questions common
- Entity relationships matter
- Global summarization needed
- Data is highly connected
- Queries are simple lookups
- Documents are independent
- Graph construction cost too high
- Data changes too fast
- Microsoft GraphRAG library
- LLM extracts entities + relations
- Store in Neo4j / NetworkX
- Community detection for summaries
In standard RAG, the pipeline is fixed: retrieve โ generate. In Agentic RAG, the LLM acts as an agent that decides what to retrieve, when, and how many times. It can reformulate queries, request more context, or search different sources.
Retrieve โ analyze gaps โ retrieve more โ combine. The agent keeps searching until it has enough context.
- FLARE: Forward-Looking Active REtrieval
- Agent generates, detects uncertainty, retrieves more
- 2โ5 retrieval rounds typical
Complex question โ break into sub-questions โ retrieve for each โ combine answers.
- "Compare pricing of X vs Y" โ two separate retrievals
- LLM decomposes, retrieves, synthesizes
- Better for multi-part questions
Agentic RAG is powerful but adds latency (2โ10ร more LLM calls), cost, and unpredictability. The agent might loop, over-retrieve, or go off-track. Use it only when standard RAG demonstrably fails on your queries. Start with CRAG before going full agentic.
| Pattern | Complexity | Latency | Best For | Requires |
|---|---|---|---|---|
| Standard RAG | Low | 200โ500ms | 80% of use cases | Chapters 1โ7 |
| CRAG | Medium | 500โ1000ms | Unreliable retrieval | Relevance scorer |
| Self-RAG | High | 500โ1500ms | Mixed query types | Fine-tuned model |
| GraphRAG | High | 1โ3s | Multi-hop, connected data | Graph DB, extraction pipeline |
| Agentic RAG | Very High | 2โ10s | Complex multi-step queries | Agent framework, tool definitions |
Start with standard RAG (Ch 1โ7) + good evaluation (Ch 8). Measure where it fails. If retrieval quality is the bottleneck, add CRAG. If multi-hop queries fail, consider GraphRAG. If complex reasoning fails, consider agentic. Each pattern adds complexity โ only add it when you've measured the need.
∑ Chapter 09 — Key Takeaways
- CRAG evaluates retrieval quality and falls back to web search or refined queries when results are poor
- Self-RAG teaches the LLM to decide when to retrieve and whether results are grounded (requires fine-tuning)
- GraphRAG builds a knowledge graph for multi-hop reasoning across connected documents
- Agentic RAG lets the LLM control retrieval โ iterative search, query decomposition, multi-source
- Standard RAG covers 80% of use cases โ only add advanced patterns when evaluation shows specific failures
- Each pattern adds complexity, latency, and cost โ measure the tradeoff before committing
You've built a RAG system that works. Now ship it. Production RAG isn't about making the retrieval 1% better โ it's about keeping it working reliably at scale, managing costs, and responding to drift. This chapter covers the engineering that keeps RAG systems alive.
Many RAG queries are repetitive or semantically similar. Caching avoids re-running expensive retrieval and LLM calls for questions you've already answered.
Hash the query string, cache the full response.
- Hit rate: 5โ15% typical
- Simple: Redis key-value
- TTL: 1โ24 hours
Embed the query, find similar cached queries by vector distance.
- Hit rate: 15โ40% typical
- "refund policy" โ "return policy"
- Threshold: cosine > 0.95
Cache retrieval results only, still run LLM. Saves retrieval latency + vector DB cost.
- Useful when prompts change often
- TTL: 1โ6 hours
- Invalidate on index update
| Cost Component | Typical % | Optimization | Savings |
|---|---|---|---|
| LLM generation | 60โ70% | Fewer chunks in context, smaller model for simple queries, caching | 30โ60% |
| Embedding API | 10โ15% | Cache embeddings, batch calls, lower dimensions | 50โ80% |
| Vector DB | 10โ15% | Reduce dimensions, quantization, tiered storage | 30โ60% |
| Re-ranker | 5โ10% | Cache re-rank results, reduce candidate count | 20โ40% |
Not every query needs GPT-4o. Route simple factual queries to a smaller/cheaper model (GPT-4o-mini, Claude Haiku) and complex reasoning queries to the best model. A simple classifier can save 40โ60% on LLM costs by routing 70% of queries to the cheap model.
| Stage | Typical Latency | Optimization | Target |
|---|---|---|---|
| Embedding query | 20โ50ms | Local model for embedding, batch | <50ms |
| Vector search | 5โ20ms | HNSW tuning, pre-filter, warm cache | <20ms |
| Re-ranking | 50โ200ms | Fewer candidates (20 not 50), ColBERT | <100ms |
| LLM generation | 500โ3000ms | Streaming, shorter context, faster model | <2000ms |
| Total (no cache) | 800โ3500ms | Parallelize retrieval + embedding | <2500ms |
| Total (cache hit) | 10โ50ms | Semantic cache | <50ms |
Use streaming responses โ start showing the LLM's answer token-by-token while it's still generating. Perceived latency drops from 2s to <500ms (time to first token). Every production RAG system should stream.
Cron job: re-index all docs every N hours. Simple but inefficient for large corpora.
- Best for: <10K docs
- Frequency: hourly to daily
- Pro: Simple to implement
Track document hashes. Only re-embed changed/new/deleted docs. 10โ100ร faster than full re-index.
- Best for: 10Kโ1M docs
- Frequency: real-time to hourly
- Pro: Efficient, low cost
Webhook on doc change triggers re-indexing. Near-real-time freshness.
- Best for: critical freshness needs
- Frequency: real-time
- Pro: Immediate, targeted
When a document is deleted from the source, its chunks remain in the vector DB until explicitly removed. Users get answers from documents that no longer exist. Always track document IDs in your vector DB and delete chunks when source docs are removed.
| Category | Checklist Item | Status |
|---|---|---|
| Data | All source documents ingested and validated | โ |
| Data | Incremental sync pipeline running | โ |
| Data | Stale document cleanup (TTL or deletion sync) | โ |
| Quality | Golden test set (50+ queries) with passing scores | โ |
| Quality | Eval runs in CI โ blocks deploy on regression | โ |
| Quality | Faithfulness >0.90, Recall@5 >0.85 on eval set | โ |
| Performance | p99 latency <3s (or streaming TTFT <500ms) | โ |
| Performance | Semantic cache deployed, hit rate monitored | โ |
| Cost | Per-query cost calculated and budgeted | โ |
| Cost | Model routing for simple vs complex queries | โ |
| Observability | All queries/chunks/scores logged | โ |
| Observability | Alerts on quality degradation, error spikes | โ |
| Security | Access control on retrieval (user can only see their docs) | โ |
| Security | PII handling in logs and cache | โ |
| Fallback | "I don't know" with helpful fallback (human handoff, search link) | โ |
∑ Chapter 10 — Key Takeaways
- Production RAG = offline pipeline (data) + online pipeline (query) + observability + evaluation
- Caching (exact + semantic) can save 30โ60% of costs and reduce latency to <50ms for repeated queries
- LLM generation is 60โ70% of cost โ optimize with fewer chunks, model routing, and caching
- Stream responses to cut perceived latency from 2s to <500ms time-to-first-token
- Keep the index fresh: incremental sync + deletion tracking โ stale data destroys trust
- Use the production checklist: data quality, evaluation gates, performance, cost, observability, security
- RAG systems drift โ continuous evaluation and monitoring are not optional, they're the product