AI Advanced · RAG Engineering

RAG Engineering

From naive retrieval to production-grade RAG systems — chunking, embeddings, vector search, re-ranking, and advanced patterns for retrieval-augmented generation.

RAG is not about putting more data into the prompt. It's about putting the RIGHT data into the prompt. Production RAG requires careful engineering of every stage — from how you chunk documents to how you rank and present retrieved context. This guide covers all of it.

Chapter 01 · Foundations

RAG Mental Model — How Retrieval Actually Works

The fundamental problem with LLMs is not that they're stupid — it's that they don't know your data. They were trained on the public internet up to some cutoff date. They haven't read your internal docs, your product specs, your customer tickets. RAG is how you give them that knowledge at runtime, without retraining.

What Is RAG — The Core Idea Foundation

Retrieval-Augmented Generation (RAG) is an architecture pattern where you enhance LLM responses by retrieving relevant information from external data sources and injecting it into the prompt at inference time. The model generates its response grounded in this retrieved context — not just its pre-trained knowledge.

The RAG pipeline — from question to grounded answer

The Key Insight

RAG separates knowledge storage from knowledge access. Your knowledge base can be petabytes. Your context window is 128K tokens. Retrieval is the bridge — it selects the few thousand tokens most relevant to this particular query, every time.

RAG = Search + Prompt Engineering. Retrieval decides WHAT the model sees. The prompt decides HOW the model uses it.

This separation is critical to understand:

Even perfect retrieval fails with poor prompts

You find the right documents, but the prompt doesn't instruct the model to cite sources, stay grounded, or say "I don't know." The model hallucinates or ignores the context entirely.

Even perfect prompts fail with bad retrieval

Your prompt is flawless, but the retriever returns irrelevant chunks. The model faithfully answers from garbage context — producing a well-formatted wrong answer.

The Implication

A production RAG system must optimize both layers together. Retrieval quality without prompt quality is wasted effort. Prompt quality without retrieval quality is a beautiful facade over bad data. Every chapter in this guide addresses one or both.

Why RAG Exists — The Problems It Solves Foundation

LLMs have powerful language understanding and generation capabilities, but they ship with three fundamental limitations that RAG directly addresses:

📅

Knowledge Cutoff

Training data stops at some date. GPT-4o's cutoff is ~October 2023. Anything after that — new products, policy changes, recent events — is invisible to the base model.

RAG injects current documents at inference

🔒

No Access to Private Data

Your internal docs, customer records, proprietary research — none of it is in the model. It was never trained on your company's Confluence, Notion, or databases.

RAG retrieves from your data sources

🎭

Hallucination Risk

Without grounding, the model can confidently generate plausible-sounding but wrong information. It's completing text patterns, not checking facts.

RAG grounds answers in source documents

What the LLM knows vs what it doesn't

RAG vs Fine-Tuning — The Decision Framework In-depth

The most common question: "Should I fine-tune or use RAG?" This is not an either/or — they solve different problems and can be combined. Here's how to decide:

Dimension	RAG	Fine-Tuning	Both Combined
What it changes	What the model sees (context)	How the model behaves (weights)	Both knowledge and behaviour
Knowledge updates	Instant — update docs, instant effect	Requires re-training (~hours/days)	Mixed — fast for RAG portion
Cost structure	Higher per-query (retrieval + longer prompts)	Higher upfront, lower per-query	Highest upfront, balanced query cost
Factual accuracy	High — grounded in source docs	Lower — can still hallucinate	Highest with both
Style/format control	Limited — prompt engineering only	Strong — embedded in weights	Best of both
Complexity	Retrieval pipeline, chunking, indexing	Data curation, training, evaluation	Both complexities combined

✅ Choose RAG when...

• Knowledge changes frequently (docs, policies, products)

• You need source attribution ("this came from doc X")

• Dataset is large (>100K documents)

• You can't afford training compute

• Factual accuracy is critical (legal, medical, support)

✅ Choose Fine-Tuning when...

• You need consistent style/tone/format

• Domain has specialized vocabulary the base model struggles with

• Knowledge is stable and won't change often

• Latency is critical (no retrieval overhead)

• You have high-quality training data (>1K examples)

The Hybrid Sweet Spot

In practice, many production systems use RAG + fine-tuned model. Fine-tune for domain style and vocabulary, RAG for factual grounding. Example: a legal assistant fine-tuned on case law writing style, with RAG retrieving relevant precedents. The fine-tuning doesn't add factual knowledge — it teaches the model how to write like a lawyer.

The RAG Pipeline — Component by Component Core

A production RAG system has 6 core stages. Each stage has multiple design choices — and getting all of them right is what separates "RAG that works in demos" from "RAG that works in production."

The 6 stages of a RAG pipeline — each is an engineering problem

📥

Ingestion & Chunking (Ch 2)

Load documents from disparate sources, parse various formats (PDF, HTML, Markdown, Notion, Confluence), split into retrievable chunks with appropriate size and overlap.

Loaders for 50+ formats
Semantic vs fixed-size chunking
Metadata extraction

🧮

Embedding & Storage (Ch 3-4)

Convert text chunks to dense vector representations. Store in specialized vector databases with efficient similarity search indices.

OpenAI, Cohere, BGE models
HNSW, IVF indexing
Scaling to millions of vectors

🔍

Retrieval & Generation (Ch 5-7)

At query time: find relevant chunks via vector similarity + keyword search, re-rank for precision, construct an optimized prompt, generate grounded response.

Hybrid search strategies
Cross-encoder re-ranking
Context window management

The Most Underrated Component: Context Construction In-depth

Retrieval finds relevant chunks. But the LLM doesn't understand "relevance" — it only processes tokens. This means how you assemble those chunks into the prompt matters enormously.

📍

Ordering matters

Important chunks buried in the middle get ignored (the "lost-in-the-middle" problem). Put the most relevant content first or last.

📐

Formatting matters

Chunks dumped as raw text confuse the model. Adding source labels, separators, and structure helps the LLM parse context correctly.

📌

Instruction placement matters

Where you place "answer only from context" relative to the chunks changes how strictly the model follows it.

Bad context construction destroys good retrieval

You retrieved the perfect 5 chunks. But you stuffed them unformatted between a vague system prompt and the user query. The model skips to chunk #3 (the least relevant), hallucinates details from #1, and ignores #2 entirely. In practice, context construction often determines final answer quality more than retrieval itself. Chapter 7 covers this in depth.

Query Understanding & Rewriting Core

User queries are often vague, incomplete, or poorly structured. A single word like "refund" is not a good search query. Production systems improve retrieval by transforming queries before search.

❌ Raw user query

refund

Too vague — retrieves everything mentioning "refund" with no intent clarity.

✅ Rewritten queries

What is the refund policy?

How many days for refund eligibility?

Specific, intent-clear — retrieves the right documents.

Techniques include: rewriting for clarity, expanding keywords, generating multiple search queries from one question, and using the LLM itself to reformulate. Chapters 5 and 7 cover these strategies in detail.

Why "Naive RAG" Fails — The Demo-to-Production Gap In-depth

You can build a working RAG demo in 20 lines of code. You can also watch it fail spectacularly in production. The gap between "it works on my laptop" and "it works for 1000 users on real data" is filled with engineering challenges that naive implementations ignore.

Naive Approach	What Goes Wrong	Production Solution
Fixed 500-token chunks	Splits mid-sentence, loses context, breaks tables	Semantic chunking, document-aware splitting
Top-3 vector results	Misses relevant docs, returns duplicates	Hybrid search + re-ranking + deduplication
Stuff all chunks in prompt	Lost-in-the-middle, context overflow, cost explosion	Compression, map-reduce, hierarchical retrieval
No metadata filtering	Retrieves outdated docs, wrong department's info	Metadata filters, access control, freshness ranking
"Answer from these docs"	Hallucinations when docs don't contain answer	"I don't know" instruction, citation enforcement
Embed once, never update	Stale index, deleted docs still retrieved	Incremental indexing, TTL, sync pipelines

The "80% Accuracy Trap"

Naive RAG often achieves 80% accuracy in testing — good enough to demo, not good enough to deploy. The 20% failure cases are where users lose trust: wrong answers stated confidently, outdated information, obviously missed documents. Production RAG is about eliminating those 20% — and that requires engineering every stage of the pipeline.

The Hidden Cost of RAG Core

RAG is not free. Each query involves multiple steps — embedding lookup, vector search, optional re-ranking, larger prompt construction — each adding latency and cost.

⏱️

Latency

RAG adds +100ms to +500ms over a direct LLM call: embedding (20–50ms), vector search (5–20ms), re-ranking (50–200ms), plus longer prompts = slower generation.

💰

Token Cost

Retrieved context adds 2K–10K tokens per query. At GPT-4o rates, that's $0.005–$0.025 per query just for context — multiplied by thousands of daily queries.

📈

Scaling Difficulty

Naive implementations that retrieve too many chunks, skip caching, and use the most expensive model for every query become slow and expensive at scale.

The Fix

Production systems control costs by: limiting retrieved chunks (5 not 20), compressing context (extract relevant sentences), caching results (semantic cache for repeated queries), and routing simple queries to cheaper models. Chapter 10 covers production optimization in depth.

How to Measure RAG Quality — The Metrics That Matter Core

You can't improve what you can't measure. RAG quality is measured in two independent dimensions: retrieval quality (did we find the right documents?) and generation quality (did we produce a correct, grounded answer?).

📊 Retrieval Metrics

Recall@K: Of all relevant docs, what % did we retrieve in top-K?

Precision@K: Of retrieved docs, what % are actually relevant?

MRR: Mean Reciprocal Rank — how high is the first relevant result?

NDCG: Normalized Discounted Cumulative Gain — rank quality score

📝 Generation Metrics

Faithfulness: Is the answer grounded in retrieved context? (no hallucination)

Answer Relevance: Does the answer address the user's question?

Context Relevance: Was the retrieved context actually useful?

RAGAS: Framework combining faithfulness + relevance metrics

The RAG quality equation — both retrieval and generation must succeed

Evaluation Is Non-Negotiable

Production RAG systems require automated evaluation pipelines. Build a golden test set of (query, relevant_docs, expected_answer) tuples. Run retrieval metrics after every indexing change. Run generation metrics after every prompt change. Chapter 8 covers this in depth.

High retrieval accuracy does NOT guarantee good answers

If irrelevant chunks dominate the context, the model will still generate incorrect responses. A retriever with 90% recall but poor precision floods the context with noise — and the LLM faithfully summarizes that noise. Retrieval quality and generation quality must be optimized together.

When RAG Is NOT the Right Choice Core

RAG is powerful, but it's not a universal solution. Some problems are better solved with other approaches — and forcing RAG where it doesn't fit leads to complex systems that underperform simpler alternatives.

Situation	Use RAG?	Better Alternative
Domain-specific style/vocabulary	No	Fine-tuning teaches the model how to speak, not what to say
General knowledge questions	No	The base LLM already knows this — RAG adds latency and cost for no benefit
Creative writing tasks	No	RAG constrains creativity; use the model's generative capabilities directly
Ultra-low latency (<100ms)	Often No	Retrieval adds 50–200ms minimum; consider pre-computed responses or caching
Highly structured data (SQL databases)	Sometimes	Text-to-SQL may be more accurate than embedding rows as text chunks
Document-grounded factual Q&A	Yes ✓	RAG is the right tool for this job
Knowledge that changes frequently	Yes ✓	RAG shines when knowledge is dynamic
Source attribution required	Yes ✓	RAG naturally supports citation since sources are explicit

RAG Mental Model — Quick Reference Core

Misconception	Reality	Practical Implication
"RAG = vector search"	RAG = retrieval + augmentation + generation	Don't neglect re-ranking, context construction, and prompt engineering
"More chunks = better"	More noise in context = worse answers	Quality over quantity — use re-ranking to select the best 3-5 chunks
"Embeddings capture everything"	Embeddings miss keywords, numbers, exact matches	Hybrid search (vector + BM25) outperforms pure vector in most cases
"One chunking strategy fits all"	Optimal chunking depends on doc type and query type	Test different strategies on your actual data with your actual queries
"Set and forget"	RAG systems drift as data and queries change	Continuous evaluation, monitoring, and re-indexing are required
"RAG eliminates hallucination"	RAG reduces but doesn't eliminate hallucination	Use citation prompting, faithfulness scoring, and "I don't know" instructions

∑ Chapter 01 — Key Takeaways

RAG = Retrieval-Augmented Generation — inject relevant external documents into LLM context at inference time
RAG solves three LLM problems: knowledge cutoff, private data access, and hallucination grounding
RAG vs Fine-tuning: RAG changes what the model sees (context), fine-tuning changes how it behaves (weights)
The 6-stage pipeline: Ingestion → Chunking → Embedding → Storage → Retrieval → Generation
Naive RAG (~80% accuracy) is not production RAG — the 20% failure cases destroy user trust
Measure both retrieval quality (did we find the right docs?) and generation quality (did we answer correctly?)
RAG is NOT always the answer — consider fine-tuning for style, base LLM for general knowledge, SQL for structured data

Chapter 02 · Data Preparation

Data Ingestion & Chunking — Breaking Documents Into Retrievable Units

Chunking is where most RAG systems silently fail. A bad chunking strategy doesn't throw errors — it just returns irrelevant results that the LLM confidently uses to generate wrong answers. Get chunking wrong, and nothing downstream can fix it.

The Ingestion Pipeline — From Raw Docs to Indexed Chunks Foundation

Before you can retrieve, you need to ingest. The ingestion pipeline transforms raw documents — PDFs, web pages, Notion exports, database dumps — into indexed, searchable chunks. Each step has failure modes that propagate downstream.

The ingestion pipeline — each stage has decisions that affect retrieval quality

Document Loaders — Parsing Every Format Foundation

Your knowledge lives in dozens of formats. Each format has quirks that affect text extraction quality. Use the right loader for each source — and always validate output before proceeding.

Format	Recommended Loader	Watch Out For	Quality
PDF	pypdf, pdfplumber, unstructured	Scanned PDFs need OCR; tables often parse badly	Variable
HTML / Web	BeautifulSoup, trafilatura, unstructured	Nav/footer pollution; JavaScript-rendered content	Good
Markdown	Native text, MarkdownLoader	Code blocks need special handling; images are lost	Excellent
Word / DOCX	python-docx, unstructured	Embedded images, track changes, comments	Good
PowerPoint	python-pptx, unstructured	Layout is lost; speaker notes often missed	Fair
Notion	Notion API, notion-to-md	Nested blocks, databases need flattening	Good
Confluence	Confluence REST API, Atlassian SDK	Macros, embeds, permissions filtering	Fair
SQL Database	SQLAlchemy, custom extractors	Schema matters more than raw data; denormalize first	Good

The PDF Problem

PDFs are the worst format for RAG. They're designed for printing, not parsing. A table in a PDF might render as "Column1 Column2 Row1Val1 Row1Val2 Row2Val1..." — semantically meaningless. Solutions: Use pdfplumber for tables, run OCR on scanned docs (Tesseract, Azure Doc Intelligence), or convert to Markdown before chunking.

🔧

Production pattern: Unified document loading with LangChain

from langchain_community.document_loaders import ( PyPDFLoader, UnstructuredHTMLLoader, TextLoader, NotionDirectoryLoader, ConfluenceLoader ) def load_document(path: str) -> list[Document]: """Load any supported format — auto-detect by extension.""" loaders = { ".pdf": PyPDFLoader, ".html": UnstructuredHTMLLoader, ".txt": TextLoader, ".md": TextLoader, } ext = Path(path).suffix.lower() loader_cls = loaders.get(ext) if not loader_cls: raise ValueError(f"Unsupported format: {ext}") return loader_cls(path).load()

Chunking Strategies — The Heart of RAG Quality In-depth

Chunking determines what your retriever can find. Too large, and irrelevant content dilutes the signal. Too small, and context is lost. The right strategy depends on your document type and query patterns.

📏

Fixed-Size Chunking

Split by character/token count with overlap. Simple, predictable, format-agnostic.

Chunk size: 500–1000 tokens
Overlap: 50–200 tokens (10–20%)
Pro: Easy to implement
Con: Splits mid-sentence/idea

📄

Recursive Character Splitting

Try splitting by paragraph, then sentence, then character. Preserves document structure better.

Separators: ["\n\n", "\n", ". ", " "]
Respects natural boundaries
Pro: Better semantic units
Con: Chunk sizes vary

🧠

Semantic Chunking

Use embedding similarity to find natural breakpoints. Split where meaning shifts.

Compute sentence embeddings
Split at similarity drops
Pro: Meaning-preserving
Con: 10–100× slower, more complex

How different chunking strategies handle the same document

Strategy	Best For	Chunk Size	Speed	Quality
Fixed-size	Uniform docs, quick prototyping	512–1024 tokens	Fast	Fair
Recursive character	General purpose, most use cases	500–1000 tokens	Fast	Good
Semantic	High-value docs, precision critical	Variable	Slow	Excellent
Document-aware (markdown headers)	Structured docs with clear sections	Section-based	Medium	Excellent
Sentence-window	Dense technical content	3–5 sentences	Fast	Good

Chunk Size — The Goldilocks Problem Core

Chunk size is a tradeoff between precision and context. There's no universal answer — it depends on query type, document structure, and embedding model capabilities.

⬇️ Smaller Chunks (100–300 tokens)

✓ Pro: Higher precision for specific queries

✓ Pro: Less noise in retrieved context

✓ Pro: Better for exact match questions

✗ Con: May lose surrounding context

✗ Con: More chunks = more vectors = higher cost

Best for: FAQ, definitions, code snippets

⬆️ Larger Chunks (500–1500 tokens)

✓ Pro: Preserves full context

✓ Pro: Better for complex reasoning

✓ Pro: Fewer vectors to store/search

✗ Con: May include irrelevant content

✗ Con: Lower precision for specific queries

Best for: Analysis, summaries, narratives

The Empirical Answer

Don't guess — test. Create 50–100 (query, expected_doc) pairs from your actual data. Run retrieval with chunk sizes 256, 512, 1024, 2048. Measure Recall@5. The winner varies by dataset — we've seen 256 win for support tickets and 1024 win for research papers. Your optimal size is the one that maximizes recall on your queries.

Chunk size vs retrieval precision — typical tradeoff curve

Advanced Patterns — Parent-Child, Sentence-Window, Agentic In-depth

Production RAG systems often use more sophisticated chunking patterns that decouple what you search from what you retrieve. These add complexity but can significantly improve quality.

👨‍👧

Parent-Child Chunking

Store small chunks for precise matching, but retrieve their larger parent for context.

# Search: small chunk (256 tokens) "The refund policy is 30 days" # Retrieve: parent chunk (1024 tokens) "RETURNS AND REFUNDS\n\nThe refund policy is 30 days from purchase date. To initiate a refund, contact support with your order number. Refunds are processed within 5–7 business days..."

Best of both worlds: precision + context
Requires document ID linking

🪟

Sentence-Window

Embed individual sentences, but retrieve surrounding sentences as context.

# Index: single sentence "The API rate limit is 100 req/min." # Retrieve: sentence + window (±2) "Our API uses rate limiting to ensure fair usage. The API rate limit is 100 req/min. Exceeding this triggers a 429 error. Use exponential backoff..."

Very precise matching
More setup complexity

📑

Document-aware chunking for Markdown

from langchain.text_splitter import MarkdownHeaderTextSplitter headers_to_split = [ ("#", "h1"), ("##", "h2"), ("###", "h3"), ] splitter = MarkdownHeaderTextSplitter( headers_to_split_on=headers_to_split, strip_headers=False # Keep headers in chunk ) chunks = splitter.split_text(markdown_doc) # Each chunk includes its header hierarchy as metadata # {"h1": "User Guide", "h2": "Authentication", "h3": "API Keys"}

Headers become metadata — filter by section, include hierarchy in context, maintain document structure.

Metadata — The Secret Weapon for Filtering Core

Metadata enables filtering before semantic search — dramatically improving precision. Every chunk should carry metadata that answers: where did this come from, when, and who can see it?

Metadata Field	Example Values	Use Case
source	pricing-faq.pdf, api-docs/auth.md	Filter by document, show citations
date_created	2026-04-01	Freshness ranking, exclude outdated
date_modified	2026-04-20	Sync detection, re-indexing triggers
department	engineering, sales, legal	Access control, relevance filtering
doc_type	faq, policy, tutorial, reference	Match intent to doc type
chunk_index	0, 1, 2, ...	Reconstruct document order
parent_id	doc_123	Link chunks to parent documents
section_title	Authentication > API Keys	Hierarchical filtering, better context

Pre-filter, Don't Post-filter

Vector databases support metadata filtering during search, not after. Use it: filter={"department": "engineering", "date_modified": {"$gte": "2026-01-01"}}. Filtering before search is vastly more efficient than retrieving 100 results and filtering to 5.

∑ Chapter 02 — Key Takeaways

The ingestion pipeline: Load → Clean → Chunk → Enrich → Embed — each stage has failure modes
PDFs are problematic — use specialized loaders (pdfplumber) and validate extraction quality
Chunking strategies: fixed-size (simple), recursive (general purpose), semantic (highest quality)
Chunk size is a tradeoff: smaller = higher precision, larger = better context — test on your data
Advanced patterns: parent-child (search small, retrieve large) and sentence-window (precise + context)
Metadata enables pre-filtering — extract source, date, department, doc_type at ingestion time
Document-aware chunking (Markdown headers) preserves structure and enables hierarchical filtering

Chapter 03 · Representation

Embeddings & Representation — Turning Text Into Vectors

Embeddings are how machines "understand" text. A good embedding model compresses semantic meaning into a vector such that similar meanings are close together in vector space. The choice of embedding model fundamentally determines what your RAG system can and cannot retrieve.

What Are Embeddings — The Core Concept Foundation

An embedding is a fixed-length vector (array of numbers) that represents the semantic meaning of text. Embedding models are trained so that semantically similar text produces similar vectors — enabling "meaning-based" search rather than keyword matching.

How embeddings enable semantic search — similar meanings cluster together

Why This Works

Embedding models learn from billions of text pairs. During training, they learn that "password reset" and "forgot credentials" appear in similar contexts — so they map to nearby vectors. At search time, we embed the query and find the closest document vectors. No keyword matching required.

Embedding Models — Which One to Choose In-depth

There are dozens of embedding models available. They differ in quality (MTEB benchmarks), dimensionality (affects storage/speed), cost, and whether they can run locally. Here are the ones that matter in 2026:

Model	Dimensions	MTEB Avg	Cost	Where	Best For
OpenAI text-embedding-3-large	3072 (or 256–1536)	64.6	$0.13/1M tokens	API	Production default, scalable
OpenAI text-embedding-3-small	1536 (or 256–512)	62.3	$0.02/1M tokens	API	Budget production
Cohere embed-v3	1024	64.5	$0.10/1M tokens	API	Multilingual, input types
Voyage-2	1024	65.4	$0.10/1M tokens	API	Legal, code, finance
BGE-large-en-v1.5	1024	64.2	Free	Local	Self-hosted, privacy
E5-mistral-7b-instruct	4096	66.6	Free	Local (GPU)	Highest quality OSS
GTE-Qwen2-7B-instruct	3584	67.2	Free	Local (GPU)	SOTA open-source
all-MiniLM-L6-v2	384	56.3	Free	Local (CPU)	Prototyping only

🏢

Production (API)

Use OpenAI text-embedding-3-large or Cohere embed-v3. Battle-tested, no GPU infra needed, good latency. Cost is usually negligible vs LLM costs.

🔒

Privacy-Required

Use BGE-large or E5-mistral. Run locally, no data leaves your infrastructure. BGE runs on CPU; E5/GTE need GPU but are higher quality.

🌍

Multilingual

Use Cohere embed-v3 (100+ languages) or multilingual-e5-large. Don't assume English models work for other languages — they don't.

Dimensionality — Storage vs Quality Tradeoff Core

Higher dimensions capture more nuance but cost more to store and search. OpenAI's embedding-3 models support dimension reduction via Matryoshka Representation Learning — you can truncate vectors without retraining.

Low Dimensions (256–512)

✓ Storage: 1M vectors × 512 dims × 4 bytes = 2 GB

✓ Speed: Faster similarity calculations

✓ Cost: Lower vector DB costs

✗ Quality: May lose subtle distinctions

Best for: Large scale (>10M docs), cost-sensitive

High Dimensions (1536–3072)

✓ Quality: Captures nuanced meaning

✓ Precision: Better for similar documents

✗ Storage: 1M × 3072 × 4 bytes = 12 GB

✗ Speed: Slower search (but still fast)

Best for: High precision needs, <1M docs

🔧

Dimension reduction with OpenAI embedding-3

from openai import OpenAI client = OpenAI() # Full dimensions (3072) response = client.embeddings.create( model="text-embedding-3-large", input="Your text here", ) full_vector = response.data[0].embedding # 3072 dims # Reduced dimensions (256) — via API parameter response = client.embeddings.create( model="text-embedding-3-large", input="Your text here", dimensions=256 # Matryoshka truncation ) small_vector = response.data[0].embedding # 256 dims # Same cost, 12× less storage, ~2–5% quality drop

Asymmetric Embeddings — Query vs Document In-depth

Queries and documents are fundamentally different: queries are short questions, documents are long answers. Some embedding models handle this asymmetry explicitly — and they perform significantly better for retrieval tasks.

❌

Symmetric embedding (naive)

Same embedding model, same prompt for queries and documents. Works okay, but not optimal.

# Same embedding for both: embed("What is the refund policy?") embed("Our refund policy allows...") # May not align well

✅

Asymmetric embedding (better)

Different instruction prefixes for queries vs documents. Models trained for this (E5, BGE, Cohere) perform ~5–10% better.

# E5 format: embed("query: What is the refund policy?") embed("passage: Our refund policy allows...") # Cohere input_type parameter: embed(text, input_type="search_query") embed(text, input_type="search_document")

Always Check Documentation

Each model has its own conventions. BGE uses "Represent this sentence: " prefix. E5 uses "query:" and "passage:". Cohere uses API parameters. Using the wrong format can reduce retrieval quality by 10–15% — read the model card.

Embedding Best Practices Core

Practice	Why	Implementation
Batch your API calls	50–100× faster than one-by-one	Send up to 2048 texts per API call (OpenAI limit)
Cache embeddings	Don't re-embed unchanged docs	Hash document content, store in cache, skip if exists
Normalize vectors	Required for cosine similarity	Most APIs return normalized, but verify
Same model everywhere	Don't mix models	Query and doc embeddings must use same model
Prepend chunk metadata	Context helps embedding quality	"Title: X \| Section: Y \| Content: Z"
Handle max length	Models truncate silently	Check model's max tokens (usually 512–8192)

⚡

Production embedding pipeline

import hashlib from openai import OpenAI client = OpenAI() BATCH_SIZE = 100 cache = {} # In production: Redis or similar def get_content_hash(text: str) -> str: return hashlib.md5(text.encode()).hexdigest() def embed_batch(texts: list[str]) -> list[list[float]]: """Embed with caching and batching.""" results = [None] * len(texts) to_embed = [] # (index, text) pairs # Check cache for i, text in enumerate(texts): h = get_content_hash(text) if h in cache: results[i] = cache[h] else: to_embed.append((i, text, h)) # Batch embed uncached for batch_start in range(0, len(to_embed), BATCH_SIZE): batch = to_embed[batch_start:batch_start + BATCH_SIZE] response = client.embeddings.create( model="text-embedding-3-large", input=[t[1] for t in batch] ) for j, emb in enumerate(response.data): idx, _, h = batch[j] results[idx] = emb.embedding cache[h] = emb.embedding # Cache for next time return results

Fine-tuning Embeddings — When and How In-depth

General-purpose embeddings work well for most text. But for specialized domains (legal, medical, code), fine-tuning on your data can improve retrieval quality significantly — 10–30% gains are common.

✅ Fine-tune embeddings when...

• Domain has specialized vocabulary (medical, legal, code)

• General embeddings fail on your evaluation set

• You have (query, relevant_doc) pairs for training

• Retrieval quality is critical for production

• You can afford retraining when model updates

❌ Don't fine-tune when...

• General-purpose embeddings work well enough

• You don't have labeled training data

• You need to iterate quickly (fine-tuning is slow)

• Domain is general knowledge / common English

• Using an API-only model (can't fine-tune OpenAI embeds)

How to Fine-tune

Use sentence-transformers with contrastive loss. You need (anchor, positive, negative) triplets: the anchor is a query, positive is a relevant doc, negative is an irrelevant doc. Train for 1–3 epochs with a learning rate of 2e-5. Evaluate on a held-out test set — if Recall@5 improves by >5%, deploy the fine-tuned model.

Fine-tuning Method	Training Data Needed	Quality Gain	Effort
Contrastive fine-tuning	1K–10K (query, doc) pairs	+10–30% recall	Medium
Matryoshka fine-tuning	1K–10K pairs	Maintain quality at low dims	Medium
Adapter layers (LoRA)	500–2K pairs	+5–15% recall	Low
Hard negative mining	Requires iterative labeling	+15–25% over random negatives	High

Evaluating Embedding Quality — MTEB and Beyond Core

MTEB (Massive Text Embedding Benchmark) is the industry standard for comparing embedding models. It evaluates models across 56+ datasets in retrieval, classification, clustering, and more. But MTEB alone isn't enough — you must test on your own data.

📊

MTEB Leaderboard

Check huggingface.co/spaces/mteb/leaderboard for current rankings. Focus on the Retrieval category for RAG use cases.

🎯

Your Own Eval Set

Create 50–100 (query, relevant_docs) pairs from your actual data. Measure Recall@5 and MRR. The model that wins on MTEB may not win on your domain.

⚖️

Compare Fairly

Same chunking, same index, same queries. Only change the embedding model. Run 3 times, average results. Statistical significance matters.

MTEB Hacking

Some models are "trained on the test set" — they've seen MTEB datasets during development. Their MTEB scores look great, but they don't generalize. Always validate on your own held-out data before committing to a model in production.

∑ Chapter 03 — Key Takeaways

Embeddings turn text into vectors where similar meanings are close together — enabling semantic search
Top models: OpenAI text-embedding-3-large (API), BGE-large (local), Cohere embed-v3 (multilingual)
Dimensionality tradeoff: higher = better quality, lower = cheaper storage — use Matryoshka truncation to choose
Use asymmetric embeddings — different prefixes for queries vs documents (E5: "query:", "passage:")
Best practices: batch API calls, cache embeddings, prepend metadata to chunks
Fine-tune for specialized domains — 10–30% gains possible with 1K+ training pairs
MTEB is useful but not definitive — test on your own data before choosing a production model

Chapter 04 · Infrastructure

Vector Storage & Indexing — Storing and Searching at Scale

You've got embeddings. Now where do you put them? Vector databases are purpose-built for storing millions of vectors and finding the nearest neighbors in milliseconds. The choice of database and index type determines your latency, accuracy, cost, and operational complexity.

Why Vector Databases — Not Just Any Database Foundation

You could store vectors in PostgreSQL as arrays. But when you have 10 million vectors, computing cosine similarity against all of them takes minutes. Vector databases use specialized indices that trade perfect accuracy for 100–1000× faster search.

Brute-force search vs. ANN (Approximate Nearest Neighbor) search

The ANN Tradeoff

ANN is approximate — it might miss the actual closest neighbor and return the 2nd or 3rd closest instead. In practice, with good tuning, you get 95–99% recall (95% of queries return the true top-k) at 100× the speed. For RAG, this tradeoff is almost always worth it.

Vector Database Options — Managed vs Self-Hosted In-depth

The vector database market has exploded. Here's an honest comparison of the major options — there's no single "best" choice, only tradeoffs for your use case.

Database	Type	Best For	Scale	Cost	Complexity
Pinecone	Managed	Production, serverless, no ops	Billions	$70+/mo	Very low
Qdrant	Both	Self-hosted + cloud, flexible	Billions	Free–$$$	Medium
Weaviate	Both	Built-in ML, hybrid search	Billions	Free–$$$	Medium
Milvus	Self-hosted	Massive scale, on-prem	Billions+	Free (infra)	High
pgvector	Extension	Already using Postgres	~5M vectors	Free	Low
Chroma	Embedded	Prototyping, local dev	~1M vectors	Free	Very low
FAISS	Library	Research, custom pipelines	Billions	Free	High

☁️

Start Here: Managed

Use Pinecone or Qdrant Cloud to start. Zero ops, scales automatically, free tiers available. Focus on your RAG logic, not infrastructure.

🐘

Already on Postgres?

Use pgvector. No new infrastructure, same ops model, works up to ~5M vectors. Beyond that, consider dedicated vector DB.

🏠

Privacy/On-Prem Required

Self-host Qdrant or Milvus. Both are production-ready. Qdrant is simpler; Milvus scales larger but needs more ops work.

Index Types — HNSW, IVF, and When to Use Each In-depth

ANN search works by building an index — a data structure that allows skipping most vectors during search. The two dominant index types are HNSW and IVF. Most modern vector databases default to HNSW.

🕸️ HNSW (Hierarchical NSW)

How it works: Builds a multi-layer graph where each node connects to nearby neighbors. Search starts at top layer, descends to find approximate nearest.

✓ Pros: Fast search (1–10ms), high recall, no training needed, incrementally updatable

✗ Cons: High memory usage (stores graph edges), slow build time for very large datasets

Best for: Most RAG use cases, <100M vectors

📊 IVF (Inverted File Index)

How it works: Clusters vectors into buckets (centroids) via k-means. Search only probes nearby buckets.

✓ Pros: Lower memory, compresses well with PQ, good for very large scale

✗ Cons: Requires training phase, slower search than HNSW, updating is expensive

Best for: 100M+ vectors, memory-constrained, batch-build scenarios

HNSW index structure — multi-layer graph for fast navigation

Parameter	What It Controls	Higher Value	Lower Value
M (HNSW)	Edges per node	Better recall, more memory	Less memory, lower recall
ef_construction	Build-time search width	Better index quality, slower build	Faster build, lower quality
ef_search	Query-time search width	Better recall, slower queries	Faster queries, lower recall
nlist (IVF)	Number of clusters	More granular, slower build	Faster build, coarser search
nprobe (IVF)	Clusters to search	Better recall, slower queries	Faster queries, lower recall

Setting Up Vector Storage — Code Examples Core

Here's how to set up the most common vector databases. All examples use Python and store 1536-dimensional embeddings with metadata.

🌲

Pinecone (managed, serverless)

from pinecone import Pinecone pc = Pinecone(api_key="YOUR_KEY") # Create index pc.create_index( name="my-rag-index", dimension=1536, metric="cosine", spec=ServerlessSpec( cloud="aws", region="us-east-1" ) ) # Upsert vectors with metadata index = pc.Index("my-rag-index") index.upsert(vectors=[ {"id": "doc1", "values": embedding_vector, "metadata": { "source": "faq.pdf", "date": "2026-01" }} ])

🔷

Qdrant (self-hosted or cloud)

from qdrant_client import QdrantClient from qdrant_client.models import * client = QdrantClient("localhost", port=6333) # Create collection client.create_collection( collection_name="my-rag", vectors_config=VectorParams( size=1536, distance=Distance.COSINE ) ) # Upsert with payload (metadata) client.upsert( collection_name="my-rag", points=[ PointStruct( id=1, vector=embedding_vector, payload={"source": "faq.pdf"} ) ] )

🐘

pgvector (PostgreSQL extension)

-- Enable extension CREATE EXTENSION vector; -- Create table with vector column CREATE TABLE documents ( id SERIAL PRIMARY KEY, content TEXT, embedding vector(1536), -- pgvector type source TEXT, created_at TIMESTAMP ); -- Create HNSW index for fast search CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64); -- Query: find 5 nearest neighbors SELECT id, content, source, 1 - (embedding <=> '[0.1, 0.2, ...]') AS similarity FROM documents ORDER BY embedding <=> '[0.1, 0.2, ...]' LIMIT 5;

Scaling Vector Storage — Cost and Performance Core

Vector databases scale differently than traditional databases. Memory is usually the bottleneck, not disk. Here's what to expect:

Scale	Vectors	Memory (1536d)	Typical Cost	Latency (p99)
Small	<100K	~600 MB	Free tier / $20/mo	<10ms
Medium	100K–1M	600 MB–6 GB	$50–200/mo	<20ms
Large	1M–10M	6–60 GB	$200–500/mo	20–50ms
Very Large	10M–100M	60–600 GB	$500–2000/mo	50–100ms
Massive	>100M	>600 GB	$2000+/mo + ops	100ms+ (sharded)

Cost-Saving Strategies

1. Reduce dimensions: 256d instead of 1536d = 6× less memory. 2. Quantization: Store int8 instead of float32 = 4× less memory (some quality loss). 3. Tiered storage: Keep hot vectors in memory, cold in disk-backed storage. 4. Aggressive deduplication: Remove near-duplicate chunks before indexing.

∑ Chapter 04 — Key Takeaways

Vector databases use ANN indices (not brute force) to search millions of vectors in milliseconds
Top choices: Pinecone (managed), Qdrant (flexible), pgvector (if already on Postgres)
HNSW is the default index type — fast search, high recall, good for <100M vectors
Key parameters: M (edges), ef_construction (build quality), ef_search (query recall)
Memory scales with vectors: 1M × 1536d × 4 bytes ≈ 6 GB — dimension reduction helps
Start with managed services — self-host only when you need privacy or have ops capacity

Chapter 05 · Core Mechanics

Retrieval Strategies — Finding What Matters

Retrieval is where RAG succeeds or fails. You can have perfect embeddings and a fast vector database, but if your retrieval strategy doesn't find the right documents, the LLM will confidently answer with irrelevant context. This chapter covers how to actually get retrieval right.

Dense vs Sparse Retrieval — Two Approaches Foundation

There are fundamentally two ways to find relevant documents: dense retrieval (semantic similarity via embeddings) and sparse retrieval (keyword matching via inverted indices). Both have strengths; the best systems use both.

🧠 Dense Retrieval (Vector Search)

How it works: Embed query, find nearest document embeddings by cosine similarity

✓ Strengths: Semantic understanding — "car" matches "automobile", handles paraphrase

✗ Weaknesses: Misses exact keywords, struggles with numbers, acronyms, rare terms

Example: "Python web framework" retrieves docs about Flask even if "Flask" is the only word in the doc

🔤 Sparse Retrieval (BM25/TF-IDF)

How it works: Count word occurrences, score by term frequency and rarity (BM25)

✓ Strengths: Exact keyword match, numbers, codes, acronyms, rare terms

✗ Weaknesses: No semantic understanding — "car" doesn't match "automobile"

Example: "ERR-4592" finds exact error code, vector search might miss it

Dense vs Sparse retrieval — each finds different documents

The Empirical Result

Benchmarks consistently show: Hybrid search (dense + sparse) outperforms either alone by 5–15% on recall. Dense captures paraphrase and semantic similarity; sparse captures exact terms the embeddings might miss. Use both.

Hybrid Search — Combining Dense and Sparse In-depth

Hybrid search runs both dense and sparse retrieval, then combines the results. The key question: how do you merge two ranked lists? The most common approach is Reciprocal Rank Fusion (RRF).

Reciprocal Rank Fusion (RRF) RRF_score(d) = Σ 1 / (k + rank_i(d)) k = 60 (constant), rank_i = document's rank in retriever i. Higher score = better.

🔧

Hybrid search with RRF — production pattern

from rank_bm25 import BM25Okapi import numpy as np def hybrid_search(query: str, docs: list, embeddings: np.array, embed_fn, top_k: int = 10, alpha: float = 0.5): """ Combine dense (vector) and sparse (BM25) retrieval using RRF. alpha: weight for dense vs sparse (0.5 = equal weight) """ # Dense retrieval query_emb = embed_fn(query) dense_scores = np.dot(embeddings, query_emb) # cosine similarity dense_ranks = np.argsort(-dense_scores) # descending # Sparse retrieval (BM25) tokenized_docs = [doc.split() for doc in docs] bm25 = BM25Okapi(tokenized_docs) sparse_scores = bm25.get_scores(query.split()) sparse_ranks = np.argsort(-sparse_scores) # RRF fusion k = 60 # standard RRF constant rrf_scores = {} for rank, doc_idx in enumerate(dense_ranks): rrf_scores[doc_idx] = rrf_scores.get(doc_idx, 0) + alpha / (k + rank) for rank, doc_idx in enumerate(sparse_ranks): rrf_scores[doc_idx] = rrf_scores.get(doc_idx, 0) + (1 - alpha) / (k + rank) # Sort by RRF score ranked = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True) return [docs[idx] for idx, _ in ranked[:top_k]]

Hybrid Method	How It Works	Best When
RRF (Reciprocal Rank Fusion)	Sum of 1/(k+rank) across retrievers	Default choice, no tuning needed
Weighted Linear	α × dense_score + (1-α) × sparse_score	When you can tune α on your data
Cascade	Sparse first for recall, dense re-rank	When sparse is faster, need latency
Learned Fusion	Train a model to combine scores	When you have labeled data + resources

Query Enhancement — Making Queries Retrieval-Friendly In-depth

User queries are often poor search queries: too short, ambiguous, or phrased as questions when documents are statements. Query enhancement transforms the user's question into something more likely to retrieve relevant documents.

📝

Query Expansion

Add related terms to the query. "Python web" → "Python web Flask Django framework app"

Use LLM to generate synonyms
Or use WordNet/embeddings
Increases recall, may hurt precision

✨

HyDE (Hypothetical Doc)

Generate a hypothetical answer, embed that instead of the question.

Questions embed differently than docs
Generated answer is closer to real docs
+10–15% recall on some benchmarks

🔀

Multi-Query

Generate 3–5 variations of the query, retrieve with each, merge results.

"refund policy" + "return items" + "money back guarantee"
Captures more angles
3× retrieval cost

🪄

HyDE implementation — generate hypothetical document, then embed

def hyde_retrieval(query: str, llm, embed_fn, index, top_k=5): """ HyDE: Hypothetical Document Embeddings 1. Generate a hypothetical answer to the query 2. Embed that answer (not the question) 3. Search for similar documents """ # Step 1: Generate hypothetical document prompt = f"""Write a short passage that would answer this question: Question: {query} Passage:""" hypothetical_doc = llm.generate(prompt, max_tokens=200) # Step 2: Embed the hypothetical answer (not the question!) hyde_embedding = embed_fn(hypothetical_doc) # Step 3: Search with the hypothetical doc's embedding results = index.search(hyde_embedding, top_k=top_k) return results # Why this works: Questions are phrased differently than answers. # "What is the refund policy?" embeds far from "Our refund policy is..." # HyDE generates "Our refund policy allows 30-day returns..." which # embeds close to actual policy documents.

HyDE Caveat

HyDE adds an LLM call to every query — 100–500ms latency + cost. It works best when: (1) question-answer asymmetry is high, (2) latency is not critical, (3) you've measured real improvement on your data. Don't blindly apply it — A/B test.

Metadata Filtering — Narrowing Before Search Core

Sometimes you know constraints before search: only documents from this year, only from the engineering team, only product docs. Metadata filtering narrows the search space, improving both precision and speed.

Filter Type	Example	Use Case
Equality	`department = "engineering"`	User belongs to a specific team
Range	`date >= "2025-01-01"`	Only recent documents
In-list	`source IN ["faq", "docs"]`	Only certain document types
Boolean AND	`public = true AND lang = "en"`	Multiple constraints
Geo	`location NEAR (lat, lon, 10km)`	Location-aware retrieval

🌲

Pinecone filtering

results = index.query( vector=query_embedding, top_k=10, filter={ "department": {"$eq": "engineering"}, "date": {"$gte": "2025-01-01"} }, include_metadata=True )

🔷

Qdrant filtering

results = client.search( collection_name="docs", query_vector=query_embedding, limit=10, query_filter=Filter(must=[ FieldCondition( key="department", match=MatchValue(value="eng") ) ]) )

Pre-filter vs Post-filter

Pre-filter (narrow before ANN search) is much faster and should be default. Post-filter (search all, filter results) only when: (1) filter would leave <100 candidates, (2) you need exact top-k after filtering. Most vector DBs support efficient pre-filtering.

Multi-Index Retrieval — Different Indices for Different Data In-depth

Real applications have multiple document types: FAQs, documentation, support tickets, code. Each may benefit from different chunking, embeddings, or ranking. Multi-index retrieval queries multiple specialized indices and merges results.

Multi-index architecture — route query to specialized indices

✅

When to use multi-index

Documents have very different structures (FAQ vs manuals)
Different chunking strategies needed
Different embedding models work better for each type
Access control varies by doc type

❌

When single index is fine

Documents are homogeneous
Same chunking works everywhere
Adding complexity without measured benefit
Small corpus (<100K docs)

Retrieval Optimization Checklist Core

Optimization	Expected Gain	Effort	When to Use
Add BM25 hybrid search	+5–15% recall	Low	Almost always — default recommendation
Cross-encoder re-ranking	+5–20% precision	Medium	When precision matters more than latency
Multi-query retrieval	+5–10% recall	Low–Medium	Short/ambiguous queries
HyDE	+5–15% recall	Medium	High question-doc asymmetry
Metadata pre-filtering	Variable (precision)	Low	When you have useful metadata
Better chunking	+10–30% recall	Medium	Before other optimizations
Better embedding model	+5–15% recall	Medium	If current model underperforms on eval

Optimization Order

Before adding fancy techniques, get the basics right: 1. Good chunking (Ch 2), 2. Good embeddings (Ch 3), 3. Hybrid search (this chapter), 4. Re-ranking (Ch 6). Only then consider HyDE, multi-query, or multi-index. Measure each change on your eval set — many "optimizations" don't help specific datasets.

∑ Chapter 05 — Key Takeaways

Dense retrieval (vectors) captures semantic similarity; sparse retrieval (BM25) captures exact keywords
Hybrid search combining both outperforms either alone — use RRF to merge ranked lists
Query enhancement: HyDE (hypothetical docs), multi-query (variations), query expansion (synonyms)
Metadata filtering narrows search space before ANN — faster and more precise
Multi-index architectures help when document types are very different
Optimization order: chunking → embeddings → hybrid search → re-ranking → advanced techniques
Always measure on your eval set — not all optimizations help all datasets

Chapter 06 · Quality

Ranking & Re-Ranking — From Retrieved to Relevant

Retrieval gives you candidates. Re-ranking gives you the right candidates in the right order. A cross-encoder re-ranker looking at 20 retrieved chunks and selecting the best 5 can improve answer quality more than any other single optimization in the RAG pipeline.

Why Re-ranking — The Two-Stage Architecture Foundation

Embedding-based retrieval is fast but shallow — it computes similarity independently for each document without comparing query and document together. Re-ranking takes the top-k candidates and applies a more powerful (but slower) model that reads query + document jointly.

Two-stage retrieval — fast recall first, then precise re-ranking

⚡ Bi-encoder (Stage 1)

How: Encode query and document independently, then compare

Speed: ~1ms per 1M docs (pre-indexed)

Quality: Good recall, moderate precision

Use: Initial retrieval from full corpus

🎯 Cross-encoder (Stage 2)

How: Feed [query + document] together through transformer

Speed: ~5ms per document pair

Quality: Much higher precision

Use: Re-rank top 20–50 candidates

Re-ranking Models — What to Use In-depth

Model	Type	Quality	Speed	Cost	Best For
Cohere Rerank v3	API	Excellent	~100ms / 50 docs	$1/1K searches	Production default
Voyage Reranker	API	Excellent	~100ms / 50 docs	$0.05/1K	Cost-effective API option
BGE-reranker-v2-m3	Local	Very good	~200ms / 50 docs (GPU)	Free	Self-hosted, multilingual
cross-encoder/ms-marco	Local	Good	~300ms / 50 docs (GPU)	Free	Prototyping, English only
ColBERT v2	Local	Very good	~50ms / 50 docs	Free	Late interaction, fast
Jina Reranker v2	Both	Very good	~100ms / 50 docs	Free / API	Multilingual, long docs

🔧

Re-ranking with Cohere — production pattern

import cohere co = cohere.Client("YOUR_API_KEY") def rerank_results(query: str, documents: list[str], top_k: int = 5): """Re-rank retrieved documents using Cohere Rerank.""" response = co.rerank( model="rerank-english-v3.0", query=query, documents=documents, top_n=top_k, return_documents=True ) # Returns documents sorted by relevance score return [ { "text": r.document.text, "score": r.relevance_score, # 0.0 to 1.0 "index": r.index # original position } for r in response.results ] # Usage: retrieve 50 with vector search, re-rank to top 5 candidates = vector_search(query, top_k=50) best = rerank_results(query, candidates, top_k=5)

🔧

Re-ranking with local cross-encoder (sentence-transformers)

from sentence_transformers import CrossEncoder # Load model once at startup reranker = CrossEncoder("BAAI/bge-reranker-v2-m3") def rerank_local(query: str, docs: list[str], top_k: int = 5): """Re-rank using local cross-encoder model.""" pairs = [[query, doc] for doc in docs] scores = reranker.predict(pairs) # Sort by score descending ranked = sorted( zip(docs, scores), key=lambda x: x[1], reverse=True ) return [doc for doc, _ in ranked[:top_k]]

Lost in the Middle — Position Matters In-depth

Research shows LLMs pay more attention to information at the beginning and end of the context window, but tend to miss information in the middle. This "lost in the middle" effect means document ordering matters.

LLM attention by position — the "lost in the middle" effect

✅

Best-first ordering

Put the most relevant document first. Simple, effective for most cases.

🔄

Interleaved ordering

Alternate: #1 first, #2 last, #3 second, #4 second-to-last. Spreads relevance across attention peaks.

📐

Fewer, better chunks

Re-rank to top 3–5 instead of stuffing 10+. Less middle = less lost. Quality over quantity.

Practical Impact

In benchmarks, placing the answer in the middle vs at the start reduces accuracy by 10–20%. The fix: re-rank aggressively (top 3–5 only), put best results first, and keep total context short. More context ≠ better answers.

Diversity Ranking — Avoiding Redundancy Core

If your top 5 results are 5 chunks from the same document saying the same thing, you waste context and miss other relevant information. Diversity ranking ensures retrieved results cover different aspects of the query.

Strategy	How It Works	When to Use
MMR (Maximal Marginal Relevance)	Balance relevance to query vs diversity from already-selected docs	Default choice — simple, effective
Source deduplication	Max 2 chunks per source document	When same doc is over-represented
Similarity threshold	Remove results with cosine sim >0.95 to each other	When near-duplicates are common
Category diversity	Ensure mix of doc types (FAQ + docs + code)	Multi-index systems

Maximal Marginal Relevance (MMR) MMR = λ · Sim(q, d) − (1−λ) · max[Sim(d, d_selected)] λ = 0.5–0.7 typical. Higher λ = more relevance, lower λ = more diversity.

🔧

MMR implementation

import numpy as np def mmr_rerank(query_emb, doc_embs, docs, top_k=5, lambda_=0.6): """Select diverse, relevant documents using MMR.""" selected = [] remaining = list(range(len(docs))) # Precompute similarities q_sim = np.dot(doc_embs, query_emb) # relevance scores d_sim = np.dot(doc_embs, doc_embs.T) # pairwise doc similarities for _ in range(top_k): if not remaining: break if not selected: # First pick: most relevant best = max(remaining, key=lambda i: q_sim[i]) else: # MMR: balance relevance vs diversity mmr_scores = {} for i in remaining: max_sim_to_selected = max(d_sim[i][j] for j in selected) mmr_scores[i] = lambda_ * q_sim[i] - (1 - lambda_) * max_sim_to_selected best = max(remaining, key=lambda i: mmr_scores[i]) selected.append(best) remaining.remove(best) return [docs[i] for i in selected]

∑ Chapter 06 — Key Takeaways

Use a two-stage architecture: fast bi-encoder retrieval (top 50), then cross-encoder re-ranking (top 5)
Top re-rankers: Cohere Rerank v3 (API), BGE-reranker-v2 (local), ColBERT v2 (fast local)
Re-ranking typically improves precision@5 by 15–30% — the highest-ROI optimization after hybrid search
Lost in the middle: LLMs miss info in the middle of context — put best results first, use fewer chunks
MMR diversity ranking avoids redundant results — balance relevance (λ) vs diversity (1−λ)
Deduplicate by source — max 2 chunks per document prevents one doc dominating context

Chapter 07 · Prompting

Context Construction & Prompting — What Goes Into the LLM

You've retrieved the right documents and ranked them well. Now comes the final mile: how you assemble the prompt determines whether the LLM uses that context correctly. Bad context construction turns great retrieval into mediocre answers.

RAG Prompt Anatomy — The Four Parts Foundation

Every RAG prompt has four parts. Getting each part right — and getting the ordering right — is what separates good RAG from great RAG.

The four parts of a RAG prompt — order matters

📝

Production RAG prompt template

SYSTEM_PROMPT = """You are a helpful assistant for {company_name}. INSTRUCTIONS: - Answer ONLY using the provided context below - If the context doesn't contain the answer, say "I don't have information about that in my knowledge base" - Cite sources using [Source: filename] format - Be concise and direct - Do NOT make up information not in the context CONTEXT: {retrieved_context} """ USER_PROMPT = """Question: {user_query} Answer (cite sources):""" def build_rag_prompt(query, chunks, company="Acme Corp"): # Format retrieved chunks with source metadata context_parts = [] for i, chunk in enumerate(chunks): source = chunk.metadata.get("source", "unknown") section = chunk.metadata.get("section", "") header = f"[Source {i+1}: {source}" if section: header += f" | {section}" header += "]" context_parts.append(f"{header}\n{chunk.text}") context = "\n\n".join(context_parts) system = SYSTEM_PROMPT.format( company_name=company, retrieved_context=context ) user = USER_PROMPT.format(user_query=query) return system, user

Citation Prompting — Making Answers Traceable Core

Citations serve two purposes: they let users verify answers and they reduce hallucination by forcing the model to ground statements in specific sources. Without citations, you can't tell if the model invented something.

1️⃣

Inline Citations

"The refund window is 30 days [Source 1]. For subscriptions, cancellation is immediate [Source 2]."

Easiest to implement
Clear source per claim
Works with any LLM

📎

Footnote Citations

"The refund window is 30 days¹. Cancel anytime²."
¹ pricing-faq.pdf ² terms.pdf

Cleaner reading flow
Needs post-processing
Better for long answers

💬

Quote-based Citations

"As stated in the FAQ: 'returns within 30 days of purchase' (pricing-faq.pdf)"

Verifiable quotes
Highest trust
Longer responses

Citation Hallucination

LLMs can hallucinate citations — they'll cite "[Source 3]" even if only 2 sources exist, or attribute information to the wrong source. Always validate citations programmatically: check that cited source numbers exist, and optionally verify the claim appears in the cited chunk using string matching or semantic similarity.

Context Window Management — Fitting What Matters In-depth

Even with 128K-token context windows, more context is not always better. Cost increases linearly, latency increases, and the "lost in the middle" effect gets worse. Smart context management is about using the window efficiently.

Strategy	How It Works	Token Savings	Quality Impact
Fewer, better chunks	Re-rank to top 3–5 instead of 10+	50–70% fewer tokens	Often improves quality
Chunk compression	Use LLM to summarize each chunk before insertion	60–80% fewer tokens	May lose details
Relevant sentence extraction	Extract only sentences relevant to query from each chunk	50–70% fewer tokens	Preserves key info
Token budget allocation	Set max tokens per chunk (e.g., 500), truncate overflow	Predictable	May cut important context
Map-reduce for large corpus	Summarize each chunk separately, then combine summaries	Can handle unlimited docs	Multiple LLM calls, higher latency

❌ Don't: Stuff everything

Retrieving 20 chunks × 500 tokens = 10K tokens of context. Most of it is noise. Cost: $0.03 per query with GPT-4o. At 10K queries/day = $300/day just for context.

Quality: LLM drowns in irrelevant text, misses the answer, or picks wrong chunk.

✅ Do: Curate aggressively

Re-rank to top 5 chunks × 500 tokens = 2.5K tokens. Cost: $0.0075 per query. At 10K queries/day = $75/day. 4× cheaper.

Quality: Less noise, LLM focuses on best content, answers more accurately.

Teaching "I Don't Know" — Abstention Core

The hardest part of RAG isn't answering questions — it's knowing when not to answer. When retrieved context doesn't contain the answer, the LLM should say "I don't know" rather than hallucinate.

❌

Bad: No abstention instruction

"Answer the user's question using the context." # LLM will ALWAYS answer — even when # context doesn't contain the answer. # It fills the gap with hallucination.

✅

Good: Explicit abstention

"Answer ONLY using the provided context. If the context does not contain enough information to answer, respond with: 'I don't have information about that in my knowledge base. Please contact support@company.com for help.'" # LLM knows it's OK to not answer. # Provides fallback action.

Confidence Scoring

For production systems, combine prompt-based abstention with a retrieval confidence check: if the best re-ranked score is below a threshold (e.g., 0.3), don't even send to the LLM — return a canned "I can't help with that" response. Saves tokens and avoids hallucination entirely.

Multi-turn RAG — Handling Follow-up Questions In-depth

In conversation, follow-up questions reference earlier context: "What about for enterprise plans?" requires knowing the previous question was about pricing. Query rewriting transforms follow-ups into standalone queries for retrieval.

💬User asks"What's the refund policy?"

🔍Retrievefind refund docs

💬Follow-up"What about enterprise?"

✏️Rewrite"enterprise refund policy"

🔍Retrievefind enterprise docs

✏️

Query rewriting for multi-turn RAG

REWRITE_PROMPT = """Given the conversation history, rewrite the last user message as a standalone search query. Include all necessary context from the conversation. Chat history: {history} Last message: {current_message} Standalone query:""" def rewrite_query(history: list, current: str, llm) -> str: """Rewrite follow-up question as standalone query.""" formatted_history = "\n".join( f"{m['role']}: {m['content']}" for m in history[-4:] ) prompt = REWRITE_PROMPT.format( history=formatted_history, current_message=current ) return llm.generate(prompt, max_tokens=100) # Example: # History: "What's the refund policy?" → "30 days..." # Current: "What about enterprise?" # Rewritten: "What is the refund policy for enterprise plans?"

Rewrite Cost

Query rewriting adds an LLM call per turn (~50–100ms, ~100 tokens). For simple applications, check if the user message is self-contained first (using heuristics like "does it contain a noun?") — only rewrite when it's clearly a follow-up. Don't rewrite "How do I reset my password?" — it's already standalone.

∑ Chapter 07 — Key Takeaways

RAG prompts have 4 parts: system instruction → retrieved context → user query → output format
Citation prompting forces grounding — inline, footnote, or quote-based — always validate citations programmatically
Less context is often better: re-rank to top 3–5 chunks, avoid stuffing 10+ into the prompt
Teach the model to say "I don't know" with explicit abstention instructions + retrieval confidence thresholds
Lost in the middle: put best results first, keep context short, consider interleaved ordering
Multi-turn RAG needs query rewriting — transform follow-ups into standalone retrieval queries
Context window management: fewer better chunks = 4× cheaper, better quality than stuffing everything

Chapter 08 · Quality Assurance

Failure Modes & Evaluation — Why RAG Breaks and How to Measure

Every RAG system that "works in demos" eventually breaks in production. The difference between a toy and a product is knowing how it fails, measuring how often, and fixing it systematically. This chapter is about building that feedback loop.

The RAG Failure Taxonomy — 7 Ways RAG Breaks Core

RAG failures fall into two categories: retrieval failures (wrong documents found) and generation failures (wrong answer produced from correct documents). Each requires different fixes.

RAG failure taxonomy — retrieval vs generation failures

Failure Mode	Symptom	Root Cause	Fix
① Missing content	"I don't know" on answerable questions	Document not ingested, format parsing failed	Content coverage audit, loader validation
② Wrong chunks	Confident answer from wrong topic	Poor embeddings, no metadata filter, bad chunking	Hybrid search, re-ranking, metadata filters
③ Stale content	Outdated information returned	No re-indexing pipeline, deleted docs still indexed	TTL, incremental sync, freshness ranking
④ Chunking artifacts	Partial/incoherent answers	Answer split across chunk boundary	Parent-child chunks, larger overlap, semantic chunking
⑤ Hallucination	Facts not in any retrieved document	LLM uses parametric knowledge instead of context	Citation enforcement, faithfulness scoring
⑥ Wrong synthesis	Misinterprets or contradicts source	Context too noisy, conflicting chunks, poor prompt	Fewer chunks, better re-ranking, explicit instructions
⑦ Over-refusal	"I don't know" when answer exists in context	Abstention threshold too high, overly cautious prompt	Calibrate confidence threshold, tune prompt

Retrieval Metrics — Did We Find the Right Docs? In-depth

🎯

Recall@K

Of all relevant documents, what fraction did we find in the top-K?

Recall@K Recall@K = |relevant ∩ retrieved@K| / |relevant| K=5 typical. Target: >0.85 for production.

📊

MRR (Mean Reciprocal Rank)

On average, how high is the first relevant result?

MRR MRR = (1/N) × Σ (1 / rank_i) rank_i = position of first relevant doc for query i. MRR=1.0 means always rank 1.

📐

NDCG (Normalized DCG)

How good is the ordering of results? Penalizes relevant docs at low positions.

NDCG=1.0 = perfect ranking
Considers graded relevance (not just binary)
Standard for search quality evaluation

⚖️

Precision@K

Of the K retrieved docs, how many are actually relevant?

Precision@5 of 0.6 = 3 of 5 are relevant
Tradeoff with recall — improve one, other may drop
Critical for context quality (less noise)

Generation Metrics — Did We Answer Correctly? In-depth

Metric	What It Measures	How to Compute	Target
Faithfulness	Is the answer grounded in retrieved context?	LLM-as-judge: "Are all claims in the answer supported by context?"	>0.90
Answer Relevance	Does the answer address the user's question?	LLM-as-judge: "Does this answer the question? Score 1–5"	>0.85
Context Relevance	Was the retrieved context useful?	LLM-as-judge: "Is this context relevant to the question?"	>0.80
Correctness	Is the answer factually correct?	Compare against golden answer (exact or semantic match)	>0.85
Harmfulness	Does the answer contain harmful/biased content?	Safety classifier or LLM-as-judge	<0.01

RAGAS — The Standard RAG Evaluation Framework Core

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that automates RAG evaluation. It computes faithfulness, answer relevance, and context metrics using LLM-as-judge.

🔧

RAGAS evaluation pipeline

from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall, ) from datasets import Dataset # Prepare evaluation dataset eval_data = { "question": ["What's the refund policy?", ...], "answer": ["The refund policy is 30 days...", ...], "contexts": [["Chunk 1 text", "Chunk 2 text"], ...], "ground_truth": ["Refunds within 30 days...", ...], } dataset = Dataset.from_dict(eval_data) # Run evaluation results = evaluate( dataset, metrics=[ faithfulness, # Is answer grounded in context? answer_relevancy, # Does answer address the question? context_precision, # Are retrieved docs relevant? context_recall, # Did we find all relevant docs? ] ) print(results) # {'faithfulness': 0.92, 'answer_relevancy': 0.88, # 'context_precision': 0.85, 'context_recall': 0.90}

Building a Golden Test Set

Create 50–100 (question, expected_answer, relevant_docs) tuples from your actual data. Have domain experts label them. Run RAGAS after every change to chunking, retrieval, or prompts. Automate this in CI — no RAG change ships without passing eval. This is the single most important practice for production RAG.

Continuous Evaluation — Monitoring in Production Core

📝User Queryincoming question

🔍Retrievelog chunks + scores

🤖Generatelog answer

📊Evaluateauto-score quality

🚨Alertflag low scores

What to Log	Why	Alert Threshold
Retrieval scores (top-k)	Detect queries with no good matches	Best score < 0.3
Re-ranker scores	Detect relevance drops after re-ranking	Best score < 0.5
LLM response latency	Detect slow queries	p99 > 5s
"I don't know" rate	Detect coverage gaps	>15% of queries
User feedback (thumbs)	Ground truth from users	Negative rate >20%
Faithfulness (sampled)	Detect hallucination drift	Score < 0.85 on sample

The Eval Paradox

You need evaluation to improve, but building eval sets takes time. Start small: 20 golden queries on day 1, add 5 per week from production logs. Within a month you'll have a meaningful eval set. Don't wait for perfection — imperfect evaluation beats no evaluation by a mile.

∑ Chapter 08 — Key Takeaways

RAG fails in 7 ways: 4 retrieval failures (missing, wrong, stale, chunking) + 3 generation failures (hallucination, wrong synthesis, over-refusal)
Retrieval metrics: Recall@K (did we find it?), MRR (how high?), NDCG (rank quality), Precision@K (how clean?)
Generation metrics: Faithfulness (grounded?), Answer Relevance (addresses question?), Correctness (factually right?)
Use RAGAS framework to automate evaluation — run after every pipeline change
Build a golden test set: 50–100 labeled (query, answer, docs) tuples — the most important practice for production RAG
Log everything in production: retrieval scores, latency, "I don't know" rates, user feedback — alert on degradation

Chapter 09 · Architecture

Advanced RAG Patterns — Beyond Naive Retrieval

Chapters 1–8 covered how to build a solid RAG pipeline. This chapter goes beyond — patterns that push RAG quality to the next level when the basics aren't enough. These techniques are more complex but solve specific failure modes that standard RAG can't handle.

Corrective RAG (CRAG) — Self-Correcting Retrieval In-depth

Standard RAG uses whatever the retriever returns, even if the results are irrelevant. CRAG adds a self-correction step: evaluate retrieval quality, and if it's poor, try alternative strategies before generating.

CRAG flow — evaluate retrieval quality, correct if needed

Self-RAG — Decide Whether to Retrieve In-depth

Self-RAG teaches the LLM to decide: (1) whether retrieval is needed, (2) whether retrieved docs are relevant, and (3) whether the generated answer is grounded. The model outputs special reflection tokens during generation.

❓Query arrivesanalyze intent

🤔Need retrieval?[Retrieve] or [No Retrieve]

📄If yes: retrieveget docs

✅Is it relevant?[ISREL] token

📝Generate + check[ISSUP] grounded?

When to Use Self-RAG

Self-RAG requires fine-tuning a model with reflection tokens. It's most valuable when: queries are mixed (some need retrieval, some don't), and when you need per-statement grounding verification. For most applications, CRAG (no fine-tuning needed) provides 80% of the benefit.

GraphRAG — Knowledge Graphs Meet Retrieval In-depth

Standard RAG retrieves individual chunks. GraphRAG first builds a knowledge graph from documents (entities and relationships), then traverses the graph during retrieval. This enables multi-hop reasoning across documents.

Standard RAG

Query: "Who manages the team that built Project X?"

Retrieves chunks about Project X, but team and manager info is in different documents. Fails.

Each chunk is independent — no connections between them.

GraphRAG

Query: "Who manages the team that built Project X?"

Graph: Project X → built by → Team Alpha → managed by → Sarah. Succeeds via graph traversal.

Entities and relationships connect information across documents.

✅

GraphRAG shines when

Multi-hop questions common
Entity relationships matter
Global summarization needed
Data is highly connected

❌

GraphRAG overkill when

Queries are simple lookups
Documents are independent
Graph construction cost too high
Data changes too fast

🔧

Implementation

Microsoft GraphRAG library
LLM extracts entities + relations
Store in Neo4j / NetworkX
Community detection for summaries

Agentic RAG — LLM Controls the Retrieval In-depth

In standard RAG, the pipeline is fixed: retrieve → generate. In Agentic RAG, the LLM acts as an agent that decides what to retrieve, when, and how many times. It can reformulate queries, request more context, or search different sources.

🔁

Iterative Retrieval

Retrieve → analyze gaps → retrieve more → combine. The agent keeps searching until it has enough context.

FLARE: Forward-Looking Active REtrieval
Agent generates, detects uncertainty, retrieves more
2–5 retrieval rounds typical

🧩

Query Decomposition

Complex question → break into sub-questions → retrieve for each → combine answers.

"Compare pricing of X vs Y" → two separate retrievals
LLM decomposes, retrieves, synthesizes
Better for multi-part questions

Agentic Complexity

Agentic RAG is powerful but adds latency (2–10× more LLM calls), cost, and unpredictability. The agent might loop, over-retrieve, or go off-track. Use it only when standard RAG demonstrably fails on your queries. Start with CRAG before going full agentic.

Advanced Pattern Comparison Core

Pattern	Complexity	Latency	Best For	Requires
Standard RAG	Low	200–500ms	80% of use cases	Chapters 1–7
CRAG	Medium	500–1000ms	Unreliable retrieval	Relevance scorer
Self-RAG	High	500–1500ms	Mixed query types	Fine-tuned model
GraphRAG	High	1–3s	Multi-hop, connected data	Graph DB, extraction pipeline
Agentic RAG	Very High	2–10s	Complex multi-step queries	Agent framework, tool definitions

The Pragmatic Path

Start with standard RAG (Ch 1–7) + good evaluation (Ch 8). Measure where it fails. If retrieval quality is the bottleneck, add CRAG. If multi-hop queries fail, consider GraphRAG. If complex reasoning fails, consider agentic. Each pattern adds complexity — only add it when you've measured the need.

∑ Chapter 09 — Key Takeaways

CRAG evaluates retrieval quality and falls back to web search or refined queries when results are poor
Self-RAG teaches the LLM to decide when to retrieve and whether results are grounded (requires fine-tuning)
GraphRAG builds a knowledge graph for multi-hop reasoning across connected documents
Agentic RAG lets the LLM control retrieval — iterative search, query decomposition, multi-source
Standard RAG covers 80% of use cases — only add advanced patterns when evaluation shows specific failures
Each pattern adds complexity, latency, and cost — measure the tradeoff before committing

Chapter 10 · Production Systems

Production Systems — Deployment, Monitoring, and Optimization

You've built a RAG system that works. Now ship it. Production RAG isn't about making the retrieval 1% better — it's about keeping it working reliably at scale, managing costs, and responding to drift. This chapter covers the engineering that keeps RAG systems alive.

Production Architecture — The Full Stack Core

Production RAG architecture — all the pieces

Caching — Reducing Cost and Latency Core

Many RAG queries are repetitive or semantically similar. Caching avoids re-running expensive retrieval and LLM calls for questions you've already answered.

🔑

Exact Cache

Hash the query string, cache the full response.

Hit rate: 5–15% typical
Simple: Redis key-value
TTL: 1–24 hours

🧠

Semantic Cache

Embed the query, find similar cached queries by vector distance.

Hit rate: 15–40% typical
"refund policy" ≈ "return policy"
Threshold: cosine > 0.95

📦

Retrieval Cache

Cache retrieval results only, still run LLM. Saves retrieval latency + vector DB cost.

Useful when prompts change often
TTL: 1–6 hours
Invalidate on index update

Cost Optimization — Making RAG Affordable Core

Cost Component	Typical %	Optimization	Savings
LLM generation	60–70%	Fewer chunks in context, smaller model for simple queries, caching	30–60%
Embedding API	10–15%	Cache embeddings, batch calls, lower dimensions	50–80%
Vector DB	10–15%	Reduce dimensions, quantization, tiered storage	30–60%
Re-ranker	5–10%	Cache re-rank results, reduce candidate count	20–40%

The Model Routing Pattern

Not every query needs GPT-4o. Route simple factual queries to a smaller/cheaper model (GPT-4o-mini, Claude Haiku) and complex reasoning queries to the best model. A simple classifier can save 40–60% on LLM costs by routing 70% of queries to the cheap model.

Latency Optimization — Making RAG Fast In-depth

Stage	Typical Latency	Optimization	Target
Embedding query	20–50ms	Local model for embedding, batch	<50ms
Vector search	5–20ms	HNSW tuning, pre-filter, warm cache	<20ms
Re-ranking	50–200ms	Fewer candidates (20 not 50), ColBERT	<100ms
LLM generation	500–3000ms	Streaming, shorter context, faster model	<2000ms
Total (no cache)	800–3500ms	Parallelize retrieval + embedding	<2500ms
Total (cache hit)	10–50ms	Semantic cache	<50ms

Stream Everything

Use streaming responses — start showing the LLM's answer token-by-token while it's still generating. Perceived latency drops from 2s to <500ms (time to first token). Every production RAG system should stream.

Keeping the Index Fresh — Data Sync Pipelines Core

⏰

Scheduled Re-index

Cron job: re-index all docs every N hours. Simple but inefficient for large corpora.

Best for: <10K docs
Frequency: hourly to daily
Pro: Simple to implement

🔄

Incremental Sync

Track document hashes. Only re-embed changed/new/deleted docs. 10–100× faster than full re-index.

Best for: 10K–1M docs
Frequency: real-time to hourly
Pro: Efficient, low cost

📡

Event-driven

Webhook on doc change triggers re-indexing. Near-real-time freshness.

Best for: critical freshness needs
Frequency: real-time
Pro: Immediate, targeted

The Deletion Problem

When a document is deleted from the source, its chunks remain in the vector DB until explicitly removed. Users get answers from documents that no longer exist. Always track document IDs in your vector DB and delete chunks when source docs are removed.

Production Launch Checklist Core

Category	Checklist Item	Status
Data	All source documents ingested and validated	☐
Data	Incremental sync pipeline running	☐
Data	Stale document cleanup (TTL or deletion sync)	☐
Quality	Golden test set (50+ queries) with passing scores	☐
Quality	Eval runs in CI — blocks deploy on regression	☐
Quality	Faithfulness >0.90, Recall@5 >0.85 on eval set	☐
Performance	p99 latency <3s (or streaming TTFT <500ms)	☐
Performance	Semantic cache deployed, hit rate monitored	☐
Cost	Per-query cost calculated and budgeted	☐
Cost	Model routing for simple vs complex queries	☐
Observability	All queries/chunks/scores logged	☐
Observability	Alerts on quality degradation, error spikes	☐
Security	Access control on retrieval (user can only see their docs)	☐
Security	PII handling in logs and cache	☐
Fallback	"I don't know" with helpful fallback (human handoff, search link)	☐

∑ Chapter 10 — Key Takeaways

Production RAG = offline pipeline (data) + online pipeline (query) + observability + evaluation
Caching (exact + semantic) can save 30–60% of costs and reduce latency to <50ms for repeated queries
LLM generation is 60–70% of cost — optimize with fewer chunks, model routing, and caching
Stream responses to cut perceived latency from 2s to <500ms time-to-first-token
Keep the index fresh: incremental sync + deletion tracking — stale data destroys trust
Use the production checklist: data quality, evaluation gates, performance, cost, observability, security
RAG systems drift — continuous evaluation and monitoring are not optional, they're the product

← Prompt Engineering Agents in Production →