Context
Engineering
Building and optimizing context windows for LLM applications β construction, compression, windowing strategies, and production patterns.
Context is the bottleneck. How you construct, compress, and optimize context determines model quality, latency, and cost. This guide teaches the full spectrum of context engineering β from selection to compression to production caching.
The context window is not a container β it's a weighted attention field. What you put in it, where you put it, and how much of it is noise directly determines what the model can and cannot do with your input.
The model does not know what is relevant, what is correct, or what you intended. It only sees tokens. This single constraint has deep engineering implications:
The model cannot distinguish signal from noise. If irrelevant content is in the context, the model will attend to it. Relevance is your responsibility, not the model's.
Token position influences attention weight. The model does not magically find the most important chunk β it follows position bias. Design your context layout deliberately.
Every irrelevant token competes with relevant ones for attention. More noise = more diluted signal. A smaller, high-quality context consistently outperforms a large, noisy one.
Better context beats a better model β in most real-world systems. Switching from GPT-4o-mini to GPT-4o gives you a marginal improvement. Fixing your context construction can give you a 2β4Γ improvement on the same model. Always optimize context before upgrading the model.
Cost scales directly with token count β and the context window is almost always the largest component. 5 large chunks β ~10K input tokens β 5Γ the cost of a 2K-token context. Every optimization that reduces context size compounds across every request: lower cost, lower latency, less noise, better answers.
Every LLM receives a flat sequence of tokens β system prompt, conversation history, retrieved documents, tool results, and user input are all concatenated into a single integer array. The model attends across all of it simultaneously. There is no "memory" separate from this β the context window is the model's entire working memory for a given call.
| Model | Max Context | Typical Input Budget | Notes |
|---|---|---|---|
| GPT-4o | 128K tokens | ~120K usable | Good long-context performance up to ~64K |
| Claude 3.5 Sonnet | 200K tokens | ~190K usable | Strong at long documents |
| Gemini 1.5 Pro | 1M tokens | ~900K usable | Best for very long docs; some quality degradation at extremes |
| Llama 3.1 8B | 128K tokens | ~32K effective | Quality degrades significantly beyond 32K |
| Mistral 7B | 32K tokens | ~24K usable | Standard for local deployment |
A model may support 128K tokens but only reliably use 32Kβ64K. Beyond that, recall drops β especially for information placed in the middle. Always test your specific use case at your expected context lengths. Advertised != effective.
The 2023 paper "Lost in the Middle" demonstrated experimentally what practitioners already suspected: LLMs pay disproportionate attention to the beginning and end of the context window. Information placed in the middle is recalled less reliably β even when it's clearly the most relevant.
- Attention mechanisms naturally focus on nearby and very early tokens
- Position embeddings create an implicit primacy/recency bias
- Training data patterns reinforce beginning/end attention
- Longer contexts amplify the effect β more middle to get lost in
- Primacy: Put most critical context at the very start
- Recency: Move high-priority info near the query (end)
- Chunking: Shorter context windows reduce middle depth
- Repetition: Repeat key facts at start and end
- Explicit refs: "Based on document 1 above..." anchors attention
Place your most important instruction or the most relevant retrieved chunk either at the very beginning of the context or immediately before the user's question. Never bury critical facts in the middle of a long document list.
Every production LLM call has a token budget. Blow it and you get truncation errors, silent degradation, or hard failures. Managing this budget is a core engineering discipline.
| Budget Component | Typical Range | Controllable? | Notes |
|---|---|---|---|
| System prompt | 200β2,000 | Yes β compress it | Most system prompts can be halved with careful editing |
| Retrieved context | 1,000β50,000 | Yes β main lever | Chunk count Γ chunk size β your primary engineering surface |
| Conversation history | 0β32,000 | Partially β truncate | Grows unbounded in long chats; must be managed explicitly |
| User query | 10β500 | Not really | User controls this; guard against prompt injection stuffing |
| Max output | 256β8,000 | Yes β set it low | Reserve less when short answers expected; more for generation tasks |
Encoded in weights during training. Available without any prompt input.
Examples: What is Python? Who wrote Hamlet? What is a transformer?
Limits: Cutoff date, hallucination risk, no private data
Access: Always available, zero tokens
Provided at inference time via the context window. Overrides and extends parametric knowledge.
Examples: Your product docs, today's news, user's account data
Limits: Token budget, retrieval quality, position bias
Access: Costs tokens, requires retrieval or explicit injection
For facts the model already knows well (general concepts, public knowledge), don't waste context tokens restating them. Reserve your context budget for what the model cannot know: private data, recent events, user-specific information, and domain specifics it may hallucinate without grounding.
∑ Chapter 01 — Key Takeaways
- The context window is a flat token sequence β system prompt, retrieved docs, history, and user query all compete for the same budget
- Lost-in-the-middle: Recall drops for information placed in the center β put critical content at the start or immediately before the query
- Advertised β effective context length β always test your model at your target context sizes
- Token budget = model max β system β query β max_output β safety margin β retrieved context is your primary engineering lever
- Don't waste context on parametric knowledge the model already has β reserve it for private, recent, or user-specific information
Retrieval gives you candidates. Context construction turns candidates into a prompt the model can actually use. Selection, ordering, formatting, and citation anchoring are each distinct engineering problems.
Most teams treat retrieval and context construction as the same problem. They are not. Retrieval returns candidates. Context construction decides what to include, what to exclude, how to format it, and where to place it in the window.
System A retrieves the same 8 chunks as System B. It dumps them in retrieval order, unformatted, with no deduplication and no token budget. The model receives 6K tokens of noisy context.
System B re-ranks the same 8 chunks, drops 3 as irrelevant, deduplicates one, formats each with a source label, and places the most relevant first. The model receives 2K tokens of clean, ordered context.
System B consistently outperforms System A despite identical retrieval β purely from construction quality.
Production-grade systems do not use a fixed context template for all queries. They rewrite queries before retrieval, adapt the number of chunks based on query complexity, vary context size based on task type, and apply different construction strategies per user or workflow. A static pipeline that serves every query identically will underperform a dynamic one that adapts to the request.
Each stage affects quality independently. Teams that only focus on retrieval and ignore construction leave significant quality on the table.
Retrieved context always exceeds your budget. Selection decides what makes the cut.
Take the top N results by relevance score, regardless of token count.
- Simple, easy to reason about
- Problem: Long chunks waste budget; short chunks waste retrieval
- Use when: Chunks are uniform size
Add chunks in relevance order until token budget is exhausted.
- Efficient β always fills the budget exactly
- Problem: A large irrelevant chunk wastes the budget
- Use when: Maximizing context density matters
Only include chunks whose relevance score exceeds a minimum threshold.
- Quality control β excludes low-relevance noise
- Problem: May return empty context if nothing passes
- Use when: False positives are costly
Ordering is the direct answer to the lost-in-the-middle problem. Where you place retrieved chunks determines how well the model can use them.
| Strategy | Ordering | Best For | Risk |
|---|---|---|---|
| Relevance Descending | Most relevant first | Default β leverages primacy bias | Lowest-relevance chunks still in middle |
| Sandwich (U-shape) | Best β middle chunks β best | Long context, multiple equally relevant | Duplication of top chunk; slightly larger prompt |
| Reverse Relevance | Least relevant first, best last | When recency bias stronger than primacy | Model may anchor on weak context early |
| Temporal | Chronological order | Conversation history, time-sensitive docs | Most relevant may not be most recent |
| Hierarchical | Summary β detail chunks | Long documents with overview + details | Requires pre-computed summaries |
For 4+ retrieved chunks, use the sandwich ordering: most relevant chunk first, least relevant in the middle, second-most relevant last β immediately before the user's query. This exploits both primacy and recency bias simultaneously.
Formatting determines how clearly the model can distinguish between context chunks and how reliably it can cite them. Poor formatting causes models to conflate sources, miss boundaries, or fail to cite.
- No source labels β can't cite
- No chunk boundaries β model conflates
- No document IDs β can't reference
- Clear boundaries β model knows edges
- Source labels β enables citation
- IDs β "According to doc 1..."
Whatever format you choose in development, use exactly the same format in production. If you use XML tags, always use XML tags. If you use numbered lists, always use numbered. Models learn context patterns from your prompt β inconsistency causes unpredictable citation behavior.
Citation anchoring is the practice of instructing the model to explicitly reference which part of the context it used for each claim. It reduces hallucination, improves verifiability, and allows downstream validation.
∑ Chapter 02 — Key Takeaways
- Context construction is a pipeline: retrieve β re-rank β select β order β format β inject β each step affects quality
- Token-budget-aware selection: filter by score threshold, then fill budget greedily by relevance
- Ordering rule: Most relevant first (primacy); consider sandwich for 4+ chunks (primacy + recency)
- Use structured formatting (XML tags or numbered labels) β enables citation and prevents source conflation
- Citation anchoring in the system prompt: "cite using [doc id]" dramatically reduces hallucination and enables verification
Most production LLM quality issues trace back to context construction failures β not model limitations. Know these patterns so you can instrument for them.
The answer exists in your knowledge base but didn't make the top-k. Cause: embedding mismatch, wrong k, or poor chunking at ingestion.
Signal: user says "that info is in your docs" β you check and it is.
Retrieval returns chunks that share keywords with the query but don't answer it. The model attends to noise and produces a confused or blended answer.
Fix: raise relevance threshold; add re-ranking step.
Critical chunk placed at position 4β6 in a 8-chunk context. Lost-in-the-middle effect causes the model to underweight it or miss it entirely.
Fix: sandwich ordering β most important first or last.
Two chunks contradict each other (e.g., different policy versions). The model picks one without noting the conflict β often the wrong one (older, lower-quality, or earlier in context).
Fix: explicit conflict detection instruction + timestamp metadata.
Chunks injected as raw text with no delimiters or source labels. The model cannot distinguish where one document ends and another begins.
Fix: structured XML tags or numbered doc labels on every chunk.
The right chunk is in the context, but the model ignores it and generates from parametric memory anyway. Common when context is noisy, too long, or the relevant fact is in the middle.
Fix: reduce noise, move key chunk to start, add "only use provided sources" instruction.
Retrieval failures are easy to detect β the right chunk simply isn't there. Behavioral failures are harder: the right chunk is present, but the model still mixes sources, ignores the chunk, or hallucinates. These require evaluating both the context content (retrieval) and the model's grounding behaviour (faithfulness). Instrument both separately.
Context compression is not about making prompts shorter. It's about preserving maximum information density while spending fewer tokens. The goal: same answers, lower cost, lower latency, less position bias.
The instinct is to include more context β more docs, more history, more detail β to give the model "everything it needs." This is wrong. Beyond a quality threshold, adding more context actively degrades performance.
Every irrelevant token is noise. The model cannot filter noise itself β it attends to everything. More irrelevant content means more wrong attention patterns.
Attention is finite. Adding 10 chunks instead of 3 spreads the model's "focus" thinner. The relevant chunks get less effective attention weight.
Every additional chunk pushes other chunks further from the ends of the context window. A 10-chunk context is worse than a 4-chunk context if 6 chunks are low-relevance.
A small high-signal context consistently outperforms a large noisy context. Set relevance thresholds and enforce chunk limits. If you can answer the query with 3 chunks, don't include 8. The goal of compression is not smaller prompts β it's higher signal density per token.
Input tokens cost money. A 50% compression = 50% cost reduction on input tokens β meaningful at scale.
- GPT-4o: $2.50/1M input tokens
- 1M queries @ 10K tokens each = $25K
- 50% compression β $12.5K saved
Prefill time scales linearly with context length. Shorter context = faster TTFT (time-to-first-token).
- 10K tokens β 100β500ms prefill
- 50K tokens β 500β3000ms prefill
- Compression reduces this directly
Less context = less noise, less lost-in-the-middle, more focused attention on relevant parts.
- Removes off-topic sentences
- Reduces position bias effects
- Cleaner signal for the model
| Technique | How It Works | Compression Ratio | Quality Loss | Best For |
|---|---|---|---|---|
| Sentence Filtering | Remove sentences with low relevance score to query | 30β60% | Low (preserves exact text) | Long documents with mixed relevance |
| Extractive Summarization | Select and concatenate most relevant sentences | 40β70% | Low (exact sentences) | Articles, reports, documentation |
| Abstractive Summarization | LLM rewrites chunk in fewer tokens | 50β80% | Medium (may lose nuance) | Dense technical text, tables |
| Entity Extraction | Extract only key facts as structured snippets | 60β90% | High for open-ended; low for structured tasks | Structured data extraction tasks |
| LLMLingua / Selective Removal | Token-level perplexity scoring removes low-info tokens | 50β70% | Low (preserves key tokens) | General compression, RAG pipelines |
Sentence filtering is the highest-fidelity compression method: score each sentence against the query, keep only sentences above a relevance threshold. Lossless for the kept sentences; zero hallucination risk since no text is generated.
Abstractive compression uses an LLM (typically a small, cheap one) to rewrite chunks into denser summaries targeted at a specific query. It can achieve the highest compression ratios but introduces a quality dependency: the compressor must not lose or distort key facts.
β Long documents with redundant prose
β Dense technical text that can be paraphrased
β You need 60%+ compression
β Small fast model available as compressor
β Legal, medical, financial text where exact wording matters
β Code β summarization destroys syntax
β Numerical data β risk of transcription errors
β When compressor latency exceeds savings benefit
Abstractive compression requires an extra LLM call per chunk. At 5 chunks Γ 200ms per call = 1 second added to your pipeline. Use a small, fast model (Llama 3.1 8B, GPT-4o-mini, Claude Haiku) as the compressor β the main LLM's quality improvement must justify the added latency.
LLMLingua (Microsoft Research, 2023) is a compression method that uses a small LM to compute token-level perplexity. Tokens that are predictable (low perplexity) carry little information and can be dropped. The remaining tokens form a compressed β but still parseable β prompt.
- Works on any text β no LLM generation step
- Deterministic compression ratio control
- Preserves semantic meaning at 50% compression
- Fast β small local model for scoring
- Open source:
llmlinguapip package
- Compressed text looks garbled to humans
- Model-dependent: works best for models similar to compressor
- Sensitive structure (JSON, code, tables) may break
- Requires additional local model dependency
| Situation | Recommended Strategy | Reason |
|---|---|---|
| Legal / medical documents | Sentence filtering only | Exact wording matters β no LLM rewriting |
| General knowledge articles | Abstractive or LLMLingua | Prose is compressible without loss |
| Code snippets | Minimal or none | Code structure is brittle β compression may break syntax |
| Long-form reports (20+ pages) | Hierarchical: abstract summary + sentence filtering of sections | Need overview + relevant detail |
| Latency-sensitive pipeline | Sentence filtering (local embeddings) | No extra LLM call β milliseconds, not seconds |
| High-volume, cost-critical | LLMLingua (offline pre-compression) | Compress once, store compressed; pay cost once |
The most cost-effective compression happens at document ingestion, not at query time. Pre-compute summaries and compressed representations when documents are indexed. At query time, retrieve the pre-compressed version β zero extra latency, zero extra cost per query.
∑ Chapter 03 — Key Takeaways
- Compression delivers three wins: lower cost, lower latency, better quality (less noise = more focused attention)
- Sentence filtering: score sentences vs query with embeddings, drop below threshold β high fidelity, no hallucination risk
- Abstractive compression: use a fast LLM to rewrite chunks β high ratio but adds latency; avoid for precise text
- LLMLingua: token-level perplexity-based dropping β deterministic, works on arbitrary text, open source
- Pre-compress at ingestion β not at query time. Pay once; save cost on every query
- Never compress code, numerical tables, or legal text with abstractive methods β use extractive or no compression
Compression reduces tokens, cost, and latency β but it trades against information fidelity. Applied incorrectly, it loses the details your system depends on.
| Compression Risk | How It Manifests | Mitigation |
|---|---|---|
| Critical detail loss | A numeric threshold, date, or constraint is dropped because it scored low in isolation β but it was essential | Always evaluate compressed vs original output on a test set; flag numeric patterns for retention |
| Meaning alteration | Abstractive compression changes a negation, qualifier, or conditional β "not required" becomes "required" | Avoid abstractive compression for policy, legal, or safety content; use extractive only |
| Introduced bias | LLM compressor summarizes multiple viewpoints as one, losing nuance or introducing the model's own bias | Sample-evaluate compressed summaries; A/B test faithfulness scores |
| Over-compression | High compression ratio leaves too few tokens β the answer is no longer reconstructible from the compressed context | Set minimum token floor per chunk; test at boundary compression ratios |
Treat compression as a pipeline stage with its own regression tests. Every time you change compression logic, run your faithfulness eval on a held-out set and verify the score doesn't drop. Use compression selectively β apply it to narrative prose, not to structured data, code, or contractual language.
When the source material is longer than your context window, windowing decides which part of the document the model sees β and when. The wrong windowing strategy silently discards the answer before the model even runs.
A 200-page contract is ~150,000 tokens. A typical RAG chunk budget is 8,000β32,000 tokens. Even a "long context" model with 128K capacity can only fit ~85 pages. Windowing is how you navigate this mismatch for both indexing-time chunking and inference-time context assembly.
How you split documents into chunks when building the index. Determines retrieval granularity and the atomic unit of context.
Key question: How big should each chunk be?
Too large β diluted relevance scores, wasteful tokens
Too small β loses context needed to answer, more chunks to rank
How you manage context when a conversation grows or a task requires iterating through a long document.
Key question: What do I drop when the window fills?
Drop old turns β lose conversation coherence
Drop retrieved context β lose grounding
Fixed windows split documents into equal-size chunks at fixed token boundaries. Simple to implement, but mid-sentence and mid-paragraph cuts destroy semantic coherence.
The key fact is split across two chunks. Either chunk alone will produce an incomplete or wrong answer.
- Uniform prose with no sentence-crossing critical facts
- Large chunks (1,000+ tokens) where mid-cut rarely matters
- Pre-processing step before applying semantic splitting
- Code files split at function boundaries (not token count)
Minimum mitigation: add 10β20% overlap between adjacent chunks
Sliding windows add overlap between adjacent chunks. Each chunk shares N tokens with the previous chunk, ensuring that information at boundaries exists in at least one complete chunk.
| Chunk Size | Overlap | Use Case | Trade-off |
|---|---|---|---|
| 128β256 tokens | 20β40 tokens | FAQ, structured data, precise fact retrieval | High precision, low recall, more chunks to rank |
| 512 tokens | 50β100 tokens | General purpose RAG (most common) | Good balance β default starting point |
| 1,024β2,048 tokens | 100β200 tokens | Technical docs, legal text, reasoning tasks | Better context per chunk; harder to rank precisely |
| 4,096+ tokens | 400β500 tokens | Long-form analysis, whole-section retrieval | Context-rich but relevance score diluted |
Semantic chunking splits documents at natural topic boundaries rather than fixed token counts. When embedding similarity drops sharply between adjacent sentences, that's a topic transition β a good split point.
- Each chunk covers one coherent topic
- Retrieval scores are more meaningful (one topic per chunk)
- Fewer cross-boundary answer splits
- Works especially well on structured docs (reports, articles)
- Requires embedding every sentence β expensive at ingestion
- Variable chunk sizes complicate token budget planning
- Poor results on dense technical text with no clear topic shifts
- Needs a good sentence-level embedding model
Hierarchical windowing indexes small chunks for precise retrieval but retrieves larger parent chunks for fuller context. You get precision from small-chunk search and coherence from large-chunk content.
Small chunks win retrieval battles β they're precise, focused, and score high. But small chunks lose context battles β they're too short to fully answer. Parent-child solves both: retrieve with child precision, answer with parent context.
In multi-turn conversations, history grows unbounded. At some point, it exceeds the context window. Inference-time windowing decides what to keep and what to drop.
| Strategy | What Is Dropped | Pros | Cons |
|---|---|---|---|
| FIFO Truncation | Oldest messages first | Simple, no extra processing | Loses early context (system setup, key decisions) |
| Pinned + FIFO | Oldest non-pinned messages | Preserves system prompt + key anchors | Requires explicit pinning logic |
| Summary Compression | Older turns compressed to summary | No information loss per se; coherent history | Adds LLM call; summary may lose nuance |
| Semantic Retrieval | Low-relevance historical turns | Keeps only what's relevant to current query | Complex; requires embedding history |
Silent truncation is the most dangerous pattern β the model gets half a conversation with no indication that context was removed, leading to confused or contradictory responses. Always inject an explicit indicator when history is compressed: [Earlier conversation summarized: user asked about X, decided Y].
∑ Chapter 04 — Key Takeaways
- Fixed windows are simple but break sentences β always add 10β20% overlap at minimum
- Sliding windows with stride = chunk_size β overlap ensure boundary facts exist in at least one complete chunk
- Chunk size rule of thumb: 512 tokens with 64-token overlap is the default starting point; tune per domain
- Semantic chunking splits at topic boundaries β better precision but expensive to compute at ingestion
- Hierarchical windowing (parentβchild): index small chunks for retrieval, inject large parent chunks for context β the best of both worlds
- Inference-time: never silently truncate; use pinned-FIFO or summary compression; always tell the model when history was dropped
More context is not always better. Every low-signal token you add is a high-signal token the model attends to less. Information density engineering is about ensuring every token in your context window earns its place.
Transformer attention is not uniform β the model allocates attention across all tokens, but that allocation is competitive. When your context contains 20% useful signal and 80% boilerplate, the model must "work harder" to attend to the right parts. Dense context = better answers at lower token cost.
- Legal boilerplate and disclaimers
- Document headers, footers, page numbers
- Repeated information across chunks
- Off-topic paragraphs in retrieved docs
- Verbose explanations of obvious concepts
- Specific facts: numbers, dates, names
- Decision criteria and rules
- Definitions unique to your domain
- Step-by-step procedures
- Constraint lists and edge cases
- Maximize relevant tokens per chunk
- Remove formatting artifacts
- Deduplicate near-identical content
- Prefer structured representations
- Strip navigation, menus, ads, footers
| Noise Type | Example | Impact | Fix |
|---|---|---|---|
| Structural Artifacts | Page numbers, TOC entries, nav menus | Clutters chunks with zero-signal text | Strip during document pre-processing |
| Legal Boilerplate | "This document is confidential and intended only for..." | Wastes 100β500 tokens per document | Blacklist common boilerplate patterns |
| Duplicate Content | Same paragraph repeated in overview + detail section | Dilutes attention; inflates token count | Dedup at ingestion via embedding similarity |
| Irrelevant Sidebars | Related articles, footnotes, bibliography | Off-topic context confuses the model | Semantic filtering per section type |
| Verbose Prose | "It is worth noting that, in the context of..." (β "Note:") | 2β5Γ token waste on filler | Abstractive compression targeting filler phrases |
| Format Overhead | HTML tags, markdown escape sequences | Raw HTML uses 30β50% extra tokens vs plain text | Strip HTML; convert markdown to plain text |
Before tuning, measure. Two practical density metrics: relevance density (what fraction of the context is relevant to the query) and entity density (how many unique named entities per 100 tokens).
When the same information can be represented as prose or as a structured list/table, the structured version is almost always higher density β more facts per token, easier for the model to parse.
When indexing documents that contain policy lists, specifications, or tabular data, pre-convert prose to structured format during ingestion and store the structured version in your vector database. The embedding quality stays the same; the token density improves dramatically.
Research and production experience consistently show a non-monotonic relationship between context length and quality. Adding more context helps up to a point β then it starts to hurt.
Doubling your retrieved chunk count from 5 to 10 doesn't improve answers if the extra 5 chunks are marginally relevant. They add tokens, increase latency, and introduce noise. Default to fewer, higher-quality chunks. Start with 3β5; only add more if eval shows consistent improvement.
∑ Chapter 05 — Key Takeaways
- Attention is competitive β every low-signal token dilutes attention on high-signal tokens
- Main noise sources: structural artifacts, boilerplate, duplicates, verbose prose, HTML tags β strip at ingestion time
- Measure density: relevance density > 0.6 (fraction of sentences relevant to query) and entity density > 3.0 (named entities per 100 tokens)
- Structured representations beat prose β same information in bullet/table form uses 30β50% fewer tokens
- The threshold effect is real β quality peaks around 4Kβ8K tokens for most tasks; more context beyond that introduces noise
- Default to 3β5 high-quality chunks, not 10β20 mediocre ones
Long context models change what's possible β but not how you should think. A 1M token window doesn't eliminate the need for context engineering; it shifts the constraints. Cost, latency, and position bias all scale with context length.
| Model | Context Window | Effective Range | Input Cost | Best Use |
|---|---|---|---|---|
| GPT-4o | 128K tokens | ~32Kβ64K real reliability | $2.50/1M tokens | Standard production, API access |
| Claude 3.5 Sonnet | 200K tokens | Strong to ~128K | $3.00/1M tokens | Long legal/research docs |
| Claude 3 Opus | 200K tokens | Strong to ~150K | $15.00/1M tokens | Complex multi-doc analysis |
| Gemini 1.5 Pro | 1M tokens | Strong at 100Kβ500K | $1.25/1M (β€128K) / $2.50 (>128K) | Whole codebase, book-length docs |
| Gemini 1.5 Flash | 1M tokens | Good to ~200K, degrades beyond | $0.075/1M (β€128K) | High-volume, cost-sensitive long context |
| Llama 3.1 70B | 128K tokens | ~16Kβ32K effective for open models | Self-hosted | Private deployment, data sovereignty |
Understanding why models degrade at long context requires understanding position embeddings β how models encode token position in the sequence.
Each position has a fixed learned embedding. Max positions fixed at training β cannot generalize beyond training length.
- GPT-2 style; largely deprecated
- Hard stop at training max length
Relative positional encoding via rotation. Can extend beyond training length via RoPE scaling (YaRN, NTK-aware). Used by Llama, Mistral, Qwen.
- Extensible via fine-tuning or scaling
- Quality degrades gracefully beyond training range
Penalizes attention scores by distance, no explicit position embedding. Generalizes to arbitrary length without retraining.
- Used in MPT; robust extrapolation
- Some quality loss vs RoPE at long range
For models using RoPE (Llama, Mistral): stay within the fine-tuned context range. Beyond it, quality degrades unpredictably. For API models (GPT-4o, Claude, Gemini): the provider has already handled extension β but test your specific task at your target length before committing to production.
- The answer requires synthesizing across an entire document
- You can't predict which sections will be relevant (reduces retrieval risk)
- Document structure matters (cross-references, section dependencies)
- Few-shot examples are so large they exceed normal context
- Whole codebase analysis, full book Q&A, complete contract review
- RAG would achieve the same quality at 10Γ lower cost
- Only a small section of the doc is ever relevant
- You're answering many queries (cost scales with every call)
- You need low latency β 100K+ tokens β 2β10s prefill
- Cheaper model + retrieval outperforms expensive long-context model
| Strategy | Context Size | Cost (GPT-4o) | TTFT | Best For |
|---|---|---|---|---|
| RAG (sparse retrieval) | 2Kβ8K tokens | $0.005β$0.02 / query | <200ms | High-volume, known-answer retrieval |
| RAG (dense retrieval) | 8Kβ32K tokens | $0.02β$0.08 / query | 200β800ms | Complex queries, multiple documents |
| Long context (64K) | 64K tokens | $0.16 / query | 1β3s | Full-doc analysis, infrequent queries |
| Long context (200K) | 200K tokens | $0.50 / query | 5β15s | One-time analysis, not production serving |
The most cost-effective production pattern combines both: use retrieval as a first pass to identify relevant sections, then inject those sections (plus surrounding context) into a long-context model for synthesis. RAG gives you precision; long context gives you coherence within the relevant section. Cost stays bounded; quality improves.
Before committing to a long-context model for production, run the "needle in a haystack" test: place a specific fact (the needle) at various positions in a large document (the haystack) and ask questions that require recalling it. This reveals where each model's attention actually degrades.
∑ Chapter 06 — Key Takeaways
- Gemini 1.5 Pro (1M) is the longest-context option; Claude 3.5 Sonnet (200K) has the best long-context quality/cost balance for most production use
- Advertised context β effective context β run needle-in-a-haystack tests at your target lengths before committing
- RoPE (Llama, Mistral) can be extended but degrades beyond training range; API models (GPT-4o, Claude, Gemini) have handled this internally
- Use long context when synthesis across the whole document is required; use RAG when only a subset is relevant
- Long context has a cost tax: 64K tokens = $0.16/query (GPT-4o), 200K = $0.50/query β unsustainable for high-volume use
- Hybrid pattern: RAG to identify relevant sections β long-context model to synthesize within those sections
Context caching is one of the highest-ROI optimizations in LLM engineering. If the same prefix appears in multiple requests, you pay to process it once and reuse the KV cache. Savings of 50β90% on input token costs are achievable for the right workloads.
When an LLM processes a prompt, it computes key-value (KV) pairs for every token in the attention layers. This computation is expensive and proportional to context length. Prefix caching stores these pre-computed KV pairs β so if the same prefix appears in the next request, the model skips recomputing it entirely.
For caching to work, the prefix must be identical byte-for-byte across requests β same characters, same whitespace, same order. Even one token difference means a cache miss. This is why stable, front-loaded prefixes are the design pattern for cacheable prompts.
| Provider | Cache Type | Cached Token Cost | Min Cacheable Prefix | TTL |
|---|---|---|---|---|
| OpenAI (gpt-4o) | Automatic prompt caching | 50% off input tokens | 1,024 tokens | ~1 hour (auto-evicted) |
| Anthropic (Claude) | Explicit cache_control markers | ~90% off input (write once, read many) | 1,024 tokens | 5 min (ephemeral) / manual |
| Google (Gemini) | Explicit cached_content API | ~75% off input tokens | 32,768 tokens | Configurable (up to hours) |
| Self-hosted (vLLM) | Automatic prefix caching (βenable-prefix-caching) | GPU compute saved (no KV recompute) | Any prefix length | Memory-bound (in-flight) |
| Self-hosted (SGLang) | RadixAttention β tree-based KV sharing | Highest hit rate for branching prompts | Any prefix | Memory-bound |
OpenAI caches prompts automatically β no code changes required. Any prompt prefix of 1,024+ tokens that is reused within ~1 hour gets cached at 50% discount. The usage field in the response reports cached_tokens.
Claude's caching is explicit β you mark exactly which parts of the prompt to cache using cache_control breakpoints. The first request writes the cache (slight cost premium); subsequent requests read it at ~90% discount.
Prompt structure determines cache hit rate. Small changes to prompt ordering or dynamic content placement can mean the difference between 90% hit rate and 0%.
The date is in the prefix β every new day invalidates the entire cache. Zero cache hits.
Stable prefix stays identical β high cache hit rate for all static content.
| Dynamic Element | Where to Place | Why |
|---|---|---|
| Static instructions | System prompt β first | Always cached after first call |
| Large static documents | System prompt or early user turn | Highest value to cache β most tokens |
| Retrieved context | After stable prefix | Changes per query β prefix still cached |
| User query | Last in user turn | Always unique β keep after everything stable |
| Current date/time | Last β after all stable content | Invalidates cache if in prefix |
| Session/user ID | Never in cacheable prefix | Makes every request unique β zero hits |
∑ Chapter 07 — Key Takeaways
- Prefix caching reuses pre-computed KV pairs β 50β90% input token cost reduction for cacheable workloads
- OpenAI: automatic, no code changes β requires 1,024+ token prefix, ~1hr TTL, 50% discount
- Anthropic: explicit
cache_controlmarkers β up to 90% discount, use for large static documents - Gemini: explicit
cached_contentAPI, 32K+ min tokens β best for very large stable content - Design rule: stable content first, dynamic content last β never put dates/user IDs in the cacheable prefix
- For self-hosted: enable
--enable-prefix-cachingin vLLM or use SGLang's RadixAttention for branching prompts
Not all context is equally cache-worthy. The value of caching a piece of context depends on how frequently it's reused, how expensive it is to recompute, and how stable it is over time.
| What to Cache | Reuse Frequency | Benefit | Stability |
|---|---|---|---|
| Formatted retrieved chunks | High β same query pattern | Eliminates retrieval + formatting cost | Hoursβdays |
| Compressed document summaries | Very high β per document | Eliminates compression LLM call | Daysβweeks |
| System prompt | Every request | Provider prefix caching (50β90% discount) | Weeksβmonths |
| User preference context | Per user session | Eliminates DB lookup and formatting | Minutesβhours |
| Static knowledge base sections | High β shared across users | Serve from cache, skip retrieval | Days |
| Assembled context for top queries | Very high (80/20 rule) | Full pipeline bypass for hot queries | Minutesβhours |
Cached context is deterministic β the same pre-formatted, pre-compressed chunk is returned every time. This improves answer consistency across sessions. Without caching, minor variations in retrieval scores or compression outputs can cause the same query to produce different context β and different answers β across requests. Caching is both a cost lever and a reliability lever.
Most real-world queries require synthesizing across multiple documents or sources. How you rank, present, and delimit multiple documents determines whether the model synthesizes them correctly β or confuses, ignores, or contradicts them.
Model blends information from different documents into a single "answer," losing attribution of which source said what.
Fix: Explicit delimiters + citation instructions
Model answers from the first one or two documents, ignoring others entirely. Lost-in-the-middle at document level.
Fix: Fewer docs + sandwich ordering + explicit "use all sources" instruction
Documents contradict each other; model picks one without noting the contradiction.
Fix: Explicit conflict detection prompt + "note disagreements" instruction
All documents treated as equally current. An outdated doc overrides a newer one.
Fix: Inject timestamps; instruct model to prefer recent sources
Model cites "document 2" when the fact came from "document 4." Especially common with 5+ documents.
Fix: Unique, memorable source IDs (not just numbers)
When you have multiple retrieved documents, their order in the context window affects which ones the model uses. Rank-aware ordering is different from simple relevance sorting.
| Ranking Signal | What It Is | When to Use |
|---|---|---|
| Relevance Score | Cosine similarity or BM25 score to query | Default β most relevant first |
| Recency | Document timestamp or last-updated date | News, policies, product docs that change |
| Authority | Source type (official docs > forum post) or domain weight | Knowledge bases with mixed source quality |
| Re-rank Score | Cross-encoder score (Cohere Rerank, BGE reranker) | High-stakes retrieval; worth the extra latency |
| Diversity | MMR (Maximal Marginal Relevance) β relevance minus redundancy | When top chunks are near-duplicates of each other |
With multiple documents, clear formatting is critical. Labels must be unambiguous, metadata must be useful, and delimiters must prevent content bleeding between sources.
Including authority="official" vs authority="user" and document dates lets you instruct the model to resolve conflicts by authority and recency. Without this metadata, the model has no principled basis for choosing between contradictory sources.
Documents often contradict each other β especially across time (policy updated) or authority level (official docs vs user reports). Without explicit conflict handling, models pick arbitrarily.
Some answers require combining information from multiple documents β no single source is complete. Synthesis prompts encourage the model to explicitly integrate rather than just retrieve.
"Answer based on the documents provided."
Result: model picks the most relevant single document and answers from it, ignoring complementary information in others.
"Synthesize a complete answer by drawing from ALL provided documents. Identify which aspects each document contributes. Note if any document provides unique information not found in others."
Result: model explicitly combines across sources.
Quality of multi-document synthesis degrades significantly beyond 3β5 documents for most models. Each additional document increases the probability of source neglect, conflation, or mislabeling. If you need 10 documents, consider a two-pass approach: first pass summarizes each document independently; second pass synthesizes the summaries. This is cheaper, more reliable, and scales better than a single context with 10 documents.
∑ Chapter 08 — Key Takeaways
- Five multi-doc failure modes: conflation, neglect, conflict blindness, recency blindness, mislabeling β each requires a specific fix
- Use rich metadata (source ID, date, authority level) in document tags β it enables automatic conflict resolution by the model
- MMR ranking balances relevance and diversity β prevents top-k from returning near-duplicate chunks
- Explicit conflict resolution instructions: prefer official over user, recent over old, note contradictions explicitly
- Quality degrades with 3+ documents β for 10+ docs, use the map-reduce (two-pass) pattern: extract per doc, then synthesize
- Synthesis prompts outperform retrieval prompts β "synthesize from all" vs "answer based on documents" produces meaningfully different results
You can't improve what you don't measure. Context quality is the invisible variable that determines whether your LLM application works in production β and most teams only discover problems when users complain. This chapter defines the metric stack that tells you exactly where context is failing.
Does the context contain information that answers the query? High relevance = low noise. Measured per-chunk and at the context level.
Metric: relevance score (0β1); % of chunks used in the answer
Does the model's response stay grounded in the provided context? Low faithfulness = hallucination even when context is good.
Metric: RAGAS faithfulness; claim verification rate
Does the context include all facts needed to answer completely? Missing a key piece forces the model to hallucinate or hedge.
Metric: answer completeness; recall@k vs gold answer
How much of the context window is actually useful? Token waste = higher cost + higher latency + more noise for the model.
Metric: utilization ratio; noise fraction; token cost per query
Retrieval metrics measure the quality of what you put into context β before the model sees it. These are fast, cheap, and deterministic.
| Metric | What It Measures | How to Compute | Target |
|---|---|---|---|
| Precision@k | Fraction of top-k retrieved chunks that are relevant | Manual labels or LLM judge on sample | >0.7 |
| Recall@k | Fraction of all relevant chunks that appear in top-k | Requires ground-truth relevant set | >0.8 for factual QA |
| MRR | Mean Reciprocal Rank β how early is the first relevant result? | avg(1/rank of first relevant chunk) | >0.6 |
| NDCG@k | Normalized Discounted Cumulative Gain β graded relevance, rank-aware | Relevance labels (0/1/2) + DCG formula | >0.75 |
| Context Utilization | % of retrieved chunks cited or used in final answer | LLM judge: "which chunks did the model actually use?" | >50% β low means too much noise |
| Noise Fraction | % of context tokens that are irrelevant to the query | LLM relevance scorer per chunk | <30% β lower is better |
RAGAS (Retrieval-Augmented Generation Assessment) provides four core metrics that together cover the full RAG quality surface:
Are all claims in the answer supported by the context? Breaks the answer into atomic claims and verifies each against retrieved chunks.
Is the answer actually addressing the question asked? Generates back-questions from the answer and measures alignment with the original.
Are the retrieved chunks actually useful for generating the answer? Measures signal-to-noise in the context window.
Does the retrieved context contain all the information needed to answer? Measures coverage relative to ground-truth answer.
For production systems at scale, you can't manually review every retrieval result. LLM-as-judge provides automated, scalable evaluation that correlates well with human judgments.
Faithfulness measures whether the model's answer is grounded in the context. This is your primary hallucination detector for RAG systems.
Step 1 β Claim Extraction: Break the model's answer into atomic, verifiable claims. Each claim should be a single, unambiguous statement.
Step 2 β Claim Verification: For each claim, check whether it is supported, contradicted, or absent from the retrieved context.
Step 3 β Score Computation: Faithfulness = supported_claims / total_claims. A score below 0.8 indicates significant hallucination risk.
In production you need continuous metric tracking β not just offline eval. Log the key signals with every request and aggregate them into a live dashboard.
| Metric | Collection Method | Alert Threshold |
|---|---|---|
| Context Relevance (mean) | LLM scorer on sampled requests (5β10%) | <0.6 |
| Faithfulness | Async faithfulness check post-response | <0.75 |
| Context Utilization | Citation extraction from response | <0.4 β too many irrelevant chunks |
| Tokens per Query | LLM usage logs | >2Γ baseline β context bloat |
| Answer Latency p95 | Request timing | >5s β retrieval or context issues |
| User Feedback Rate | Thumbs up/down or follow-up question rate | Downvote rate >15% |
Offline evaluation on a benchmark dataset rarely reflects production performance. Production queries have a different distribution, different lengths, and different failure modes. Always run online metrics (sampled LLM evaluation + user feedback) alongside offline benchmarks. A system that scores 0.9 on your eval set may score 0.65 in production on queries you didn't anticipate.
∑ Chapter 09 — Key Takeaways
- Context quality has four pillars: relevance, faithfulness, coverage, efficiency β measure all four, not just end-task accuracy
- Use retrieval metrics (precision@k, recall@k, NDCG) as fast pre-LLM signals that catch retrieval failures before they reach the model
- RAGAS is the standard framework: faithfulness, answer relevancy, context precision, context recall β run it on every major change
- LLM-as-judge scales evaluation to production β sample 5β10% of requests and score chunk relevance asynchronously
- Faithfulness verification (claim extraction β claim verification) is your primary hallucination detector in RAG systems
- Build a live metrics dashboard: context relevance, faithfulness, utilization, token cost, latency β alert on degradation
Context quality must be evaluated, not assumed. A system that "seems to work" in manual testing can have systematic failure modes that only show up under controlled evaluation. Build a test harness that isolates context variables.
Remove individual chunks from the context and measure the change in answer accuracy. If removing a chunk doesn't change the answer, the chunk is wasted tokens. If removing it causes failure, it's critical.
Test: answer_quality(full_context) vs answer_quality(context - chunk_N)
Shuffle chunk order and measure how much answer quality varies. High variance = model is fragile to ordering. Low variance = ordering doesn't matter much for this query type.
Test 5 permutations; measure faithfulness variance across orderings.
Compare answers produced from the original uncompressed context vs compressed context on your test set. If faithfulness drops more than 5% absolute, the compression ratio is too aggressive.
Target: <5% faithfulness drop at your target compression ratio.
For a sample of production queries, measure what fraction of context tokens were cited in the answer. Tokens not cited are wasted. A utilization below 40% signals a retrieval or selection problem.
Target: >50% of context tokens referenced in the final answer.
Context engineering in a notebook is easy. Context engineering at production scale β with real latency budgets, cost constraints, concurrent users, and cascading failures β is an entirely different discipline. This chapter is the full production playbook.
Each stage has its own latency budget, failure mode, and optimization surface. Treat them as independent services with SLAs β not a single monolithic function.
| Stage | Typical Latency | Primary Optimization | Failure Mode |
|---|---|---|---|
| Query Parsing | 1β5ms | Pre-compiled regex; cached NLP models | Wrong intent extraction β wrong retrieval |
| Vector Retrieval | 10β50ms | ANN index (HNSW); GPU-accelerated | Index staleness; embedding model mismatch |
| Keyword Search | 5β20ms | Inverted index; field weighting | Sparse coverage on long-tail queries |
| Re-Ranking | 50β200ms | Async; cache popular queries | Latency spike; cross-encoder OOM |
| Compression | 100β500ms | Rule-based first; LLM only when needed | Over-compression loses key facts |
| LLM Inference | 500msβ5s | Streaming; prefix caching; batching | Timeout; context length exceeded |
Every millisecond of context construction latency adds directly to user-perceived response time. Parallelise all retrieval and processing steps that don't depend on each other.
At scale, context size is your primary cost driver. A system consuming 4,000 input tokens per request at $3/M tokens costs $0.012 per request β at 1M daily requests, that's $12,000/day just in input tokens.
Use small, cheap models (GPT-4o-mini, Haiku) for simple queries with short context. Route complex queries to large models. Saves 60β80% on most workloads.
Cache system prompts and static context with providers that support it (Anthropic, OpenAI). Repeated prefix tokens cost 10Γ less. Saves 20β40% for chat applications.
Set hard token budgets per context section. Use extractive compression on retrieved chunks. Remove boilerplate from system prompts. Target <2,000 input tokens for simple Q&A.
| Strategy | Token Reduction | Quality Impact | Implementation Effort |
|---|---|---|---|
| Reduce k (fewer chunks) | 20β40% | Minimal if precision is high | Low |
| Extractive compression | 30β60% | Low β keeps key sentences | Medium |
| History summarization | 40β70% | Moderate β may lose nuance | Medium |
| Prefix caching | 10β30% cost | None β same tokens | Low (provider feature) |
| Model routing | 50β80% cost | Depends on routing accuracy | High |
| Semantic deduplication | 10β25% | Positive β removes noise | Medium |
Full observability means you can trace any production failure back to its root cause in the context pipeline: was it a bad retrieval, a compression error, a cache miss, or an LLM failure?
Key signals to trace at every request:
- Retrieval latency (ms)
- Chunks retrieved (count)
- Mean relevance score
- Cache hit/miss
- Total tokens assembled
- Tokens per section
- Compression ratio
- Truncation events
- Input / output tokens
- Time to first token
- Total latency
- Provider / model used
If vector search fails or returns low-confidence results, fall back to BM25 keyword search. If both fail, serve from a pre-built static context for the query category.
Always check token count before sending to the LLM. If the assembled context exceeds the model's limit, apply emergency compression: truncate history first, then reduce chunk count.
Each pipeline stage gets a hard timeout. A slow re-ranker should not block the entire request. Degrade to fewer chunks rather than wait indefinitely.
If a retrieval backend fails repeatedly (e.g., vector DB unreachable), open the circuit and serve from cache or static context rather than hammering the failing service.
| Scale Level | Architecture | Key Optimizations |
|---|---|---|
| <100 RPS | Single service, async Python (FastAPI) | Async retrieval, prefix caching, response streaming |
| 100β1K RPS | Horizontal scaling + Redis cache | Semantic query caching, HNSW index on dedicated GPU, re-rank batching |
| 1Kβ10K RPS | Dedicated retrieval microservice + context assembly service | Read replicas, shard vector index, async evaluation pipeline |
| >10K RPS | Kafka-based pipeline, geo-distributed indexes, CDN for static context | Pre-computed context for top queries, speculative prefill, model replicas |
In most production systems, 20% of distinct query patterns account for 80% of traffic. Pre-compute and cache context for your top query templates. This can reduce live retrieval load by 60β80%, dramatically improving p99 latency. Use semantic clustering to identify your top query templates from production logs.
- Hybrid search (vector + BM25)
- Re-ranking on top-k
- Semantic deduplication
- Retrieval fallback chain
- Index freshness monitoring
- Token budget enforced per section
- Overflow guard (assert <limit)
- Extractive compression for long chunks
- Conversation history summarization
- Context template versioning
- Parallel async retrieval
- Timeout budgets per stage
- Prefix caching enabled
- Semantic query caching (Redis)
- Streaming responses
- Distributed tracing (OTEL)
- Token usage per section logged
- Relevance score sampled (5β10%)
- Faithfulness check on samples
- Alerts on degradation
∑ Chapter 10 — Key Takeaways
- Treat the context pipeline as a microservice graph with independent latency budgets, SLAs, and failure modes per stage
- Parallelise all retrieval β vector search, keyword search, history fetch, and user context should all run concurrently with per-stage timeouts
- Cost management: model tiering + prefix caching + aggressive compression can reduce token spend by 60β80% vs naΓ―ve implementation
- Full observability requires distributed tracing at every pipeline stage β retrieval span, assembly span, LLM span β not just end-to-end latency
- Reliability patterns: retrieval fallback chain, overflow guard, circuit breaker, and timeout-based degradation prevent single-component failures from cascading
- The 80/20 query distribution is your biggest scaling lever β pre-compute context for top query templates to cut live retrieval load by 60β80%
In 2024β2026, the capability gap between frontier models has narrowed dramatically. GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro are all excellent β and increasingly similar on most benchmarks. The infrastructure for calling them is standardized. What's left as the primary differentiator?
How you select, order, format, and assemble context determines model grounding. Two teams using the same model get dramatically different results based on construction alone.
Teams that ruthlessly filter noise, compress aggressively, and enforce relevance thresholds see 2β4Γ quality improvements on the same model vs teams that dump raw retrieval results.
Context engineering is the discipline of controlling what the model sees β not hoping it figures it out. Teams with systematic eval loops, compression pipelines, and cache architectures win.
The model is a fixed function. You cannot change what it knows or how it reasons. The only variable you control is the input. Every improvement in your LLM application β quality, cost, latency, reliability β comes from engineering better inputs. Context engineering is not a supporting discipline. It is the core discipline.
Focus on: control over what enters the context Β· cost awareness at every token Β· failure handling at every pipeline stage Β· systematic measurement of what's working. These four habits separate production-grade context systems from everything else.