AI Advanced · Context Engineering

Context
Engineering

Building and optimizing context windows for LLM applications β€” construction, compression, windowing strategies, and production patterns.

Context is the bottleneck. How you construct, compress, and optimize context determines model quality, latency, and cost. This guide teaches the full spectrum of context engineering β€” from selection to compression to production caching.

01
Chapter 01 Β· Foundations
Context Fundamentals β€” How LLMs Use Context

The context window is not a container β€” it's a weighted attention field. What you put in it, where you put it, and how much of it is noise directly determines what the model can and cannot do with your input.

The model does not know what is relevant, what is correct, or what you intended. It only sees tokens. This single constraint has deep engineering implications:

🎯
Relevance must be engineered

The model cannot distinguish signal from noise. If irrelevant content is in the context, the model will attend to it. Relevance is your responsibility, not the model's.

πŸ“
Ordering must be intentional

Token position influences attention weight. The model does not magically find the most important chunk β€” it follows position bias. Design your context layout deliberately.

πŸ“‰
Noise directly reduces accuracy

Every irrelevant token competes with relevant ones for attention. More noise = more diluted signal. A smaller, high-quality context consistently outperforms a large, noisy one.

The Most Important Insight in Context Engineering

Better context beats a better model β€” in most real-world systems. Switching from GPT-4o-mini to GPT-4o gives you a marginal improvement. Fixing your context construction can give you a 2–4Γ— improvement on the same model. Always optimize context before upgrading the model.

Context Is the Primary Cost Driver

Cost scales directly with token count β€” and the context window is almost always the largest component. 5 large chunks β†’ ~10K input tokens β†’ 5Γ— the cost of a 2K-token context. Every optimization that reduces context size compounds across every request: lower cost, lower latency, less noise, better answers.

Every LLM receives a flat sequence of tokens β€” system prompt, conversation history, retrieved documents, tool results, and user input are all concatenated into a single integer array. The model attends across all of it simultaneously. There is no "memory" separate from this β€” the context window is the model's entire working memory for a given call.

The context window β€” one flat token sequence
System Prompt Role, instructions ~200–2K tokens Retrieved Context Docs, chunks, tool output ~1K–64K tokens Conv. History Previous turns ~0–32K tokens User Query Current message ~10–500 tokens Generated Output max_tokens Total budget = model max (e.g. 128K) β€” input tokens β€” output tokens β†’ remaining KV cache
ModelMax ContextTypical Input BudgetNotes
GPT-4o 128K tokens ~120K usable Good long-context performance up to ~64K
Claude 3.5 Sonnet 200K tokens ~190K usable Strong at long documents
Gemini 1.5 Pro 1M tokens ~900K usable Best for very long docs; some quality degradation at extremes
Llama 3.1 8B 128K tokens ~32K effective Quality degrades significantly beyond 32K
Mistral 7B 32K tokens ~24K usable Standard for local deployment
Advertised vs Effective Context Length

A model may support 128K tokens but only reliably use 32K–64K. Beyond that, recall drops β€” especially for information placed in the middle. Always test your specific use case at your expected context lengths. Advertised != effective.

The 2023 paper "Lost in the Middle" demonstrated experimentally what practitioners already suspected: LLMs pay disproportionate attention to the beginning and end of the context window. Information placed in the middle is recalled less reliably β€” even when it's clearly the most relevant.

Recall accuracy by position in context window
0% 50% 100% Recall Accuracy Start Middle End Position of relevant information in context High recall Recall drops here High recall
πŸ“‰
Why It Happens
  • Attention mechanisms naturally focus on nearby and very early tokens
  • Position embeddings create an implicit primacy/recency bias
  • Training data patterns reinforce beginning/end attention
  • Longer contexts amplify the effect β€” more middle to get lost in
🎯
Mitigation Strategies
  • Primacy: Put most critical context at the very start
  • Recency: Move high-priority info near the query (end)
  • Chunking: Shorter context windows reduce middle depth
  • Repetition: Repeat key facts at start and end
  • Explicit refs: "Based on document 1 above..." anchors attention
The Practical Rule

Place your most important instruction or the most relevant retrieved chunk either at the very beginning of the context or immediately before the user's question. Never bury critical facts in the middle of a long document list.

Every production LLM call has a token budget. Blow it and you get truncation errors, silent degradation, or hard failures. Managing this budget is a core engineering discipline.

πŸ”§
Token budget calculator (Python)
import tiktoken def count_tokens(text: str, model: str = "gpt-4o") -> int: enc = tiktoken.encoding_for_model(model) return len(enc.encode(text)) def build_context_budget( model_max: int = 128000, system_prompt: str = "", user_query: str = "", max_output: int = 2000, safety_margin: int = 500, ) -> int: """Return how many tokens are available for retrieved context.""" system_tokens = count_tokens(system_prompt) query_tokens = count_tokens(user_query) reserved = system_tokens + query_tokens + max_output + safety_margin available = model_max - reserved return max(0, available) # Example: 128K model, 500-token system, 50-token query budget = build_context_budget( model_max=128000, system_prompt=system_prompt, user_query=user_query, ) # budget β‰ˆ 125,000 β€” how many tokens you can fill with retrieved chunks
Budget ComponentTypical RangeControllable?Notes
System prompt 200–2,000 Yes β€” compress it Most system prompts can be halved with careful editing
Retrieved context 1,000–50,000 Yes β€” main lever Chunk count Γ— chunk size β€” your primary engineering surface
Conversation history 0–32,000 Partially β€” truncate Grows unbounded in long chats; must be managed explicitly
User query 10–500 Not really User controls this; guard against prompt injection stuffing
Max output 256–8,000 Yes β€” set it low Reserve less when short answers expected; more for generation tasks
Parametric Knowledge (baked in)

Encoded in weights during training. Available without any prompt input.

Examples: What is Python? Who wrote Hamlet? What is a transformer?

Limits: Cutoff date, hallucination risk, no private data

Access: Always available, zero tokens

Contextual Knowledge (injected)

Provided at inference time via the context window. Overrides and extends parametric knowledge.

Examples: Your product docs, today's news, user's account data

Limits: Token budget, retrieval quality, position bias

Access: Costs tokens, requires retrieval or explicit injection

The Engineering Implication

For facts the model already knows well (general concepts, public knowledge), don't waste context tokens restating them. Reserve your context budget for what the model cannot know: private data, recent events, user-specific information, and domain specifics it may hallucinate without grounding.

∑ Chapter 01 — Key Takeaways

  • The context window is a flat token sequence β€” system prompt, retrieved docs, history, and user query all compete for the same budget
  • Lost-in-the-middle: Recall drops for information placed in the center β€” put critical content at the start or immediately before the query
  • Advertised β‰  effective context length β€” always test your model at your target context sizes
  • Token budget = model max βˆ’ system βˆ’ query βˆ’ max_output βˆ’ safety margin β€” retrieved context is your primary engineering lever
  • Don't waste context on parametric knowledge the model already has β€” reserve it for private, recent, or user-specific information
02
Chapter 02 Β· Building Context
Context Construction β€” Selection, Ordering, and Formatting

Retrieval gives you candidates. Context construction turns candidates into a prompt the model can actually use. Selection, ordering, formatting, and citation anchoring are each distinct engineering problems.

Most teams treat retrieval and context construction as the same problem. They are not. Retrieval returns candidates. Context construction decides what to include, what to exclude, how to format it, and where to place it in the window.

Two systems with identical retrieval β€” different results

System A retrieves the same 8 chunks as System B. It dumps them in retrieval order, unformatted, with no deduplication and no token budget. The model receives 6K tokens of noisy context.

System B re-ranks the same 8 chunks, drops 3 as irrelevant, deduplicates one, formats each with a source label, and places the most relevant first. The model receives 2K tokens of clean, ordered context.

System B consistently outperforms System A despite identical retrieval β€” purely from construction quality.

Dynamic Construction Outperforms Static Pipelines

Production-grade systems do not use a fixed context template for all queries. They rewrite queries before retrieval, adapt the number of chunks based on query complexity, vary context size based on task type, and apply different construction strategies per user or workflow. A static pipeline that serves every query identically will underperform a dynamic one that adapts to the request.

πŸ”RetrieveTop-K candidates
πŸ“ŠRe-rankBy relevance to query
βœ‚οΈSelectToken-budget-aware
πŸ—‚οΈOrderPrimacy/recency aware
🏷️FormatLabels, delimiters
βœ…InjectInto system/user prompt

Each stage affects quality independently. Teams that only focus on retrieval and ignore construction leave significant quality on the table.

Retrieved context always exceeds your budget. Selection decides what makes the cut.

πŸ“
Top-K Cutoff

Take the top N results by relevance score, regardless of token count.

  • Simple, easy to reason about
  • Problem: Long chunks waste budget; short chunks waste retrieval
  • Use when: Chunks are uniform size
πŸ’°
Token Budget Fill

Add chunks in relevance order until token budget is exhausted.

  • Efficient β€” always fills the budget exactly
  • Problem: A large irrelevant chunk wastes the budget
  • Use when: Maximizing context density matters
🎯
Score Threshold

Only include chunks whose relevance score exceeds a minimum threshold.

  • Quality control β€” excludes low-relevance noise
  • Problem: May return empty context if nothing passes
  • Use when: False positives are costly
πŸ”§
Token-budget-aware selection
def select_chunks( chunks: list[dict], # {"text": ..., "score": ..., "tokens": ...} token_budget: int, min_score: float = 0.5, max_chunks: int = 10, ) -> list[dict]: # Filter by minimum relevance candidates = [c for c in chunks if c["score"] >= min_score] # Sort by score descending candidates.sort(key=lambda x: x["score"], reverse=True) selected, used = [], 0 for chunk in candidates[:max_chunks]: if used + chunk["tokens"] <= token_budget: selected.append(chunk) used += chunk["tokens"] else: break # budget exhausted return selected

Ordering is the direct answer to the lost-in-the-middle problem. Where you place retrieved chunks determines how well the model can use them.

StrategyOrderingBest ForRisk
Relevance Descending Most relevant first Default β€” leverages primacy bias Lowest-relevance chunks still in middle
Sandwich (U-shape) Best β†’ middle chunks β†’ best Long context, multiple equally relevant Duplication of top chunk; slightly larger prompt
Reverse Relevance Least relevant first, best last When recency bias stronger than primacy Model may anchor on weak context early
Temporal Chronological order Conversation history, time-sensitive docs Most relevant may not be most recent
Hierarchical Summary β†’ detail chunks Long documents with overview + details Requires pre-computed summaries
The Sandwich Pattern

For 4+ retrieved chunks, use the sandwich ordering: most relevant chunk first, least relevant in the middle, second-most relevant last β€” immediately before the user's query. This exploits both primacy and recency bias simultaneously.

Formatting determines how clearly the model can distinguish between context chunks and how reliably it can cite them. Poor formatting causes models to conflate sources, miss boundaries, or fail to cite.

❌
Poor Formatting
Here is some context: The refund policy is 30 days. Customer service hours are 9-5. Returns require a receipt. Answer the question.
  • No source labels β†’ can't cite
  • No chunk boundaries β†’ model conflates
  • No document IDs β†’ can't reference
βœ…
Structured Formatting
<context> <doc id="1" source="refund-policy.pdf"> The refund policy allows returns within 30 days of purchase with receipt. </doc> <doc id="2" source="support-hours.txt"> Customer service: Mon-Fri 9am-5pm EST. </doc> </context>
  • Clear boundaries β†’ model knows edges
  • Source labels β†’ enables citation
  • IDs β†’ "According to doc 1..."
πŸ”§
Context formatter (Python)
def format_context( chunks: list[dict], style: str = "xml" # "xml" | "markdown" | "numbered" ) -> str: if style == "xml": parts = ["<context>"] for i, chunk in enumerate(chunks, 1): src = chunk.get("source", "unknown") parts.append(f'<doc id="{i}" source="{src}">') parts.append(chunk["text"].strip()) parts.append("</doc>") parts.append("</context>") return "\n".join(parts) elif style == "markdown": parts = [] for i, chunk in enumerate(chunks, 1): parts.append(f"### Source {i}: {chunk.get('source', '')}") parts.append(chunk["text"].strip()) parts.append("---") return "\n\n".join(parts) elif style == "numbered": return "\n\n".join( f"[{i}] {c['text'].strip()}" for i, c in enumerate(chunks, 1) )
Format Consistency Matters

Whatever format you choose in development, use exactly the same format in production. If you use XML tags, always use XML tags. If you use numbered lists, always use numbered. Models learn context patterns from your prompt β€” inconsistency causes unpredictable citation behavior.

Citation anchoring is the practice of instructing the model to explicitly reference which part of the context it used for each claim. It reduces hallucination, improves verifiability, and allows downstream validation.

πŸ“‹
Citation Instruction (System Prompt)
You are a helpful assistant with access to company documentation. Rules: - Answer ONLY using the provided context - Cite your source using [doc id] notation - If the answer is not in the context, say "I don't have information on this." - Do NOT use your general knowledge
βœ…
Grounded Output Example
Our refund policy allows returns within 30 days of purchase [doc 1]. You'll need your original receipt to process the return [doc 1]. For questions, contact customer service Monday through Friday, 9am–5pm EST [doc 2].

∑ Chapter 02 — Key Takeaways

  • Context construction is a pipeline: retrieve β†’ re-rank β†’ select β†’ order β†’ format β†’ inject β€” each step affects quality
  • Token-budget-aware selection: filter by score threshold, then fill budget greedily by relevance
  • Ordering rule: Most relevant first (primacy); consider sandwich for 4+ chunks (primacy + recency)
  • Use structured formatting (XML tags or numbered labels) β€” enables citation and prevents source conflation
  • Citation anchoring in the system prompt: "cite using [doc id]" dramatically reduces hallucination and enables verification

Most production LLM quality issues trace back to context construction failures β€” not model limitations. Know these patterns so you can instrument for them.

πŸ™ˆ
Relevant Chunk Not Selected

The answer exists in your knowledge base but didn't make the top-k. Cause: embedding mismatch, wrong k, or poor chunking at ingestion.

Signal: user says "that info is in your docs" β€” you check and it is.

πŸ”€
Too Many Irrelevant Chunks

Retrieval returns chunks that share keywords with the query but don't answer it. The model attends to noise and produces a confused or blended answer.

Fix: raise relevance threshold; add re-ranking step.

πŸ“
Key Info Buried in the Middle

Critical chunk placed at position 4–6 in a 8-chunk context. Lost-in-the-middle effect causes the model to underweight it or miss it entirely.

Fix: sandwich ordering β€” most important first or last.

βš”οΈ
Conflicting Chunks β€” Wrong One Wins

Two chunks contradict each other (e.g., different policy versions). The model picks one without noting the conflict β€” often the wrong one (older, lower-quality, or earlier in context).

Fix: explicit conflict detection instruction + timestamp metadata.

🏷️
Poor Formatting Causing Confusion

Chunks injected as raw text with no delimiters or source labels. The model cannot distinguish where one document ends and another begins.

Fix: structured XML tags or numbered doc labels on every chunk.

🌫️
Hallucination Despite Correct Context

The right chunk is in the context, but the model ignores it and generates from parametric memory anyway. Common when context is noisy, too long, or the relevant fact is in the middle.

Fix: reduce noise, move key chunk to start, add "only use provided sources" instruction.

Behavioral Failures Are Harder to Debug Than Retrieval Failures

Retrieval failures are easy to detect β€” the right chunk simply isn't there. Behavioral failures are harder: the right chunk is present, but the model still mixes sources, ignores the chunk, or hallucinates. These require evaluating both the context content (retrieval) and the model's grounding behaviour (faithfulness). Instrument both separately.

03
Chapter 03 Β· Compression
Context Compression β€” Fitting More Signal into Fewer Tokens

Context compression is not about making prompts shorter. It's about preserving maximum information density while spending fewer tokens. The goal: same answers, lower cost, lower latency, less position bias.

The instinct is to include more context β€” more docs, more history, more detail β€” to give the model "everything it needs." This is wrong. Beyond a quality threshold, adding more context actively degrades performance.

πŸ“‘
Increased Noise

Every irrelevant token is noise. The model cannot filter noise itself β€” it attends to everything. More irrelevant content means more wrong attention patterns.

🌊
Spread Attention

Attention is finite. Adding 10 chunks instead of 3 spreads the model's "focus" thinner. The relevant chunks get less effective attention weight.

πŸ“
Lost-in-the-Middle Worsens

Every additional chunk pushes other chunks further from the ends of the context window. A 10-chunk context is worse than a 4-chunk context if 6 chunks are low-relevance.

The High-Signal Rule

A small high-signal context consistently outperforms a large noisy context. Set relevance thresholds and enforce chunk limits. If you can answer the query with 3 chunks, don't include 8. The goal of compression is not smaller prompts β€” it's higher signal density per token.

πŸ’°
Cost

Input tokens cost money. A 50% compression = 50% cost reduction on input tokens β€” meaningful at scale.

  • GPT-4o: $2.50/1M input tokens
  • 1M queries @ 10K tokens each = $25K
  • 50% compression β†’ $12.5K saved
⚑
Latency

Prefill time scales linearly with context length. Shorter context = faster TTFT (time-to-first-token).

  • 10K tokens β‰ˆ 100–500ms prefill
  • 50K tokens β‰ˆ 500–3000ms prefill
  • Compression reduces this directly
🎯
Quality

Less context = less noise, less lost-in-the-middle, more focused attention on relevant parts.

  • Removes off-topic sentences
  • Reduces position bias effects
  • Cleaner signal for the model
TechniqueHow It WorksCompression RatioQuality LossBest For
Sentence Filtering Remove sentences with low relevance score to query 30–60% Low (preserves exact text) Long documents with mixed relevance
Extractive Summarization Select and concatenate most relevant sentences 40–70% Low (exact sentences) Articles, reports, documentation
Abstractive Summarization LLM rewrites chunk in fewer tokens 50–80% Medium (may lose nuance) Dense technical text, tables
Entity Extraction Extract only key facts as structured snippets 60–90% High for open-ended; low for structured tasks Structured data extraction tasks
LLMLingua / Selective Removal Token-level perplexity scoring removes low-info tokens 50–70% Low (preserves key tokens) General compression, RAG pipelines

Sentence filtering is the highest-fidelity compression method: score each sentence against the query, keep only sentences above a relevance threshold. Lossless for the kept sentences; zero hallucination risk since no text is generated.

πŸ”§
Sentence-level compression with embeddings
from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer("BAAI/bge-small-en-v1.5") def compress_chunk(chunk: str, query: str, threshold: float = 0.4) -> str: # Split into sentences sentences = [s.strip() for s in chunk.split(".") if s.strip()] if not sentences: return chunk # Embed query and sentences query_emb = model.encode([query], normalize_embeddings=True)[0] sent_embs = model.encode(sentences, normalize_embeddings=True) # Score by cosine similarity scores = sent_embs @ query_emb # Keep sentences above threshold kept = [s for s, score in zip(sentences, scores) if score >= threshold] return ". ".join(kept) + ("." if kept else "")

Abstractive compression uses an LLM (typically a small, cheap one) to rewrite chunks into denser summaries targeted at a specific query. It can achieve the highest compression ratios but introduces a quality dependency: the compressor must not lose or distort key facts.

When to Use Abstractive

βœ… Long documents with redundant prose

βœ… Dense technical text that can be paraphrased

βœ… You need 60%+ compression

βœ… Small fast model available as compressor

When NOT to Use Abstractive

❌ Legal, medical, financial text where exact wording matters

❌ Code β€” summarization destroys syntax

❌ Numerical data β€” risk of transcription errors

❌ When compressor latency exceeds savings benefit

πŸ”§
Abstractive compression prompt
# System prompt for the compressor LLM COMPRESS_SYSTEM = """You are a precise text compressor. Given a document chunk and a user query, rewrite the chunk to retain ONLY information relevant to answering the query. Rules: - Preserve all numbers, dates, and named entities exactly - Keep sentences that directly relate to the query - Remove off-topic background information - Output ONLY the compressed text, nothing else - Target 40-60% of the original token count""" COMPRESS_USER = """Query: {query} Chunk to compress: {chunk}"""
The Compressor Adds Latency

Abstractive compression requires an extra LLM call per chunk. At 5 chunks Γ— 200ms per call = 1 second added to your pipeline. Use a small, fast model (Llama 3.1 8B, GPT-4o-mini, Claude Haiku) as the compressor β€” the main LLM's quality improvement must justify the added latency.

LLMLingua (Microsoft Research, 2023) is a compression method that uses a small LM to compute token-level perplexity. Tokens that are predictable (low perplexity) carry little information and can be dropped. The remaining tokens form a compressed β€” but still parseable β€” prompt.

βœ…
LLMLingua Strengths
  • Works on any text β€” no LLM generation step
  • Deterministic compression ratio control
  • Preserves semantic meaning at 50% compression
  • Fast β€” small local model for scoring
  • Open source: llmlingua pip package
⚠️
LLMLingua Limitations
  • Compressed text looks garbled to humans
  • Model-dependent: works best for models similar to compressor
  • Sensitive structure (JSON, code, tables) may break
  • Requires additional local model dependency
πŸ”§
LLMLingua usage
# pip install llmlingua from llmlingua import PromptCompressor compressor = PromptCompressor( model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank", use_llmlingua2=True, ) compressed = compressor.compress_prompt( context, rate=0.5, # keep 50% of tokens force_tokens=["\n"], # always keep newlines ) print(compressed["compressed_prompt"]) # compressed text print(compressed["ratio"]) # actual achieved ratio print(compressed["saving"]) # tokens saved
SituationRecommended StrategyReason
Legal / medical documents Sentence filtering only Exact wording matters β€” no LLM rewriting
General knowledge articles Abstractive or LLMLingua Prose is compressible without loss
Code snippets Minimal or none Code structure is brittle β€” compression may break syntax
Long-form reports (20+ pages) Hierarchical: abstract summary + sentence filtering of sections Need overview + relevant detail
Latency-sensitive pipeline Sentence filtering (local embeddings) No extra LLM call β€” milliseconds, not seconds
High-volume, cost-critical LLMLingua (offline pre-compression) Compress once, store compressed; pay cost once
Pre-Compress at Ingestion Time

The most cost-effective compression happens at document ingestion, not at query time. Pre-compute summaries and compressed representations when documents are indexed. At query time, retrieve the pre-compressed version β€” zero extra latency, zero extra cost per query.

∑ Chapter 03 — Key Takeaways

  • Compression delivers three wins: lower cost, lower latency, better quality (less noise = more focused attention)
  • Sentence filtering: score sentences vs query with embeddings, drop below threshold β€” high fidelity, no hallucination risk
  • Abstractive compression: use a fast LLM to rewrite chunks β€” high ratio but adds latency; avoid for precise text
  • LLMLingua: token-level perplexity-based dropping β€” deterministic, works on arbitrary text, open source
  • Pre-compress at ingestion β€” not at query time. Pay once; save cost on every query
  • Never compress code, numerical tables, or legal text with abstractive methods β€” use extractive or no compression

Compression reduces tokens, cost, and latency β€” but it trades against information fidelity. Applied incorrectly, it loses the details your system depends on.

Compression RiskHow It ManifestsMitigation
Critical detail loss A numeric threshold, date, or constraint is dropped because it scored low in isolation β€” but it was essential Always evaluate compressed vs original output on a test set; flag numeric patterns for retention
Meaning alteration Abstractive compression changes a negation, qualifier, or conditional β€” "not required" becomes "required" Avoid abstractive compression for policy, legal, or safety content; use extractive only
Introduced bias LLM compressor summarizes multiple viewpoints as one, losing nuance or introducing the model's own bias Sample-evaluate compressed summaries; A/B test faithfulness scores
Over-compression High compression ratio leaves too few tokens β€” the answer is no longer reconstructible from the compressed context Set minimum token floor per chunk; test at boundary compression ratios
Production Rule

Treat compression as a pipeline stage with its own regression tests. Every time you change compression logic, run your faithfulness eval on a held-out set and verify the score doesn't drop. Use compression selectively β€” apply it to narrative prose, not to structured data, code, or contractual language.

04
Chapter 04 Β· Windowing
Windowing Strategies β€” Managing Context Across Long Inputs

When the source material is longer than your context window, windowing decides which part of the document the model sees β€” and when. The wrong windowing strategy silently discards the answer before the model even runs.

A 200-page contract is ~150,000 tokens. A typical RAG chunk budget is 8,000–32,000 tokens. Even a "long context" model with 128K capacity can only fit ~85 pages. Windowing is how you navigate this mismatch for both indexing-time chunking and inference-time context assembly.

Indexing-Time Windowing

How you split documents into chunks when building the index. Determines retrieval granularity and the atomic unit of context.

Key question: How big should each chunk be?

Too large β†’ diluted relevance scores, wasteful tokens

Too small β†’ loses context needed to answer, more chunks to rank

Inference-Time Windowing

How you manage context when a conversation grows or a task requires iterating through a long document.

Key question: What do I drop when the window fills?

Drop old turns β†’ lose conversation coherence

Drop retrieved context β†’ lose grounding

Fixed windows split documents into equal-size chunks at fixed token boundaries. Simple to implement, but mid-sentence and mid-paragraph cuts destroy semantic coherence.

❌
Fixed Window Failure Mode
# Split at exactly 512 tokens Chunk 1: "...The defendant was found guilty on three counts. The sentence was determined by the presiding judge after carefully reviewing the evidence. The maximum penalty under statute 42B" Chunk 2: "is ten years imprisonment or a fine not exceeding $50,000. The court also considered the defendant's..."

The key fact is split across two chunks. Either chunk alone will produce an incomplete or wrong answer.

βœ…
When Fixed Windows Are Acceptable
  • Uniform prose with no sentence-crossing critical facts
  • Large chunks (1,000+ tokens) where mid-cut rarely matters
  • Pre-processing step before applying semantic splitting
  • Code files split at function boundaries (not token count)

Minimum mitigation: add 10–20% overlap between adjacent chunks

Sliding windows add overlap between adjacent chunks. Each chunk shares N tokens with the previous chunk, ensuring that information at boundaries exists in at least one complete chunk.

Sliding window β€” each chunk overlaps its neighbors
Full Document (e.g. 10,000 tokens) Chunk 1 (512 tokens) overlap Chunk 2 (512 tokens) overlap Chunk 3 (512 tokens) stride = chunk_size βˆ’ overlap β†’ each step moves (512 βˆ’ 64) = 448 tokens forward
πŸ”§
Sliding window chunker
def sliding_window_chunks( text: str, chunk_size: int = 512, overlap: int = 64, tokenizer = None, ) -> list[str]: """Split text into overlapping token-based chunks.""" if tokenizer is None: import tiktoken tokenizer = tiktoken.encoding_for_model("gpt-4o") tokens = tokenizer.encode(text) stride = chunk_size - overlap chunks = [] for start in range(0, len(tokens), stride): end = start + chunk_size chunk = tokenizer.decode(tokens[start:end]) chunks.append(chunk) if end >= len(tokens): break return chunks # Rule of thumb: overlap = 10–20% of chunk_size # chunk 512 tokens β†’ overlap 50–100 tokens
Chunk SizeOverlapUse CaseTrade-off
128–256 tokens 20–40 tokens FAQ, structured data, precise fact retrieval High precision, low recall, more chunks to rank
512 tokens 50–100 tokens General purpose RAG (most common) Good balance β€” default starting point
1,024–2,048 tokens 100–200 tokens Technical docs, legal text, reasoning tasks Better context per chunk; harder to rank precisely
4,096+ tokens 400–500 tokens Long-form analysis, whole-section retrieval Context-rich but relevance score diluted

Semantic chunking splits documents at natural topic boundaries rather than fixed token counts. When embedding similarity drops sharply between adjacent sentences, that's a topic transition β€” a good split point.

βœ…
Semantic Chunking Benefits
  • Each chunk covers one coherent topic
  • Retrieval scores are more meaningful (one topic per chunk)
  • Fewer cross-boundary answer splits
  • Works especially well on structured docs (reports, articles)
⚠️
Semantic Chunking Costs
  • Requires embedding every sentence β€” expensive at ingestion
  • Variable chunk sizes complicate token budget planning
  • Poor results on dense technical text with no clear topic shifts
  • Needs a good sentence-level embedding model
πŸ”§
Semantic chunker (cosine drop method)
from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer("BAAI/bge-small-en-v1.5") def semantic_chunks( text: str, threshold: float = 0.3, # cosine drop triggers split min_chunk_tokens: int = 100, max_chunk_tokens: int = 1024, ) -> list[str]: sentences = [s.strip() for s in text.split(".") if s.strip()] embeddings = model.encode(sentences, normalize_embeddings=True) # Compute cosine similarity between adjacent sentences similarities = [ embeddings[i] @ embeddings[i+1] for i in range(len(embeddings) - 1) ] # Split where similarity drops below threshold chunks, current = [], [] for i, sentence in enumerate(sentences): current.append(sentence) if i < len(similarities) and similarities[i] < (1.0 - threshold): chunks.append(". ".join(current) + ".") current = [] if current: chunks.append(". ".join(current) + ".") return chunks

Hierarchical windowing indexes small chunks for precise retrieval but retrieves larger parent chunks for fuller context. You get precision from small-chunk search and coherence from large-chunk content.

Parent-child chunking β€” retrieve small, inject large
Parent Chunk (~2048 tokens) Full section β€” stored in vector DB NOT indexed for retrieval Child 1 ~256 tok indexed βœ“ Child 2 ~256 tok indexed βœ“ Child 3 ~256 tok indexed βœ“ Query matches Child 2 Inject Parent into context Full 2048-token section β†’richer context
The Key Advantage

Small chunks win retrieval battles β€” they're precise, focused, and score high. But small chunks lose context battles β€” they're too short to fully answer. Parent-child solves both: retrieve with child precision, answer with parent context.

In multi-turn conversations, history grows unbounded. At some point, it exceeds the context window. Inference-time windowing decides what to keep and what to drop.

StrategyWhat Is DroppedProsCons
FIFO Truncation Oldest messages first Simple, no extra processing Loses early context (system setup, key decisions)
Pinned + FIFO Oldest non-pinned messages Preserves system prompt + key anchors Requires explicit pinning logic
Summary Compression Older turns compressed to summary No information loss per se; coherent history Adds LLM call; summary may lose nuance
Semantic Retrieval Low-relevance historical turns Keeps only what's relevant to current query Complex; requires embedding history
Never Silently Truncate

Silent truncation is the most dangerous pattern β€” the model gets half a conversation with no indication that context was removed, leading to confused or contradictory responses. Always inject an explicit indicator when history is compressed: [Earlier conversation summarized: user asked about X, decided Y].

∑ Chapter 04 — Key Takeaways

  • Fixed windows are simple but break sentences β€” always add 10–20% overlap at minimum
  • Sliding windows with stride = chunk_size βˆ’ overlap ensure boundary facts exist in at least one complete chunk
  • Chunk size rule of thumb: 512 tokens with 64-token overlap is the default starting point; tune per domain
  • Semantic chunking splits at topic boundaries β€” better precision but expensive to compute at ingestion
  • Hierarchical windowing (parent–child): index small chunks for retrieval, inject large parent chunks for context β€” the best of both worlds
  • Inference-time: never silently truncate; use pinned-FIFO or summary compression; always tell the model when history was dropped
05
Chapter 05 Β· Density
Information Density β€” Signal vs Noise in Context

More context is not always better. Every low-signal token you add is a high-signal token the model attends to less. Information density engineering is about ensuring every token in your context window earns its place.

Transformer attention is not uniform β€” the model allocates attention across all tokens, but that allocation is competitive. When your context contains 20% useful signal and 80% boilerplate, the model must "work harder" to attend to the right parts. Dense context = better answers at lower token cost.

πŸ“°
Low-Density Content
  • Legal boilerplate and disclaimers
  • Document headers, footers, page numbers
  • Repeated information across chunks
  • Off-topic paragraphs in retrieved docs
  • Verbose explanations of obvious concepts
🎯
High-Density Content
  • Specific facts: numbers, dates, names
  • Decision criteria and rules
  • Definitions unique to your domain
  • Step-by-step procedures
  • Constraint lists and edge cases
⚑
Density Engineering Goals
  • Maximize relevant tokens per chunk
  • Remove formatting artifacts
  • Deduplicate near-identical content
  • Prefer structured representations
  • Strip navigation, menus, ads, footers
Noise TypeExampleImpactFix
Structural Artifacts Page numbers, TOC entries, nav menus Clutters chunks with zero-signal text Strip during document pre-processing
Legal Boilerplate "This document is confidential and intended only for..." Wastes 100–500 tokens per document Blacklist common boilerplate patterns
Duplicate Content Same paragraph repeated in overview + detail section Dilutes attention; inflates token count Dedup at ingestion via embedding similarity
Irrelevant Sidebars Related articles, footnotes, bibliography Off-topic context confuses the model Semantic filtering per section type
Verbose Prose "It is worth noting that, in the context of..." (β†’ "Note:") 2–5Γ— token waste on filler Abstractive compression targeting filler phrases
Format Overhead HTML tags, markdown escape sequences Raw HTML uses 30–50% extra tokens vs plain text Strip HTML; convert markdown to plain text

Before tuning, measure. Two practical density metrics: relevance density (what fraction of the context is relevant to the query) and entity density (how many unique named entities per 100 tokens).

πŸ”§
Density scorer
from sentence_transformers import SentenceTransformer import tiktoken, re embed_model = SentenceTransformer("BAAI/bge-small-en-v1.5") tokenizer = tiktoken.encoding_for_model("gpt-4o") def relevance_density(context: str, query: str) -> float: """Fraction of sentences with cosine sim > 0.4 to query.""" sentences = [s.strip() for s in context.split(".") if len(s.strip()) > 20] if not sentences: return 0.0 q_emb = embed_model.encode([query], normalize_embeddings=True)[0] s_embs = embed_model.encode(sentences, normalize_embeddings=True) scores = s_embs @ q_emb return float((scores > 0.4).mean()) def entity_density(text: str) -> float: """Named entities + numbers per 100 tokens (rough proxy).""" token_count = len(tokenizer.encode(text)) # Count capitalized phrases and numbers as rough entity proxy entities = len(re.findall(r'\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)*|\b\d+[\d,.]*\b', text)) return (entities / max(token_count, 1)) * 100 # Good targets: relevance_density > 0.6, entity_density > 3.0

When the same information can be represented as prose or as a structured list/table, the structured version is almost always higher density β€” more facts per token, easier for the model to parse.

πŸ“„
Prose (73 tokens)
The refund window is 30 days from the date of purchase. Customers must have their original receipt. Items must be in original condition and unopened. Electronics are eligible for exchange only, not refund. Shipping costs are non-refundable in all cases.
πŸ“‹
Structured (38 tokens β€” 48% less)
Refund policy: - Window: 30 days from purchase - Requires: original receipt - Condition: unopened, original condition - Electronics: exchange only (no refund) - Shipping: non-refundable
Convert at Ingestion

When indexing documents that contain policy lists, specifications, or tabular data, pre-convert prose to structured format during ingestion and store the structured version in your vector database. The embedding quality stays the same; the token density improves dramatically.

Research and production experience consistently show a non-monotonic relationship between context length and quality. Adding more context helps up to a point β€” then it starts to hurt.

Answer quality vs context size β€” the threshold effect
Low High Answer Quality 0 2K 8K 32K Context Size (tokens) Quality peak ~4K–8K ↑ Adding signal helps ↓ Noise dilutes attention
More Chunks β‰  Better Answers

Doubling your retrieved chunk count from 5 to 10 doesn't improve answers if the extra 5 chunks are marginally relevant. They add tokens, increase latency, and introduce noise. Default to fewer, higher-quality chunks. Start with 3–5; only add more if eval shows consistent improvement.

∑ Chapter 05 — Key Takeaways

  • Attention is competitive β€” every low-signal token dilutes attention on high-signal tokens
  • Main noise sources: structural artifacts, boilerplate, duplicates, verbose prose, HTML tags β€” strip at ingestion time
  • Measure density: relevance density > 0.6 (fraction of sentences relevant to query) and entity density > 3.0 (named entities per 100 tokens)
  • Structured representations beat prose β€” same information in bullet/table form uses 30–50% fewer tokens
  • The threshold effect is real β€” quality peaks around 4K–8K tokens for most tasks; more context beyond that introduces noise
  • Default to 3–5 high-quality chunks, not 10–20 mediocre ones
06
Chapter 06 Β· Long Context
Long Context Models β€” 100K+ Tokens in Practice

Long context models change what's possible β€” but not how you should think. A 1M token window doesn't eliminate the need for context engineering; it shifts the constraints. Cost, latency, and position bias all scale with context length.

ModelContext WindowEffective RangeInput CostBest Use
GPT-4o 128K tokens ~32K–64K real reliability $2.50/1M tokens Standard production, API access
Claude 3.5 Sonnet 200K tokens Strong to ~128K $3.00/1M tokens Long legal/research docs
Claude 3 Opus 200K tokens Strong to ~150K $15.00/1M tokens Complex multi-doc analysis
Gemini 1.5 Pro 1M tokens Strong at 100K–500K $1.25/1M (≀128K) / $2.50 (>128K) Whole codebase, book-length docs
Gemini 1.5 Flash 1M tokens Good to ~200K, degrades beyond $0.075/1M (≀128K) High-volume, cost-sensitive long context
Llama 3.1 70B 128K tokens ~16K–32K effective for open models Self-hosted Private deployment, data sovereignty

Understanding why models degrade at long context requires understanding position embeddings β€” how models encode token position in the sequence.

πŸ“
Absolute Position Embeddings

Each position has a fixed learned embedding. Max positions fixed at training β€” cannot generalize beyond training length.

  • GPT-2 style; largely deprecated
  • Hard stop at training max length
πŸ”„
RoPE (Rotary Position Embedding)

Relative positional encoding via rotation. Can extend beyond training length via RoPE scaling (YaRN, NTK-aware). Used by Llama, Mistral, Qwen.

  • Extensible via fine-tuning or scaling
  • Quality degrades gracefully beyond training range
♾️
ALiBi (Attention with Linear Biases)

Penalizes attention scores by distance, no explicit position embedding. Generalizes to arbitrary length without retraining.

  • Used in MPT; robust extrapolation
  • Some quality loss vs RoPE at long range
The Practical Takeaway

For models using RoPE (Llama, Mistral): stay within the fine-tuned context range. Beyond it, quality degrades unpredictably. For API models (GPT-4o, Claude, Gemini): the provider has already handled extension β€” but test your specific task at your target length before committing to production.

βœ…
Use Long Context When:
  • The answer requires synthesizing across an entire document
  • You can't predict which sections will be relevant (reduces retrieval risk)
  • Document structure matters (cross-references, section dependencies)
  • Few-shot examples are so large they exceed normal context
  • Whole codebase analysis, full book Q&A, complete contract review
❌
Don't Use Long Context When:
  • RAG would achieve the same quality at 10Γ— lower cost
  • Only a small section of the doc is ever relevant
  • You're answering many queries (cost scales with every call)
  • You need low latency β€” 100K+ tokens β†’ 2–10s prefill
  • Cheaper model + retrieval outperforms expensive long-context model
StrategyContext SizeCost (GPT-4o)TTFTBest For
RAG (sparse retrieval) 2K–8K tokens $0.005–$0.02 / query <200ms High-volume, known-answer retrieval
RAG (dense retrieval) 8K–32K tokens $0.02–$0.08 / query 200–800ms Complex queries, multiple documents
Long context (64K) 64K tokens $0.16 / query 1–3s Full-doc analysis, infrequent queries
Long context (200K) 200K tokens $0.50 / query 5–15s One-time analysis, not production serving
The Hybrid Pattern

The most cost-effective production pattern combines both: use retrieval as a first pass to identify relevant sections, then inject those sections (plus surrounding context) into a long-context model for synthesis. RAG gives you precision; long context gives you coherence within the relevant section. Cost stays bounded; quality improves.

Before committing to a long-context model for production, run the "needle in a haystack" test: place a specific fact (the needle) at various positions in a large document (the haystack) and ask questions that require recalling it. This reveals where each model's attention actually degrades.

πŸ”§
NIAH test scaffold
def run_niah_test( model_fn, needle: str = "The secret code is PURPLE-42.", haystack: str = None, # large filler document positions: list[float] = [0.1, 0.3, 0.5, 0.7, 0.9], context_lengths: list[int] = [8000, 32000, 64000, 128000], ) -> dict: results = {} for ctx_len in context_lengths: for pos in positions: # Insert needle at position fraction of context insert_at = int(ctx_len * pos) ctx = haystack[:insert_at] + needle + haystack[insert_at:ctx_len] response = model_fn( system="Answer only from the provided context.", user=f"{ctx}\n\nWhat is the secret code?" ) results[(ctx_len, pos)] = "PURPLE-42" in response return results # True/False grid: length Γ— position

∑ Chapter 06 — Key Takeaways

  • Gemini 1.5 Pro (1M) is the longest-context option; Claude 3.5 Sonnet (200K) has the best long-context quality/cost balance for most production use
  • Advertised context β‰  effective context β€” run needle-in-a-haystack tests at your target lengths before committing
  • RoPE (Llama, Mistral) can be extended but degrades beyond training range; API models (GPT-4o, Claude, Gemini) have handled this internally
  • Use long context when synthesis across the whole document is required; use RAG when only a subset is relevant
  • Long context has a cost tax: 64K tokens = $0.16/query (GPT-4o), 200K = $0.50/query β€” unsustainable for high-volume use
  • Hybrid pattern: RAG to identify relevant sections β†’ long-context model to synthesize within those sections
07
Chapter 07 Β· Caching
Context Caching β€” Reusing Prefixes for Cost and Latency Savings

Context caching is one of the highest-ROI optimizations in LLM engineering. If the same prefix appears in multiple requests, you pay to process it once and reuse the KV cache. Savings of 50–90% on input token costs are achievable for the right workloads.

When an LLM processes a prompt, it computes key-value (KV) pairs for every token in the attention layers. This computation is expensive and proportional to context length. Prefix caching stores these pre-computed KV pairs β€” so if the same prefix appears in the next request, the model skips recomputing it entirely.

Without vs with prefix caching
WITHOUT CACHING System (8K) + Query 1 compute all tokens $$ System (8K) + Query 2 compute all tokens $$ again Pays for system prompt twice WITH PREFIX CACHING System (8K) computed & cached $$ Q1 $ KV Cache stored System (8K) β€” FREE cache hit β€” 0 compute Q2 $ Pays for system prompt once
The Key Requirement

For caching to work, the prefix must be identical byte-for-byte across requests β€” same characters, same whitespace, same order. Even one token difference means a cache miss. This is why stable, front-loaded prefixes are the design pattern for cacheable prompts.

ProviderCache TypeCached Token CostMin Cacheable PrefixTTL
OpenAI (gpt-4o) Automatic prompt caching 50% off input tokens 1,024 tokens ~1 hour (auto-evicted)
Anthropic (Claude) Explicit cache_control markers ~90% off input (write once, read many) 1,024 tokens 5 min (ephemeral) / manual
Google (Gemini) Explicit cached_content API ~75% off input tokens 32,768 tokens Configurable (up to hours)
Self-hosted (vLLM) Automatic prefix caching (β€”enable-prefix-caching) GPU compute saved (no KV recompute) Any prefix length Memory-bound (in-flight)
Self-hosted (SGLang) RadixAttention β€” tree-based KV sharing Highest hit rate for branching prompts Any prefix Memory-bound

OpenAI caches prompts automatically β€” no code changes required. Any prompt prefix of 1,024+ tokens that is reused within ~1 hour gets cached at 50% discount. The usage field in the response reports cached_tokens.

πŸ”§
Checking cache hits (OpenAI)
from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": long_system_prompt}, # 1K+ tokens {"role": "user", "content": user_query}, ], ) usage = response.usage print(f"Total input tokens : {usage.prompt_tokens}") print(f"Cached tokens : {usage.prompt_tokens_details.cached_tokens}") print(f"Cache hit rate : {usage.prompt_tokens_details.cached_tokens / usage.prompt_tokens:.1%}") # Design for caching: stable prefix first, dynamic content last # Cached tokens billed at $1.25/1M (vs $2.50/1M full price)

Claude's caching is explicit β€” you mark exactly which parts of the prompt to cache using cache_control breakpoints. The first request writes the cache (slight cost premium); subsequent requests read it at ~90% discount.

πŸ”§
Claude cache_control usage
import anthropic client = anthropic.Anthropic() response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, system=[ { "type": "text", "text": large_document, # 10K+ tokens β€” cache this "cache_control": {"type": "ephemeral"}, }, { "type": "text", "text": "Answer questions based only on the document above." # This part NOT cached β€” changes per use case }, ], messages=[{"role": "user", "content": user_question}], ) usage = response.usage print(f"Cache write tokens : {usage.cache_creation_input_tokens}") print(f"Cache read tokens : {usage.cache_read_input_tokens}") # Subsequent calls: 90% discount on the cached part

Prompt structure determines cache hit rate. Small changes to prompt ordering or dynamic content placement can mean the difference between 90% hit rate and 0%.

❌
Cache-Unfriendly Pattern
System: You are a helpful assistant. Today is {current_date}. ← CHANGES EVERY DAY User: {question} Context: {retrieved_docs} ← CHANGES PER QUERY

The date is in the prefix β€” every new day invalidates the entire cache. Zero cache hits.

βœ…
Cache-Friendly Pattern
System: You are a helpful assistant. ← STABLE PREFIX [All static instructions here] ← STABLE PREFIX --- CONTEXT --- {retrieved_docs} ← per-query, AFTER stable --- QUERY --- {question} ← per-query, at end Note: Today is {current_date}. ← dynamic, keep LAST

Stable prefix stays identical β€” high cache hit rate for all static content.

Dynamic ElementWhere to PlaceWhy
Static instructions System prompt β€” first Always cached after first call
Large static documents System prompt or early user turn Highest value to cache β€” most tokens
Retrieved context After stable prefix Changes per query β€” prefix still cached
User query Last in user turn Always unique β€” keep after everything stable
Current date/time Last β€” after all stable content Invalidates cache if in prefix
Session/user ID Never in cacheable prefix Makes every request unique β€” zero hits
πŸ”§
Cache ROI calculator
def cache_roi( daily_requests: int, avg_input_tokens: int, stable_prefix_tokens: int, # tokens that stay same cost_per_1m_full: float, # e.g. 2.50 for gpt-4o cost_per_1m_cached: float, # e.g. 1.25 for gpt-4o hit_rate: float = 0.85, # expected cache hit rate ) -> dict: dynamic_tokens = avg_input_tokens - stable_prefix_tokens # Without caching cost_no_cache = (daily_requests * avg_input_tokens / 1_000_000) * cost_per_1m_full # With caching (prefix cached at reduced rate on hits) prefix_cached = daily_requests * hit_rate * stable_prefix_tokens prefix_full = daily_requests * (1 - hit_rate) * stable_prefix_tokens dynamic_tokens_total = daily_requests * dynamic_tokens cost_with_cache = ( (prefix_cached / 1_000_000) * cost_per_1m_cached + (prefix_full + dynamic_tokens_total) / 1_000_000 * cost_per_1m_full ) savings = cost_no_cache - cost_with_cache return { "daily_cost_no_cache": cost_no_cache, "daily_cost_cached": cost_with_cache, "daily_savings": savings, "monthly_savings": savings * 30, } # Example: 10K requests/day, 8K avg input, 6K stable prefix roi = cache_roi(10000, 8000, 6000, 2.50, 1.25) # Monthly savings β‰ˆ $900 on a $2,000/month bill

∑ Chapter 07 — Key Takeaways

  • Prefix caching reuses pre-computed KV pairs β€” 50–90% input token cost reduction for cacheable workloads
  • OpenAI: automatic, no code changes β€” requires 1,024+ token prefix, ~1hr TTL, 50% discount
  • Anthropic: explicit cache_control markers β€” up to 90% discount, use for large static documents
  • Gemini: explicit cached_content API, 32K+ min tokens β€” best for very large stable content
  • Design rule: stable content first, dynamic content last β€” never put dates/user IDs in the cacheable prefix
  • For self-hosted: enable --enable-prefix-caching in vLLM or use SGLang's RadixAttention for branching prompts

Not all context is equally cache-worthy. The value of caching a piece of context depends on how frequently it's reused, how expensive it is to recompute, and how stable it is over time.

What to CacheReuse FrequencyBenefitStability
Formatted retrieved chunksHigh β€” same query patternEliminates retrieval + formatting costHours–days
Compressed document summariesVery high β€” per documentEliminates compression LLM callDays–weeks
System promptEvery requestProvider prefix caching (50–90% discount)Weeks–months
User preference contextPer user sessionEliminates DB lookup and formattingMinutes–hours
Static knowledge base sectionsHigh β€” shared across usersServe from cache, skip retrievalDays
Assembled context for top queriesVery high (80/20 rule)Full pipeline bypass for hot queriesMinutes–hours
Caching Improves Consistency, Not Just Cost

Cached context is deterministic β€” the same pre-formatted, pre-compressed chunk is returned every time. This improves answer consistency across sessions. Without caching, minor variations in retrieval scores or compression outputs can cause the same query to produce different context β€” and different answers β€” across requests. Caching is both a cost lever and a reliability lever.

08
Chapter 08 Β· Multi-Document
Multi-Document Context β€” Synthesizing Across Multiple Sources

Most real-world queries require synthesizing across multiple documents or sources. How you rank, present, and delimit multiple documents determines whether the model synthesizes them correctly β€” or confuses, ignores, or contradicts them.

πŸ”€
Source Conflation

Model blends information from different documents into a single "answer," losing attribution of which source said what.

Fix: Explicit delimiters + citation instructions

πŸ™ˆ
Source Neglect

Model answers from the first one or two documents, ignoring others entirely. Lost-in-the-middle at document level.

Fix: Fewer docs + sandwich ordering + explicit "use all sources" instruction

βš”οΈ
Conflict Blindness

Documents contradict each other; model picks one without noting the contradiction.

Fix: Explicit conflict detection prompt + "note disagreements" instruction

πŸ“…
Recency Blindness

All documents treated as equally current. An outdated doc overrides a newer one.

Fix: Inject timestamps; instruct model to prefer recent sources

🏷️
Source Mislabeling

Model cites "document 2" when the fact came from "document 4." Especially common with 5+ documents.

Fix: Unique, memorable source IDs (not just numbers)

When you have multiple retrieved documents, their order in the context window affects which ones the model uses. Rank-aware ordering is different from simple relevance sorting.

Ranking SignalWhat It IsWhen to Use
Relevance Score Cosine similarity or BM25 score to query Default β€” most relevant first
Recency Document timestamp or last-updated date News, policies, product docs that change
Authority Source type (official docs > forum post) or domain weight Knowledge bases with mixed source quality
Re-rank Score Cross-encoder score (Cohere Rerank, BGE reranker) High-stakes retrieval; worth the extra latency
Diversity MMR (Maximal Marginal Relevance) β€” relevance minus redundancy When top chunks are near-duplicates of each other
πŸ”§
MMR document ranking (diversity-aware)
import numpy as np from sentence_transformers import SentenceTransformer model = SentenceTransformer("BAAI/bge-small-en-v1.5") def mmr_rank( query: str, docs: list[str], k: int = 5, lambda_: float = 0.5, # 0=max diversity, 1=max relevance ) -> list[int]: """Return indices of top-k docs via Maximal Marginal Relevance.""" embeddings = model.encode([query] + docs, normalize_embeddings=True) q_emb, d_embs = embeddings[0], embeddings[1:] relevance = d_embs @ q_emb # similarity to query selected, remaining = [], list(range(len(docs))) while len(selected) < k and remaining: if not selected: # First: pick most relevant best = max(remaining, key=lambda i: relevance[i]) else: # MMR: balance relevance vs redundancy sel_embs = d_embs[selected] mmr_scores = [ lambda_ * relevance[i] - (1 - lambda_) * float((d_embs[i] @ sel_embs.T).max()) for i in remaining ] best = remaining[int(np.argmax(mmr_scores))] selected.append(best) remaining.remove(best) return selected

With multiple documents, clear formatting is critical. Labels must be unambiguous, metadata must be useful, and delimiters must prevent content bleeding between sources.

βœ…
Rich multi-document format (recommended)
<documents> <doc id="refund-policy" source="internal/policies/refund-v3.pdf" date="2025-01-15" authority="official"> Customers may return items within 30 days of purchase. A valid receipt is required. Opened items are not eligible. </doc> <doc id="cs-faq" source="support/faq.md" date="2024-08-10" authority="support"> Q: Can I return without receipt? A: No. A receipt is required for all returns. </doc> <doc id="forum-post" source="community/post-4421" date="2023-05-02" authority="user"> I returned without a receipt and they were fine with it. </doc> </documents> Answer the question using the documents above. Cite sources using [doc id] notation. If documents conflict, note the disagreement and prefer higher-authority, more recent sources.
Authority Metadata Is Powerful

Including authority="official" vs authority="user" and document dates lets you instruct the model to resolve conflicts by authority and recency. Without this metadata, the model has no principled basis for choosing between contradictory sources.

Documents often contradict each other β€” especially across time (policy updated) or authority level (official docs vs user reports). Without explicit conflict handling, models pick arbitrarily.

πŸ“‹
Conflict-Aware System Prompt
When documents contradict each other: 1. Note the contradiction explicitly 2. Prefer official/authoritative sources over community/user sources 3. Prefer more recent dates over older 4. If unresolvable, present both views and ask the user to clarify Format: "According to [official source], X is the case. Note: [community source] states Y, but this may be outdated."
🎯
Conflict Detection Prompt
Before answering, check: - Do any documents disagree with each other? - Are any documents likely outdated (old date)? - Is there uncertainty in the sources? If yes: state the conflict, your resolution logic, and your confidence level. If no conflicts: answer directly.

Some answers require combining information from multiple documents β€” no single source is complete. Synthesis prompts encourage the model to explicitly integrate rather than just retrieve.

Retrieval Prompt (bad synthesis)

"Answer based on the documents provided."

Result: model picks the most relevant single document and answers from it, ignoring complementary information in others.

Synthesis Prompt (good synthesis)

"Synthesize a complete answer by drawing from ALL provided documents. Identify which aspects each document contributes. Note if any document provides unique information not found in others."

Result: model explicitly combines across sources.

The 3+ Document Degradation Cliff

Quality of multi-document synthesis degrades significantly beyond 3–5 documents for most models. Each additional document increases the probability of source neglect, conflation, or mislabeling. If you need 10 documents, consider a two-pass approach: first pass summarizes each document independently; second pass synthesizes the summaries. This is cheaper, more reliable, and scales better than a single context with 10 documents.

πŸ“„Doc 1 β†’ SummaryIndependent extraction
πŸ“„Doc 2 β†’ SummaryIndependent extraction
πŸ“„Doc N β†’ SummaryIndependent extraction
πŸ”—SynthesizeAll summaries β†’ final answer
πŸ”§
Map-reduce document synthesis
async def map_reduce_synthesis( documents: list[dict], query: str, llm_fn, ) -> str: # MAP: extract relevant info from each doc independently MAP_PROMPT = """From the document below, extract ONLY information relevant to: {query} If nothing is relevant, respond: "No relevant information." Be concise. Preserve exact numbers, dates, and names. Document [{doc_id}]: {content}""" extractions = await asyncio.gather(*[ llm_fn(MAP_PROMPT.format( query=query, doc_id=doc["id"], content=doc["text"] )) for doc in documents ]) # Filter empty extractions useful = [ f"[{doc['id']}]: {ext}" for doc, ext in zip(documents, extractions) if "No relevant" not in ext ] # REDUCE: synthesize all extractions into final answer REDUCE_PROMPT = """Synthesize a complete answer to: {query} Using these extracted facts from multiple sources: {facts} Cite each fact with its source ID. Note any conflicts.""" return await llm_fn(REDUCE_PROMPT.format( query=query, facts="\n\n".join(useful) ))

∑ Chapter 08 — Key Takeaways

  • Five multi-doc failure modes: conflation, neglect, conflict blindness, recency blindness, mislabeling β€” each requires a specific fix
  • Use rich metadata (source ID, date, authority level) in document tags β€” it enables automatic conflict resolution by the model
  • MMR ranking balances relevance and diversity β€” prevents top-k from returning near-duplicate chunks
  • Explicit conflict resolution instructions: prefer official over user, recent over old, note contradictions explicitly
  • Quality degrades with 3+ documents β€” for 10+ docs, use the map-reduce (two-pass) pattern: extract per doc, then synthesize
  • Synthesis prompts outperform retrieval prompts β€” "synthesize from all" vs "answer based on documents" produces meaningfully different results
09
Chapter 09 Β· Evaluation
Context Quality Metrics β€” Measuring Effectiveness

You can't improve what you don't measure. Context quality is the invisible variable that determines whether your LLM application works in production β€” and most teams only discover problems when users complain. This chapter defines the metric stack that tells you exactly where context is failing.

🎯
Relevance

Does the context contain information that answers the query? High relevance = low noise. Measured per-chunk and at the context level.

Metric: relevance score (0–1); % of chunks used in the answer

πŸ”
Faithfulness

Does the model's response stay grounded in the provided context? Low faithfulness = hallucination even when context is good.

Metric: RAGAS faithfulness; claim verification rate

πŸ“
Coverage

Does the context include all facts needed to answer completely? Missing a key piece forces the model to hallucinate or hedge.

Metric: answer completeness; recall@k vs gold answer

⚑
Efficiency

How much of the context window is actually useful? Token waste = higher cost + higher latency + more noise for the model.

Metric: utilization ratio; noise fraction; token cost per query

Retrieval metrics measure the quality of what you put into context β€” before the model sees it. These are fast, cheap, and deterministic.

MetricWhat It MeasuresHow to ComputeTarget
Precision@k Fraction of top-k retrieved chunks that are relevant Manual labels or LLM judge on sample >0.7
Recall@k Fraction of all relevant chunks that appear in top-k Requires ground-truth relevant set >0.8 for factual QA
MRR Mean Reciprocal Rank β€” how early is the first relevant result? avg(1/rank of first relevant chunk) >0.6
NDCG@k Normalized Discounted Cumulative Gain β€” graded relevance, rank-aware Relevance labels (0/1/2) + DCG formula >0.75
Context Utilization % of retrieved chunks cited or used in final answer LLM judge: "which chunks did the model actually use?" >50% β€” low means too much noise
Noise Fraction % of context tokens that are irrelevant to the query LLM relevance scorer per chunk <30% β€” lower is better

RAGAS (Retrieval-Augmented Generation Assessment) provides four core metrics that together cover the full RAG quality surface:

πŸ“Š
Faithfulness

Are all claims in the answer supported by the context? Breaks the answer into atomic claims and verifies each against retrieved chunks.

score = verified_claims / total_claims
🎯
Answer Relevancy

Is the answer actually addressing the question asked? Generates back-questions from the answer and measures alignment with the original.

score = cosine_sim(generated_Qs, original_Q)
πŸ”
Context Precision

Are the retrieved chunks actually useful for generating the answer? Measures signal-to-noise in the context window.

score = useful_chunks / total_chunks
πŸ“
Context Recall

Does the retrieved context contain all the information needed to answer? Measures coverage relative to ground-truth answer.

score = covered_claims / total_claims_in_GT
πŸ”§
RAGAS evaluation pipeline
from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall ) from datasets import Dataset data = { "question": ["What is the return policy?"], "answer": ["30 days for unused items."], "contexts": [["Returns accepted within 30 days..."]], "ground_truth": ["Items can be returned within 30 days if unused."], } dataset = Dataset.from_dict(data) result = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], ) print(result.to_pandas())

For production systems at scale, you can't manually review every retrieval result. LLM-as-judge provides automated, scalable evaluation that correlates well with human judgments.

πŸ”§
Chunk relevance scorer (LLM-as-judge)
RELEVANCE_PROMPT = """Rate the relevance of this context chunk to the query. Query: {query} Chunk: {chunk} Score 0-3: 0 = Completely irrelevant 1 = Tangentially related 2 = Partially relevant 3 = Directly answers the query Respond with ONLY the number.""" async def score_chunk_relevance(query: str, chunk: str, llm) -> int: response = await llm(RELEVANCE_PROMPT.format( query=query, chunk=chunk )) try: return int(response.strip()) except: return 0 async def evaluate_context(query: str, chunks: list[str], llm): scores = await asyncio.gather(*[ score_chunk_relevance(query, c, llm) for c in chunks ]) return { "mean_relevance": sum(scores) / (len(scores) * 3), "noise_fraction": scores.count(0) / len(scores), "high_quality_chunks": sum(1 for s in scores if s >= 2) / len(scores), }

Faithfulness measures whether the model's answer is grounded in the context. This is your primary hallucination detector for RAG systems.

Faithfulness Evaluation Pattern

Step 1 β€” Claim Extraction: Break the model's answer into atomic, verifiable claims. Each claim should be a single, unambiguous statement.

Step 2 β€” Claim Verification: For each claim, check whether it is supported, contradicted, or absent from the retrieved context.

Step 3 β€” Score Computation: Faithfulness = supported_claims / total_claims. A score below 0.8 indicates significant hallucination risk.

πŸ”§
Faithfulness checker
CLAIM_EXTRACTOR = """Extract all factual claims from this answer as a JSON list. Each claim must be a single, atomic statement. Answer: {answer} Return: ["claim1", "claim2", ...]""" CLAIM_VERIFIER = """Does the provided context support this claim? Context: {context} Claim: {claim} Answer ONLY: SUPPORTED / CONTRADICTED / NOT_IN_CONTEXT""" async def check_faithfulness(answer: str, context: str, llm) -> dict: claims_json = await llm(CLAIM_EXTRACTOR.format(answer=answer)) claims = json.loads(claims_json) verdicts = await asyncio.gather(*[ llm(CLAIM_VERIFIER.format(context=context, claim=c)) for c in claims ]) supported = sum(1 for v in verdicts if "SUPPORTED" in v) return { "faithfulness": supported / len(claims), "total_claims": len(claims), "supported": supported, "hallucinated": len(claims) - supported, }

In production you need continuous metric tracking β€” not just offline eval. Log the key signals with every request and aggregate them into a live dashboard.

MetricCollection MethodAlert Threshold
Context Relevance (mean)LLM scorer on sampled requests (5–10%)<0.6
FaithfulnessAsync faithfulness check post-response<0.75
Context UtilizationCitation extraction from response<0.4 β†’ too many irrelevant chunks
Tokens per QueryLLM usage logs>2Γ— baseline β†’ context bloat
Answer Latency p95Request timing>5s β†’ retrieval or context issues
User Feedback RateThumbs up/down or follow-up question rateDownvote rate >15%
The Eval-Prod Gap

Offline evaluation on a benchmark dataset rarely reflects production performance. Production queries have a different distribution, different lengths, and different failure modes. Always run online metrics (sampled LLM evaluation + user feedback) alongside offline benchmarks. A system that scores 0.9 on your eval set may score 0.65 in production on queries you didn't anticipate.

∑ Chapter 09 — Key Takeaways

  • Context quality has four pillars: relevance, faithfulness, coverage, efficiency β€” measure all four, not just end-task accuracy
  • Use retrieval metrics (precision@k, recall@k, NDCG) as fast pre-LLM signals that catch retrieval failures before they reach the model
  • RAGAS is the standard framework: faithfulness, answer relevancy, context precision, context recall β€” run it on every major change
  • LLM-as-judge scales evaluation to production β€” sample 5–10% of requests and score chunk relevance asynchronously
  • Faithfulness verification (claim extraction β†’ claim verification) is your primary hallucination detector in RAG systems
  • Build a live metrics dashboard: context relevance, faithfulness, utilization, token cost, latency β€” alert on degradation

Context quality must be evaluated, not assumed. A system that "seems to work" in manual testing can have systematic failure modes that only show up under controlled evaluation. Build a test harness that isolates context variables.

πŸ§ͺ
Ablation Tests β€” Chunk Impact

Remove individual chunks from the context and measure the change in answer accuracy. If removing a chunk doesn't change the answer, the chunk is wasted tokens. If removing it causes failure, it's critical.

Test: answer_quality(full_context) vs answer_quality(context - chunk_N)

πŸ”€
Ordering Sensitivity Tests

Shuffle chunk order and measure how much answer quality varies. High variance = model is fragile to ordering. Low variance = ordering doesn't matter much for this query type.

Test 5 permutations; measure faithfulness variance across orderings.

βœ‚οΈ
Compression Quality Tests

Compare answers produced from the original uncompressed context vs compressed context on your test set. If faithfulness drops more than 5% absolute, the compression ratio is too aggressive.

Target: <5% faithfulness drop at your target compression ratio.

πŸ“Š
Token Efficiency Audit

For a sample of production queries, measure what fraction of context tokens were cited in the answer. Tokens not cited are wasted. A utilization below 40% signals a retrieval or selection problem.

Target: >50% of context tokens referenced in the final answer.

10
Chapter 10 Β· Production
Production Context Systems β€” Scale and Reliability

Context engineering in a notebook is easy. Context engineering at production scale β€” with real latency budgets, cost constraints, concurrent users, and cascading failures β€” is an entirely different discipline. This chapter is the full production playbook.

πŸ“₯User QueryParse intent + entities
πŸ”RetrievalVector + keyword search
πŸ†Re-RankCross-encoder scoring
βœ‚οΈCompressFilter + summarize
🧱AssembleSystem + history + chunks
πŸ€–LLM CallWith assembled context

Each stage has its own latency budget, failure mode, and optimization surface. Treat them as independent services with SLAs β€” not a single monolithic function.

StageTypical LatencyPrimary OptimizationFailure Mode
Query Parsing1–5msPre-compiled regex; cached NLP modelsWrong intent extraction β†’ wrong retrieval
Vector Retrieval10–50msANN index (HNSW); GPU-acceleratedIndex staleness; embedding model mismatch
Keyword Search5–20msInverted index; field weightingSparse coverage on long-tail queries
Re-Ranking50–200msAsync; cache popular queriesLatency spike; cross-encoder OOM
Compression100–500msRule-based first; LLM only when neededOver-compression loses key facts
LLM Inference500ms–5sStreaming; prefix caching; batchingTimeout; context length exceeded

Every millisecond of context construction latency adds directly to user-perceived response time. Parallelise all retrieval and processing steps that don't depend on each other.

πŸ”§
Parallel context construction with timeout
import asyncio from typing import Optional async def build_context_parallel( query: str, user_id: str, conversation_id: str, timeout_ms: int = 300, ) -> dict: timeout = timeout_ms / 1000 # All retrieval tasks run in parallel tasks = { "vector": vector_search(query, k=8), "keyword": bm25_search(query, k=5), "user_history": get_recent_turns(conversation_id, n=5), "user_prefs": get_user_preferences(user_id), "system_state": get_system_context(), } results = {} for name, coro in tasks.items(): try: results[name] = await asyncio.wait_for(coro, timeout=timeout) except asyncio.TimeoutError: results[name] = [] # Degrade gracefully log_metric(f"context_timeout_{name}", 1) # Merge, deduplicate, and assemble chunks = deduplicate(results["vector"] + results["keyword"]) chunks = rerank(query, chunks, k=5) return assemble_context( system_prompt=SYSTEM_PROMPT, history=results["user_history"], user_context=results["user_prefs"], retrieved_chunks=chunks, system_state=results["system_state"], )

At scale, context size is your primary cost driver. A system consuming 4,000 input tokens per request at $3/M tokens costs $0.012 per request β€” at 1M daily requests, that's $12,000/day just in input tokens.

πŸ“¦
Context Tiering

Use small, cheap models (GPT-4o-mini, Haiku) for simple queries with short context. Route complex queries to large models. Saves 60–80% on most workloads.

πŸ’Ύ
Prefix Caching

Cache system prompts and static context with providers that support it (Anthropic, OpenAI). Repeated prefix tokens cost 10Γ— less. Saves 20–40% for chat applications.

βœ‚οΈ
Aggressive Compression

Set hard token budgets per context section. Use extractive compression on retrieved chunks. Remove boilerplate from system prompts. Target <2,000 input tokens for simple Q&A.

StrategyToken ReductionQuality ImpactImplementation Effort
Reduce k (fewer chunks)20–40%Minimal if precision is highLow
Extractive compression30–60%Low β€” keeps key sentencesMedium
History summarization40–70%Moderate β€” may lose nuanceMedium
Prefix caching10–30% costNone β€” same tokensLow (provider feature)
Model routing50–80% costDepends on routing accuracyHigh
Semantic deduplication10–25%Positive β€” removes noiseMedium

Full observability means you can trace any production failure back to its root cause in the context pipeline: was it a bad retrieval, a compression error, a cache miss, or an LLM failure?

πŸ”§
Context span tracing with OpenTelemetry
from opentelemetry import trace from opentelemetry.trace import Status, StatusCode tracer = trace.get_tracer("context-pipeline") async def traced_context_build(query: str, **kwargs): with tracer.start_as_current_span("context.build") as root: root.set_attribute("query.length", len(query)) with tracer.start_as_current_span("context.retrieve") as span: chunks = await retrieve(query) span.set_attribute("chunks.count", len(chunks)) span.set_attribute("chunks.total_tokens", count_tokens(chunks)) with tracer.start_as_current_span("context.compress") as span: compressed = compress(chunks, budget=2000) span.set_attribute("tokens.before", count_tokens(chunks)) span.set_attribute("tokens.after", count_tokens(compressed)) span.set_attribute("compression.ratio", count_tokens(compressed) / count_tokens(chunks)) context = assemble(compressed, **kwargs) root.set_attribute("context.final_tokens", count_tokens(context)) return context

Key signals to trace at every request:

Retrieval Span
  • Retrieval latency (ms)
  • Chunks retrieved (count)
  • Mean relevance score
  • Cache hit/miss
Assembly Span
  • Total tokens assembled
  • Tokens per section
  • Compression ratio
  • Truncation events
LLM Span
  • Input / output tokens
  • Time to first token
  • Total latency
  • Provider / model used
πŸ”„
Retrieval Fallback

If vector search fails or returns low-confidence results, fall back to BM25 keyword search. If both fail, serve from a pre-built static context for the query category.

vector β†’ bm25 β†’ static_fallback
βœ‚οΈ
Context Overflow Guard

Always check token count before sending to the LLM. If the assembled context exceeds the model's limit, apply emergency compression: truncate history first, then reduce chunk count.

assert tokens <= model_limit * 0.9
⏱️
Timeout Budgets

Each pipeline stage gets a hard timeout. A slow re-ranker should not block the entire request. Degrade to fewer chunks rather than wait indefinitely.

rerank timeout: 150ms β†’ skip if exceeded
πŸ—οΈ
Circuit Breaker

If a retrieval backend fails repeatedly (e.g., vector DB unreachable), open the circuit and serve from cache or static context rather than hammering the failing service.

5 failures / 10s β†’ open circuit for 30s
Scale LevelArchitectureKey Optimizations
<100 RPS Single service, async Python (FastAPI) Async retrieval, prefix caching, response streaming
100–1K RPS Horizontal scaling + Redis cache Semantic query caching, HNSW index on dedicated GPU, re-rank batching
1K–10K RPS Dedicated retrieval microservice + context assembly service Read replicas, shard vector index, async evaluation pipeline
>10K RPS Kafka-based pipeline, geo-distributed indexes, CDN for static context Pre-computed context for top queries, speculative prefill, model replicas
The 80/20 Query Distribution

In most production systems, 20% of distinct query patterns account for 80% of traffic. Pre-compute and cache context for your top query templates. This can reduce live retrieval load by 60–80%, dramatically improving p99 latency. Use semantic clustering to identify your top query templates from production logs.

βœ…
Retrieval
  • Hybrid search (vector + BM25)
  • Re-ranking on top-k
  • Semantic deduplication
  • Retrieval fallback chain
  • Index freshness monitoring
βœ…
Context Assembly
  • Token budget enforced per section
  • Overflow guard (assert <limit)
  • Extractive compression for long chunks
  • Conversation history summarization
  • Context template versioning
βœ…
Performance
  • Parallel async retrieval
  • Timeout budgets per stage
  • Prefix caching enabled
  • Semantic query caching (Redis)
  • Streaming responses
βœ…
Observability
  • Distributed tracing (OTEL)
  • Token usage per section logged
  • Relevance score sampled (5–10%)
  • Faithfulness check on samples
  • Alerts on degradation

∑ Chapter 10 — Key Takeaways

  • Treat the context pipeline as a microservice graph with independent latency budgets, SLAs, and failure modes per stage
  • Parallelise all retrieval β€” vector search, keyword search, history fetch, and user context should all run concurrently with per-stage timeouts
  • Cost management: model tiering + prefix caching + aggressive compression can reduce token spend by 60–80% vs naΓ―ve implementation
  • Full observability requires distributed tracing at every pipeline stage β€” retrieval span, assembly span, LLM span β€” not just end-to-end latency
  • Reliability patterns: retrieval fallback chain, overflow guard, circuit breaker, and timeout-based degradation prevent single-component failures from cascading
  • The 80/20 query distribution is your biggest scaling lever β€” pre-compute context for top query templates to cut live retrieval load by 60–80%

In 2024–2026, the capability gap between frontier models has narrowed dramatically. GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro are all excellent β€” and increasingly similar on most benchmarks. The infrastructure for calling them is standardized. What's left as the primary differentiator?

πŸ—οΈ
Construction Quality

How you select, order, format, and assemble context determines model grounding. Two teams using the same model get dramatically different results based on construction alone.

πŸ”¬
Signal Density

Teams that ruthlessly filter noise, compress aggressively, and enforce relevance thresholds see 2–4Γ— quality improvements on the same model vs teams that dump raw retrieval results.

βš™οΈ
Systematic Optimization

Context engineering is the discipline of controlling what the model sees β€” not hoping it figures it out. Teams with systematic eval loops, compression pipelines, and cache architectures win.

The Production Engineering Mindset

The model is a fixed function. You cannot change what it knows or how it reasons. The only variable you control is the input. Every improvement in your LLM application β€” quality, cost, latency, reliability β€” comes from engineering better inputs. Context engineering is not a supporting discipline. It is the core discipline.

Focus on: control over what enters the context Β· cost awareness at every token Β· failure handling at every pipeline stage Β· systematic measurement of what's working. These four habits separate production-grade context systems from everything else.