AI Advanced · Context Engineering

Context
Engineering

Building and optimizing context windows for LLM applications — construction, compression, windowing strategies, and production patterns.

Context is the bottleneck. How you construct, compress, and optimize context determines model quality, latency, and cost. This guide teaches the full spectrum of context engineering — from selection to compression to production caching.

Chapter 01 · Foundations

Context Fundamentals — How LLMs Use Context

The context window is not a container — it's a weighted attention field. What you put in it, where you put it, and how much of it is noise directly determines what the model can and cannot do with your input.

Context Is the Only Thing the Model Sees Mental Model

The model does not know what is relevant, what is correct, or what you intended. It only sees tokens. This single constraint has deep engineering implications:

🎯

Relevance must be engineered

The model cannot distinguish signal from noise. If irrelevant content is in the context, the model will attend to it. Relevance is your responsibility, not the model's.

📐

Ordering must be intentional

Token position influences attention weight. The model does not magically find the most important chunk — it follows position bias. Design your context layout deliberately.

📉

Noise directly reduces accuracy

Every irrelevant token competes with relevant ones for attention. More noise = more diluted signal. A smaller, high-quality context consistently outperforms a large, noisy one.

The Most Important Insight in Context Engineering

Better context beats a better model — in most real-world systems. Switching from GPT-4o-mini to GPT-4o gives you a marginal improvement. Fixing your context construction can give you a 2–4× improvement on the same model. Always optimize context before upgrading the model.

Context Is the Primary Cost Driver

Cost scales directly with token count — and the context window is almost always the largest component. 5 large chunks → ~10K input tokens → 5× the cost of a 2K-token context. Every optimization that reduces context size compounds across every request: lower cost, lower latency, less noise, better answers.

What Is Context — Token Sequences and Attention Foundation

Every LLM receives a flat sequence of tokens — system prompt, conversation history, retrieved documents, tool results, and user input are all concatenated into a single integer array. The model attends across all of it simultaneously. There is no "memory" separate from this — the context window is the model's entire working memory for a given call.

The context window — one flat token sequence

Model	Max Context	Typical Input Budget	Notes
GPT-4o	128K tokens	~120K usable	Good long-context performance up to ~64K
Claude 3.5 Sonnet	200K tokens	~190K usable	Strong at long documents
Gemini 1.5 Pro	1M tokens	~900K usable	Best for very long docs; some quality degradation at extremes
Llama 3.1 8B	128K tokens	~32K effective	Quality degrades significantly beyond 32K
Mistral 7B	32K tokens	~24K usable	Standard for local deployment

Advertised vs Effective Context Length

A model may support 128K tokens but only reliably use 32K–64K. Beyond that, recall drops — especially for information placed in the middle. Always test your specific use case at your expected context lengths. Advertised != effective.

Lost-in-the-Middle — The Attention Decay Problem In-depth

The 2023 paper "Lost in the Middle" demonstrated experimentally what practitioners already suspected: LLMs pay disproportionate attention to the beginning and end of the context window. Information placed in the middle is recalled less reliably — even when it's clearly the most relevant.

Recall accuracy by position in context window

📉

Why It Happens

Attention mechanisms naturally focus on nearby and very early tokens
Position embeddings create an implicit primacy/recency bias
Training data patterns reinforce beginning/end attention
Longer contexts amplify the effect — more middle to get lost in

🎯

Mitigation Strategies

Primacy: Put most critical context at the very start
Recency: Move high-priority info near the query (end)
Chunking: Shorter context windows reduce middle depth
Repetition: Repeat key facts at start and end
Explicit refs: "Based on document 1 above..." anchors attention

The Practical Rule

Place your most important instruction or the most relevant retrieved chunk either at the very beginning of the context or immediately before the user's question. Never bury critical facts in the middle of a long document list.

Token Budget Accounting — Know Your Numbers Core

Every production LLM call has a token budget. Blow it and you get truncation errors, silent degradation, or hard failures. Managing this budget is a core engineering discipline.

🔧

Token budget calculator (Python)

import tiktoken def count_tokens(text: str, model: str = "gpt-4o") -> int: enc = tiktoken.encoding_for_model(model) return len(enc.encode(text)) def build_context_budget( model_max: int = 128000, system_prompt: str = "", user_query: str = "", max_output: int = 2000, safety_margin: int = 500, ) -> int: """Return how many tokens are available for retrieved context.""" system_tokens = count_tokens(system_prompt) query_tokens = count_tokens(user_query) reserved = system_tokens + query_tokens + max_output + safety_margin available = model_max - reserved return max(0, available) # Example: 128K model, 500-token system, 50-token query budget = build_context_budget( model_max=128000, system_prompt=system_prompt, user_query=user_query, ) # budget ≈ 125,000 — how many tokens you can fill with retrieved chunks

Budget Component	Typical Range	Controllable?	Notes
System prompt	200–2,000	Yes — compress it	Most system prompts can be halved with careful editing
Retrieved context	1,000–50,000	Yes — main lever	Chunk count × chunk size — your primary engineering surface
Conversation history	0–32,000	Partially — truncate	Grows unbounded in long chats; must be managed explicitly
User query	10–500	Not really	User controls this; guard against prompt injection stuffing
Max output	256–8,000	Yes — set it low	Reserve less when short answers expected; more for generation tasks

Context vs Knowledge — What the Model Knows vs What It Sees Foundation

Parametric Knowledge (baked in)

Encoded in weights during training. Available without any prompt input.

Examples: What is Python? Who wrote Hamlet? What is a transformer?

Limits: Cutoff date, hallucination risk, no private data

Access: Always available, zero tokens

Contextual Knowledge (injected)

Provided at inference time via the context window. Overrides and extends parametric knowledge.

Examples: Your product docs, today's news, user's account data

Limits: Token budget, retrieval quality, position bias

Access: Costs tokens, requires retrieval or explicit injection

The Engineering Implication

For facts the model already knows well (general concepts, public knowledge), don't waste context tokens restating them. Reserve your context budget for what the model cannot know: private data, recent events, user-specific information, and domain specifics it may hallucinate without grounding.

∑ Chapter 01 — Key Takeaways

The context window is a flat token sequence — system prompt, retrieved docs, history, and user query all compete for the same budget
Lost-in-the-middle: Recall drops for information placed in the center — put critical content at the start or immediately before the query
Advertised ≠ effective context length — always test your model at your target context sizes
Token budget = model max − system − query − max_output − safety margin — retrieved context is your primary engineering lever
Don't waste context on parametric knowledge the model already has — reserve it for private, recent, or user-specific information

Chapter 02 · Building Context

Context Construction — Selection, Ordering, and Formatting

Retrieval gives you candidates. Context construction turns candidates into a prompt the model can actually use. Selection, ordering, formatting, and citation anchoring are each distinct engineering problems.

Retrieval Finds Data — Construction Makes It Usable Mental Model

Most teams treat retrieval and context construction as the same problem. They are not. Retrieval returns candidates. Context construction decides what to include, what to exclude, how to format it, and where to place it in the window.

Two systems with identical retrieval — different results

System A retrieves the same 8 chunks as System B. It dumps them in retrieval order, unformatted, with no deduplication and no token budget. The model receives 6K tokens of noisy context.

System B re-ranks the same 8 chunks, drops 3 as irrelevant, deduplicates one, formats each with a source label, and places the most relevant first. The model receives 2K tokens of clean, ordered context.

System B consistently outperforms System A despite identical retrieval — purely from construction quality.

Dynamic Construction Outperforms Static Pipelines

Production-grade systems do not use a fixed context template for all queries. They rewrite queries before retrieval, adapt the number of chunks based on query complexity, vary context size based on task type, and apply different construction strategies per user or workflow. A static pipeline that serves every query identically will underperform a dynamic one that adapts to the request.

The Construction Pipeline — From Retrieval to Prompt Foundation

🔍RetrieveTop-K candidates

📊Re-rankBy relevance to query

✂️SelectToken-budget-aware

🗂️OrderPrimacy/recency aware

🏷️FormatLabels, delimiters

✅InjectInto system/user prompt

Each stage affects quality independently. Teams that only focus on retrieval and ignore construction leave significant quality on the table.

Selection Strategies — What to Include In-depth

Retrieved context always exceeds your budget. Selection decides what makes the cut.

📐

Top-K Cutoff

Take the top N results by relevance score, regardless of token count.

Simple, easy to reason about
Problem: Long chunks waste budget; short chunks waste retrieval
Use when: Chunks are uniform size

💰

Token Budget Fill

Add chunks in relevance order until token budget is exhausted.

Efficient — always fills the budget exactly
Problem: A large irrelevant chunk wastes the budget
Use when: Maximizing context density matters

🎯

Score Threshold

Only include chunks whose relevance score exceeds a minimum threshold.

Quality control — excludes low-relevance noise
Problem: May return empty context if nothing passes
Use when: False positives are costly

🔧

Token-budget-aware selection

def select_chunks( chunks: list[dict], # {"text": ..., "score": ..., "tokens": ...} token_budget: int, min_score: float = 0.5, max_chunks: int = 10, ) -> list[dict]: # Filter by minimum relevance candidates = [c for c in chunks if c["score"] >= min_score] # Sort by score descending candidates.sort(key=lambda x: x["score"], reverse=True) selected, used = [], 0 for chunk in candidates[:max_chunks]: if used + chunk["tokens"] <= token_budget: selected.append(chunk) used += chunk["tokens"] else: break # budget exhausted return selected

Ordering Strategies — Where to Place What In-depth

Ordering is the direct answer to the lost-in-the-middle problem. Where you place retrieved chunks determines how well the model can use them.

Strategy	Ordering	Best For	Risk
Relevance Descending	Most relevant first	Default — leverages primacy bias	Lowest-relevance chunks still in middle
Sandwich (U-shape)	Best → middle chunks → best	Long context, multiple equally relevant	Duplication of top chunk; slightly larger prompt
Reverse Relevance	Least relevant first, best last	When recency bias stronger than primacy	Model may anchor on weak context early
Temporal	Chronological order	Conversation history, time-sensitive docs	Most relevant may not be most recent
Hierarchical	Summary → detail chunks	Long documents with overview + details	Requires pre-computed summaries

The Sandwich Pattern

For 4+ retrieved chunks, use the sandwich ordering: most relevant chunk first, least relevant in the middle, second-most relevant last — immediately before the user's query. This exploits both primacy and recency bias simultaneously.

Context Formatting — Labels, Delimiters, and Structure Core

Formatting determines how clearly the model can distinguish between context chunks and how reliably it can cite them. Poor formatting causes models to conflate sources, miss boundaries, or fail to cite.

❌

Poor Formatting

Here is some context: The refund policy is 30 days. Customer service hours are 9-5. Returns require a receipt. Answer the question.

No source labels → can't cite
No chunk boundaries → model conflates
No document IDs → can't reference

✅

Structured Formatting

<context> <doc id="1" source="refund-policy.pdf"> The refund policy allows returns within 30 days of purchase with receipt. </doc> <doc id="2" source="support-hours.txt"> Customer service: Mon-Fri 9am-5pm EST. </doc> </context>

Clear boundaries → model knows edges
Source labels → enables citation
IDs → "According to doc 1..."

🔧

Context formatter (Python)

def format_context( chunks: list[dict], style: str = "xml" # "xml" | "markdown" | "numbered" ) -> str: if style == "xml": parts = ["<context>"] for i, chunk in enumerate(chunks, 1): src = chunk.get("source", "unknown") parts.append(f'<doc id="{i}" source="{src}">') parts.append(chunk["text"].strip()) parts.append("</doc>") parts.append("</context>") return "\n".join(parts) elif style == "markdown": parts = [] for i, chunk in enumerate(chunks, 1): parts.append(f"### Source {i}: {chunk.get('source', '')}") parts.append(chunk["text"].strip()) parts.append("---") return "\n\n".join(parts) elif style == "numbered": return "\n\n".join( f"[{i}] {c['text'].strip()}" for i, c in enumerate(chunks, 1) )

Format Consistency Matters

Whatever format you choose in development, use exactly the same format in production. If you use XML tags, always use XML tags. If you use numbered lists, always use numbered. Models learn context patterns from your prompt — inconsistency causes unpredictable citation behavior.

Citation Anchoring — Grounding Model Output in Sources Core

Citation anchoring is the practice of instructing the model to explicitly reference which part of the context it used for each claim. It reduces hallucination, improves verifiability, and allows downstream validation.

📋

Citation Instruction (System Prompt)

You are a helpful assistant with access to company documentation. Rules: - Answer ONLY using the provided context - Cite your source using [doc id] notation - If the answer is not in the context, say "I don't have information on this." - Do NOT use your general knowledge

✅

Grounded Output Example

Our refund policy allows returns within 30 days of purchase [doc 1]. You'll need your original receipt to process the return [doc 1]. For questions, contact customer service Monday through Friday, 9am–5pm EST [doc 2].

∑ Chapter 02 — Key Takeaways

Context construction is a pipeline: retrieve → re-rank → select → order → format → inject — each step affects quality
Token-budget-aware selection: filter by score threshold, then fill budget greedily by relevance
Ordering rule: Most relevant first (primacy); consider sandwich for 4+ chunks (primacy + recency)
Use structured formatting (XML tags or numbered labels) — enables citation and prevents source conflation
Citation anchoring in the system prompt: "cite using [doc id]" dramatically reduces hallucination and enables verification

Common Context Failures in Production Failure Modes

Most production LLM quality issues trace back to context construction failures — not model limitations. Know these patterns so you can instrument for them.

🙈

Relevant Chunk Not Selected

The answer exists in your knowledge base but didn't make the top-k. Cause: embedding mismatch, wrong k, or poor chunking at ingestion.

Signal: user says "that info is in your docs" — you check and it is.

🔀

Too Many Irrelevant Chunks

Retrieval returns chunks that share keywords with the query but don't answer it. The model attends to noise and produces a confused or blended answer.

Fix: raise relevance threshold; add re-ranking step.

📍

Key Info Buried in the Middle

Critical chunk placed at position 4–6 in a 8-chunk context. Lost-in-the-middle effect causes the model to underweight it or miss it entirely.

Fix: sandwich ordering — most important first or last.

⚔️

Conflicting Chunks — Wrong One Wins

Two chunks contradict each other (e.g., different policy versions). The model picks one without noting the conflict — often the wrong one (older, lower-quality, or earlier in context).

Fix: explicit conflict detection instruction + timestamp metadata.

🏷️

Poor Formatting Causing Confusion

Chunks injected as raw text with no delimiters or source labels. The model cannot distinguish where one document ends and another begins.

Fix: structured XML tags or numbered doc labels on every chunk.

🌫️

Hallucination Despite Correct Context

The right chunk is in the context, but the model ignores it and generates from parametric memory anyway. Common when context is noisy, too long, or the relevant fact is in the middle.

Fix: reduce noise, move key chunk to start, add "only use provided sources" instruction.

Behavioral Failures Are Harder to Debug Than Retrieval Failures

Retrieval failures are easy to detect — the right chunk simply isn't there. Behavioral failures are harder: the right chunk is present, but the model still mixes sources, ignores the chunk, or hallucinates. These require evaluating both the context content (retrieval) and the model's grounding behaviour (faithfulness). Instrument both separately.

Chapter 03 · Compression

Context Compression — Fitting More Signal into Fewer Tokens

Context compression is not about making prompts shorter. It's about preserving maximum information density while spending fewer tokens. The goal: same answers, lower cost, lower latency, less position bias.

More Context ≠ Better Performance Critical Insight

The instinct is to include more context — more docs, more history, more detail — to give the model "everything it needs." This is wrong. Beyond a quality threshold, adding more context actively degrades performance.

📡

Increased Noise

Every irrelevant token is noise. The model cannot filter noise itself — it attends to everything. More irrelevant content means more wrong attention patterns.

🌊

Spread Attention

Attention is finite. Adding 10 chunks instead of 3 spreads the model's "focus" thinner. The relevant chunks get less effective attention weight.

📍

Lost-in-the-Middle Worsens

Every additional chunk pushes other chunks further from the ends of the context window. A 10-chunk context is worse than a 4-chunk context if 6 chunks are low-relevance.

The High-Signal Rule

A small high-signal context consistently outperforms a large noisy context. Set relevance thresholds and enforce chunk limits. If you can answer the query with 3 chunks, don't include 8. The goal of compression is not smaller prompts — it's higher signal density per token.

Why Compress — The Economic and Quality Case Foundation

💰

Cost

Input tokens cost money. A 50% compression = 50% cost reduction on input tokens — meaningful at scale.

GPT-4o: $2.50/1M input tokens
1M queries @ 10K tokens each = $25K
50% compression → $12.5K saved

⚡

Latency

Prefill time scales linearly with context length. Shorter context = faster TTFT (time-to-first-token).

10K tokens ≈ 100–500ms prefill
50K tokens ≈ 500–3000ms prefill
Compression reduces this directly

🎯

Quality

Less context = less noise, less lost-in-the-middle, more focused attention on relevant parts.

Removes off-topic sentences
Reduces position bias effects
Cleaner signal for the model

Compression Techniques — The Tool Kit In-depth

Technique	How It Works	Compression Ratio	Quality Loss	Best For
Sentence Filtering	Remove sentences with low relevance score to query	30–60%	Low (preserves exact text)	Long documents with mixed relevance
Extractive Summarization	Select and concatenate most relevant sentences	40–70%	Low (exact sentences)	Articles, reports, documentation
Abstractive Summarization	LLM rewrites chunk in fewer tokens	50–80%	Medium (may lose nuance)	Dense technical text, tables
Entity Extraction	Extract only key facts as structured snippets	60–90%	High for open-ended; low for structured tasks	Structured data extraction tasks
LLMLingua / Selective Removal	Token-level perplexity scoring removes low-info tokens	50–70%	Low (preserves key tokens)	General compression, RAG pipelines

Sentence Filtering — Query-Relevance Based Compression Core

Sentence filtering is the highest-fidelity compression method: score each sentence against the query, keep only sentences above a relevance threshold. Lossless for the kept sentences; zero hallucination risk since no text is generated.

🔧

Sentence-level compression with embeddings

from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer("BAAI/bge-small-en-v1.5") def compress_chunk(chunk: str, query: str, threshold: float = 0.4) -> str: # Split into sentences sentences = [s.strip() for s in chunk.split(".") if s.strip()] if not sentences: return chunk # Embed query and sentences query_emb = model.encode([query], normalize_embeddings=True)[0] sent_embs = model.encode(sentences, normalize_embeddings=True) # Score by cosine similarity scores = sent_embs @ query_emb # Keep sentences above threshold kept = [s for s, score in zip(sentences, scores) if score >= threshold] return ". ".join(kept) + ("." if kept else "")

Abstractive Compression — LLM-Powered Rewriting In-depth

Abstractive compression uses an LLM (typically a small, cheap one) to rewrite chunks into denser summaries targeted at a specific query. It can achieve the highest compression ratios but introduces a quality dependency: the compressor must not lose or distort key facts.

When to Use Abstractive

✅ Long documents with redundant prose

✅ Dense technical text that can be paraphrased

✅ You need 60%+ compression

✅ Small fast model available as compressor

When NOT to Use Abstractive

❌ Legal, medical, financial text where exact wording matters

❌ Code — summarization destroys syntax

❌ Numerical data — risk of transcription errors

❌ When compressor latency exceeds savings benefit

🔧

Abstractive compression prompt

# System prompt for the compressor LLM COMPRESS_SYSTEM = """You are a precise text compressor. Given a document chunk and a user query, rewrite the chunk to retain ONLY information relevant to answering the query. Rules: - Preserve all numbers, dates, and named entities exactly - Keep sentences that directly relate to the query - Remove off-topic background information - Output ONLY the compressed text, nothing else - Target 40-60% of the original token count""" COMPRESS_USER = """Query: {query} Chunk to compress: {chunk}"""

The Compressor Adds Latency

Abstractive compression requires an extra LLM call per chunk. At 5 chunks × 200ms per call = 1 second added to your pipeline. Use a small, fast model (Llama 3.1 8B, GPT-4o-mini, Claude Haiku) as the compressor — the main LLM's quality improvement must justify the added latency.

LLMLingua — Token-Level Selective Compression In-depth

LLMLingua (Microsoft Research, 2023) is a compression method that uses a small LM to compute token-level perplexity. Tokens that are predictable (low perplexity) carry little information and can be dropped. The remaining tokens form a compressed — but still parseable — prompt.

✅

LLMLingua Strengths

Works on any text — no LLM generation step
Deterministic compression ratio control
Preserves semantic meaning at 50% compression
Fast — small local model for scoring
Open source: llmlingua pip package

⚠️

LLMLingua Limitations

Compressed text looks garbled to humans
Model-dependent: works best for models similar to compressor
Sensitive structure (JSON, code, tables) may break
Requires additional local model dependency

🔧

LLMLingua usage

# pip install llmlingua from llmlingua import PromptCompressor compressor = PromptCompressor( model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank", use_llmlingua2=True, ) compressed = compressor.compress_prompt( context, rate=0.5, # keep 50% of tokens force_tokens=["\n"], # always keep newlines ) print(compressed["compressed_prompt"]) # compressed text print(compressed["ratio"]) # actual achieved ratio print(compressed["saving"]) # tokens saved

Choosing a Strategy — Decision Guide Core

Situation	Recommended Strategy	Reason
Legal / medical documents	Sentence filtering only	Exact wording matters — no LLM rewriting
General knowledge articles	Abstractive or LLMLingua	Prose is compressible without loss
Code snippets	Minimal or none	Code structure is brittle — compression may break syntax
Long-form reports (20+ pages)	Hierarchical: abstract summary + sentence filtering of sections	Need overview + relevant detail
Latency-sensitive pipeline	Sentence filtering (local embeddings)	No extra LLM call — milliseconds, not seconds
High-volume, cost-critical	LLMLingua (offline pre-compression)	Compress once, store compressed; pay cost once

Pre-Compress at Ingestion Time

The most cost-effective compression happens at document ingestion, not at query time. Pre-compute summaries and compressed representations when documents are indexed. At query time, retrieve the pre-compressed version — zero extra latency, zero extra cost per query.

∑ Chapter 03 — Key Takeaways

Compression delivers three wins: lower cost, lower latency, better quality (less noise = more focused attention)
Sentence filtering: score sentences vs query with embeddings, drop below threshold — high fidelity, no hallucination risk
Abstractive compression: use a fast LLM to rewrite chunks — high ratio but adds latency; avoid for precise text
LLMLingua: token-level perplexity-based dropping — deterministic, works on arbitrary text, open source
Pre-compress at ingestion — not at query time. Pay once; save cost on every query
Never compress code, numerical tables, or legal text with abstractive methods — use extractive or no compression

Compression Is a Tradeoff, Not a Free Win Warning

Compression reduces tokens, cost, and latency — but it trades against information fidelity. Applied incorrectly, it loses the details your system depends on.

Compression Risk	How It Manifests	Mitigation
Critical detail loss	A numeric threshold, date, or constraint is dropped because it scored low in isolation — but it was essential	Always evaluate compressed vs original output on a test set; flag numeric patterns for retention
Meaning alteration	Abstractive compression changes a negation, qualifier, or conditional — "not required" becomes "required"	Avoid abstractive compression for policy, legal, or safety content; use extractive only
Introduced bias	LLM compressor summarizes multiple viewpoints as one, losing nuance or introducing the model's own bias	Sample-evaluate compressed summaries; A/B test faithfulness scores
Over-compression	High compression ratio leaves too few tokens — the answer is no longer reconstructible from the compressed context	Set minimum token floor per chunk; test at boundary compression ratios

Production Rule

Treat compression as a pipeline stage with its own regression tests. Every time you change compression logic, run your faithfulness eval on a held-out set and verify the score doesn't drop. Use compression selectively — apply it to narrative prose, not to structured data, code, or contractual language.

Chapter 04 · Windowing

Windowing Strategies — Managing Context Across Long Inputs

When the source material is longer than your context window, windowing decides which part of the document the model sees — and when. The wrong windowing strategy silently discards the answer before the model even runs.

The Windowing Problem — What Happens When Docs Don't Fit Foundation

A 200-page contract is ~150,000 tokens. A typical RAG chunk budget is 8,000–32,000 tokens. Even a "long context" model with 128K capacity can only fit ~85 pages. Windowing is how you navigate this mismatch for both indexing-time chunking and inference-time context assembly.

Indexing-Time Windowing

How you split documents into chunks when building the index. Determines retrieval granularity and the atomic unit of context.

Key question: How big should each chunk be?

Too large → diluted relevance scores, wasteful tokens

Too small → loses context needed to answer, more chunks to rank

Inference-Time Windowing

How you manage context when a conversation grows or a task requires iterating through a long document.

Key question: What do I drop when the window fills?

Drop old turns → lose conversation coherence

Drop retrieved context → lose grounding

Fixed Windows — Simple but Dangerous Core

Fixed windows split documents into equal-size chunks at fixed token boundaries. Simple to implement, but mid-sentence and mid-paragraph cuts destroy semantic coherence.

❌

Fixed Window Failure Mode

# Split at exactly 512 tokens Chunk 1: "...The defendant was found guilty on three counts. The sentence was determined by the presiding judge after carefully reviewing the evidence. The maximum penalty under statute 42B" Chunk 2: "is ten years imprisonment or a fine not exceeding $50,000. The court also considered the defendant's..."

The key fact is split across two chunks. Either chunk alone will produce an incomplete or wrong answer.

✅

When Fixed Windows Are Acceptable

Uniform prose with no sentence-crossing critical facts
Large chunks (1,000+ tokens) where mid-cut rarely matters
Pre-processing step before applying semantic splitting
Code files split at function boundaries (not token count)

Minimum mitigation: add 10–20% overlap between adjacent chunks

Sliding Windows — Overlap as a Safety Net In-depth

Sliding windows add overlap between adjacent chunks. Each chunk shares N tokens with the previous chunk, ensuring that information at boundaries exists in at least one complete chunk.

Sliding window — each chunk overlaps its neighbors

🔧

Sliding window chunker

def sliding_window_chunks( text: str, chunk_size: int = 512, overlap: int = 64, tokenizer = None, ) -> list[str]: """Split text into overlapping token-based chunks.""" if tokenizer is None: import tiktoken tokenizer = tiktoken.encoding_for_model("gpt-4o") tokens = tokenizer.encode(text) stride = chunk_size - overlap chunks = [] for start in range(0, len(tokens), stride): end = start + chunk_size chunk = tokenizer.decode(tokens[start:end]) chunks.append(chunk) if end >= len(tokens): break return chunks # Rule of thumb: overlap = 10–20% of chunk_size # chunk 512 tokens → overlap 50–100 tokens

Chunk Size	Overlap	Use Case	Trade-off
128–256 tokens	20–40 tokens	FAQ, structured data, precise fact retrieval	High precision, low recall, more chunks to rank
512 tokens	50–100 tokens	General purpose RAG (most common)	Good balance — default starting point
1,024–2,048 tokens	100–200 tokens	Technical docs, legal text, reasoning tasks	Better context per chunk; harder to rank precisely
4,096+ tokens	400–500 tokens	Long-form analysis, whole-section retrieval	Context-rich but relevance score diluted

Semantic Chunking — Splitting on Meaning In-depth

Semantic chunking splits documents at natural topic boundaries rather than fixed token counts. When embedding similarity drops sharply between adjacent sentences, that's a topic transition — a good split point.

✅

Semantic Chunking Benefits

Each chunk covers one coherent topic
Retrieval scores are more meaningful (one topic per chunk)
Fewer cross-boundary answer splits
Works especially well on structured docs (reports, articles)

⚠️

Semantic Chunking Costs

Requires embedding every sentence — expensive at ingestion
Variable chunk sizes complicate token budget planning
Poor results on dense technical text with no clear topic shifts
Needs a good sentence-level embedding model

🔧

Semantic chunker (cosine drop method)

from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer("BAAI/bge-small-en-v1.5") def semantic_chunks( text: str, threshold: float = 0.3, # cosine drop triggers split min_chunk_tokens: int = 100, max_chunk_tokens: int = 1024, ) -> list[str]: sentences = [s.strip() for s in text.split(".") if s.strip()] embeddings = model.encode(sentences, normalize_embeddings=True) # Compute cosine similarity between adjacent sentences similarities = [ embeddings[i] @ embeddings[i+1] for i in range(len(embeddings) - 1) ] # Split where similarity drops below threshold chunks, current = [], [] for i, sentence in enumerate(sentences): current.append(sentence) if i < len(similarities) and similarities[i] < (1.0 - threshold): chunks.append(". ".join(current) + ".") current = [] if current: chunks.append(". ".join(current) + ".") return chunks

Hierarchical Windowing — Parent + Child Chunks In-depth

Hierarchical windowing indexes small chunks for precise retrieval but retrieves larger parent chunks for fuller context. You get precision from small-chunk search and coherence from large-chunk content.

Parent-child chunking — retrieve small, inject large

The Key Advantage

Small chunks win retrieval battles — they're precise, focused, and score high. But small chunks lose context battles — they're too short to fully answer. Parent-child solves both: retrieve with child precision, answer with parent context.

Inference-Time Windowing — Managing Growing Conversations Core

In multi-turn conversations, history grows unbounded. At some point, it exceeds the context window. Inference-time windowing decides what to keep and what to drop.

Strategy	What Is Dropped	Pros	Cons
FIFO Truncation	Oldest messages first	Simple, no extra processing	Loses early context (system setup, key decisions)
Pinned + FIFO	Oldest non-pinned messages	Preserves system prompt + key anchors	Requires explicit pinning logic
Summary Compression	Older turns compressed to summary	No information loss per se; coherent history	Adds LLM call; summary may lose nuance
Semantic Retrieval	Low-relevance historical turns	Keeps only what's relevant to current query	Complex; requires embedding history

Never Silently Truncate

Silent truncation is the most dangerous pattern — the model gets half a conversation with no indication that context was removed, leading to confused or contradictory responses. Always inject an explicit indicator when history is compressed: [Earlier conversation summarized: user asked about X, decided Y].

∑ Chapter 04 — Key Takeaways

Fixed windows are simple but break sentences — always add 10–20% overlap at minimum
Sliding windows with stride = chunk_size − overlap ensure boundary facts exist in at least one complete chunk
Chunk size rule of thumb: 512 tokens with 64-token overlap is the default starting point; tune per domain
Semantic chunking splits at topic boundaries — better precision but expensive to compute at ingestion
Hierarchical windowing (parent–child): index small chunks for retrieval, inject large parent chunks for context — the best of both worlds
Inference-time: never silently truncate; use pinned-FIFO or summary compression; always tell the model when history was dropped

Chapter 05 · Density

Information Density — Signal vs Noise in Context

More context is not always better. Every low-signal token you add is a high-signal token the model attends to less. Information density engineering is about ensuring every token in your context window earns its place.

The Density Principle — Attention Is a Finite Resource Foundation

Transformer attention is not uniform — the model allocates attention across all tokens, but that allocation is competitive. When your context contains 20% useful signal and 80% boilerplate, the model must "work harder" to attend to the right parts. Dense context = better answers at lower token cost.

📰

Low-Density Content

Legal boilerplate and disclaimers
Document headers, footers, page numbers
Repeated information across chunks
Off-topic paragraphs in retrieved docs
Verbose explanations of obvious concepts

🎯

High-Density Content

Specific facts: numbers, dates, names
Decision criteria and rules
Definitions unique to your domain
Step-by-step procedures
Constraint lists and edge cases

⚡

Density Engineering Goals

Maximize relevant tokens per chunk
Remove formatting artifacts
Deduplicate near-identical content
Prefer structured representations
Strip navigation, menus, ads, footers

Noise Sources — What Degrades Density In-depth

Noise Type	Example	Impact	Fix
Structural Artifacts	Page numbers, TOC entries, nav menus	Clutters chunks with zero-signal text	Strip during document pre-processing
Legal Boilerplate	"This document is confidential and intended only for..."	Wastes 100–500 tokens per document	Blacklist common boilerplate patterns
Duplicate Content	Same paragraph repeated in overview + detail section	Dilutes attention; inflates token count	Dedup at ingestion via embedding similarity
Irrelevant Sidebars	Related articles, footnotes, bibliography	Off-topic context confuses the model	Semantic filtering per section type
Verbose Prose	"It is worth noting that, in the context of..." (→ "Note:")	2–5× token waste on filler	Abstractive compression targeting filler phrases
Format Overhead	HTML tags, markdown escape sequences	Raw HTML uses 30–50% extra tokens vs plain text	Strip HTML; convert markdown to plain text

Measuring Density — Quantifying Signal Quality Core

Before tuning, measure. Two practical density metrics: relevance density (what fraction of the context is relevant to the query) and entity density (how many unique named entities per 100 tokens).

🔧

Density scorer

from sentence_transformers import SentenceTransformer import tiktoken, re embed_model = SentenceTransformer("BAAI/bge-small-en-v1.5") tokenizer = tiktoken.encoding_for_model("gpt-4o") def relevance_density(context: str, query: str) -> float: """Fraction of sentences with cosine sim > 0.4 to query.""" sentences = [s.strip() for s in context.split(".") if len(s.strip()) > 20] if not sentences: return 0.0 q_emb = embed_model.encode([query], normalize_embeddings=True)[0] s_embs = embed_model.encode(sentences, normalize_embeddings=True) scores = s_embs @ q_emb return float((scores > 0.4).mean()) def entity_density(text: str) -> float: """Named entities + numbers per 100 tokens (rough proxy).""" token_count = len(tokenizer.encode(text)) # Count capitalized phrases and numbers as rough entity proxy entities = len(re.findall(r'\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)*|\b\d+[\d,.]*\b', text)) return (entities / max(token_count, 1)) * 100 # Good targets: relevance_density > 0.6, entity_density > 3.0

Structured Representations — Tables and Lists Beat Prose Core

When the same information can be represented as prose or as a structured list/table, the structured version is almost always higher density — more facts per token, easier for the model to parse.

📄

Prose (73 tokens)

The refund window is 30 days from the date of purchase. Customers must have their original receipt. Items must be in original condition and unopened. Electronics are eligible for exchange only, not refund. Shipping costs are non-refundable in all cases.

📋

Structured (38 tokens — 48% less)

Refund policy: - Window: 30 days from purchase - Requires: original receipt - Condition: unopened, original condition - Electronics: exchange only (no refund) - Shipping: non-refundable

Convert at Ingestion

When indexing documents that contain policy lists, specifications, or tabular data, pre-convert prose to structured format during ingestion and store the structured version in your vector database. The embedding quality stays the same; the token density improves dramatically.

The Threshold Effect — When More Context Hurts In-depth

Research and production experience consistently show a non-monotonic relationship between context length and quality. Adding more context helps up to a point — then it starts to hurt.

Answer quality vs context size — the threshold effect

More Chunks ≠ Better Answers

Doubling your retrieved chunk count from 5 to 10 doesn't improve answers if the extra 5 chunks are marginally relevant. They add tokens, increase latency, and introduce noise. Default to fewer, higher-quality chunks. Start with 3–5; only add more if eval shows consistent improvement.

∑ Chapter 05 — Key Takeaways

Attention is competitive — every low-signal token dilutes attention on high-signal tokens
Main noise sources: structural artifacts, boilerplate, duplicates, verbose prose, HTML tags — strip at ingestion time
Measure density: relevance density > 0.6 (fraction of sentences relevant to query) and entity density > 3.0 (named entities per 100 tokens)
Structured representations beat prose — same information in bullet/table form uses 30–50% fewer tokens
The threshold effect is real — quality peaks around 4K–8K tokens for most tasks; more context beyond that introduces noise
Default to 3–5 high-quality chunks, not 10–20 mediocre ones

Chapter 06 · Long Context

Long Context Models — 100K+ Tokens in Practice

Long context models change what's possible — but not how you should think. A 1M token window doesn't eliminate the need for context engineering; it shifts the constraints. Cost, latency, and position bias all scale with context length.

The Long Context Landscape — What's Available Foundation

Model	Context Window	Effective Range	Input Cost	Best Use
GPT-4o	128K tokens	~32K–64K real reliability	$2.50/1M tokens	Standard production, API access
Claude 3.5 Sonnet	200K tokens	Strong to ~128K	$3.00/1M tokens	Long legal/research docs
Claude 3 Opus	200K tokens	Strong to ~150K	$15.00/1M tokens	Complex multi-doc analysis
Gemini 1.5 Pro	1M tokens	Strong at 100K–500K	$1.25/1M (≤128K) / $2.50 (>128K)	Whole codebase, book-length docs
Gemini 1.5 Flash	1M tokens	Good to ~200K, degrades beyond	$0.075/1M (≤128K)	High-volume, cost-sensitive long context
Llama 3.1 70B	128K tokens	~16K–32K effective for open models	Self-hosted	Private deployment, data sovereignty

Position Embeddings — Why Long Context Degrades In-depth

Understanding why models degrade at long context requires understanding position embeddings — how models encode token position in the sequence.

📐

Absolute Position Embeddings

Each position has a fixed learned embedding. Max positions fixed at training — cannot generalize beyond training length.

GPT-2 style; largely deprecated
Hard stop at training max length

🔄

RoPE (Rotary Position Embedding)

Relative positional encoding via rotation. Can extend beyond training length via RoPE scaling (YaRN, NTK-aware). Used by Llama, Mistral, Qwen.

Extensible via fine-tuning or scaling
Quality degrades gracefully beyond training range

♾️

ALiBi (Attention with Linear Biases)

Penalizes attention scores by distance, no explicit position embedding. Generalizes to arbitrary length without retraining.

Used in MPT; robust extrapolation
Some quality loss vs RoPE at long range

The Practical Takeaway

For models using RoPE (Llama, Mistral): stay within the fine-tuned context range. Beyond it, quality degrades unpredictably. For API models (GPT-4o, Claude, Gemini): the provider has already handled extension — but test your specific task at your target length before committing to production.

Long Context Patterns — When to Use and How Core

✅

Use Long Context When:

The answer requires synthesizing across an entire document
You can't predict which sections will be relevant (reduces retrieval risk)
Document structure matters (cross-references, section dependencies)
Few-shot examples are so large they exceed normal context
Whole codebase analysis, full book Q&A, complete contract review

❌

Don't Use Long Context When:

RAG would achieve the same quality at 10× lower cost
Only a small section of the doc is ever relevant
You're answering many queries (cost scales with every call)
You need low latency — 100K+ tokens → 2–10s prefill
Cheaper model + retrieval outperforms expensive long-context model

Cost and Latency at Scale — The Long Context Tax In-depth

Strategy	Context Size	Cost (GPT-4o)	TTFT	Best For
RAG (sparse retrieval)	2K–8K tokens	$0.005–$0.02 / query	<200ms	High-volume, known-answer retrieval
RAG (dense retrieval)	8K–32K tokens	$0.02–$0.08 / query	200–800ms	Complex queries, multiple documents
Long context (64K)	64K tokens	$0.16 / query	1–3s	Full-doc analysis, infrequent queries
Long context (200K)	200K tokens	$0.50 / query	5–15s	One-time analysis, not production serving

The Hybrid Pattern

The most cost-effective production pattern combines both: use retrieval as a first pass to identify relevant sections, then inject those sections (plus surrounding context) into a long-context model for synthesis. RAG gives you precision; long context gives you coherence within the relevant section. Cost stays bounded; quality improves.

Needle in a Haystack — The Standard Long-Context Eval Reference

Before committing to a long-context model for production, run the "needle in a haystack" test: place a specific fact (the needle) at various positions in a large document (the haystack) and ask questions that require recalling it. This reveals where each model's attention actually degrades.

🔧

NIAH test scaffold

def run_niah_test( model_fn, needle: str = "The secret code is PURPLE-42.", haystack: str = None, # large filler document positions: list[float] = [0.1, 0.3, 0.5, 0.7, 0.9], context_lengths: list[int] = [8000, 32000, 64000, 128000], ) -> dict: results = {} for ctx_len in context_lengths: for pos in positions: # Insert needle at position fraction of context insert_at = int(ctx_len * pos) ctx = haystack[:insert_at] + needle + haystack[insert_at:ctx_len] response = model_fn( system="Answer only from the provided context.", user=f"{ctx}\n\nWhat is the secret code?" ) results[(ctx_len, pos)] = "PURPLE-42" in response return results # True/False grid: length × position

∑ Chapter 06 — Key Takeaways

Gemini 1.5 Pro (1M) is the longest-context option; Claude 3.5 Sonnet (200K) has the best long-context quality/cost balance for most production use
Advertised context ≠ effective context — run needle-in-a-haystack tests at your target lengths before committing
RoPE (Llama, Mistral) can be extended but degrades beyond training range; API models (GPT-4o, Claude, Gemini) have handled this internally
Use long context when synthesis across the whole document is required; use RAG when only a subset is relevant
Long context has a cost tax: 64K tokens = $0.16/query (GPT-4o), 200K = $0.50/query — unsustainable for high-volume use
Hybrid pattern: RAG to identify relevant sections → long-context model to synthesize within those sections

Chapter 07 · Caching

Context Caching — Reusing Prefixes for Cost and Latency Savings

Context caching is one of the highest-ROI optimizations in LLM engineering. If the same prefix appears in multiple requests, you pay to process it once and reuse the KV cache. Savings of 50–90% on input token costs are achievable for the right workloads.

How Prefix Caching Works — KV Cache Reuse Foundation

When an LLM processes a prompt, it computes key-value (KV) pairs for every token in the attention layers. This computation is expensive and proportional to context length. Prefix caching stores these pre-computed KV pairs — so if the same prefix appears in the next request, the model skips recomputing it entirely.

Without vs with prefix caching

The Key Requirement

For caching to work, the prefix must be identical byte-for-byte across requests — same characters, same whitespace, same order. Even one token difference means a cache miss. This is why stable, front-loaded prefixes are the design pattern for cacheable prompts.

Provider Caching Support — What's Available Core

Provider	Cache Type	Cached Token Cost	Min Cacheable Prefix	TTL
OpenAI (gpt-4o)	Automatic prompt caching	50% off input tokens	1,024 tokens	~1 hour (auto-evicted)
Anthropic (Claude)	Explicit cache_control markers	~90% off input (write once, read many)	1,024 tokens	5 min (ephemeral) / manual
Google (Gemini)	Explicit cached_content API	~75% off input tokens	32,768 tokens	Configurable (up to hours)
Self-hosted (vLLM)	Automatic prefix caching (—enable-prefix-caching)	GPU compute saved (no KV recompute)	Any prefix length	Memory-bound (in-flight)
Self-hosted (SGLang)	RadixAttention — tree-based KV sharing	Highest hit rate for branching prompts	Any prefix	Memory-bound

OpenAI — Automatic Prompt Caching Core

OpenAI caches prompts automatically — no code changes required. Any prompt prefix of 1,024+ tokens that is reused within ~1 hour gets cached at 50% discount. The usage field in the response reports cached_tokens.

🔧

Checking cache hits (OpenAI)

from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": long_system_prompt}, # 1K+ tokens {"role": "user", "content": user_query}, ], ) usage = response.usage print(f"Total input tokens : {usage.prompt_tokens}") print(f"Cached tokens : {usage.prompt_tokens_details.cached_tokens}") print(f"Cache hit rate : {usage.prompt_tokens_details.cached_tokens / usage.prompt_tokens:.1%}") # Design for caching: stable prefix first, dynamic content last # Cached tokens billed at $1.25/1M (vs $2.50/1M full price)

Anthropic — Explicit cache_control Markers In-depth

Claude's caching is explicit — you mark exactly which parts of the prompt to cache using cache_control breakpoints. The first request writes the cache (slight cost premium); subsequent requests read it at ~90% discount.

🔧

Claude cache_control usage

import anthropic client = anthropic.Anthropic() response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, system=[ { "type": "text", "text": large_document, # 10K+ tokens — cache this "cache_control": {"type": "ephemeral"}, }, { "type": "text", "text": "Answer questions based only on the document above." # This part NOT cached — changes per use case }, ], messages=[{"role": "user", "content": user_question}], ) usage = response.usage print(f"Cache write tokens : {usage.cache_creation_input_tokens}") print(f"Cache read tokens : {usage.cache_read_input_tokens}") # Subsequent calls: 90% discount on the cached part

Design Patterns for Cacheability — Structure Your Prompts to Win In-depth

Prompt structure determines cache hit rate. Small changes to prompt ordering or dynamic content placement can mean the difference between 90% hit rate and 0%.

❌

Cache-Unfriendly Pattern

System: You are a helpful assistant. Today is {current_date}. ← CHANGES EVERY DAY User: {question} Context: {retrieved_docs} ← CHANGES PER QUERY

The date is in the prefix — every new day invalidates the entire cache. Zero cache hits.

✅

Cache-Friendly Pattern

System: You are a helpful assistant. ← STABLE PREFIX [All static instructions here] ← STABLE PREFIX --- CONTEXT --- {retrieved_docs} ← per-query, AFTER stable --- QUERY --- {question} ← per-query, at end Note: Today is {current_date}. ← dynamic, keep LAST

Stable prefix stays identical — high cache hit rate for all static content.

Dynamic Element	Where to Place	Why
Static instructions	System prompt — first	Always cached after first call
Large static documents	System prompt or early user turn	Highest value to cache — most tokens
Retrieved context	After stable prefix	Changes per query — prefix still cached
User query	Last in user turn	Always unique — keep after everything stable
Current date/time	Last — after all stable content	Invalidates cache if in prefix
Session/user ID	Never in cacheable prefix	Makes every request unique — zero hits

Measuring Cache ROI — Before and After Core

🔧

Cache ROI calculator

def cache_roi( daily_requests: int, avg_input_tokens: int, stable_prefix_tokens: int, # tokens that stay same cost_per_1m_full: float, # e.g. 2.50 for gpt-4o cost_per_1m_cached: float, # e.g. 1.25 for gpt-4o hit_rate: float = 0.85, # expected cache hit rate ) -> dict: dynamic_tokens = avg_input_tokens - stable_prefix_tokens # Without caching cost_no_cache = (daily_requests * avg_input_tokens / 1_000_000) * cost_per_1m_full # With caching (prefix cached at reduced rate on hits) prefix_cached = daily_requests * hit_rate * stable_prefix_tokens prefix_full = daily_requests * (1 - hit_rate) * stable_prefix_tokens dynamic_tokens_total = daily_requests * dynamic_tokens cost_with_cache = ( (prefix_cached / 1_000_000) * cost_per_1m_cached + (prefix_full + dynamic_tokens_total) / 1_000_000 * cost_per_1m_full ) savings = cost_no_cache - cost_with_cache return { "daily_cost_no_cache": cost_no_cache, "daily_cost_cached": cost_with_cache, "daily_savings": savings, "monthly_savings": savings * 30, } # Example: 10K requests/day, 8K avg input, 6K stable prefix roi = cache_roi(10000, 8000, 6000, 2.50, 1.25) # Monthly savings ≈ $900 on a $2,000/month bill

∑ Chapter 07 — Key Takeaways

Prefix caching reuses pre-computed KV pairs — 50–90% input token cost reduction for cacheable workloads
OpenAI: automatic, no code changes — requires 1,024+ token prefix, ~1hr TTL, 50% discount
Anthropic: explicit cache_control markers — up to 90% discount, use for large static documents
Gemini: explicit cached_content API, 32K+ min tokens — best for very large stable content
Design rule: stable content first, dynamic content last — never put dates/user IDs in the cacheable prefix
For self-hosted: enable --enable-prefix-caching in vLLM or use SGLang's RadixAttention for branching prompts

What to Cache — A Production Hierarchy Production

Not all context is equally cache-worthy. The value of caching a piece of context depends on how frequently it's reused, how expensive it is to recompute, and how stable it is over time.

What to Cache	Reuse Frequency	Benefit	Stability
Formatted retrieved chunks	High — same query pattern	Eliminates retrieval + formatting cost	Hours–days
Compressed document summaries	Very high — per document	Eliminates compression LLM call	Days–weeks
System prompt	Every request	Provider prefix caching (50–90% discount)	Weeks–months
User preference context	Per user session	Eliminates DB lookup and formatting	Minutes–hours
Static knowledge base sections	High — shared across users	Serve from cache, skip retrieval	Days
Assembled context for top queries	Very high (80/20 rule)	Full pipeline bypass for hot queries	Minutes–hours

Caching Improves Consistency, Not Just Cost

Cached context is deterministic — the same pre-formatted, pre-compressed chunk is returned every time. This improves answer consistency across sessions. Without caching, minor variations in retrieval scores or compression outputs can cause the same query to produce different context — and different answers — across requests. Caching is both a cost lever and a reliability lever.

Chapter 08 · Multi-Document

Multi-Document Context — Synthesizing Across Multiple Sources

Most real-world queries require synthesizing across multiple documents or sources. How you rank, present, and delimit multiple documents determines whether the model synthesizes them correctly — or confuses, ignores, or contradicts them.

The Multi-Document Challenge — Five Failure Modes Foundation

🔀

Source Conflation

Model blends information from different documents into a single "answer," losing attribution of which source said what.

Fix: Explicit delimiters + citation instructions

🙈

Source Neglect

Model answers from the first one or two documents, ignoring others entirely. Lost-in-the-middle at document level.

Fix: Fewer docs + sandwich ordering + explicit "use all sources" instruction

⚔️

Conflict Blindness

Documents contradict each other; model picks one without noting the contradiction.

Fix: Explicit conflict detection prompt + "note disagreements" instruction

📅

Recency Blindness

All documents treated as equally current. An outdated doc overrides a newer one.

Fix: Inject timestamps; instruct model to prefer recent sources

🏷️

Source Mislabeling

Model cites "document 2" when the fact came from "document 4." Especially common with 5+ documents.

Fix: Unique, memorable source IDs (not just numbers)

Document Ranking — Which Source Goes First In-depth

When you have multiple retrieved documents, their order in the context window affects which ones the model uses. Rank-aware ordering is different from simple relevance sorting.

Ranking Signal	What It Is	When to Use
Relevance Score	Cosine similarity or BM25 score to query	Default — most relevant first
Recency	Document timestamp or last-updated date	News, policies, product docs that change
Authority	Source type (official docs > forum post) or domain weight	Knowledge bases with mixed source quality
Re-rank Score	Cross-encoder score (Cohere Rerank, BGE reranker)	High-stakes retrieval; worth the extra latency
Diversity	MMR (Maximal Marginal Relevance) — relevance minus redundancy	When top chunks are near-duplicates of each other

🔧

MMR document ranking (diversity-aware)

import numpy as np from sentence_transformers import SentenceTransformer model = SentenceTransformer("BAAI/bge-small-en-v1.5") def mmr_rank( query: str, docs: list[str], k: int = 5, lambda_: float = 0.5, # 0=max diversity, 1=max relevance ) -> list[int]: """Return indices of top-k docs via Maximal Marginal Relevance.""" embeddings = model.encode([query] + docs, normalize_embeddings=True) q_emb, d_embs = embeddings[0], embeddings[1:] relevance = d_embs @ q_emb # similarity to query selected, remaining = [], list(range(len(docs))) while len(selected) < k and remaining: if not selected: # First: pick most relevant best = max(remaining, key=lambda i: relevance[i]) else: # MMR: balance relevance vs redundancy sel_embs = d_embs[selected] mmr_scores = [ lambda_ * relevance[i] - (1 - lambda_) * float((d_embs[i] @ sel_embs.T).max()) for i in remaining ] best = remaining[int(np.argmax(mmr_scores))] selected.append(best) remaining.remove(best) return selected

Multi-Document Formatting — Labels, Metadata, and Delimiters Core

With multiple documents, clear formatting is critical. Labels must be unambiguous, metadata must be useful, and delimiters must prevent content bleeding between sources.

✅

Rich multi-document format (recommended)

<documents> <doc id="refund-policy" source="internal/policies/refund-v3.pdf" date="2025-01-15" authority="official"> Customers may return items within 30 days of purchase. A valid receipt is required. Opened items are not eligible. </doc> <doc id="cs-faq" source="support/faq.md" date="2024-08-10" authority="support"> Q: Can I return without receipt? A: No. A receipt is required for all returns. </doc> <doc id="forum-post" source="community/post-4421" date="2023-05-02" authority="user"> I returned without a receipt and they were fine with it. </doc> </documents> Answer the question using the documents above. Cite sources using [doc id] notation. If documents conflict, note the disagreement and prefer higher-authority, more recent sources.

Authority Metadata Is Powerful

Including authority="official" vs authority="user" and document dates lets you instruct the model to resolve conflicts by authority and recency. Without this metadata, the model has no principled basis for choosing between contradictory sources.

Conflict Resolution — Handling Contradictory Sources In-depth

Documents often contradict each other — especially across time (policy updated) or authority level (official docs vs user reports). Without explicit conflict handling, models pick arbitrarily.

📋

Conflict-Aware System Prompt

When documents contradict each other: 1. Note the contradiction explicitly 2. Prefer official/authoritative sources over community/user sources 3. Prefer more recent dates over older 4. If unresolvable, present both views and ask the user to clarify Format: "According to [official source], X is the case. Note: [community source] states Y, but this may be outdated."

🎯

Conflict Detection Prompt

Before answering, check: - Do any documents disagree with each other? - Are any documents likely outdated (old date)? - Is there uncertainty in the sources? If yes: state the conflict, your resolution logic, and your confidence level. If no conflicts: answer directly.

Cross-Document Synthesis — Building Complete Answers Core

Some answers require combining information from multiple documents — no single source is complete. Synthesis prompts encourage the model to explicitly integrate rather than just retrieve.

Retrieval Prompt (bad synthesis)

"Answer based on the documents provided."

Result: model picks the most relevant single document and answers from it, ignoring complementary information in others.

Synthesis Prompt (good synthesis)

"Synthesize a complete answer by drawing from ALL provided documents. Identify which aspects each document contributes. Note if any document provides unique information not found in others."

Result: model explicitly combines across sources.

The 3+ Document Degradation Cliff

Quality of multi-document synthesis degrades significantly beyond 3–5 documents for most models. Each additional document increases the probability of source neglect, conflation, or mislabeling. If you need 10 documents, consider a two-pass approach: first pass summarizes each document independently; second pass synthesizes the summaries. This is cheaper, more reliable, and scales better than a single context with 10 documents.

Two-Pass Synthesis — Map-Reduce for Documents In-depth

📄Doc 1 → SummaryIndependent extraction

📄Doc 2 → SummaryIndependent extraction

📄Doc N → SummaryIndependent extraction

🔗SynthesizeAll summaries → final answer

🔧

Map-reduce document synthesis

async def map_reduce_synthesis( documents: list[dict], query: str, llm_fn, ) -> str: # MAP: extract relevant info from each doc independently MAP_PROMPT = """From the document below, extract ONLY information relevant to: {query} If nothing is relevant, respond: "No relevant information." Be concise. Preserve exact numbers, dates, and names. Document [{doc_id}]: {content}""" extractions = await asyncio.gather(*[ llm_fn(MAP_PROMPT.format( query=query, doc_id=doc["id"], content=doc["text"] )) for doc in documents ]) # Filter empty extractions useful = [ f"[{doc['id']}]: {ext}" for doc, ext in zip(documents, extractions) if "No relevant" not in ext ] # REDUCE: synthesize all extractions into final answer REDUCE_PROMPT = """Synthesize a complete answer to: {query} Using these extracted facts from multiple sources: {facts} Cite each fact with its source ID. Note any conflicts.""" return await llm_fn(REDUCE_PROMPT.format( query=query, facts="\n\n".join(useful) ))

∑ Chapter 08 — Key Takeaways

Five multi-doc failure modes: conflation, neglect, conflict blindness, recency blindness, mislabeling — each requires a specific fix
Use rich metadata (source ID, date, authority level) in document tags — it enables automatic conflict resolution by the model
MMR ranking balances relevance and diversity — prevents top-k from returning near-duplicate chunks
Explicit conflict resolution instructions: prefer official over user, recent over old, note contradictions explicitly
Quality degrades with 3+ documents — for 10+ docs, use the map-reduce (two-pass) pattern: extract per doc, then synthesize
Synthesis prompts outperform retrieval prompts — "synthesize from all" vs "answer based on documents" produces meaningfully different results

Chapter 09 · Evaluation

Context Quality Metrics — Measuring Effectiveness

You can't improve what you don't measure. Context quality is the invisible variable that determines whether your LLM application works in production — and most teams only discover problems when users complain. This chapter defines the metric stack that tells you exactly where context is failing.

The Four Pillars of Context Quality Foundation

🎯

Relevance

Does the context contain information that answers the query? High relevance = low noise. Measured per-chunk and at the context level.

Metric: relevance score (0–1); % of chunks used in the answer

🔍

Faithfulness

Does the model's response stay grounded in the provided context? Low faithfulness = hallucination even when context is good.

Metric: RAGAS faithfulness; claim verification rate

📐

Coverage

Does the context include all facts needed to answer completely? Missing a key piece forces the model to hallucinate or hedge.

Metric: answer completeness; recall@k vs gold answer

⚡

Efficiency

How much of the context window is actually useful? Token waste = higher cost + higher latency + more noise for the model.

Metric: utilization ratio; noise fraction; token cost per query

Retrieval Metrics — Measuring Before the LLM In-depth

Retrieval metrics measure the quality of what you put into context — before the model sees it. These are fast, cheap, and deterministic.

Metric	What It Measures	How to Compute	Target
Precision@k	Fraction of top-k retrieved chunks that are relevant	Manual labels or LLM judge on sample	>0.7
Recall@k	Fraction of all relevant chunks that appear in top-k	Requires ground-truth relevant set	>0.8 for factual QA
MRR	Mean Reciprocal Rank — how early is the first relevant result?	avg(1/rank of first relevant chunk)	>0.6
NDCG@k	Normalized Discounted Cumulative Gain — graded relevance, rank-aware	Relevance labels (0/1/2) + DCG formula	>0.75
Context Utilization	% of retrieved chunks cited or used in final answer	LLM judge: "which chunks did the model actually use?"	>50% — low means too much noise
Noise Fraction	% of context tokens that are irrelevant to the query	LLM relevance scorer per chunk	<30% — lower is better

RAGAS — The Standard RAG Evaluation Framework Tool

RAGAS (Retrieval-Augmented Generation Assessment) provides four core metrics that together cover the full RAG quality surface:

📊

Faithfulness

Are all claims in the answer supported by the context? Breaks the answer into atomic claims and verifies each against retrieved chunks.

score = verified_claims / total_claims

🎯

Answer Relevancy

Is the answer actually addressing the question asked? Generates back-questions from the answer and measures alignment with the original.

score = cosine_sim(generated_Qs, original_Q)

🔍

Context Precision

Are the retrieved chunks actually useful for generating the answer? Measures signal-to-noise in the context window.

score = useful_chunks / total_chunks

📐

Context Recall

Does the retrieved context contain all the information needed to answer? Measures coverage relative to ground-truth answer.

score = covered_claims / total_claims_in_GT

🔧

RAGAS evaluation pipeline

from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall ) from datasets import Dataset data = { "question": ["What is the return policy?"], "answer": ["30 days for unused items."], "contexts": [["Returns accepted within 30 days..."]], "ground_truth": ["Items can be returned within 30 days if unused."], } dataset = Dataset.from_dict(data) result = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], ) print(result.to_pandas())

LLM-as-Judge for Context Quality In-depth

For production systems at scale, you can't manually review every retrieval result. LLM-as-judge provides automated, scalable evaluation that correlates well with human judgments.

🔧

Chunk relevance scorer (LLM-as-judge)

RELEVANCE_PROMPT = """Rate the relevance of this context chunk to the query. Query: {query} Chunk: {chunk} Score 0-3: 0 = Completely irrelevant 1 = Tangentially related 2 = Partially relevant 3 = Directly answers the query Respond with ONLY the number.""" async def score_chunk_relevance(query: str, chunk: str, llm) -> int: response = await llm(RELEVANCE_PROMPT.format( query=query, chunk=chunk )) try: return int(response.strip()) except: return 0 async def evaluate_context(query: str, chunks: list[str], llm): scores = await asyncio.gather(*[ score_chunk_relevance(query, c, llm) for c in chunks ]) return { "mean_relevance": sum(scores) / (len(scores) * 3), "noise_fraction": scores.count(0) / len(scores), "high_quality_chunks": sum(1 for s in scores if s >= 2) / len(scores), }

Faithfulness Verification — Catching Hallucinations Critical

Faithfulness measures whether the model's answer is grounded in the context. This is your primary hallucination detector for RAG systems.

Faithfulness Evaluation Pattern

Step 1 — Claim Extraction: Break the model's answer into atomic, verifiable claims. Each claim should be a single, unambiguous statement.

Step 2 — Claim Verification: For each claim, check whether it is supported, contradicted, or absent from the retrieved context.

Step 3 — Score Computation: Faithfulness = supported_claims / total_claims. A score below 0.8 indicates significant hallucination risk.

🔧

Faithfulness checker

CLAIM_EXTRACTOR = """Extract all factual claims from this answer as a JSON list. Each claim must be a single, atomic statement. Answer: {answer} Return: ["claim1", "claim2", ...]""" CLAIM_VERIFIER = """Does the provided context support this claim? Context: {context} Claim: {claim} Answer ONLY: SUPPORTED / CONTRADICTED / NOT_IN_CONTEXT""" async def check_faithfulness(answer: str, context: str, llm) -> dict: claims_json = await llm(CLAIM_EXTRACTOR.format(answer=answer)) claims = json.loads(claims_json) verdicts = await asyncio.gather(*[ llm(CLAIM_VERIFIER.format(context=context, claim=c)) for c in claims ]) supported = sum(1 for v in verdicts if "SUPPORTED" in v) return { "faithfulness": supported / len(claims), "total_claims": len(claims), "supported": supported, "hallucinated": len(claims) - supported, }

Building a Context Metrics Dashboard Production

In production you need continuous metric tracking — not just offline eval. Log the key signals with every request and aggregate them into a live dashboard.

Metric	Collection Method	Alert Threshold
Context Relevance (mean)	LLM scorer on sampled requests (5–10%)	<0.6
Faithfulness	Async faithfulness check post-response	<0.75
Context Utilization	Citation extraction from response	<0.4 → too many irrelevant chunks
Tokens per Query	LLM usage logs	>2× baseline → context bloat
Answer Latency p95	Request timing	>5s → retrieval or context issues
User Feedback Rate	Thumbs up/down or follow-up question rate	Downvote rate >15%

The Eval-Prod Gap

Offline evaluation on a benchmark dataset rarely reflects production performance. Production queries have a different distribution, different lengths, and different failure modes. Always run online metrics (sampled LLM evaluation + user feedback) alongside offline benchmarks. A system that scores 0.9 on your eval set may score 0.65 in production on queries you didn't anticipate.

∑ Chapter 09 — Key Takeaways

Context quality has four pillars: relevance, faithfulness, coverage, efficiency — measure all four, not just end-task accuracy
Use retrieval metrics (precision@k, recall@k, NDCG) as fast pre-LLM signals that catch retrieval failures before they reach the model
RAGAS is the standard framework: faithfulness, answer relevancy, context precision, context recall — run it on every major change
LLM-as-judge scales evaluation to production — sample 5–10% of requests and score chunk relevance asynchronously
Faithfulness verification (claim extraction → claim verification) is your primary hallucination detector in RAG systems
Build a live metrics dashboard: context relevance, faithfulness, utilization, token cost, latency — alert on degradation

Measuring Context Quality — What to Test and How Production

Context quality must be evaluated, not assumed. A system that "seems to work" in manual testing can have systematic failure modes that only show up under controlled evaluation. Build a test harness that isolates context variables.

🧪

Ablation Tests — Chunk Impact

Remove individual chunks from the context and measure the change in answer accuracy. If removing a chunk doesn't change the answer, the chunk is wasted tokens. If removing it causes failure, it's critical.

Test: answer_quality(full_context) vs answer_quality(context - chunk_N)

🔀

Ordering Sensitivity Tests

Shuffle chunk order and measure how much answer quality varies. High variance = model is fragile to ordering. Low variance = ordering doesn't matter much for this query type.

Test 5 permutations; measure faithfulness variance across orderings.

✂️

Compression Quality Tests

Compare answers produced from the original uncompressed context vs compressed context on your test set. If faithfulness drops more than 5% absolute, the compression ratio is too aggressive.

Target: <5% faithfulness drop at your target compression ratio.

📊

Token Efficiency Audit

For a sample of production queries, measure what fraction of context tokens were cited in the answer. Tokens not cited are wasted. A utilization below 40% signals a retrieval or selection problem.

Target: >50% of context tokens referenced in the final answer.

Chapter 10 · Production

Production Context Systems — Scale and Reliability

Context engineering in a notebook is easy. Context engineering at production scale — with real latency budgets, cost constraints, concurrent users, and cascading failures — is an entirely different discipline. This chapter is the full production playbook.

Production Architecture — The Full Context Pipeline Foundation

📥User QueryParse intent + entities

🔍RetrievalVector + keyword search

🏆Re-RankCross-encoder scoring

✂️CompressFilter + summarize

🧱AssembleSystem + history + chunks

🤖LLM CallWith assembled context

Each stage has its own latency budget, failure mode, and optimization surface. Treat them as independent services with SLAs — not a single monolithic function.

Stage	Typical Latency	Primary Optimization	Failure Mode
Query Parsing	1–5ms	Pre-compiled regex; cached NLP models	Wrong intent extraction → wrong retrieval
Vector Retrieval	10–50ms	ANN index (HNSW); GPU-accelerated	Index staleness; embedding model mismatch
Keyword Search	5–20ms	Inverted index; field weighting	Sparse coverage on long-tail queries
Re-Ranking	50–200ms	Async; cache popular queries	Latency spike; cross-encoder OOM
Compression	100–500ms	Rule-based first; LLM only when needed	Over-compression loses key facts
LLM Inference	500ms–5s	Streaming; prefix caching; batching	Timeout; context length exceeded

Real-Time Context Construction — Latency-First Design In-depth

Every millisecond of context construction latency adds directly to user-perceived response time. Parallelise all retrieval and processing steps that don't depend on each other.

🔧

Parallel context construction with timeout

import asyncio from typing import Optional async def build_context_parallel( query: str, user_id: str, conversation_id: str, timeout_ms: int = 300, ) -> dict: timeout = timeout_ms / 1000 # All retrieval tasks run in parallel tasks = { "vector": vector_search(query, k=8), "keyword": bm25_search(query, k=5), "user_history": get_recent_turns(conversation_id, n=5), "user_prefs": get_user_preferences(user_id), "system_state": get_system_context(), } results = {} for name, coro in tasks.items(): try: results[name] = await asyncio.wait_for(coro, timeout=timeout) except asyncio.TimeoutError: results[name] = [] # Degrade gracefully log_metric(f"context_timeout_{name}", 1) # Merge, deduplicate, and assemble chunks = deduplicate(results["vector"] + results["keyword"]) chunks = rerank(query, chunks, k=5) return assemble_context( system_prompt=SYSTEM_PROMPT, history=results["user_history"], user_context=results["user_prefs"], retrieved_chunks=chunks, system_state=results["system_state"], )

Cost Management — Controlling Token Spend Critical

At scale, context size is your primary cost driver. A system consuming 4,000 input tokens per request at $3/M tokens costs $0.012 per request — at 1M daily requests, that's $12,000/day just in input tokens.

📦

Context Tiering

Use small, cheap models (GPT-4o-mini, Haiku) for simple queries with short context. Route complex queries to large models. Saves 60–80% on most workloads.

💾

Prefix Caching

Cache system prompts and static context with providers that support it (Anthropic, OpenAI). Repeated prefix tokens cost 10× less. Saves 20–40% for chat applications.

✂️

Aggressive Compression

Set hard token budgets per context section. Use extractive compression on retrieved chunks. Remove boilerplate from system prompts. Target <2,000 input tokens for simple Q&A.

Strategy	Token Reduction	Quality Impact	Implementation Effort
Reduce k (fewer chunks)	20–40%	Minimal if precision is high	Low
Extractive compression	30–60%	Low — keeps key sentences	Medium
History summarization	40–70%	Moderate — may lose nuance	Medium
Prefix caching	10–30% cost	None — same tokens	Low (provider feature)
Model routing	50–80% cost	Depends on routing accuracy	High
Semantic deduplication	10–25%	Positive — removes noise	Medium

Observability — Tracing Context Through the Pipeline In-depth

Full observability means you can trace any production failure back to its root cause in the context pipeline: was it a bad retrieval, a compression error, a cache miss, or an LLM failure?

🔧

Context span tracing with OpenTelemetry

from opentelemetry import trace from opentelemetry.trace import Status, StatusCode tracer = trace.get_tracer("context-pipeline") async def traced_context_build(query: str, **kwargs): with tracer.start_as_current_span("context.build") as root: root.set_attribute("query.length", len(query)) with tracer.start_as_current_span("context.retrieve") as span: chunks = await retrieve(query) span.set_attribute("chunks.count", len(chunks)) span.set_attribute("chunks.total_tokens", count_tokens(chunks)) with tracer.start_as_current_span("context.compress") as span: compressed = compress(chunks, budget=2000) span.set_attribute("tokens.before", count_tokens(chunks)) span.set_attribute("tokens.after", count_tokens(compressed)) span.set_attribute("compression.ratio", count_tokens(compressed) / count_tokens(chunks)) context = assemble(compressed, **kwargs) root.set_attribute("context.final_tokens", count_tokens(context)) return context

Key signals to trace at every request:

Retrieval Span

Retrieval latency (ms)
Chunks retrieved (count)
Mean relevance score
Cache hit/miss

Assembly Span

Total tokens assembled
Tokens per section
Compression ratio
Truncation events

LLM Span

Input / output tokens
Time to first token
Total latency
Provider / model used

Reliability Patterns — Handling Failures Gracefully Critical

🔄

Retrieval Fallback

If vector search fails or returns low-confidence results, fall back to BM25 keyword search. If both fail, serve from a pre-built static context for the query category.

vector → bm25 → static_fallback

✂️

Context Overflow Guard

Always check token count before sending to the LLM. If the assembled context exceeds the model's limit, apply emergency compression: truncate history first, then reduce chunk count.

assert tokens <= model_limit * 0.9

⏱️

Timeout Budgets

Each pipeline stage gets a hard timeout. A slow re-ranker should not block the entire request. Degrade to fewer chunks rather than wait indefinitely.

rerank timeout: 150ms → skip if exceeded

🏗️

Circuit Breaker

If a retrieval backend fails repeatedly (e.g., vector DB unreachable), open the circuit and serve from cache or static context rather than hammering the failing service.

5 failures / 10s → open circuit for 30s

Scaling Patterns — Context Pipelines at High Throughput Advanced

Scale Level	Architecture	Key Optimizations
<100 RPS	Single service, async Python (FastAPI)	Async retrieval, prefix caching, response streaming
100–1K RPS	Horizontal scaling + Redis cache	Semantic query caching, HNSW index on dedicated GPU, re-rank batching
1K–10K RPS	Dedicated retrieval microservice + context assembly service	Read replicas, shard vector index, async evaluation pipeline
>10K RPS	Kafka-based pipeline, geo-distributed indexes, CDN for static context	Pre-computed context for top queries, speculative prefill, model replicas

The 80/20 Query Distribution

In most production systems, 20% of distinct query patterns account for 80% of traffic. Pre-compute and cache context for your top query templates. This can reduce live retrieval load by 60–80%, dramatically improving p99 latency. Use semantic clustering to identify your top query templates from production logs.

The Production Context Engineering Checklist Checklist

✅

Retrieval

Hybrid search (vector + BM25)
Re-ranking on top-k
Semantic deduplication
Retrieval fallback chain
Index freshness monitoring

✅

Context Assembly

Token budget enforced per section
Overflow guard (assert <limit)
Extractive compression for long chunks
Conversation history summarization
Context template versioning

✅

Performance

Parallel async retrieval
Timeout budgets per stage
Prefix caching enabled
Semantic query caching (Redis)
Streaming responses

✅

Observability

Distributed tracing (OTEL)
Token usage per section logged
Relevance score sampled (5–10%)
Faithfulness check on samples
Alerts on degradation

∑ Chapter 10 — Key Takeaways

Treat the context pipeline as a microservice graph with independent latency budgets, SLAs, and failure modes per stage
Parallelise all retrieval — vector search, keyword search, history fetch, and user context should all run concurrently with per-stage timeouts
Cost management: model tiering + prefix caching + aggressive compression can reduce token spend by 60–80% vs naïve implementation
Full observability requires distributed tracing at every pipeline stage — retrieval span, assembly span, LLM span — not just end-to-end latency
Reliability patterns: retrieval fallback chain, overflow guard, circuit breaker, and timeout-based degradation prevent single-component failures from cascading
The 80/20 query distribution is your biggest scaling lever — pre-compute context for top query templates to cut live retrieval load by 60–80%

Context Engineering Is the Real Differentiator Golden Insight

In 2024–2026, the capability gap between frontier models has narrowed dramatically. GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro are all excellent — and increasingly similar on most benchmarks. The infrastructure for calling them is standardized. What's left as the primary differentiator?

🏗️

Construction Quality

How you select, order, format, and assemble context determines model grounding. Two teams using the same model get dramatically different results based on construction alone.

🔬

Signal Density

Teams that ruthlessly filter noise, compress aggressively, and enforce relevance thresholds see 2–4× quality improvements on the same model vs teams that dump raw retrieval results.

⚙️

Systematic Optimization

Context engineering is the discipline of controlling what the model sees — not hoping it figures it out. Teams with systematic eval loops, compression pipelines, and cache architectures win.

The Production Engineering Mindset

The model is a fixed function. You cannot change what it knows or how it reasons. The only variable you control is the input. Every improvement in your LLM application — quality, cost, latency, reliability — comes from engineering better inputs. Context engineering is not a supporting discipline. It is the core discipline.

Focus on: control over what enters the context · cost awareness at every token · failure handling at every pipeline stage · systematic measurement of what's working. These four habits separate production-grade context systems from everything else.

← Fine-Tuning Advanced Overview →