AI Advanced · Evaluation & Observability

Evaluation & Observability

Measuring what matters — benchmarks, LLM-as-judge, regression testing, tracing, and production monitoring for AI systems.

"It works in the demo" is not a deployment criterion. Without systematic evaluation and observability, you are flying blind — shipping changes that might improve or destroy quality, with no way to know which.

Chapter 01 · Foundations

Why Evaluation Matters — The Cost of Not Measuring

Evaluation is the discipline that separates LLM experimentation from LLM engineering. Without it, every prompt change is a guess, every model upgrade is a risk, and every production incident is a surprise. Eval is not a QA step — it is the feedback loop that makes iteration possible.

LLM Systems Are Probabilistic — Evaluation Must Be Statistical Foundation

LLM-based systems do not behave like traditional software. The same input can produce different outputs, different reasoning paths, and different failure modes across runs. This is not a bug — it is the architecture. Evaluation must account for this explicitly.

🎲

Correctness Is a Distribution

A system that is "90% correct" fails 1 in 10 requests. At 10K queries/day that is 1,000 failures — even if every individual call "seems to work" when you test it manually.

Measure pass rate, not "does it work"
Track failure rate per input category
Report p50/p90 quality, not just avg

📊

Evaluation Must Be Statistical

Testing one or two inputs tells you almost nothing about system reliability. Evaluation requires a dataset large enough to detect real signal from noise.

Single-input testing ≠ evaluation
Minimum: 50 diverse cases for signal
Run multiple times to estimate variance

🛡️

Reliability Requires a Control Layer

Non-determinism cannot be eliminated — only bounded. Production systems must wrap LLM calls with validation, retries, and fallbacks that enforce acceptable behavior even when the model doesn't.

Validate every output before use
Retry on format failures (max 2–3×)
Fallback on repeated failure

The Measurement Gap — Why "Looks Good" Is Not Good Enough Foundation

LLM outputs are hard to evaluate by reading them. A response can look fluent, well-formatted, and confident — while being factually wrong, missing a key constraint, or breaking a downstream parser 15% of the time. Human spot-checking at scale is too slow, inconsistent, and biased.

🙈

The Fluency Trap

Fluent ≠ correct. LLMs produce grammatically perfect, confident-sounding text regardless of accuracy. Evaluating output by reading it catches obvious failures — not subtle ones.

Wrong answer, perfect prose
Missing constraints, elegant format
Hallucinated facts, appropriate hedging

📉

Silent Degradation

Without live eval, quality degrades invisibly. Model updates, prompt changes, schema changes, or input distribution shifts all erode quality — silently, until a user reports it.

Prompt change improves A, breaks B
Model upgrade changes tone/format
Edge case inputs grow over time

🎯

The Eval Feedback Loop

Evaluation converts vague "something feels off" into measurable "format compliance dropped 8% after the last prompt change." That's actionable — the former is not.

Quantified quality signal
A/B comparison between versions
Regression detection before users notice

The Core Principle

Every change to an LLM system — prompt, model, temperature, schema, retrieval — is a hypothesis. Evaluation is the experiment that tells you whether the hypothesis is correct. Without evaluation, you're not engineering — you're guessing with extra steps.

What to Measure — The Evaluation Taxonomy Core

LLM evaluation is not a single number. Different properties require different measurement approaches — and they don't always move together. A prompt change can improve accuracy while breaking format compliance.

Dimension	What It Measures	How to Measure	Priority
Functional correctness	Does output satisfy the task? (right answer, correct classification)	Exact match / regex / unit tests on output	Highest — ship-blocking
Format compliance	Is output parseable? Does it match the required schema?	JSON parse attempt / schema validation / regex	High — downstream systems break on failure
Factual accuracy	Are stated facts true? Are claims grounded in source?	LLM-as-judge / grounding check / human review	High for knowledge tasks
Quality / tone	Is output helpful, appropriate, on-brand?	LLM-as-judge with rubric / human rating	Medium — subjective but important
Safety / refusal	Does the system refuse harmful requests? Does it over-refuse benign ones?	Red-team datasets / adversarial test suite	Critical for user-facing systems
Latency	Time to first token / total response time	Instrumented timing / p50/p95/p99	Medium — SLA dependent
Cost per query	Tokens in + tokens out × model price	Token count logging × price table	Medium — economics

Cost Is a First-Class Evaluation Metric Core

Evaluation pipelines themselves incur real cost. Running the wrong eval strategy will produce both false confidence and unnecessary API spend — simultaneously.

Eval Type	Run When	Cost per Run	Why This Cadence
Deterministic checks	Every commit, every output in production	~$0	Zero marginal cost — no reason not to run always
LLM judge (golden set)	Every pull request	$0.50–$2.00 per run (100–200 samples)	Catches quality regressions before merge; cost is low vs risk
Human eval	Major releases / model changes	$50–$500+ per review	Too slow and expensive for every change; reserved for high-stakes decisions
Online sampling + judge	Continuous in production (1–5% of traffic)	Scales with traffic volume	Real distribution signal; catches drift offline tests miss

Common Eval Cost Mistakes

Running LLM judge on every commit (not just PRs) can burn $20–$50/day with no added signal. Using verbose judge prompts inflates tokens and judge costs. Evaluating with GPT-4o when GPT-4o-mini is calibrated to give the same scores at 10× lower cost. Track your eval pipeline spend as a separate budget line — it's a real operational cost, not a sunk cost.

The Eval Hierarchy — Fast to Slow, Cheap to Expensive In-depth

Not all evals should run on every change. The principle: fast, cheap evals run always; slow, expensive evals run on significant changes. This is the eval equivalent of a testing pyramid — unit tests at the base, integration tests in the middle, human review at the top.

The evaluation hierarchy — frequency and cost increase top to bottom

⚡

Level 1 — Run Everything, Always

Deterministic checks have zero marginal cost. JSON parse, schema validation, field presence checks, regex format checks — these should run on every single output in tests and in production.

Takes milliseconds per sample
Catches format regressions immediately
Gate all deployments on 100% pass

🤖

Level 2 — LLM Judge on PRs

Run LLM-as-judge evaluation on your golden set for every pull request. 100–200 samples × $0.005/judge call ≈ $0.50–$1.00 per PR. Worth it — catches quality regressions before merge.

~$0.50–$1.00 per eval run
Catches subtle quality changes
Automated — no human bottleneck

Deterministic Guards — First Line of Defense Core

Before any LLM-based evaluation runs, enforce deterministic checks. These have near-zero cost, run in milliseconds, and catch the most impactful failures — format errors that would crash downstream systems or produce silently corrupt data.

What Deterministic Guards Check

✅ JSON schema validation — does the output match the required schema?

✅ Required field presence — are all expected fields non-null?

✅ Regex constraints — does a field match its expected pattern (date, email, ID)?

✅ Type validation — is a numeric field actually a number?

✅ Enum validation — is a classification label one of the allowed values?

The Key Rule

If a failure can be caught deterministically, it must never reach an LLM judge.

Running an LLM judge on a malformed JSON response costs money and adds latency — while the right answer is to fail immediately with a clear error.

Deterministic guards: ~$0, <1ms, run on every output

LLM judge: $0.005+, 1–3s, run on sampled outputs

Apply the cheapest check that can detect the failure — and only escalate when cheaper checks pass.

Offline vs Online Evaluation — Two Complementary Views Core

Offline Evaluation (pre-production)

What: Run fixed test set against system before deployment

When: During development, on every significant change, blocking deployment

Pros: Controlled, reproducible, no user impact

Cons: Test set may not match production distribution; eval inputs may become stale

Tools: promptfoo, LangSmith, custom pytest harness

Online Evaluation (post-production)

What: Continuously sample and evaluate live traffic

When: Always running in production at 1–10% sampling rate

Pros: Real distribution, catches drift, finds failures offline tests miss

Cons: Failures already reached users; LLM judge adds cost per sample

Tools: LangSmith, Langfuse, custom sampling + judge pipeline

Offline Eval Is Necessary But Not Sufficient

A golden test set built in January will not cover the inputs your users are actually sending in July. Production inputs drift over time — new topics, new edge cases, adversarial inputs. Online evaluation sampling at 5% of production traffic, judged automatically, is the only way to know if quality is holding as inputs evolve. Run both — they catch different things.

The Cost of Not Measuring — Real Failure Patterns In-depth

Failure Pattern	How It Happens	Discovery Without Eval	Prevention
Prompt regression	Fix one failure mode in a prompt → silently break 3 others	User complaints weeks later	Eval on full golden set before merge
Model update breakage	Provider updates GPT-4o silently; JSON structure changes slightly	Parser 500s in production	Eval + model version pinning
Distribution shift	New user cohort sends different query types than anticipated	Low satisfaction scores over weeks	Online eval sampling detects drop early
Format drift	Downstream parser changes; LLM still outputs old format	Silent data corruption in DB	Schema validation on every output
Cost explosion	Prompt grows longer; token count doubles; nobody notices until bill arrives	Monthly invoice shock	Token count tracking in eval + alerts

Failure Taxonomy — What Actually Breaks in Production In-depth

LLM failures in production fall into four distinct categories — each with different visibility, detection difficulty, and downstream impact. Understanding which category a failure belongs to determines how to catch and fix it.

🔴

Hard Failures — Immediately Visible

System crashes, parser throws, API returns error. These are easy to detect but still must be handled gracefully.

Invalid JSON — parser throws
Missing required field — null pointer
Wrong data type — downstream cast fails
Detection: deterministic checks, error monitoring

🟡

Soft Failures — Subtly Wrong

Output parses successfully but is partially incorrect. These pass format checks but fail quality checks — often found only via LLM judge or human review.

Partially correct answer (misses one constraint)
Correct structure, wrong content
Missing context that changes the answer
Detection: LLM judge with multi-dimension rubric

⚫

Silent Failures — Most Dangerous

Output looks correct, passes all checks, reaches users. But it's wrong. These corrupt downstream systems, erode user trust, and are almost impossible to detect without continuous quality sampling.

Hallucinated values that look plausible
Confident wrong answers with no hedging
Format drift (subtle schema deviation)
Detection: online eval sampling + grounding checks

📉

Behavioral Drift — Gradual Degradation

The system works but quality drifts over time. Tone changes, verbosity increases, instruction adherence drops. No single failure — just a slow erosion of quality.

Longer responses than specified
Brand tone gradually shifts
Reliability of structured output drops week-over-week
Detection: rolling avg judge scores over time

Silent Failures Are the Most Expensive

Hard failures get fixed immediately — they break the system loudly. Silent failures pass all your checks, reach all your users, and corrupt downstream data silently. By the time you notice (user report, data audit), the failure has been happening for days or weeks. The only defense is continuous online evaluation that samples and judges live traffic — not just offline testing that only catches what you thought to test for.

Reliability vs Quality — What Production Actually Requires Core

Quality and reliability are different goals — and they require different engineering approaches. A system can be high quality on good inputs while being completely unreliable at scale.

High Quality (but unreliable)

Produces excellent outputs on typical inputs

Fails unpredictably on edge cases

Hard to test because failures are non-obvious

Users experience occasional great results — and occasional crashes

Trust level: low — users can't predict when it works

High Reliability (production-grade)

Produces consistently acceptable outputs across all inputs

Edge cases handled gracefully — fallback, "I don't know," or structured error

Testable: pass/fail rate stable across runs

Users experience predictable behavior — not occasionally brilliant

Trust level: high — behavior is predictable

In Production: Reliability Over Peak Quality

A system that produces brilliant output 70% of the time and crashes or hallucinates 30% of the time is not production-grade. Reliability is what makes users trust the system — and trust is built by consistent, predictable behavior, not occasional impressive outputs. Target reliability first (measure failure rate, build guardrails) before optimizing for peak quality.

Prompts and Evaluation Must Evolve Together Core

Prompt engineering without evaluation is unstable. Most prompt changes fix one issue and introduce new failures. Without a full eval suite, you can't know whether a change is a net improvement or a net regression.

Scenario	Without Eval	With Eval
Prompt change to fix hallucination	Seems fixed in manual testing — format compliance silently dropped 5%	Eval shows: hallucination ↓ 15%, format compliance ↓ 5%. Net positive — but catch the regression.
Model upgrade (mini → full)	Quality feels better — cost increased 16×, token use up 30%. Unknown.	Eval shows: accuracy +3%, cost +16×, tokens +28%. Decide intentionally.
Adding few-shot examples	Looks better on the 5 cases you tested — broke 8% of edge cases.	Eval on 200-case golden set shows: common cases improved, edge cases regressed.

The Prompt Engineering Workflow

Every prompt change should: (1) improve at least one metric, (2) not regress any metric beyond threshold, (3) be recorded with before/after eval scores. Track metric deltas — not just pass/fail. A change that improves accuracy from 88% to 91% is valuable. The same change that simultaneously drops format compliance from 99% to 94% may not be worth shipping.

Evaluation Anti-Patterns — What to Avoid In-depth

Anti-Pattern	What Happens	The Fix
"Looks good to me"	Human approval of a few test cases masquerades as evaluation. Fails silently on edge cases.	Require automated eval on 50+ cases before any merge. Manual review is a supplement, not a replacement.
Single-metric optimization	Accuracy improves; format compliance, latency, and cost all regress. Nobody noticed because only one metric was tracked.	Track all critical dimensions on every eval run. Block if any key metric regresses.
Static golden set	Golden set from 6 months ago. New features and user behaviors not covered. High offline scores, poor production quality.	Add cases from production failures monthly. Assign ownership for golden set maintenance.
Over-relying on LLM judge	LLM judge gives 4.2/5 — feels like high quality. JSON parse failure rate is 8%. Judge never checked format.	Always run deterministic checks first. LLM judge is for quality dimensions that determinism can't cover.
Ignoring eval cost	LLM judge runs on every commit with GPT-4o. $30/day eval spend. Nobody noticed for 3 months.	Run LLM judge on PRs only. Use mini model when calibrated. Track eval pipeline cost as a budget line.

∑ Chapter 01 — Key Takeaways

Evaluation is the feedback loop that turns LLM changes from guesses into measurable improvements
Fluent ≠ correct — silent degradation is invisible without systematic measurement
Measure 7 dimensions: functional correctness, format compliance, factual accuracy, quality, safety, latency, cost
The eval hierarchy: deterministic checks (always) → LLM judge (per PR) → human eval (major changes) → A/B (production)
Offline eval gates deployment; online eval catches drift in production — both are required
The five silent failure patterns: prompt regression, model update breakage, distribution shift, format drift, cost explosion

Chapter 02 · Measurement

Benchmarks — Standard Evaluations for LLMs

Benchmarks let you compare models on standardized tasks. But benchmark performance and production performance are not the same thing. Understanding what benchmarks measure — and what they miss — is essential before using them to make model selection decisions.

Major Benchmarks — What They Measure Core

Benchmark	What It Tests	Format	Useful For
MMLU Hendrycks et al. 2021	World knowledge across 57 academic subjects (STEM, humanities, law, medicine)	Multiple choice, 4 options	Comparing knowledge breadth; model selection for knowledge-intensive tasks
HumanEval OpenAI 2021	Python function completion from docstrings; 164 programming problems	Code generation, unit test pass rate	Code assistant model selection; comparing coding capability
MT-Bench LMSYS 2023	Multi-turn conversation quality across 8 categories (writing, math, coding, reasoning)	LLM-as-judge scoring by GPT-4	Chat model quality; instruction following across domains
GPQA Google 2023	Graduate-level science questions designed to be hard for non-experts	Multiple choice, expert-validated	Frontier model capability; distinguishing top-tier models
GSM8K Cobbe et al. 2021	Grade-school math word problems requiring multi-step arithmetic reasoning	Free-form answer, exact match	Multi-step reasoning; CoT effectiveness
HellaSwag 2019	Commonsense reasoning — which sentence continues an activity description correctly	Multiple choice	Common sense; less useful for frontier models (most score 95%+)
LMSYS Chatbot Arena LMSYS 2023	Human preference ranking — users compare two anonymous model responses head-to-head	ELO ranking from human votes	Real user preference; most production-relevant benchmark

The Most Production-Relevant Benchmark

Of all public benchmarks, LMSYS Chatbot Arena most closely predicts which models users prefer in practice — because it uses real human preference data rather than academic tasks. MMLU tells you about knowledge breadth. Arena tells you about perceived output quality. Use both, weight Arena more heavily for user-facing applications.

Benchmark Limitations — Why Scores Can Mislead In-depth

🎯

Data Contamination

If benchmark questions appeared in the model's training data, scores reflect memorization — not capability. Increasingly common as benchmarks become widely used.

Impossible to verify from outside
Inflates reported scores
New benchmarks contaminate faster than expected

📊

Benchmark Saturation

When most frontier models score 85–92% on a benchmark, it can no longer distinguish between them. HellaSwag is essentially useless for comparing GPT-4o vs Claude 3.5 — both score 95%+.

Old benchmarks can't rank new models
Need constantly harder challenges
GPQA was designed specifically for this

🔧

Task-Distribution Mismatch

Academic benchmarks test standardized tasks. Your production system has a specific input distribution. A model that tops MMLU may be worse than a smaller model on your specific task type.

MMLU doesn't predict JSON extraction quality
HumanEval ≠ Python code review quality
Always run domain-specific eval

Goodhart's Law Applies to Benchmarks

Once a benchmark becomes a target, it ceases to be a good measure. Labs optimize specifically for leaderboard benchmarks — through training data selection, prompt engineering, and sometimes cherry-picking evaluation conditions. A model's actual usefulness on your task may be uncorrelated with its leaderboard position. Always validate on your own data before making model selection decisions from benchmarks alone.

When to Use Benchmarks — Practical Guidance Core

Decision	Use Benchmarks?	What to Use Instead / Also
Initial model shortlisting	Yes — filter obvious losers	MMLU for knowledge tasks, HumanEval for code, Arena ELO for general quality
Final model selection	Insufficient alone	Your own golden set + task-specific eval is mandatory
Tracking model provider updates	Watch for score changes	Your production eval set is more reliable signal
Comparing your fine-tuned model to base	Yes — use MMLU to detect capability regression	Domain eval for capability gain measurement
Communicating model quality externally	Use with caveats	Benchmark + task-specific results together tell a more honest story

Custom Task Benchmarks — Building Your Own Core

For production systems, custom benchmarks tuned to your task type are more valuable than any public benchmark. They measure what you actually care about — on representative inputs from your users.

1️⃣Sample inputs100–500 real/realistic

2️⃣Annotate outputshuman or LLM-judge

3️⃣Define metricsaccuracy, format, quality

4️⃣Automate runnerpromptfoo / pytest

5️⃣Run on changegate deployment

Minimum Viable Benchmark

Start with 50 diverse inputs — not 500. Cover: (1) typical cases (60%), (2) hard/ambiguous cases (20%), (3) edge cases and failures you've seen in production (20%). A 50-case eval that runs in CI catches 80% of regressions. Perfect coverage is the enemy of getting started. Add cases as you discover new failure modes.

∑ Chapter 02 — Key Takeaways

Key benchmarks: MMLU (knowledge), HumanEval (code), MT-Bench (chat quality), GSM8K (reasoning), Arena (real user preference)
LMSYS Chatbot Arena is the most production-relevant benchmark — uses real human preference data
Three benchmark failure modes: data contamination, saturation, task-distribution mismatch
Benchmarks are for initial shortlisting — final model selection requires your own task-specific eval
Goodhart's Law: when a benchmark becomes a target, it stops being a good measure — leaderboard ≠ production performance
Build a 50-case custom benchmark before anything else — typical (60%), hard (20%), edge cases (20%)

Chapter 03 · Scalable Eval

LLM-as-Judge — Automated Qualitative Evaluation

Human evaluation is the gold standard — but it costs $0.10–$1+ per sample and can't scale to thousands of daily outputs. LLM-as-judge closes the gap: automated qualitative evaluation that costs $0.001–$0.01 per sample and scales infinitely. When designed correctly, it correlates with human judgment at 80–90%.

What Is LLM-as-Judge — The Core Concept Foundation

LLM-as-judge uses a capable frontier model (GPT-4o, Claude 3.5 Sonnet) to evaluate the output of another model — or even the same model. The judge receives the original input, relevant context, the output to evaluate, and a scoring rubric. It returns a score + brief justification.

LLM-as-judge evaluation pipeline

When LLM-as-Judge Works Well

✅ Evaluating text quality, helpfulness, tone

✅ Grounding checks (does output use source material?)

✅ Multi-turn conversation coherence

✅ Relative comparison: "which response is better?"

✅ Structured scoring with clear rubrics (1–5 scale)

When LLM-as-Judge Falls Short

❌ Mathematical / code correctness (use unit tests)

❌ Tasks requiring specific domain expertise

❌ Very long outputs (>2K tokens) — judge loses focus

❌ Fine-grained factual claims without reference source

❌ Safety evaluation — judges can be jailbroken

Judge Prompt Design — The Difference Between Good and Noisy Evaluation In-depth

The quality of your LLM judge is almost entirely determined by the quality of your judge prompt. A weak judge prompt produces noisy, inconsistent scores. A strong judge prompt produces reliable, calibrated scores that correlate with human judgment.

❌

Weak judge prompt — vague, inconsistent

System: You are an expert evaluator. Rate the response quality. Score from 1-10 and explain why. Input: {question} Response: {output}

Problems: "quality" is undefined, scale is vague (what's a 6?), no structured output, inconsistent across runs.

✅

Strong judge prompt — explicit rubric, structured output

System: You are evaluating AI assistant responses for a customer support system. Score the response on HELPFULNESS only — how well it resolves the user's issue. Scoring rubric: 5 = Fully resolves the issue with correct, actionable steps 4 = Mostly resolves with minor gaps or ambiguity 3 = Partially addresses but missing key steps 2 = Identifies the issue but gives incorrect or unhelpful guidance 1 = Does not address the user's issue at all Return JSON only: {"score": <1-5>, "reason": "<one sentence>", "missing": "<what's missing, or null>"} User query: {question} Response to evaluate: {output} Evaluate:

Specific dimension, explicit per-score definitions, structured JSON output, consistent reasoning required.

Design Principle	Why It Matters
One dimension per judge	Helpfulness + accuracy + tone in one prompt → confused, noisy scores. One judge per dimension, aggregated separately.
Explicit per-level rubric	A scale of 1–5 without definitions means different things on each call. Define exactly what each score value means.
Require structured output	JSON output is parseable and consistent. Freeform reasoning varies in format and is hard to aggregate.
Include reference answer if available	Comparing to a ground-truth answer dramatically improves accuracy evaluation over judging in isolation.
Ask for a reason	The justification catches judge errors — if the reason contradicts the score, the evaluation is unreliable.

Multi-Dimension Scoring — Evaluating More Than One Thing Core

One overall quality score hides signal. A response can be perfectly accurate but poorly formatted, or beautifully written but factually wrong. Multi-dimension scoring separates these signals so you know what to fix.

🎯

Dimension: Correctness

"Is the answer factually accurate and does it satisfy the user's stated request?"

5: Fully correct, nothing to dispute
3: Mostly correct, minor inaccuracy
1: Incorrect or misleading answer

📋

Dimension: Completeness

"Does the answer cover all aspects of the question, or does it miss key parts?"

5: All aspects addressed
3: Main answer present, details missing
1: Major parts of question unanswered

🔗

Dimension: Groundedness

"For RAG systems: is every claim in the answer supported by the provided source documents?"

5: Every claim traced to source
3: Mostly grounded, one unsupported claim
1: Significant hallucination present

The Four Canonical Dimensions (for most systems)

Most production LLM systems benefit from exactly four judge dimensions: (1) Correctness — is the answer right? (2) Completeness — does it cover everything asked? (3) Groundedness — are claims supported? (critical for RAG) (4) Format compliance — handled deterministically, not by LLM judge. Run one judge call per dimension, return structured JSON per call, aggregate across your golden set.

Judge Biases — What Distorts LLM-as-Judge Scores In-depth

Bias	What It Is	Mitigation
Verbosity bias	Judges rate longer responses higher, even when a shorter answer is better	Add explicit instruction: "Do not reward verbosity. A concise correct answer scores higher than a verbose correct answer."
Self-preference bias	GPT-4o-as-judge prefers GPT-4o outputs; Claude-as-judge prefers Claude outputs	Use a different model family as judge than the model being evaluated. Or use multiple judges and average.
Position bias	In A-vs-B comparisons, judge prefers whichever response appears first (or second)	Run each comparison twice with reversed order. Only accept agreements; reclassify disagreements as ties.
Sycophancy	If you include "this response is from our best model", judge inflates score	Never include model identity in judge prompt. Blind evaluation only.
Formatting halo	Well-formatted responses (headers, bullet points) get higher scores regardless of content quality	Add: "Evaluate content quality, not formatting. Ignore markdown styling when scoring."

Calibrating Your Judge — Ensuring Scores Mean Something Core

Before trusting your LLM judge at scale, validate that its scores correlate with human judgment. This is called calibration — and it's what separates a reliable eval pipeline from a false sense of measurement.

1️⃣Human labels50–100 samples rated by humans

2️⃣Judge labelssame samples, LLM judge

3️⃣Agreement ratetarget ≥80% on binary pass/fail

4️⃣Inspect disagreementsrefine rubric on systematic gaps

5️⃣Re-calibrateafter rubric changes

Never Deploy an Uncalibrated Judge

An LLM judge you haven't validated against human labels is measuring something — you just don't know what. A judge that looks correct on casual inspection may be systematically scoring a known failure mode as passing. Validate on at least 50 human-labeled examples before using any judge in CI. If judge-human agreement is below 75%, your rubric needs work before the judge is trustworthy.

🔧

Minimal judge implementation (Python)

import json from openai import OpenAI client = OpenAI() JUDGE_SYSTEM = """You evaluate AI assistant responses for correctness. Score 1-5 where: 5 = Fully correct and complete 4 = Mostly correct, minor omission 3 = Partially correct 2 = Incorrect with some valid content 1 = Completely incorrect or irrelevant Return JSON: {"score": <1-5>, "reason": "<one sentence>"}""" def judge(question: str, response: str, reference: str = None) -> dict: user_msg = f"Question: {question}\nResponse: {response}" if reference: user_msg += f"\nReference answer: {reference}" result = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": JUDGE_SYSTEM}, {"role": "user", "content": user_msg} ], response_format={"type": "json_object"}, temperature=0 ) return json.loads(result.choices[0].message.content)

∑ Chapter 03 — Key Takeaways

LLM-as-judge: $0.001–0.01 per sample, correlates with human judgment at 80–90% when designed correctly
Works for: quality, helpfulness, groundedness, coherence. Does not work for: math, code correctness, safety
One dimension per judge call — combined rubrics produce noisy, ambiguous scores
Strong judge prompts need: explicit per-level rubric, structured JSON output, reasoning field
Five key biases to mitigate: verbosity, self-preference, position, sycophancy, formatting halo
Calibrate against 50+ human labels before trusting any judge in CI — target ≥80% agreement on binary pass/fail
Use temperature=0 and json_object mode for consistent, parseable judge outputs

Chapter 04 · Test Data

Golden Sets — Building Your Evaluation Dataset

A golden set is the foundation of every eval pipeline. Without one, you have no baseline, no regression detection, and no way to compare prompt versions. Building a high-quality golden set is the most important engineering task in LLM evaluation — and it's done once, then maintained continuously.

What Is a Golden Set — The Anatomy Foundation

A golden set is a curated collection of (input, expected output) pairs — or more precisely, (input, evaluation criteria) pairs — that represent the task your system must perform. "Golden" means human-verified: each case has been reviewed and annotated to define what correct looks like.

📥

Input

The query, document, or task given to the system. Should be representative of real production traffic — not synthetic or idealized.

Sampled from real user inputs
Anonymized if needed (PII removal)
Diverse across your task distribution

🎯

Expected Output / Criteria

Defines what "correct" means for this input. Can be an exact reference answer, a set of required elements, or rubric criteria.

Exact answer (classification, extraction)
Required fields / key points (summarization)
Rubric criteria (open-ended quality)

🏷️

Metadata

Tags that enable sliced analysis — query type, difficulty level, failure category, date added. Essential for understanding which cases regressed.

Difficulty: easy / medium / hard
Category: topic or task type
Source: synthetic / prod sample / manual

Collection Strategies — Where Cases Come From In-depth

Source	How	Pros	Cons
Production sampling	Log 1–5% of live traffic; human-annotate a sample	Real distribution; catches actual failure modes	Requires annotation pipeline; PII concerns
Manual curation	Domain expert writes inputs covering known difficulty areas	High quality; targets known hard cases	Slow; may not cover real distribution
Failure mining	Collect every confirmed system failure → add to golden set	Directly prevents known regressions	Reactive; only catches known problems
LLM-assisted generation	Use a strong model to generate diverse inputs; human-verify	Fast at scale; can cover edge cases systematically	Distribution differs from real users; needs review
Adversarial construction	Deliberately craft inputs that test edge cases, ambiguity, format stress	Finds failure modes that sampling misses	Requires effort; hard to know what to target

The Recommended Mix

A production-grade golden set should contain: 60% production-sampled (real distribution), 20% failure-mined (known regressions), 20% adversarial/edge cases (hard cases your sampling won't catch naturally). The failure-mined cases are critical — they ensure that every bug you've fixed stays fixed.

Coverage Principles — What Your Golden Set Must Cover Core

🗺️

Task Distribution Coverage

Your golden set must represent the full range of task types your system handles — not just the most common. An eval set of only easy cases gives you a false sense of quality.

All major task categories proportionally represented
Long inputs and short inputs included
All supported languages/domains

⚠️

Edge Case Coverage

Edge cases are where production systems break. Empty inputs, extremely long inputs, ambiguous queries, multi-intent queries, adversarial phrasing.

Empty / one-word inputs
Inputs near context window limit
Ambiguous or contradictory requests

🔧

Format Stress Coverage

Test inputs that stress your output format: inputs that require nested JSON, inputs in different languages, inputs with special characters, very short or very long expected outputs.

Special characters in input (quotes, brackets)
Inputs that should produce minimal output
Inputs that should produce structured output

🕵️

Regression Coverage

Every confirmed production failure becomes a golden set case. This is your regression suite — evidence that fixed bugs stay fixed across future changes.

Add case within 1 day of confirming a failure
Tag with failure date and root cause
Never remove — only deprecate with reason

Annotation — Defining What Correct Looks Like Core

Annotation is the hardest part of building a golden set. The goal is to define "correct" precisely enough that an automated evaluator (schema check, exact match, or LLM judge) can reliably determine pass/fail.

Task Type	Annotation Format	Eval Method	Example
Classification	Exact label(s)	Exact match	"Label: BILLING"
Extraction	Required field values	Key-value match / schema check	{"name": "John", "date": "2024-01-15"}
Summarization	Key points that must be present	LLM judge (completeness rubric)	Required: [acquisition price, date, acquirer]
Q&A / Factual	Reference answer + acceptable variants	Exact / fuzzy match + LLM judge	Answer: "42.5 million" or "42,500,000"
Generation / Writing	Rubric criteria (tone, structure, constraints)	LLM judge (multi-dimension)	Must: professional tone, <200 words, include CTA
Code generation	Unit tests that must pass	Execute + test pass rate	assert output(4) == 16 # squares input

Size, Versioning & Maintenance — Keeping Your Golden Set Alive In-depth

Size Guidelines

50 cases: Minimum viable — catches major regressions, runs in <5 min. Start here.

200 cases: Production standard — statistically meaningful, covers all categories.

500+ cases: Large system / multiple task types — run on releases, not every PR.

Rule: Run time < 10 minutes for PR-blocking evals. Split larger sets into fast (PR) and full (release) tiers.

Versioning Rules

Store in Git: Golden set is code — it belongs in version control with change history.

JSONL format: One case per line, easy to diff and append.

Never delete cases — mark deprecated with reason and date.

Tag with schema version — when eval format changes, old cases can still run against old schema.

📄

Golden Set JSONL format (recommended)

# golden_set.jsonl — one case per line { "id": "cs-001", "input": "I was charged twice this month", "expected_label": "BILLING", "eval_method": "exact_match", "tags": ["classification", "easy"], "source": "production_sample", "added": "2024-03-12" } { "id": "cs-002", "input": "app crashes sometimes but also i think billing is wrong", "expected_label": "BILLING", "eval_method": "exact_match", "tags": ["classification", "hard", "multi-intent"], "source": "failure_mined", "added": "2024-04-01", "failure_date": "2024-03-30" }

Stale Golden Sets Are Worse Than No Golden Sets

A golden set that hasn't been updated in 6 months while your product has evolved will pass regressions you care about and block on cases that are no longer relevant. Review and add to your golden set monthly: (1) add cases for new features, (2) add cases from production failures, (3) deprecate cases for removed features. Assign ownership — golden set maintenance is an engineering responsibility, not a one-time task.

Golden Set Drift — Why Your Test Data Goes Stale In-depth

A golden set that was excellent six months ago may be misleading today. Input distribution, product scope, and user behavior all change over time — and a static test set can produce high offline scores that do not reflect production reality.

Drift Cause	Symptom	Detection	Response
Changing user behavior	Offline score stable; production quality drops; new query types not covered	Online eval score diverges from offline eval score	Sample production queries monthly; add new input types to golden set
New edge cases	System breaks on inputs that never appeared before; golden set doesn't cover them	Production errors cluster to specific input categories	Mine production failures; add as regression cases within 1 day
Evolving system scope	New features added; golden set tests old behavior only; no coverage on new paths	New features untested — discovered only on user report	Add golden set cases as part of every feature development cycle
Obsolete cases	Deprecated features still in golden set; cases always pass (trivially); set size inflated	Cases with 100% pass rate for 3+ months	Deprecate with reason and date — never delete; just mark inactive

∑ Chapter 04 — Key Takeaways

A golden set is (input, expected output / criteria, metadata) — human-verified pairs that define correctness
Best collection mix: 60% production-sampled, 20% failure-mined, 20% adversarial/edge cases
Coverage must include: task distribution, edge cases, format stress, and regression cases (every confirmed failure)
Annotation format depends on task: exact label (classification), key-value (extraction), unit tests (code), rubric (generation)
Size: 50 (minimum viable), 200 (production standard), 500+ (split into PR and release tiers)
Store in Git as JSONL, never delete cases, review and extend monthly — ownership is an engineering responsibility

Chapter 05 · Quality Gates

Regression Testing — Catching Quality Drops Before Production

Regression testing answers one question: "Did this change make things worse?" For LLM systems it is the primary quality gate — because unlike traditional software, LLM changes (prompts, models, parameters) are hard to reason about and easy to get subtly wrong. Regression tests catch the "works on the cases I checked, broke on the ones I didn't" failure pattern.

Establishing a Baseline — You Can't Regress Without a Reference Foundation

A regression is a drop relative to a baseline. Without a recorded baseline, all you have is a current score with no context. Baselines must be stable, reproducible, and stored — not just computed on demand.

1️⃣Run eval on current systemgolden set, all metrics

2️⃣Store results as baselinetimestamped, versioned

3️⃣Make a changeprompt / model / schema

4️⃣Run eval on new versionsame golden set, same metrics

5️⃣Diff against baselineflag drops, celebrate gains

Baseline Metric	What to Record	Update Frequency
Format compliance rate	% of outputs that parse as valid JSON/schema	Update after any format-affecting change
Functional accuracy	% correct on classification/extraction tasks	Update after prompt or model change
LLM judge score	Avg score per dimension (correctness, completeness, etc.)	Update after any semantic change
P95 latency	95th percentile response time in ms	Update after model or infrastructure change
Avg tokens per query	Input + output tokens averaged over golden set	Update after prompt or schema change

Regression Thresholds — Blocking vs Warning In-depth

Not all regressions are equal. A 0.5% accuracy drop may be noise; a 5% drop may be a real regression; a format compliance drop from 100% to 95% is always a blocker. Thresholds encode your beliefs about what matters.

🚫

Hard Block

Deployment stops. These regressions always indicate a real problem that must be fixed before release.

Format compliance drops below 98%
Any required field missing on >1% of cases
Accuracy drops >5% from baseline
P95 latency increases >50%

⚠️

Warning (Review Required)

Deployment can proceed with explicit human sign-off. Requires a documented reason for the regression.

Accuracy drops 2–5% from baseline
Judge score drops >0.3 on any dimension
Token count increases >20%
New failure pattern appears on 3+ cases

✅

Pass (Auto-approve)

Change is safe to deploy without manual review. Score is within acceptable variance.

Accuracy delta within ±2% of baseline
Format compliance ≥98%
Latency within ±20% of baseline
No new failure categories introduced

Don't Set Thresholds Too Tight

If your thresholds block every small change, engineers route around them — running fewer evals, skipping the process. Thresholds should block real regressions, not noise. For an LLM judge score (inherently variable), ±0.2 is noise; ±0.5 is signal. For exact match accuracy on a 100-case eval set, a 2% swing (2 cases) can be a single annotation error. Calibrate thresholds against your eval's natural variance before enforcing them.

Statistical Significance — When Is a Difference Real? Core

With small eval sets, observed score differences can be coincidental. A 3% accuracy difference on a 50-case set might not be statistically significant — it could easily be 1–2 cases flipping due to LLM non-determinism.

Eval Set Size	Minimum Meaningful Difference	Confidence Level
50 cases	~8–10% difference to be confident (4–5 cases)	Use directional signal only, not hard thresholds
100 cases	~5–6% difference meaningful (5–6 cases)	Reasonable for PR gates with soft thresholds
200 cases	~3–4% difference meaningful	Good for release gates with hard thresholds
500+ cases	~2% difference meaningful	Strong statistical confidence; suitable for A/B

Practical Approach: Run Twice, Compare

For non-deterministic evals (LLM judge at temperature>0), run the eval twice and take the average. If the two runs differ by more than 3%, something is wrong with your judge (temperature too high, rubric too vague). Use temperature=0 for all judge calls to eliminate this variance — the judge should be fully deterministic even when the system under test is not.

Implementation — Regression Testing with promptfoo In-depth

promptfoo is the most widely used open-source LLM testing framework. It runs your prompt against your golden set, applies assertions, and generates a pass/fail report suitable for CI integration.

⚙️

promptfoo config (promptfooconfig.yaml)

providers: - id: openai:gpt-4o-mini config: temperature: 0 prompts: - "Classify this support ticket. Reply with exactly one of: BILLING | TECHNICAL | ACCOUNT | OTHER\n\nTicket: {{input}}" tests: - vars: input: "I was charged twice this month" assert: - type: equals value: "BILLING" - vars: input: "App crashes on iOS 17" assert: - type: equals value: "TECHNICAL" - vars: input: "app crashes sometimes but also i think billing is wrong" assert: - type: equals value: "BILLING" # primary intent - type: latency threshold: 3000 # fail if >3s

🔧

Custom Python eval runner (when promptfoo isn't enough)

import json, time from pathlib import Path def run_regression(system_fn, golden_path: str, baseline_path: str = None): cases = [json.loads(l) for l in Path(golden_path).read_text().splitlines()] results = [] for case in cases: start = time.monotonic() output = system_fn(case["input"]) latency = (time.monotonic() - start) * 1000 passed = output.strip() == case["expected_label"].strip() results.append({ "id": case["id"], "passed": passed, "latency_ms": latency, "output": output }) accuracy = sum(r["passed"] for r in results) / len(results) if baseline_path: baseline = json.loads(Path(baseline_path).read_text()) delta = accuracy - baseline["accuracy"] if delta < -0.05: # Hard block: >5% drop raise SystemExit(f"REGRESSION: accuracy dropped {delta:.1%} — blocking deploy") return {"accuracy": accuracy, "results": results}

Rollback Criteria — When to Revert Core

Signal	Rollback Trigger	Action
Offline eval regression	Hard threshold violation pre-deploy	Block merge → fix prompt → re-run eval
Production error rate spike	Parser 500s increase >2× in 30 min post-deploy	Immediate rollback to previous version
Online eval quality drop	LLM judge avg drops >0.5 on live sample over 2h	Review samples → decide rollback or hotfix
Cost explosion	Token avg increases >50% vs previous version	Alert + review → rollback if no valid reason
User-reported failures	Confirmed reports cluster to specific input type	Mine failure cases → add to golden set → patch

∑ Chapter 05 — Key Takeaways

Regression testing answers: "Did this change make things worse?" — requires a recorded baseline to compare against
Record baselines for 5 metrics: format compliance, accuracy, LLM judge score, P95 latency, avg tokens
Three threshold tiers: hard block (deploy stops), warning (review required), pass (auto-approve)
Statistical significance: small eval sets need larger differences to be meaningful — 200 cases for reliable hard thresholds
Use temperature=0 for all judge calls — eliminates eval variance; makes regressions real not noise
Rollback triggers: hard threshold violation, parser 500s, online quality drop >0.5, 50%+ cost increase

Chapter 06 · Observability

Tracing — Following Requests Through LLM Systems

When an LLM system returns a wrong answer, how do you know which step failed? Tracing gives you the answer: a complete record of every step in a request's execution — which LLM was called, with what prompt, what it returned, how long it took, and what it cost. Without traces, debugging is guesswork.

Traces and Spans — The Core Concepts Foundation

Tracing for LLM systems borrows from distributed systems observability. Every user request generates one trace, which is a tree of spans — each span representing one unit of work. For LLM systems, spans map directly to the operations that matter.

Trace anatomy — a RAG pipeline request broken into spans

What to Trace — The Minimum Viable Span Set Core

Span Type	Fields to Capture	Why It Matters
LLM call	model, prompt (truncated), response (truncated), tokens_in, tokens_out, cost, latency, TTFT, retry_count	Primary cost and latency driver; contains the most debugging information
Vector/DB retrieval	query, top_k, similarity scores, retrieved_doc_ids, latency	Retrieval quality is leading indicator of RAG answer quality
Tool call	tool_name, input_args, output, latency, status (success/error)	Tool failures are the #1 agent failure mode; input/output needed for debugging
Output validation	schema_passed, parsed_output, validation_errors	Reveals format failure rate; links validation failures to which LLM call produced them
Whole request (root span)	request_id, user_id, total_latency, total_cost, total_tokens, status, feature_name	Enables per-user cost attribution and system-level performance dashboards

Don't Log Raw Prompts in Production Without Care

Full prompt logging contains user input — which may contain PII, confidential business data, or sensitive content. Before logging prompts in production: (1) implement PII detection and redaction, (2) restrict trace access to authorized personnel, (3) set retention policies (30–90 days typical), (4) check your privacy policy and data residency requirements. Truncating prompts to 500 characters captures enough for debugging without logging full user content.

The Observability Gap — Logs Are Not Enough Core

Application logs tell you what happened — which endpoint was called, what status code was returned, how long it took. They do not tell you whether the LLM output was correct, helpful, or safe. That gap is the central observability problem for LLM systems.

What Logs Give You

✅ Request volume and error rates

✅ Response latency (total)

✅ HTTP status codes

✅ Token count per call

❌ Whether the answer was correct

❌ Whether quality is degrading week-over-week

❌ Which failure category the error belongs to

What You Also Need to Track

📊 Output quality scores — LLM judge per dimension, rolling avg

📊 Format pass rate — % of outputs passing schema validation

📊 TTFT + generation latency — separately, not just total

📊 Token usage breakdown — input vs output, by feature

📊 Retry rate — % of requests that required ≥1 retry

📊 Fallback trigger rate — % of requests falling back to cheaper/cached response

OpenTelemetry for LLMs — Standardized Instrumentation In-depth

OpenTelemetry (OTel) is the industry standard for distributed tracing. The OpenTelemetry Semantic Conventions for LLMs (GenAI conventions) define standard span attribute names for LLM calls — enabling consistent tooling across providers.

📐

GenAI OTel Span Attributes (standard)

gen_ai.system — "openai", "anthropic"
gen_ai.request.model — "gpt-4o-mini"
gen_ai.usage.prompt_tokens
gen_ai.usage.completion_tokens
gen_ai.response.finish_reasons
gen_ai.request.temperature

🔌

Auto-instrumentation Libraries

These libraries automatically wrap LLM SDK calls with OTel spans — no manual instrumentation needed for basic tracing.

opentelemetry-instrumentation-openai
LangSmith — LangChain native tracing
Langfuse — open-source, self-hostable
Arize Phoenix — local + cloud

🔧

Manual span instrumentation (OpenAI + OTel)

from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider import openai, time tracer = trace.get_tracer("llm-service") def call_llm_with_trace(messages: list, model: str = "gpt-4o-mini"): with tracer.start_as_current_span("llm_call") as span: span.set_attribute("gen_ai.system", "openai") span.set_attribute("gen_ai.request.model", model) span.set_attribute("gen_ai.request.message_count", len(messages)) start = time.monotonic() try: response = openai.chat.completions.create( model=model, messages=messages ) latency = (time.monotonic() - start) * 1000 span.set_attribute("gen_ai.usage.prompt_tokens", response.usage.prompt_tokens) span.set_attribute("gen_ai.usage.completion_tokens", response.usage.completion_tokens) span.set_attribute("llm.latency_ms", round(latency)) span.set_attribute("llm.status", "success") return response except Exception as e: span.set_attribute("llm.status", "error") span.record_exception(e) raise

Tracing Tools — LangSmith, Langfuse, and Phoenix Core

Tool	Hosting	Strengths	Best For
LangSmith	SaaS (paid)	Deep LangChain integration, eval pipelines, human feedback, datasets	Teams using LangChain; want eval + tracing in one tool
Langfuse	Open-source + SaaS	Self-hostable, SDK-agnostic, cost tracking, LLM-as-judge built in	Privacy-sensitive deployments; teams wanting data control
Arize Phoenix	Open-source (local-first)	Notebook-friendly, evals, embedding visualization, OTel native	ML engineers; debugging sessions; local dev tracing
Braintrust	SaaS	Strong eval + experiment tracking, prompt versioning, CI integration	Teams focused on eval-driven development
Custom OTel + Jaeger/Tempo	Self-hosted	Full control, integrates with existing observability stack (Grafana)	Orgs with mature monitoring infra; enterprise scale

Reading a Trace — Latency Breakdown Patterns In-depth

A trace doesn't just show you where time went — it shows you why. Three latency patterns diagnose different root causes and point to different solutions.

🐌

Pattern: High TTFT

Time-to-first-token is >1s. LLM span starts late even though retrieval is fast.

Root causes: Long input prompt, provider load, large model. Fix: Prompt compression, prompt cache, smaller model, multi-provider failover.

⏳

Pattern: Long Generation

TTFT is fast but total LLM span is 5–10s. Output token count is very high.

Root causes: Model generating unnecessarily long responses. Fix: Set aggressive max_tokens, add "be concise" to prompt, use streaming to hide latency.

🔄

Pattern: Retry-Inflated Latency

Total latency 2–3× what a single call should be. Retry spans visible in trace.

Root causes: Timeout too low, intermittent provider issues, format validation failing. Fix: Tune timeout, fix format issue that's causing retries, improve output parsing.

The Trace Review Workflow

For every production p95 latency alert: (1) pull a trace from the 95th percentile, (2) identify the span contributing the most time, (3) check if it's TTFT (prompt issue), generation (output length issue), or retry (reliability issue). Traces turn "it's slow" into "the 1,847-token system prompt is adding 340ms of TTFT on every call" — a fixable problem.

∑ Chapter 06 — Key Takeaways

A trace is a tree of spans for one request — each span records one operation's input, output, latency, cost, and status
Minimum span set: LLM call, retrieval, tool call, output validation, root request — each with timing + cost
Use OpenTelemetry GenAI conventions for standard attribute names — enables consistent tooling and dashboards
Log prompts safely: truncate to 500 chars, redact PII, restrict access, set 30–90 day retention
Tool choice: Langfuse (self-hosted, privacy), LangSmith (LangChain teams), Phoenix (local dev), Braintrust (eval-focused)
Three latency patterns: high TTFT (prompt too long), long generation (max_tokens not set), retry inflation (format failures or timeouts)

Chapter 07 · Production

Monitoring — Continuous Quality in Production

Offline eval tells you the system worked before you deployed. Monitoring tells you whether it's working right now. Production quality degrades silently until someone complains — unless you're continuously sampling, evaluating, and alerting on live traffic.

Why Production Monitoring Is Different From Offline Eval Foundation

Offline Eval — What You Know

Fixed test set, controlled inputs, run before deployment

Tests the cases you thought to include

Snapshot in time — result is stable

Catches regressions relative to your golden set

Gap: Real traffic evolves; your test set doesn't

Production Monitoring — What You Need

Live traffic sample, real user inputs, continuous

Tests the cases users actually send

Rolling signal — changes as inputs and behavior change

Catches drift that offline eval can't see

Value: Discovers new failure modes before users escalate

The Production Monitoring Stack

A complete production monitoring stack has three layers: (1) Hard metrics — latency, error rate, cost (from logs, near real-time). (2) Quality sampling — LLM judge on 1–5% of live traffic (near real-time). (3) Drift detection — aggregate trend analysis over hours/days. Layer 1 alerts in minutes; Layer 2 in hours; Layer 3 in days. Each catches different failure modes.

Sampling Strategies — Which Traffic to Evaluate In-depth

You can't judge every production request — it doubles your LLM costs. Strategic sampling gets you signal coverage at manageable cost.

Strategy	What It Does	Sample Rate	Best For
Random sampling	Evaluate a uniform random subset of all requests	1–5% of traffic	Baseline quality tracking, cost drift detection
Stratified sampling	Ensure all query categories / user cohorts are represented equally	Varies per stratum	Systems with very unequal query distributions
Triggered sampling	Always evaluate when: long latency, retry occurred, format validation failed, high token count	100% of anomalous cases	Catching the worst failures immediately
User feedback sampling	Evaluate all requests where user gave explicit negative feedback (👎 / edit / rephrase)	100% of flagged cases	Connecting quality scores to user satisfaction
Time-window sampling	Heavier sampling for first hour after a deployment, then fall back to baseline rate	10–20% post-deploy, 1–2% steady-state	Catching deployment regressions fast

Online Eval Pipeline — From Request to Quality Signal Core

Online evaluation pipeline — async quality scoring on live traffic

Never Block User Responses on Online Eval

The online eval pipeline must be completely asynchronous. The user receives their response immediately. Evaluation happens in a background worker, writing to a separate metrics store. If your eval pipeline goes down, users are unaffected. Coupling eval to the critical path is a common mistake that turns a monitoring failure into a user-facing outage.

Drift Detection — Catching Quality Decay Over Time Core

📉

Quality Drift

LLM judge scores trend downward over days/weeks without a single obvious cause. Often caused by gradual input distribution shift.

Track 7-day rolling avg judge score
Alert if 7-day avg drops >0.3 below 30-day avg
Pull failing traces to identify new input patterns

🔀

Distribution Drift

The types of queries users send change. New topics, new use cases, seasonal patterns. Your system wasn't designed for these inputs.

Track query topic distribution over time
Embed queries, monitor cluster centroids
Alert on new topic clusters with low scores

💰

Cost Drift

Average tokens per query increases over weeks. Often caused by growing conversation history, longer user inputs, or unhealthy retry patterns.

Track avg tokens/query daily
Alert on >20% increase week-over-week
Break down by feature and model

Alerting — What to Alert On and How to Respond In-depth

Alert Type	Trigger Condition	Severity	Initial Response
Format failure spike	JSON parse failure rate >5% in any 10-min window	P1 — page on-call immediately	Check recent deployment; roll back if <2h since deploy
Error rate increase	LLM API error rate >10% (timeouts, 429s, 5xx)	P1 — activate fallback provider	Switch to backup model; check provider status page
P95 latency breach	P95 response time >2× of 7-day baseline for >5 min	P2 — investigate within 30 min	Check trace for retry inflation or prompt length increase
Quality score drop	Rolling 1-hour judge avg drops >0.5 below baseline	P2 — review within 1 hour	Sample recent failing traces; check for new input patterns
Cost anomaly	Hourly cost >2× the 7-day hourly average	P2 — investigate within 1 hour	Check for token count spike; look for runaway agent loops
Quality drift	7-day rolling avg drops >0.3 below 30-day avg	P3 — review in next working day	Analyze input distribution shifts; plan golden set expansion

The Quality Dashboard — What to Show Core

📊

Real-Time Panel (last 1h)

Requests/min + error rate
P50 / P95 / P99 latency
Format compliance % (last 100 requests)
Current $ cost/hour
Active provider + fallback status

📈

Quality Trend Panel (last 7 days)

Rolling average judge scores per dimension
Pass rate on sampled traffic
Failure category breakdown (format / content / safety)
Token trend (avg tokens/query over time)
Cost per query trend

🔍

Failure Explorer

Recent low-score samples (judge score <3)
Format failures with error detail
High-latency traces (p99 outliers)
High-cost requests (>$0.10/req)
Link to full trace for each row

⚙️

Model Routing Panel

% of traffic per model (mini vs frontier)
Routing correctness (are cheap queries going to cheap model?)
Cache hit rate
Batch API vs sync API split

∑ Chapter 07 — Key Takeaways

Production monitoring has three layers: hard metrics (minutes), quality sampling (hours), drift detection (days)
Best sampling mix: 1–5% random + 100% triggered (anomalous requests) + 10–20% immediately post-deploy
Online eval pipeline must be fully asynchronous — never block user responses on evaluation
Three drift types to monitor: quality drift (judge scores), distribution drift (query topics), cost drift (tokens/query)
P1 alerts: format failure >5%, error rate >10% — page on-call. P2: latency 2×, quality drop 0.5 — review within 1h
Dashboard covers four panels: real-time health, quality trend, failure explorer, model routing

Chapter 08 · Troubleshooting

Debugging LLM Applications — Finding What Went Wrong

Debugging LLM systems is qualitatively different from debugging traditional software. There is no stack trace for "the model gave a wrong answer." Systematic debugging requires traces, structured reproduction, and an understanding of LLM failure taxonomy — otherwise you're changing prompts at random and hoping.

The Debugging Mindset — Systematic Over Intuitive Foundation

The most common debugging mistake in LLM systems is jumping straight to "fix the prompt" without first diagnosing which component failed and why. A wrong answer in a RAG system could be a retrieval failure, a prompting failure, a model capability failure, or a parsing failure — each has a different fix.

1️⃣Reproduceexact input that failed

2️⃣Isolatewhich component failed

3️⃣Classifyfailure taxonomy

4️⃣Root causewhy this component

5️⃣Fix + verifyeval before/after

6️⃣Regressionadd to golden set

LLM Failure Taxonomy — Classifying What Broke In-depth

Failure Class	Symptoms	Root Component	Diagnostic
Format failure	Parser throws, missing fields, wrong data types	Output parsing / LLM output	Check raw LLM response before parsing; use structured output mode
Instruction violation	Model ignores a constraint (language, length, tone, field)	Prompt / system prompt priority	Test prompt in isolation; check instruction placement (start/end wins)
Hallucination	Model states facts not in the provided context or training data	Model + retrieval (for RAG)	Check if retrieval returned relevant docs; test with oracle context
Retrieval failure	Correct docs not returned; answer misses key information	Embedding / vector search / chunking	Check retrieved doc IDs and scores in trace; test retrieval in isolation
Capability gap	Model can't perform the task regardless of prompting	Model selection	Try frontier model (GPT-4o); if frontier succeeds, route task differently
Context overflow	Key instructions or context silently truncated; model ignores injected content	Context management	Count tokens; check if input exceeds window; trim/summarize context
Tool failure	Agent calls wrong tool, with wrong args, or loop doesn't terminate	Tool schema / agent prompt	Check tool input/output in trace; reduce visible tools; tighten schemas

Prompt Debugging — Isolating Instruction Failures Core

🔬

Prompt Isolation Technique

Strip everything away. Test the prompt against the failing input with no context, no history, no tools. If it still fails, the issue is in the prompt itself. If it passes, the issue is in one of the stripped components.

Test system prompt alone first
Add context back in stages
Pin to temperature=0 during debugging

📍

Instruction Placement

LLMs suffer from the lost-in-the-middle effect. Instructions buried in the middle of a long prompt are often ignored or underweighted.

Move ignored instructions to beginning or end
Repeat critical constraints at end of prompt
Use XML-style delimiters to separate sections

🧪

Minimal Reproduction

The most powerful debugging tool: find the shortest prompt that still fails. This eliminates noise and focuses attention on the actual problem.

Start with failing case, strip tokens
Stop stripping when failure disappears
That stripped context is the cause

⚠️

Instruction Conflict

System prompt says "be concise." User message context implies long output. Model follows whichever is statistically stronger — often the one that appears nearest the end.

Audit for contradictory instructions
System prompt wins when explicit
Add "regardless of the input length" clarifiers

Hallucination Root Cause — Why It Happens and How to Fix It In-depth

Hallucination Type	Cause	Fix
Factual hallucination	Model fills gaps with plausible-sounding facts not in training data	Add explicit "say I don't know if uncertain" instruction. Use RAG with grounding check. Add citation requirement.
Context hallucination (RAG)	Retrieved docs don't contain the answer; model extrapolates from partial information	Improve retrieval (hybrid search, re-ranking). Add: "Only use information from the provided documents."
Confident wrong answer	Model lacks uncertainty calibration; outputs high confidence regardless	Prompt: "If you are not certain, explicitly say so before answering." Add LLM-as-judge calibration eval.
Temporal hallucination	Model answers about post-training events as if it knows them	Add training cutoff date to system prompt. Provide current date. Tell model to acknowledge cutoff.
Structural hallucination	Model invents required fields not in source (e.g. JSON fields it was asked for but source lacks)	Add: "Leave fields as null if information is not present — do not guess." Use structured output with Optional fields.

The Grounding Test

To diagnose RAG hallucinations precisely: replace the retrieved context with the oracle answer verbatim and re-run the query. If the model now answers correctly, the problem is retrieval — your docs are wrong, missing, or irrelevant. If the model still hallucinates even with the correct context in front of it, the problem is prompting — the model isn't being told to stay grounded.

Chain & Pipeline Debugging — Isolating Multi-Step Failures Core

In multi-step pipelines, errors compound: a bad Step 1 output becomes the corrupted input to Step 2. The failure often surfaces in Step 3 or 4 but was caused in Step 1. Traces are the only way to see this clearly.

🔍

Step Isolation

Test each step independently with ideal inputs. If Step 2 works perfectly with handcrafted input, the bug is in Step 1's output. Walk the pipeline backwards from the failure point.

📋

Intermediate Output Logging

Log every intermediate output — not just the final result. Without step-by-step traces, you can't see where the corruption entered. This is non-negotiable for multi-step systems.

🔄

Error Propagation

Validate every step output before passing to the next step. A format error in Step 2 that isn't caught there will manifest as a confusing error in Step 5. Fail fast, fail clearly.

∑ Chapter 08 — Key Takeaways

Debugging workflow: reproduce → isolate component → classify failure → root cause → fix → verify → add to golden set
Seven failure classes: format, instruction violation, hallucination, retrieval, capability gap, context overflow, tool failure
Prompt isolation technique: strip everything, test alone, add components back — narrows failure to one layer
Lost-in-the-middle: move ignored instructions to the beginning or end — buried instructions are ignored
Grounding test: replace retrieved context with oracle answer — if model answers correctly, bug is retrieval; if still wrong, bug is prompting
Multi-step debugging: log every intermediate output, walk backwards from failure, validate each step before passing to the next

Chapter 09 · Automation

CI/CD Integration — Automated Evaluation in Your Pipeline

An eval pipeline you run manually will eventually not get run. Automated eval in CI is the only sustainable enforcement mechanism — it gates every merge, runs without human intervention, and creates an auditable quality history for every change to your LLM system.

Eval in the Development Cycle — The Right Stages Foundation

Eval gates in the LLM system development and deployment cycle

GitHub Actions — Eval on Every Pull Request In-depth

The standard CI pattern: run deterministic checks (fast, free) on every commit; run LLM judge eval on every PR; block merge if either fails.

⚙️

.github/workflows/eval.yml — PR eval gate

name: LLM Eval Gate on: pull_request: paths: - 'prompts/**' - 'src/llm/**' - 'evals/**' jobs: deterministic-eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: { python-version: '3.12' } - run: pip install -r requirements.txt - name: Run deterministic checks run: python -m pytest evals/deterministic/ -v # Schema validation, exact match, format checks # Exit 1 on any failure → blocks merge llm-judge-eval: runs-on: ubuntu-latest needs: deterministic-eval steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: { python-version: '3.12' } - run: pip install -r requirements.txt - name: Run LLM judge evaluation env: OPENAI_API_KEY: {{ secrets.OPENAI_API_KEY }} run: | python evals/run_judge_eval.py \ --golden evals/golden_set.jsonl \ --baseline evals/baseline.json \ --threshold 0.05 # Exits non-zero if accuracy drops >5% → blocks merge - name: Upload eval report uses: actions/upload-artifact@v4 with: name: eval-report path: evals/report.json

Eval-on-PR Patterns — Smart Gating Core

⚡

Path-based Triggers

Only run LLM eval when LLM-related files change. Don't waste API calls and eval time on CSS or docs changes.

Trigger on: prompts/**, src/llm/**, evals/**
Skip on: docs/**, *.md, styles/**
Saves 80%+ of unnecessary eval runs

📊

PR Comment Report

Post eval results as a PR comment so reviewers see the quality impact inline — not buried in CI logs.

Accuracy: 94.2% (baseline: 93.8%) ✅
Format compliance: 100% ✅
Avg tokens: 847 (baseline: 812) ⚠️ +4%
Cost per run: $0.73

💾

Baseline Management

Store the baseline JSON in the repo. When you intentionally improve quality, update the baseline — making new improvements the new floor.

evals/baseline.json in version control
Update with python evals/update_baseline.py
Commit baseline update with the change that caused it

LLM Eval in CI Has a Cost

100 golden set cases × GPT-4o judge × 3 dimensions = ~300 judge calls × $0.005 = $1.50 per PR eval run. On an active team with 20 PRs/day, that's $30/day or ~$900/month. Optimize: use GPT-4o-mini as judge where calibrated (often works as well at 10× lower cost), run full eval only on significant prompt/model changes, use path triggers to skip irrelevant PRs. Track eval pipeline cost as a project metric.

Prompt Versioning — Treating Prompts Like Code In-depth

Prompts change frequently and silently break things. Without version control and an eval audit trail, you have no history of what changed, when, and what effect it had on quality.

What to Version Control

✅ All prompt templates (system + user)

✅ Model selection per feature

✅ Eval golden set (as code)

✅ Baseline scores per golden set version

✅ Judge prompts and rubrics

✅ Model API parameters (temperature, max_tokens)

The Prompt Change Record (in commit message)

What changed: System prompt — added instruction to cite sources

Why: Reduce hallucination rate in doc QA feature

Eval result: Groundedness: 3.8 → 4.2 (+0.4). Accuracy: no change.

Cost impact: +15 tokens/query (avg)

Baseline updated: Yes — new floor

∑ Chapter 09 — Key Takeaways

Eval in CI has four gates: PR gate (L1 + L2 judge) → staging gate (full set) → deploy (canary) → production (online eval)
Use path-based triggers — only run LLM eval when LLM-related files change; saves 80%+ of unnecessary API calls
Post eval results as PR comment — accuracy delta, format compliance, token count change, cost per run
CI eval cost: ~$1.50/run at 100 cases × GPT-4o judge × 3 dimensions — use GPT-4o-mini judge where calibrated (10× cheaper)
Prompt versioning: treat prompts, model selection, golden set, baselines, and judge configs as version-controlled code
Every prompt change commit should record: what changed, why, eval result, cost impact, whether baseline was updated

Chapter 10 · Ecosystem

Tooling — LangSmith, Langfuse, Promptfoo, and More

The eval and observability tooling ecosystem has matured rapidly. You don't need to build everything from scratch — but you do need to pick the right tools for your stack. The wrong tool choice leads to vendor lock-in, missing features, or paying for capabilities you don't need.

Tool Landscape — The Four Categories Foundation

🔭

Category 1: Tracing & Observability

Capture, store, and visualize traces from production. Focus: what happened in this specific request?

LangSmith — LangChain ecosystem
Langfuse — open-source, self-hostable
Arize Phoenix — local-first, OTel native
Custom OTel + Grafana

🧪

Category 2: Evaluation & Testing

Run eval pipelines, compare prompt versions, gate deployments. Focus: is this better or worse than before?

promptfoo — open-source CI eval
Braintrust — eval + experiment tracking
RAGAS — RAG-specific eval metrics
LangSmith — datasets + eval integrations

🏗️

Category 3: Prompt Management

Version, store, and deploy prompts. Focus: which prompt version is in production right now?

LangSmith Hub — prompt registry
Langfuse Prompts — prompt versioning + A/B
Braintrust — prompt snapshots
Git + plain files — the simplest option

📊

Category 4: Analytics & Cost

Track aggregate quality, cost, and usage over time. Focus: is quality trending up or down this week?

Langfuse — cost dashboard built-in
Custom Grafana dashboards — from OTel metrics
Provider dashboards — OpenAI, Anthropic usage pages

Tool Deep Dives — LangSmith, Langfuse, and promptfoo In-depth

Tool	Core Strength	Tracing	Eval	Prompt Mgmt	Hosting	Cost
LangSmith	End-to-end LangChain observability	✅ Native	✅ Datasets + CI	✅ Hub	SaaS only	Free tier + paid plans
Langfuse	Self-hostable full-stack LLM observability	✅ SDK + OTel	✅ Built-in + RAGAS	✅ Versioning + A/B	Self-hosted or SaaS	Free self-hosted
promptfoo	CI-first eval framework	❌ Not tracing	✅ Best-in-class CLI + CI	⚠️ Basic	Open-source CLI	Free (open-source)
Braintrust	Eval + experiment tracking	⚠️ Basic spans	✅ Strong eval + A/B	✅ Prompt snapshots	SaaS only	Paid (usage-based)
Arize Phoenix	Local-first debugging + evals	✅ OTel native	✅ Evals + embedding viz	❌	Local + cloud	Free local tier
RAGAS	RAG-specific eval metrics	❌	✅ RAG metrics only	❌	Open-source library	Free (open-source)

Self-Hosted vs SaaS — The Decision Framework Core

Consideration	Choose SaaS	Choose Self-Hosted
Data sensitivity	Prompts contain no PII / confidential data	Prompts contain PII, IP, or regulated data (HIPAA, GDPR)
Team size / infra	Small team, no dedicated infra engineer	Mature infra team; existing K8s / monitoring stack
Time to value	Need tracing working in hours, not days	Can accept 1–2 days for initial setup
Cost at scale	SaaS costs rise linearly — large volumes become expensive	Fixed infra cost amortizes over volume
Compliance & audit	Provider compliance certifications sufficient	Need full data residency control and audit logs

The Recommended Starting Point

Start with promptfoo for CI eval (open-source, no data leaves your system) + Langfuse self-hosted for tracing and online eval (Docker Compose in 10 minutes, free, your data stays local). This covers 90% of production evaluation needs at zero ongoing cost. Graduate to SaaS tools (LangSmith, Braintrust) if you need richer integrations with LangChain or dedicated eval UX for a larger team.

Integration Patterns — Wiring Tools Into Your Stack In-depth

The best observability stack for most teams is not one monolithic tool — it's lightweight integration of best-of-breed components, each doing one thing well.

🔧

Langfuse tracing — 4-line integration

from langfuse.openai import openai # Drop-in replacement for openai from langfuse.decorators import langfuse_context, observe # That's it — all OpenAI calls are now automatically traced # Langfuse captures: model, tokens, cost, latency, input, output @observe # Wraps any function as a traced span def run_rag_pipeline(query: str) -> str: docs = retrieve(query) # retrieval span response = openai.chat.completions.create( # auto-traced LLM span model="gpt-4o-mini", messages=build_messages(query, docs) ) langfuse_context.update_current_observation( metadata={"doc_count": len(docs), "feature": "doc_qa"} ) return response.choices[0].message.content

🔁

Recommended Stack (most teams)

promptfoo — CI eval gate on PRs
Langfuse (self-hosted) — tracing + online eval
Git — golden set + prompt versioning
Grafana — dashboards from OTel/Langfuse metrics

🏢

Enterprise Stack

LangSmith or Braintrust — team eval UI
Custom OTel + existing APM (Datadog)
Internal prompt registry + CI gates
RAGAS for RAG-specific metrics

🚀

Startup / Fast Start

LangSmith free tier — instant setup
promptfoo — CI eval (free)
Provider dashboards for cost tracking
Upgrade to self-hosted when data sensitivity requires it

∑ Chapter 10 — Key Takeaways

Four tool categories: tracing, evaluation/testing, prompt management, analytics/cost — often need one from each
Best-in-class: promptfoo (CI eval), Langfuse (tracing + online eval, self-hostable), RAGAS (RAG metrics), Braintrust (eval UX)
Self-host when: PII/confidential prompts, regulated data, HIPAA/GDPR, high volume. Use SaaS when: small team, speed needed, no PII
Recommend starting stack: promptfoo + Langfuse self-hosted + Git — covers 90% of needs at zero ongoing cost
Langfuse tracing integration: 4 lines of code — drop-in OpenAI replacement auto-traces all calls
Track eval pipeline cost itself as a project metric — it can reach $900+/month on active teams if unmanaged

Minimal Production Evaluation Architecture — The Complete Picture Reference

A production LLM system requires all four evaluation layers working together. Each layer serves a different purpose and catches failures the others miss. This is the minimum architecture — not an aspirational target.

The four-layer production evaluation system — all required, all complementary

✅

What Each Layer Catches

Offline: prompt regressions, model update breakage, format regressions
Online: distribution shift, new input patterns, subtle quality erosion
Monitoring: drift over time, cost anomalies, latency spikes
Feedback: known-failure regression prevention; evolving coverage

⚠️

What Breaks Without Each Layer

No offline eval: prompt changes break things silently
No online eval: distribution shift goes undetected for weeks
No monitoring: cost spikes and quality drifts are invisible until too late
No feedback loop: same failures recur; golden set never improves

The Final Principle — You Cannot Eyeball LLM Quality In-depth

Human inspection does not scale. Reading 10 responses and concluding "it looks good" is not evaluation — it is survivorship bias. The cases you inspect are rarely the cases that fail in production.

🙈

What Looks Correct

May fail on the specific edge cases your users actually send
May pass today but degrade silently after the next model update
May hallucinate confidently on topics that are rare in your test set

Human inspection tells you about the inputs and outputs you chose to look at. It tells you almost nothing about the distribution of inputs you haven't seen.

📊

What Measurement Gives You

A statistical signal over real inputs — not a cherry-picked sample
A baseline to detect regression — not a subjective "feels better"
A continuous production signal — not a one-time check before launch

Measurement turns "I think it works" into "it passes 94.2% of cases with format compliance at 99.1%." One is a guess. The other is an engineering decision.

🎯

The Engineering Standard

Every prompt change: measure before and after
Every model upgrade: run full eval suite first
Every production deployment: have an online eval signal within hours

If you wouldn't deploy a backend service without monitoring and error rate tracking, don't deploy an LLM system without eval pipelines and quality dashboards.

If you are not measuring it, you are not controlling it. LLM quality is not self-evident. It is measured, tracked, and defended — with golden sets, judges, traces, dashboards, and feedback loops. Every component in this guide exists because "it seemed fine" was not enough.

← LLM System Design Fine-Tuning LLMs →