AI Advanced · Evaluation & Observability

Evaluation & Observability

Measuring what matters โ€” benchmarks, LLM-as-judge, regression testing, tracing, and production monitoring for AI systems.

"It works in the demo" is not a deployment criterion. Without systematic evaluation and observability, you are flying blind โ€” shipping changes that might improve or destroy quality, with no way to know which.

01
Chapter 01 ยท Foundations
Why Evaluation Matters โ€” The Cost of Not Measuring

Evaluation is the discipline that separates LLM experimentation from LLM engineering. Without it, every prompt change is a guess, every model upgrade is a risk, and every production incident is a surprise. Eval is not a QA step โ€” it is the feedback loop that makes iteration possible.

LLM-based systems do not behave like traditional software. The same input can produce different outputs, different reasoning paths, and different failure modes across runs. This is not a bug โ€” it is the architecture. Evaluation must account for this explicitly.

๐ŸŽฒ
Correctness Is a Distribution

A system that is "90% correct" fails 1 in 10 requests. At 10K queries/day that is 1,000 failures โ€” even if every individual call "seems to work" when you test it manually.

  • Measure pass rate, not "does it work"
  • Track failure rate per input category
  • Report p50/p90 quality, not just avg
๐Ÿ“Š
Evaluation Must Be Statistical

Testing one or two inputs tells you almost nothing about system reliability. Evaluation requires a dataset large enough to detect real signal from noise.

  • Single-input testing โ‰  evaluation
  • Minimum: 50 diverse cases for signal
  • Run multiple times to estimate variance
๐Ÿ›ก๏ธ
Reliability Requires a Control Layer

Non-determinism cannot be eliminated โ€” only bounded. Production systems must wrap LLM calls with validation, retries, and fallbacks that enforce acceptable behavior even when the model doesn't.

  • Validate every output before use
  • Retry on format failures (max 2โ€“3ร—)
  • Fallback on repeated failure

LLM outputs are hard to evaluate by reading them. A response can look fluent, well-formatted, and confident โ€” while being factually wrong, missing a key constraint, or breaking a downstream parser 15% of the time. Human spot-checking at scale is too slow, inconsistent, and biased.

๐Ÿ™ˆ
The Fluency Trap

Fluent โ‰  correct. LLMs produce grammatically perfect, confident-sounding text regardless of accuracy. Evaluating output by reading it catches obvious failures โ€” not subtle ones.

  • Wrong answer, perfect prose
  • Missing constraints, elegant format
  • Hallucinated facts, appropriate hedging
๐Ÿ“‰
Silent Degradation

Without live eval, quality degrades invisibly. Model updates, prompt changes, schema changes, or input distribution shifts all erode quality โ€” silently, until a user reports it.

  • Prompt change improves A, breaks B
  • Model upgrade changes tone/format
  • Edge case inputs grow over time
๐ŸŽฏ
The Eval Feedback Loop

Evaluation converts vague "something feels off" into measurable "format compliance dropped 8% after the last prompt change." That's actionable โ€” the former is not.

  • Quantified quality signal
  • A/B comparison between versions
  • Regression detection before users notice
The Core Principle

Every change to an LLM system โ€” prompt, model, temperature, schema, retrieval โ€” is a hypothesis. Evaluation is the experiment that tells you whether the hypothesis is correct. Without evaluation, you're not engineering โ€” you're guessing with extra steps.

LLM evaluation is not a single number. Different properties require different measurement approaches โ€” and they don't always move together. A prompt change can improve accuracy while breaking format compliance.

DimensionWhat It MeasuresHow to MeasurePriority
Functional correctness Does output satisfy the task? (right answer, correct classification) Exact match / regex / unit tests on output Highest โ€” ship-blocking
Format compliance Is output parseable? Does it match the required schema? JSON parse attempt / schema validation / regex High โ€” downstream systems break on failure
Factual accuracy Are stated facts true? Are claims grounded in source? LLM-as-judge / grounding check / human review High for knowledge tasks
Quality / tone Is output helpful, appropriate, on-brand? LLM-as-judge with rubric / human rating Medium โ€” subjective but important
Safety / refusal Does the system refuse harmful requests? Does it over-refuse benign ones? Red-team datasets / adversarial test suite Critical for user-facing systems
Latency Time to first token / total response time Instrumented timing / p50/p95/p99 Medium โ€” SLA dependent
Cost per query Tokens in + tokens out ร— model price Token count logging ร— price table Medium โ€” economics

Evaluation pipelines themselves incur real cost. Running the wrong eval strategy will produce both false confidence and unnecessary API spend โ€” simultaneously.

Eval TypeRun WhenCost per RunWhy This Cadence
Deterministic checks Every commit, every output in production ~$0 Zero marginal cost โ€” no reason not to run always
LLM judge (golden set) Every pull request $0.50โ€“$2.00 per run (100โ€“200 samples) Catches quality regressions before merge; cost is low vs risk
Human eval Major releases / model changes $50โ€“$500+ per review Too slow and expensive for every change; reserved for high-stakes decisions
Online sampling + judge Continuous in production (1โ€“5% of traffic) Scales with traffic volume Real distribution signal; catches drift offline tests miss
Common Eval Cost Mistakes

Running LLM judge on every commit (not just PRs) can burn $20โ€“$50/day with no added signal. Using verbose judge prompts inflates tokens and judge costs. Evaluating with GPT-4o when GPT-4o-mini is calibrated to give the same scores at 10ร— lower cost. Track your eval pipeline spend as a separate budget line โ€” it's a real operational cost, not a sunk cost.

Not all evals should run on every change. The principle: fast, cheap evals run always; slow, expensive evals run on significant changes. This is the eval equivalent of a testing pyramid โ€” unit tests at the base, integration tests in the middle, human review at the top.

The evaluation hierarchy โ€” frequency and cost increase top to bottom
Level 1 โ€” Deterministic Evals Schema validation ยท JSON parse ยท Exact match ยท Regex ยท Field presence | Cost: ~$0 ยท Time: <1ms/sample ยท Run: every commit Level 2 โ€” Model-Based Evals (LLM-as-Judge) LLM judge ยท ROUGE/BLEU ยท Embedding similarity | Cost: $0.001โ€“$0.01/sample ยท Time: 1โ€“3s ยท Run: on PR Level 3 โ€” Human Evaluation Expert review ยท Preference ranking ยท Red-teaming | Cost: $0.10โ€“$1+/sample ยท Time: minutes ยท Run: major changes Level 4 โ€” A/B Production Always on Per PR Major releases Post-launch
โšก
Level 1 โ€” Run Everything, Always

Deterministic checks have zero marginal cost. JSON parse, schema validation, field presence checks, regex format checks โ€” these should run on every single output in tests and in production.

  • Takes milliseconds per sample
  • Catches format regressions immediately
  • Gate all deployments on 100% pass
๐Ÿค–
Level 2 โ€” LLM Judge on PRs

Run LLM-as-judge evaluation on your golden set for every pull request. 100โ€“200 samples ร— $0.005/judge call โ‰ˆ $0.50โ€“$1.00 per PR. Worth it โ€” catches quality regressions before merge.

  • ~$0.50โ€“$1.00 per eval run
  • Catches subtle quality changes
  • Automated โ€” no human bottleneck

Before any LLM-based evaluation runs, enforce deterministic checks. These have near-zero cost, run in milliseconds, and catch the most impactful failures โ€” format errors that would crash downstream systems or produce silently corrupt data.

What Deterministic Guards Check

โœ… JSON schema validation โ€” does the output match the required schema?

โœ… Required field presence โ€” are all expected fields non-null?

โœ… Regex constraints โ€” does a field match its expected pattern (date, email, ID)?

โœ… Type validation โ€” is a numeric field actually a number?

โœ… Enum validation โ€” is a classification label one of the allowed values?

The Key Rule

If a failure can be caught deterministically, it must never reach an LLM judge.

Running an LLM judge on a malformed JSON response costs money and adds latency โ€” while the right answer is to fail immediately with a clear error.

Deterministic guards: ~$0, <1ms, run on every output

LLM judge: $0.005+, 1โ€“3s, run on sampled outputs

Apply the cheapest check that can detect the failure โ€” and only escalate when cheaper checks pass.

Offline Evaluation (pre-production)

What: Run fixed test set against system before deployment

When: During development, on every significant change, blocking deployment

Pros: Controlled, reproducible, no user impact

Cons: Test set may not match production distribution; eval inputs may become stale

Tools: promptfoo, LangSmith, custom pytest harness

Online Evaluation (post-production)

What: Continuously sample and evaluate live traffic

When: Always running in production at 1โ€“10% sampling rate

Pros: Real distribution, catches drift, finds failures offline tests miss

Cons: Failures already reached users; LLM judge adds cost per sample

Tools: LangSmith, Langfuse, custom sampling + judge pipeline

Offline Eval Is Necessary But Not Sufficient

A golden test set built in January will not cover the inputs your users are actually sending in July. Production inputs drift over time โ€” new topics, new edge cases, adversarial inputs. Online evaluation sampling at 5% of production traffic, judged automatically, is the only way to know if quality is holding as inputs evolve. Run both โ€” they catch different things.

Failure PatternHow It HappensDiscovery Without EvalPrevention
Prompt regression Fix one failure mode in a prompt โ†’ silently break 3 others User complaints weeks later Eval on full golden set before merge
Model update breakage Provider updates GPT-4o silently; JSON structure changes slightly Parser 500s in production Eval + model version pinning
Distribution shift New user cohort sends different query types than anticipated Low satisfaction scores over weeks Online eval sampling detects drop early
Format drift Downstream parser changes; LLM still outputs old format Silent data corruption in DB Schema validation on every output
Cost explosion Prompt grows longer; token count doubles; nobody notices until bill arrives Monthly invoice shock Token count tracking in eval + alerts

LLM failures in production fall into four distinct categories โ€” each with different visibility, detection difficulty, and downstream impact. Understanding which category a failure belongs to determines how to catch and fix it.

๐Ÿ”ด
Hard Failures โ€” Immediately Visible

System crashes, parser throws, API returns error. These are easy to detect but still must be handled gracefully.

  • Invalid JSON โ€” parser throws
  • Missing required field โ€” null pointer
  • Wrong data type โ€” downstream cast fails
  • Detection: deterministic checks, error monitoring
๐ŸŸก
Soft Failures โ€” Subtly Wrong

Output parses successfully but is partially incorrect. These pass format checks but fail quality checks โ€” often found only via LLM judge or human review.

  • Partially correct answer (misses one constraint)
  • Correct structure, wrong content
  • Missing context that changes the answer
  • Detection: LLM judge with multi-dimension rubric
โšซ
Silent Failures โ€” Most Dangerous

Output looks correct, passes all checks, reaches users. But it's wrong. These corrupt downstream systems, erode user trust, and are almost impossible to detect without continuous quality sampling.

  • Hallucinated values that look plausible
  • Confident wrong answers with no hedging
  • Format drift (subtle schema deviation)
  • Detection: online eval sampling + grounding checks
๐Ÿ“‰
Behavioral Drift โ€” Gradual Degradation

The system works but quality drifts over time. Tone changes, verbosity increases, instruction adherence drops. No single failure โ€” just a slow erosion of quality.

  • Longer responses than specified
  • Brand tone gradually shifts
  • Reliability of structured output drops week-over-week
  • Detection: rolling avg judge scores over time
Silent Failures Are the Most Expensive

Hard failures get fixed immediately โ€” they break the system loudly. Silent failures pass all your checks, reach all your users, and corrupt downstream data silently. By the time you notice (user report, data audit), the failure has been happening for days or weeks. The only defense is continuous online evaluation that samples and judges live traffic โ€” not just offline testing that only catches what you thought to test for.

Quality and reliability are different goals โ€” and they require different engineering approaches. A system can be high quality on good inputs while being completely unreliable at scale.

High Quality (but unreliable)

Produces excellent outputs on typical inputs

Fails unpredictably on edge cases

Hard to test because failures are non-obvious

Users experience occasional great results โ€” and occasional crashes

Trust level: low โ€” users can't predict when it works

High Reliability (production-grade)

Produces consistently acceptable outputs across all inputs

Edge cases handled gracefully โ€” fallback, "I don't know," or structured error

Testable: pass/fail rate stable across runs

Users experience predictable behavior โ€” not occasionally brilliant

Trust level: high โ€” behavior is predictable

In Production: Reliability Over Peak Quality

A system that produces brilliant output 70% of the time and crashes or hallucinates 30% of the time is not production-grade. Reliability is what makes users trust the system โ€” and trust is built by consistent, predictable behavior, not occasional impressive outputs. Target reliability first (measure failure rate, build guardrails) before optimizing for peak quality.

Prompt engineering without evaluation is unstable. Most prompt changes fix one issue and introduce new failures. Without a full eval suite, you can't know whether a change is a net improvement or a net regression.

ScenarioWithout EvalWith Eval
Prompt change to fix hallucination Seems fixed in manual testing โ€” format compliance silently dropped 5% Eval shows: hallucination โ†“ 15%, format compliance โ†“ 5%. Net positive โ€” but catch the regression.
Model upgrade (mini โ†’ full) Quality feels better โ€” cost increased 16ร—, token use up 30%. Unknown. Eval shows: accuracy +3%, cost +16ร—, tokens +28%. Decide intentionally.
Adding few-shot examples Looks better on the 5 cases you tested โ€” broke 8% of edge cases. Eval on 200-case golden set shows: common cases improved, edge cases regressed.
The Prompt Engineering Workflow

Every prompt change should: (1) improve at least one metric, (2) not regress any metric beyond threshold, (3) be recorded with before/after eval scores. Track metric deltas โ€” not just pass/fail. A change that improves accuracy from 88% to 91% is valuable. The same change that simultaneously drops format compliance from 99% to 94% may not be worth shipping.

Anti-PatternWhat HappensThe Fix
"Looks good to me" Human approval of a few test cases masquerades as evaluation. Fails silently on edge cases. Require automated eval on 50+ cases before any merge. Manual review is a supplement, not a replacement.
Single-metric optimization Accuracy improves; format compliance, latency, and cost all regress. Nobody noticed because only one metric was tracked. Track all critical dimensions on every eval run. Block if any key metric regresses.
Static golden set Golden set from 6 months ago. New features and user behaviors not covered. High offline scores, poor production quality. Add cases from production failures monthly. Assign ownership for golden set maintenance.
Over-relying on LLM judge LLM judge gives 4.2/5 โ€” feels like high quality. JSON parse failure rate is 8%. Judge never checked format. Always run deterministic checks first. LLM judge is for quality dimensions that determinism can't cover.
Ignoring eval cost LLM judge runs on every commit with GPT-4o. $30/day eval spend. Nobody noticed for 3 months. Run LLM judge on PRs only. Use mini model when calibrated. Track eval pipeline cost as a budget line.

∑ Chapter 01 — Key Takeaways

  • Evaluation is the feedback loop that turns LLM changes from guesses into measurable improvements
  • Fluent โ‰  correct โ€” silent degradation is invisible without systematic measurement
  • Measure 7 dimensions: functional correctness, format compliance, factual accuracy, quality, safety, latency, cost
  • The eval hierarchy: deterministic checks (always) โ†’ LLM judge (per PR) โ†’ human eval (major changes) โ†’ A/B (production)
  • Offline eval gates deployment; online eval catches drift in production โ€” both are required
  • The five silent failure patterns: prompt regression, model update breakage, distribution shift, format drift, cost explosion
02
Chapter 02 ยท Measurement
Benchmarks โ€” Standard Evaluations for LLMs

Benchmarks let you compare models on standardized tasks. But benchmark performance and production performance are not the same thing. Understanding what benchmarks measure โ€” and what they miss โ€” is essential before using them to make model selection decisions.

BenchmarkWhat It TestsFormatUseful For
MMLU
Hendrycks et al. 2021
World knowledge across 57 academic subjects (STEM, humanities, law, medicine) Multiple choice, 4 options Comparing knowledge breadth; model selection for knowledge-intensive tasks
HumanEval
OpenAI 2021
Python function completion from docstrings; 164 programming problems Code generation, unit test pass rate Code assistant model selection; comparing coding capability
MT-Bench
LMSYS 2023
Multi-turn conversation quality across 8 categories (writing, math, coding, reasoning) LLM-as-judge scoring by GPT-4 Chat model quality; instruction following across domains
GPQA
Google 2023
Graduate-level science questions designed to be hard for non-experts Multiple choice, expert-validated Frontier model capability; distinguishing top-tier models
GSM8K
Cobbe et al. 2021
Grade-school math word problems requiring multi-step arithmetic reasoning Free-form answer, exact match Multi-step reasoning; CoT effectiveness
HellaSwag
2019
Commonsense reasoning โ€” which sentence continues an activity description correctly Multiple choice Common sense; less useful for frontier models (most score 95%+)
LMSYS Chatbot Arena
LMSYS 2023
Human preference ranking โ€” users compare two anonymous model responses head-to-head ELO ranking from human votes Real user preference; most production-relevant benchmark
The Most Production-Relevant Benchmark

Of all public benchmarks, LMSYS Chatbot Arena most closely predicts which models users prefer in practice โ€” because it uses real human preference data rather than academic tasks. MMLU tells you about knowledge breadth. Arena tells you about perceived output quality. Use both, weight Arena more heavily for user-facing applications.

๐ŸŽฏ
Data Contamination

If benchmark questions appeared in the model's training data, scores reflect memorization โ€” not capability. Increasingly common as benchmarks become widely used.

  • Impossible to verify from outside
  • Inflates reported scores
  • New benchmarks contaminate faster than expected
๐Ÿ“Š
Benchmark Saturation

When most frontier models score 85โ€“92% on a benchmark, it can no longer distinguish between them. HellaSwag is essentially useless for comparing GPT-4o vs Claude 3.5 โ€” both score 95%+.

  • Old benchmarks can't rank new models
  • Need constantly harder challenges
  • GPQA was designed specifically for this
๐Ÿ”ง
Task-Distribution Mismatch

Academic benchmarks test standardized tasks. Your production system has a specific input distribution. A model that tops MMLU may be worse than a smaller model on your specific task type.

  • MMLU doesn't predict JSON extraction quality
  • HumanEval โ‰  Python code review quality
  • Always run domain-specific eval
Goodhart's Law Applies to Benchmarks

Once a benchmark becomes a target, it ceases to be a good measure. Labs optimize specifically for leaderboard benchmarks โ€” through training data selection, prompt engineering, and sometimes cherry-picking evaluation conditions. A model's actual usefulness on your task may be uncorrelated with its leaderboard position. Always validate on your own data before making model selection decisions from benchmarks alone.

DecisionUse Benchmarks?What to Use Instead / Also
Initial model shortlisting Yes โ€” filter obvious losers MMLU for knowledge tasks, HumanEval for code, Arena ELO for general quality
Final model selection Insufficient alone Your own golden set + task-specific eval is mandatory
Tracking model provider updates Watch for score changes Your production eval set is more reliable signal
Comparing your fine-tuned model to base Yes โ€” use MMLU to detect capability regression Domain eval for capability gain measurement
Communicating model quality externally Use with caveats Benchmark + task-specific results together tell a more honest story

For production systems, custom benchmarks tuned to your task type are more valuable than any public benchmark. They measure what you actually care about โ€” on representative inputs from your users.

1๏ธโƒฃSample inputs100โ€“500 real/realistic
2๏ธโƒฃAnnotate outputshuman or LLM-judge
3๏ธโƒฃDefine metricsaccuracy, format, quality
4๏ธโƒฃAutomate runnerpromptfoo / pytest
5๏ธโƒฃRun on changegate deployment
Minimum Viable Benchmark

Start with 50 diverse inputs โ€” not 500. Cover: (1) typical cases (60%), (2) hard/ambiguous cases (20%), (3) edge cases and failures you've seen in production (20%). A 50-case eval that runs in CI catches 80% of regressions. Perfect coverage is the enemy of getting started. Add cases as you discover new failure modes.

∑ Chapter 02 — Key Takeaways

  • Key benchmarks: MMLU (knowledge), HumanEval (code), MT-Bench (chat quality), GSM8K (reasoning), Arena (real user preference)
  • LMSYS Chatbot Arena is the most production-relevant benchmark โ€” uses real human preference data
  • Three benchmark failure modes: data contamination, saturation, task-distribution mismatch
  • Benchmarks are for initial shortlisting โ€” final model selection requires your own task-specific eval
  • Goodhart's Law: when a benchmark becomes a target, it stops being a good measure โ€” leaderboard โ‰  production performance
  • Build a 50-case custom benchmark before anything else โ€” typical (60%), hard (20%), edge cases (20%)
03
Chapter 03 ยท Scalable Eval
LLM-as-Judge โ€” Automated Qualitative Evaluation

Human evaluation is the gold standard โ€” but it costs $0.10โ€“$1+ per sample and can't scale to thousands of daily outputs. LLM-as-judge closes the gap: automated qualitative evaluation that costs $0.001โ€“$0.01 per sample and scales infinitely. When designed correctly, it correlates with human judgment at 80โ€“90%.

LLM-as-judge uses a capable frontier model (GPT-4o, Claude 3.5 Sonnet) to evaluate the output of another model โ€” or even the same model. The judge receives the original input, relevant context, the output to evaluate, and a scoring rubric. It returns a score + brief justification.

LLM-as-judge evaluation pipeline
Input user query + context Your System (model under test) Generates output Judge LLM GPT-4o / Claude Sonnet Input + Output + Rubric โ†’ Score (1โ€“5) + Reason ~$0.003โ€“0.01 per evaluation Score + Reason structured output logged + aggregated Aggregate avg score / pass%
When LLM-as-Judge Works Well

โœ… Evaluating text quality, helpfulness, tone

โœ… Grounding checks (does output use source material?)

โœ… Multi-turn conversation coherence

โœ… Relative comparison: "which response is better?"

โœ… Structured scoring with clear rubrics (1โ€“5 scale)

When LLM-as-Judge Falls Short

โŒ Mathematical / code correctness (use unit tests)

โŒ Tasks requiring specific domain expertise

โŒ Very long outputs (>2K tokens) โ€” judge loses focus

โŒ Fine-grained factual claims without reference source

โŒ Safety evaluation โ€” judges can be jailbroken

The quality of your LLM judge is almost entirely determined by the quality of your judge prompt. A weak judge prompt produces noisy, inconsistent scores. A strong judge prompt produces reliable, calibrated scores that correlate with human judgment.

โŒ
Weak judge prompt โ€” vague, inconsistent
System: You are an expert evaluator. Rate the response quality. Score from 1-10 and explain why. Input: {question} Response: {output}

Problems: "quality" is undefined, scale is vague (what's a 6?), no structured output, inconsistent across runs.

โœ…
Strong judge prompt โ€” explicit rubric, structured output
System: You are evaluating AI assistant responses for a customer support system. Score the response on HELPFULNESS only โ€” how well it resolves the user's issue. Scoring rubric: 5 = Fully resolves the issue with correct, actionable steps 4 = Mostly resolves with minor gaps or ambiguity 3 = Partially addresses but missing key steps 2 = Identifies the issue but gives incorrect or unhelpful guidance 1 = Does not address the user's issue at all Return JSON only: {"score": <1-5>, "reason": "<one sentence>", "missing": "<what's missing, or null>"} User query: {question} Response to evaluate: {output} Evaluate:

Specific dimension, explicit per-score definitions, structured JSON output, consistent reasoning required.

Design PrincipleWhy It Matters
One dimension per judge Helpfulness + accuracy + tone in one prompt โ†’ confused, noisy scores. One judge per dimension, aggregated separately.
Explicit per-level rubric A scale of 1โ€“5 without definitions means different things on each call. Define exactly what each score value means.
Require structured output JSON output is parseable and consistent. Freeform reasoning varies in format and is hard to aggregate.
Include reference answer if available Comparing to a ground-truth answer dramatically improves accuracy evaluation over judging in isolation.
Ask for a reason The justification catches judge errors โ€” if the reason contradicts the score, the evaluation is unreliable.

One overall quality score hides signal. A response can be perfectly accurate but poorly formatted, or beautifully written but factually wrong. Multi-dimension scoring separates these signals so you know what to fix.

๐ŸŽฏ
Dimension: Correctness

"Is the answer factually accurate and does it satisfy the user's stated request?"

  • 5: Fully correct, nothing to dispute
  • 3: Mostly correct, minor inaccuracy
  • 1: Incorrect or misleading answer
๐Ÿ“‹
Dimension: Completeness

"Does the answer cover all aspects of the question, or does it miss key parts?"

  • 5: All aspects addressed
  • 3: Main answer present, details missing
  • 1: Major parts of question unanswered
๐Ÿ”—
Dimension: Groundedness

"For RAG systems: is every claim in the answer supported by the provided source documents?"

  • 5: Every claim traced to source
  • 3: Mostly grounded, one unsupported claim
  • 1: Significant hallucination present
The Four Canonical Dimensions (for most systems)

Most production LLM systems benefit from exactly four judge dimensions: (1) Correctness โ€” is the answer right? (2) Completeness โ€” does it cover everything asked? (3) Groundedness โ€” are claims supported? (critical for RAG) (4) Format compliance โ€” handled deterministically, not by LLM judge. Run one judge call per dimension, return structured JSON per call, aggregate across your golden set.

BiasWhat It IsMitigation
Verbosity bias Judges rate longer responses higher, even when a shorter answer is better Add explicit instruction: "Do not reward verbosity. A concise correct answer scores higher than a verbose correct answer."
Self-preference bias GPT-4o-as-judge prefers GPT-4o outputs; Claude-as-judge prefers Claude outputs Use a different model family as judge than the model being evaluated. Or use multiple judges and average.
Position bias In A-vs-B comparisons, judge prefers whichever response appears first (or second) Run each comparison twice with reversed order. Only accept agreements; reclassify disagreements as ties.
Sycophancy If you include "this response is from our best model", judge inflates score Never include model identity in judge prompt. Blind evaluation only.
Formatting halo Well-formatted responses (headers, bullet points) get higher scores regardless of content quality Add: "Evaluate content quality, not formatting. Ignore markdown styling when scoring."

Before trusting your LLM judge at scale, validate that its scores correlate with human judgment. This is called calibration โ€” and it's what separates a reliable eval pipeline from a false sense of measurement.

1๏ธโƒฃHuman labels50โ€“100 samples rated by humans
2๏ธโƒฃJudge labelssame samples, LLM judge
3๏ธโƒฃAgreement ratetarget โ‰ฅ80% on binary pass/fail
4๏ธโƒฃInspect disagreementsrefine rubric on systematic gaps
5๏ธโƒฃRe-calibrateafter rubric changes
Never Deploy an Uncalibrated Judge

An LLM judge you haven't validated against human labels is measuring something โ€” you just don't know what. A judge that looks correct on casual inspection may be systematically scoring a known failure mode as passing. Validate on at least 50 human-labeled examples before using any judge in CI. If judge-human agreement is below 75%, your rubric needs work before the judge is trustworthy.

๐Ÿ”ง
Minimal judge implementation (Python)
import json from openai import OpenAI client = OpenAI() JUDGE_SYSTEM = """You evaluate AI assistant responses for correctness. Score 1-5 where: 5 = Fully correct and complete 4 = Mostly correct, minor omission 3 = Partially correct 2 = Incorrect with some valid content 1 = Completely incorrect or irrelevant Return JSON: {"score": <1-5>, "reason": "<one sentence>"}""" def judge(question: str, response: str, reference: str = None) -> dict: user_msg = f"Question: {question}\nResponse: {response}" if reference: user_msg += f"\nReference answer: {reference}" result = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": JUDGE_SYSTEM}, {"role": "user", "content": user_msg} ], response_format={"type": "json_object"}, temperature=0 ) return json.loads(result.choices[0].message.content)

∑ Chapter 03 — Key Takeaways

  • LLM-as-judge: $0.001โ€“0.01 per sample, correlates with human judgment at 80โ€“90% when designed correctly
  • Works for: quality, helpfulness, groundedness, coherence. Does not work for: math, code correctness, safety
  • One dimension per judge call โ€” combined rubrics produce noisy, ambiguous scores
  • Strong judge prompts need: explicit per-level rubric, structured JSON output, reasoning field
  • Five key biases to mitigate: verbosity, self-preference, position, sycophancy, formatting halo
  • Calibrate against 50+ human labels before trusting any judge in CI โ€” target โ‰ฅ80% agreement on binary pass/fail
  • Use temperature=0 and json_object mode for consistent, parseable judge outputs
04
Chapter 04 ยท Test Data
Golden Sets โ€” Building Your Evaluation Dataset

A golden set is the foundation of every eval pipeline. Without one, you have no baseline, no regression detection, and no way to compare prompt versions. Building a high-quality golden set is the most important engineering task in LLM evaluation โ€” and it's done once, then maintained continuously.

A golden set is a curated collection of (input, expected output) pairs โ€” or more precisely, (input, evaluation criteria) pairs โ€” that represent the task your system must perform. "Golden" means human-verified: each case has been reviewed and annotated to define what correct looks like.

๐Ÿ“ฅ
Input

The query, document, or task given to the system. Should be representative of real production traffic โ€” not synthetic or idealized.

  • Sampled from real user inputs
  • Anonymized if needed (PII removal)
  • Diverse across your task distribution
๐ŸŽฏ
Expected Output / Criteria

Defines what "correct" means for this input. Can be an exact reference answer, a set of required elements, or rubric criteria.

  • Exact answer (classification, extraction)
  • Required fields / key points (summarization)
  • Rubric criteria (open-ended quality)
๐Ÿท๏ธ
Metadata

Tags that enable sliced analysis โ€” query type, difficulty level, failure category, date added. Essential for understanding which cases regressed.

  • Difficulty: easy / medium / hard
  • Category: topic or task type
  • Source: synthetic / prod sample / manual
SourceHowProsCons
Production sampling Log 1โ€“5% of live traffic; human-annotate a sample Real distribution; catches actual failure modes Requires annotation pipeline; PII concerns
Manual curation Domain expert writes inputs covering known difficulty areas High quality; targets known hard cases Slow; may not cover real distribution
Failure mining Collect every confirmed system failure โ†’ add to golden set Directly prevents known regressions Reactive; only catches known problems
LLM-assisted generation Use a strong model to generate diverse inputs; human-verify Fast at scale; can cover edge cases systematically Distribution differs from real users; needs review
Adversarial construction Deliberately craft inputs that test edge cases, ambiguity, format stress Finds failure modes that sampling misses Requires effort; hard to know what to target
The Recommended Mix

A production-grade golden set should contain: 60% production-sampled (real distribution), 20% failure-mined (known regressions), 20% adversarial/edge cases (hard cases your sampling won't catch naturally). The failure-mined cases are critical โ€” they ensure that every bug you've fixed stays fixed.

๐Ÿ—บ๏ธ
Task Distribution Coverage

Your golden set must represent the full range of task types your system handles โ€” not just the most common. An eval set of only easy cases gives you a false sense of quality.

  • All major task categories proportionally represented
  • Long inputs and short inputs included
  • All supported languages/domains
โš ๏ธ
Edge Case Coverage

Edge cases are where production systems break. Empty inputs, extremely long inputs, ambiguous queries, multi-intent queries, adversarial phrasing.

  • Empty / one-word inputs
  • Inputs near context window limit
  • Ambiguous or contradictory requests
๐Ÿ”ง
Format Stress Coverage

Test inputs that stress your output format: inputs that require nested JSON, inputs in different languages, inputs with special characters, very short or very long expected outputs.

  • Special characters in input (quotes, brackets)
  • Inputs that should produce minimal output
  • Inputs that should produce structured output
๐Ÿ•ต๏ธ
Regression Coverage

Every confirmed production failure becomes a golden set case. This is your regression suite โ€” evidence that fixed bugs stay fixed across future changes.

  • Add case within 1 day of confirming a failure
  • Tag with failure date and root cause
  • Never remove โ€” only deprecate with reason

Annotation is the hardest part of building a golden set. The goal is to define "correct" precisely enough that an automated evaluator (schema check, exact match, or LLM judge) can reliably determine pass/fail.

Task TypeAnnotation FormatEval MethodExample
Classification Exact label(s) Exact match "Label: BILLING"
Extraction Required field values Key-value match / schema check {"name": "John", "date": "2024-01-15"}
Summarization Key points that must be present LLM judge (completeness rubric) Required: [acquisition price, date, acquirer]
Q&A / Factual Reference answer + acceptable variants Exact / fuzzy match + LLM judge Answer: "42.5 million" or "42,500,000"
Generation / Writing Rubric criteria (tone, structure, constraints) LLM judge (multi-dimension) Must: professional tone, <200 words, include CTA
Code generation Unit tests that must pass Execute + test pass rate assert output(4) == 16 # squares input
Size Guidelines

50 cases: Minimum viable โ€” catches major regressions, runs in <5 min. Start here.

200 cases: Production standard โ€” statistically meaningful, covers all categories.

500+ cases: Large system / multiple task types โ€” run on releases, not every PR.

Rule: Run time < 10 minutes for PR-blocking evals. Split larger sets into fast (PR) and full (release) tiers.

Versioning Rules

Store in Git: Golden set is code โ€” it belongs in version control with change history.

JSONL format: One case per line, easy to diff and append.

Never delete cases โ€” mark deprecated with reason and date.

Tag with schema version โ€” when eval format changes, old cases can still run against old schema.

๐Ÿ“„
Golden Set JSONL format (recommended)
# golden_set.jsonl โ€” one case per line { "id": "cs-001", "input": "I was charged twice this month", "expected_label": "BILLING", "eval_method": "exact_match", "tags": ["classification", "easy"], "source": "production_sample", "added": "2024-03-12" } { "id": "cs-002", "input": "app crashes sometimes but also i think billing is wrong", "expected_label": "BILLING", "eval_method": "exact_match", "tags": ["classification", "hard", "multi-intent"], "source": "failure_mined", "added": "2024-04-01", "failure_date": "2024-03-30" }
Stale Golden Sets Are Worse Than No Golden Sets

A golden set that hasn't been updated in 6 months while your product has evolved will pass regressions you care about and block on cases that are no longer relevant. Review and add to your golden set monthly: (1) add cases for new features, (2) add cases from production failures, (3) deprecate cases for removed features. Assign ownership โ€” golden set maintenance is an engineering responsibility, not a one-time task.

A golden set that was excellent six months ago may be misleading today. Input distribution, product scope, and user behavior all change over time โ€” and a static test set can produce high offline scores that do not reflect production reality.

Drift CauseSymptomDetectionResponse
Changing user behavior Offline score stable; production quality drops; new query types not covered Online eval score diverges from offline eval score Sample production queries monthly; add new input types to golden set
New edge cases System breaks on inputs that never appeared before; golden set doesn't cover them Production errors cluster to specific input categories Mine production failures; add as regression cases within 1 day
Evolving system scope New features added; golden set tests old behavior only; no coverage on new paths New features untested โ€” discovered only on user report Add golden set cases as part of every feature development cycle
Obsolete cases Deprecated features still in golden set; cases always pass (trivially); set size inflated Cases with 100% pass rate for 3+ months Deprecate with reason and date โ€” never delete; just mark inactive

∑ Chapter 04 — Key Takeaways

  • A golden set is (input, expected output / criteria, metadata) โ€” human-verified pairs that define correctness
  • Best collection mix: 60% production-sampled, 20% failure-mined, 20% adversarial/edge cases
  • Coverage must include: task distribution, edge cases, format stress, and regression cases (every confirmed failure)
  • Annotation format depends on task: exact label (classification), key-value (extraction), unit tests (code), rubric (generation)
  • Size: 50 (minimum viable), 200 (production standard), 500+ (split into PR and release tiers)
  • Store in Git as JSONL, never delete cases, review and extend monthly โ€” ownership is an engineering responsibility
05
Chapter 05 ยท Quality Gates
Regression Testing โ€” Catching Quality Drops Before Production

Regression testing answers one question: "Did this change make things worse?" For LLM systems it is the primary quality gate โ€” because unlike traditional software, LLM changes (prompts, models, parameters) are hard to reason about and easy to get subtly wrong. Regression tests catch the "works on the cases I checked, broke on the ones I didn't" failure pattern.

A regression is a drop relative to a baseline. Without a recorded baseline, all you have is a current score with no context. Baselines must be stable, reproducible, and stored โ€” not just computed on demand.

1๏ธโƒฃRun eval on current systemgolden set, all metrics
2๏ธโƒฃStore results as baselinetimestamped, versioned
3๏ธโƒฃMake a changeprompt / model / schema
4๏ธโƒฃRun eval on new versionsame golden set, same metrics
5๏ธโƒฃDiff against baselineflag drops, celebrate gains
Baseline MetricWhat to RecordUpdate Frequency
Format compliance rate % of outputs that parse as valid JSON/schema Update after any format-affecting change
Functional accuracy % correct on classification/extraction tasks Update after prompt or model change
LLM judge score Avg score per dimension (correctness, completeness, etc.) Update after any semantic change
P95 latency 95th percentile response time in ms Update after model or infrastructure change
Avg tokens per query Input + output tokens averaged over golden set Update after prompt or schema change

Not all regressions are equal. A 0.5% accuracy drop may be noise; a 5% drop may be a real regression; a format compliance drop from 100% to 95% is always a blocker. Thresholds encode your beliefs about what matters.

๐Ÿšซ
Hard Block

Deployment stops. These regressions always indicate a real problem that must be fixed before release.

  • Format compliance drops below 98%
  • Any required field missing on >1% of cases
  • Accuracy drops >5% from baseline
  • P95 latency increases >50%
โš ๏ธ
Warning (Review Required)

Deployment can proceed with explicit human sign-off. Requires a documented reason for the regression.

  • Accuracy drops 2โ€“5% from baseline
  • Judge score drops >0.3 on any dimension
  • Token count increases >20%
  • New failure pattern appears on 3+ cases
โœ…
Pass (Auto-approve)

Change is safe to deploy without manual review. Score is within acceptable variance.

  • Accuracy delta within ยฑ2% of baseline
  • Format compliance โ‰ฅ98%
  • Latency within ยฑ20% of baseline
  • No new failure categories introduced
Don't Set Thresholds Too Tight

If your thresholds block every small change, engineers route around them โ€” running fewer evals, skipping the process. Thresholds should block real regressions, not noise. For an LLM judge score (inherently variable), ยฑ0.2 is noise; ยฑ0.5 is signal. For exact match accuracy on a 100-case eval set, a 2% swing (2 cases) can be a single annotation error. Calibrate thresholds against your eval's natural variance before enforcing them.

With small eval sets, observed score differences can be coincidental. A 3% accuracy difference on a 50-case set might not be statistically significant โ€” it could easily be 1โ€“2 cases flipping due to LLM non-determinism.

Eval Set SizeMinimum Meaningful DifferenceConfidence Level
50 cases ~8โ€“10% difference to be confident (4โ€“5 cases) Use directional signal only, not hard thresholds
100 cases ~5โ€“6% difference meaningful (5โ€“6 cases) Reasonable for PR gates with soft thresholds
200 cases ~3โ€“4% difference meaningful Good for release gates with hard thresholds
500+ cases ~2% difference meaningful Strong statistical confidence; suitable for A/B
Practical Approach: Run Twice, Compare

For non-deterministic evals (LLM judge at temperature>0), run the eval twice and take the average. If the two runs differ by more than 3%, something is wrong with your judge (temperature too high, rubric too vague). Use temperature=0 for all judge calls to eliminate this variance โ€” the judge should be fully deterministic even when the system under test is not.

promptfoo is the most widely used open-source LLM testing framework. It runs your prompt against your golden set, applies assertions, and generates a pass/fail report suitable for CI integration.

โš™๏ธ
promptfoo config (promptfooconfig.yaml)
providers: - id: openai:gpt-4o-mini config: temperature: 0 prompts: - "Classify this support ticket. Reply with exactly one of: BILLING | TECHNICAL | ACCOUNT | OTHER\n\nTicket: {{input}}" tests: - vars: input: "I was charged twice this month" assert: - type: equals value: "BILLING" - vars: input: "App crashes on iOS 17" assert: - type: equals value: "TECHNICAL" - vars: input: "app crashes sometimes but also i think billing is wrong" assert: - type: equals value: "BILLING" # primary intent - type: latency threshold: 3000 # fail if >3s
๐Ÿ”ง
Custom Python eval runner (when promptfoo isn't enough)
import json, time from pathlib import Path def run_regression(system_fn, golden_path: str, baseline_path: str = None): cases = [json.loads(l) for l in Path(golden_path).read_text().splitlines()] results = [] for case in cases: start = time.monotonic() output = system_fn(case["input"]) latency = (time.monotonic() - start) * 1000 passed = output.strip() == case["expected_label"].strip() results.append({ "id": case["id"], "passed": passed, "latency_ms": latency, "output": output }) accuracy = sum(r["passed"] for r in results) / len(results) if baseline_path: baseline = json.loads(Path(baseline_path).read_text()) delta = accuracy - baseline["accuracy"] if delta < -0.05: # Hard block: >5% drop raise SystemExit(f"REGRESSION: accuracy dropped {delta:.1%} โ€” blocking deploy") return {"accuracy": accuracy, "results": results}
SignalRollback TriggerAction
Offline eval regression Hard threshold violation pre-deploy Block merge โ†’ fix prompt โ†’ re-run eval
Production error rate spike Parser 500s increase >2ร— in 30 min post-deploy Immediate rollback to previous version
Online eval quality drop LLM judge avg drops >0.5 on live sample over 2h Review samples โ†’ decide rollback or hotfix
Cost explosion Token avg increases >50% vs previous version Alert + review โ†’ rollback if no valid reason
User-reported failures Confirmed reports cluster to specific input type Mine failure cases โ†’ add to golden set โ†’ patch

∑ Chapter 05 — Key Takeaways

  • Regression testing answers: "Did this change make things worse?" โ€” requires a recorded baseline to compare against
  • Record baselines for 5 metrics: format compliance, accuracy, LLM judge score, P95 latency, avg tokens
  • Three threshold tiers: hard block (deploy stops), warning (review required), pass (auto-approve)
  • Statistical significance: small eval sets need larger differences to be meaningful โ€” 200 cases for reliable hard thresholds
  • Use temperature=0 for all judge calls โ€” eliminates eval variance; makes regressions real not noise
  • Rollback triggers: hard threshold violation, parser 500s, online quality drop >0.5, 50%+ cost increase
06
Chapter 06 ยท Observability
Tracing โ€” Following Requests Through LLM Systems

When an LLM system returns a wrong answer, how do you know which step failed? Tracing gives you the answer: a complete record of every step in a request's execution โ€” which LLM was called, with what prompt, what it returned, how long it took, and what it cost. Without traces, debugging is guesswork.

Tracing for LLM systems borrows from distributed systems observability. Every user request generates one trace, which is a tree of spans โ€” each span representing one unit of work. For LLM systems, spans map directly to the operations that matter.

Trace anatomy โ€” a RAG pipeline request broken into spans
TRACE: req_7f3a2b | Total: 2,340ms | Cost: $0.0087 | Status: success validate_input 3ms vector_search (embed + retrieve) 187ms | top_k=5 | score=0.87 llm_call (gpt-4o-mini) | in: 1,243 tok | out: 287 tok | $0.0082 2,090ms (TTFT: 310ms) parse+validate 47ms prompt_build | 14ms retry (attempt 1 timed out) | +850ms Each span records: โ€ข span_id + parent_span_id (for nesting) โ€ข operation name + start/end timestamps โ€ข input / output (truncated if long) โ€ข metadata: model, tokens, cost, status โ€ข error info if failed โ€ข custom tags (user_id, feature, env) One trace per user request โ€” composed of nested spans for each step. Spans tell you exactly where time and money went.
Span TypeFields to CaptureWhy It Matters
LLM call model, prompt (truncated), response (truncated), tokens_in, tokens_out, cost, latency, TTFT, retry_count Primary cost and latency driver; contains the most debugging information
Vector/DB retrieval query, top_k, similarity scores, retrieved_doc_ids, latency Retrieval quality is leading indicator of RAG answer quality
Tool call tool_name, input_args, output, latency, status (success/error) Tool failures are the #1 agent failure mode; input/output needed for debugging
Output validation schema_passed, parsed_output, validation_errors Reveals format failure rate; links validation failures to which LLM call produced them
Whole request (root span) request_id, user_id, total_latency, total_cost, total_tokens, status, feature_name Enables per-user cost attribution and system-level performance dashboards
Don't Log Raw Prompts in Production Without Care

Full prompt logging contains user input โ€” which may contain PII, confidential business data, or sensitive content. Before logging prompts in production: (1) implement PII detection and redaction, (2) restrict trace access to authorized personnel, (3) set retention policies (30โ€“90 days typical), (4) check your privacy policy and data residency requirements. Truncating prompts to 500 characters captures enough for debugging without logging full user content.

Application logs tell you what happened โ€” which endpoint was called, what status code was returned, how long it took. They do not tell you whether the LLM output was correct, helpful, or safe. That gap is the central observability problem for LLM systems.

What Logs Give You

โœ… Request volume and error rates

โœ… Response latency (total)

โœ… HTTP status codes

โœ… Token count per call

โŒ Whether the answer was correct

โŒ Whether quality is degrading week-over-week

โŒ Which failure category the error belongs to

What You Also Need to Track

๐Ÿ“Š Output quality scores โ€” LLM judge per dimension, rolling avg

๐Ÿ“Š Format pass rate โ€” % of outputs passing schema validation

๐Ÿ“Š TTFT + generation latency โ€” separately, not just total

๐Ÿ“Š Token usage breakdown โ€” input vs output, by feature

๐Ÿ“Š Retry rate โ€” % of requests that required โ‰ฅ1 retry

๐Ÿ“Š Fallback trigger rate โ€” % of requests falling back to cheaper/cached response

OpenTelemetry (OTel) is the industry standard for distributed tracing. The OpenTelemetry Semantic Conventions for LLMs (GenAI conventions) define standard span attribute names for LLM calls โ€” enabling consistent tooling across providers.

๐Ÿ“
GenAI OTel Span Attributes (standard)
  • gen_ai.system โ€” "openai", "anthropic"
  • gen_ai.request.model โ€” "gpt-4o-mini"
  • gen_ai.usage.prompt_tokens
  • gen_ai.usage.completion_tokens
  • gen_ai.response.finish_reasons
  • gen_ai.request.temperature
๐Ÿ”Œ
Auto-instrumentation Libraries

These libraries automatically wrap LLM SDK calls with OTel spans โ€” no manual instrumentation needed for basic tracing.

  • opentelemetry-instrumentation-openai
  • LangSmith โ€” LangChain native tracing
  • Langfuse โ€” open-source, self-hostable
  • Arize Phoenix โ€” local + cloud
๐Ÿ”ง
Manual span instrumentation (OpenAI + OTel)
from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider import openai, time tracer = trace.get_tracer("llm-service") def call_llm_with_trace(messages: list, model: str = "gpt-4o-mini"): with tracer.start_as_current_span("llm_call") as span: span.set_attribute("gen_ai.system", "openai") span.set_attribute("gen_ai.request.model", model) span.set_attribute("gen_ai.request.message_count", len(messages)) start = time.monotonic() try: response = openai.chat.completions.create( model=model, messages=messages ) latency = (time.monotonic() - start) * 1000 span.set_attribute("gen_ai.usage.prompt_tokens", response.usage.prompt_tokens) span.set_attribute("gen_ai.usage.completion_tokens", response.usage.completion_tokens) span.set_attribute("llm.latency_ms", round(latency)) span.set_attribute("llm.status", "success") return response except Exception as e: span.set_attribute("llm.status", "error") span.record_exception(e) raise
ToolHostingStrengthsBest For
LangSmith SaaS (paid) Deep LangChain integration, eval pipelines, human feedback, datasets Teams using LangChain; want eval + tracing in one tool
Langfuse Open-source + SaaS Self-hostable, SDK-agnostic, cost tracking, LLM-as-judge built in Privacy-sensitive deployments; teams wanting data control
Arize Phoenix Open-source (local-first) Notebook-friendly, evals, embedding visualization, OTel native ML engineers; debugging sessions; local dev tracing
Braintrust SaaS Strong eval + experiment tracking, prompt versioning, CI integration Teams focused on eval-driven development
Custom OTel + Jaeger/Tempo Self-hosted Full control, integrates with existing observability stack (Grafana) Orgs with mature monitoring infra; enterprise scale

A trace doesn't just show you where time went โ€” it shows you why. Three latency patterns diagnose different root causes and point to different solutions.

๐ŸŒ
Pattern: High TTFT

Time-to-first-token is >1s. LLM span starts late even though retrieval is fast.

Root causes: Long input prompt, provider load, large model. Fix: Prompt compression, prompt cache, smaller model, multi-provider failover.

โณ
Pattern: Long Generation

TTFT is fast but total LLM span is 5โ€“10s. Output token count is very high.

Root causes: Model generating unnecessarily long responses. Fix: Set aggressive max_tokens, add "be concise" to prompt, use streaming to hide latency.

๐Ÿ”„
Pattern: Retry-Inflated Latency

Total latency 2โ€“3ร— what a single call should be. Retry spans visible in trace.

Root causes: Timeout too low, intermittent provider issues, format validation failing. Fix: Tune timeout, fix format issue that's causing retries, improve output parsing.

The Trace Review Workflow

For every production p95 latency alert: (1) pull a trace from the 95th percentile, (2) identify the span contributing the most time, (3) check if it's TTFT (prompt issue), generation (output length issue), or retry (reliability issue). Traces turn "it's slow" into "the 1,847-token system prompt is adding 340ms of TTFT on every call" โ€” a fixable problem.

∑ Chapter 06 — Key Takeaways

  • A trace is a tree of spans for one request โ€” each span records one operation's input, output, latency, cost, and status
  • Minimum span set: LLM call, retrieval, tool call, output validation, root request โ€” each with timing + cost
  • Use OpenTelemetry GenAI conventions for standard attribute names โ€” enables consistent tooling and dashboards
  • Log prompts safely: truncate to 500 chars, redact PII, restrict access, set 30โ€“90 day retention
  • Tool choice: Langfuse (self-hosted, privacy), LangSmith (LangChain teams), Phoenix (local dev), Braintrust (eval-focused)
  • Three latency patterns: high TTFT (prompt too long), long generation (max_tokens not set), retry inflation (format failures or timeouts)
07
Chapter 07 ยท Production
Monitoring โ€” Continuous Quality in Production

Offline eval tells you the system worked before you deployed. Monitoring tells you whether it's working right now. Production quality degrades silently until someone complains โ€” unless you're continuously sampling, evaluating, and alerting on live traffic.

Offline Eval โ€” What You Know

Fixed test set, controlled inputs, run before deployment

Tests the cases you thought to include

Snapshot in time โ€” result is stable

Catches regressions relative to your golden set

Gap: Real traffic evolves; your test set doesn't

Production Monitoring โ€” What You Need

Live traffic sample, real user inputs, continuous

Tests the cases users actually send

Rolling signal โ€” changes as inputs and behavior change

Catches drift that offline eval can't see

Value: Discovers new failure modes before users escalate

The Production Monitoring Stack

A complete production monitoring stack has three layers: (1) Hard metrics โ€” latency, error rate, cost (from logs, near real-time). (2) Quality sampling โ€” LLM judge on 1โ€“5% of live traffic (near real-time). (3) Drift detection โ€” aggregate trend analysis over hours/days. Layer 1 alerts in minutes; Layer 2 in hours; Layer 3 in days. Each catches different failure modes.

You can't judge every production request โ€” it doubles your LLM costs. Strategic sampling gets you signal coverage at manageable cost.

StrategyWhat It DoesSample RateBest For
Random sampling Evaluate a uniform random subset of all requests 1โ€“5% of traffic Baseline quality tracking, cost drift detection
Stratified sampling Ensure all query categories / user cohorts are represented equally Varies per stratum Systems with very unequal query distributions
Triggered sampling Always evaluate when: long latency, retry occurred, format validation failed, high token count 100% of anomalous cases Catching the worst failures immediately
User feedback sampling Evaluate all requests where user gave explicit negative feedback (๐Ÿ‘Ž / edit / rephrase) 100% of flagged cases Connecting quality scores to user satisfaction
Time-window sampling Heavier sampling for first hour after a deployment, then fall back to baseline rate 10โ€“20% post-deploy, 1โ€“2% steady-state Catching deployment regressions fast
Online evaluation pipeline โ€” async quality scoring on live traffic
Live Request served normally Sampler 1โ€“5% random + triggered Queue async worker (no user impact) LLM Judge correctness completeness groundedness ~$0.005/call Metrics Store score + dimension rolling avg / p50 Alert + Dashboard threshold breach trend anomaly Async โ€” never blocks user response. Judge runs in background worker. Total pipeline lag: 5โ€“30s after request.
Never Block User Responses on Online Eval

The online eval pipeline must be completely asynchronous. The user receives their response immediately. Evaluation happens in a background worker, writing to a separate metrics store. If your eval pipeline goes down, users are unaffected. Coupling eval to the critical path is a common mistake that turns a monitoring failure into a user-facing outage.

๐Ÿ“‰
Quality Drift

LLM judge scores trend downward over days/weeks without a single obvious cause. Often caused by gradual input distribution shift.

  • Track 7-day rolling avg judge score
  • Alert if 7-day avg drops >0.3 below 30-day avg
  • Pull failing traces to identify new input patterns
๐Ÿ”€
Distribution Drift

The types of queries users send change. New topics, new use cases, seasonal patterns. Your system wasn't designed for these inputs.

  • Track query topic distribution over time
  • Embed queries, monitor cluster centroids
  • Alert on new topic clusters with low scores
๐Ÿ’ฐ
Cost Drift

Average tokens per query increases over weeks. Often caused by growing conversation history, longer user inputs, or unhealthy retry patterns.

  • Track avg tokens/query daily
  • Alert on >20% increase week-over-week
  • Break down by feature and model
Alert TypeTrigger ConditionSeverityInitial Response
Format failure spike JSON parse failure rate >5% in any 10-min window P1 โ€” page on-call immediately Check recent deployment; roll back if <2h since deploy
Error rate increase LLM API error rate >10% (timeouts, 429s, 5xx) P1 โ€” activate fallback provider Switch to backup model; check provider status page
P95 latency breach P95 response time >2ร— of 7-day baseline for >5 min P2 โ€” investigate within 30 min Check trace for retry inflation or prompt length increase
Quality score drop Rolling 1-hour judge avg drops >0.5 below baseline P2 โ€” review within 1 hour Sample recent failing traces; check for new input patterns
Cost anomaly Hourly cost >2ร— the 7-day hourly average P2 โ€” investigate within 1 hour Check for token count spike; look for runaway agent loops
Quality drift 7-day rolling avg drops >0.3 below 30-day avg P3 โ€” review in next working day Analyze input distribution shifts; plan golden set expansion
๐Ÿ“Š
Real-Time Panel (last 1h)
  • Requests/min + error rate
  • P50 / P95 / P99 latency
  • Format compliance % (last 100 requests)
  • Current $ cost/hour
  • Active provider + fallback status
๐Ÿ“ˆ
Quality Trend Panel (last 7 days)
  • Rolling average judge scores per dimension
  • Pass rate on sampled traffic
  • Failure category breakdown (format / content / safety)
  • Token trend (avg tokens/query over time)
  • Cost per query trend
๐Ÿ”
Failure Explorer
  • Recent low-score samples (judge score <3)
  • Format failures with error detail
  • High-latency traces (p99 outliers)
  • High-cost requests (>$0.10/req)
  • Link to full trace for each row
โš™๏ธ
Model Routing Panel
  • % of traffic per model (mini vs frontier)
  • Routing correctness (are cheap queries going to cheap model?)
  • Cache hit rate
  • Batch API vs sync API split

∑ Chapter 07 — Key Takeaways

  • Production monitoring has three layers: hard metrics (minutes), quality sampling (hours), drift detection (days)
  • Best sampling mix: 1โ€“5% random + 100% triggered (anomalous requests) + 10โ€“20% immediately post-deploy
  • Online eval pipeline must be fully asynchronous โ€” never block user responses on evaluation
  • Three drift types to monitor: quality drift (judge scores), distribution drift (query topics), cost drift (tokens/query)
  • P1 alerts: format failure >5%, error rate >10% โ€” page on-call. P2: latency 2ร—, quality drop 0.5 โ€” review within 1h
  • Dashboard covers four panels: real-time health, quality trend, failure explorer, model routing
08
Chapter 08 ยท Troubleshooting
Debugging LLM Applications โ€” Finding What Went Wrong

Debugging LLM systems is qualitatively different from debugging traditional software. There is no stack trace for "the model gave a wrong answer." Systematic debugging requires traces, structured reproduction, and an understanding of LLM failure taxonomy โ€” otherwise you're changing prompts at random and hoping.

The most common debugging mistake in LLM systems is jumping straight to "fix the prompt" without first diagnosing which component failed and why. A wrong answer in a RAG system could be a retrieval failure, a prompting failure, a model capability failure, or a parsing failure โ€” each has a different fix.

1๏ธโƒฃReproduceexact input that failed
2๏ธโƒฃIsolatewhich component failed
3๏ธโƒฃClassifyfailure taxonomy
4๏ธโƒฃRoot causewhy this component
5๏ธโƒฃFix + verifyeval before/after
6๏ธโƒฃRegressionadd to golden set
Failure ClassSymptomsRoot ComponentDiagnostic
Format failure Parser throws, missing fields, wrong data types Output parsing / LLM output Check raw LLM response before parsing; use structured output mode
Instruction violation Model ignores a constraint (language, length, tone, field) Prompt / system prompt priority Test prompt in isolation; check instruction placement (start/end wins)
Hallucination Model states facts not in the provided context or training data Model + retrieval (for RAG) Check if retrieval returned relevant docs; test with oracle context
Retrieval failure Correct docs not returned; answer misses key information Embedding / vector search / chunking Check retrieved doc IDs and scores in trace; test retrieval in isolation
Capability gap Model can't perform the task regardless of prompting Model selection Try frontier model (GPT-4o); if frontier succeeds, route task differently
Context overflow Key instructions or context silently truncated; model ignores injected content Context management Count tokens; check if input exceeds window; trim/summarize context
Tool failure Agent calls wrong tool, with wrong args, or loop doesn't terminate Tool schema / agent prompt Check tool input/output in trace; reduce visible tools; tighten schemas
๐Ÿ”ฌ
Prompt Isolation Technique

Strip everything away. Test the prompt against the failing input with no context, no history, no tools. If it still fails, the issue is in the prompt itself. If it passes, the issue is in one of the stripped components.

  • Test system prompt alone first
  • Add context back in stages
  • Pin to temperature=0 during debugging
๐Ÿ“
Instruction Placement

LLMs suffer from the lost-in-the-middle effect. Instructions buried in the middle of a long prompt are often ignored or underweighted.

  • Move ignored instructions to beginning or end
  • Repeat critical constraints at end of prompt
  • Use XML-style delimiters to separate sections
๐Ÿงช
Minimal Reproduction

The most powerful debugging tool: find the shortest prompt that still fails. This eliminates noise and focuses attention on the actual problem.

  • Start with failing case, strip tokens
  • Stop stripping when failure disappears
  • That stripped context is the cause
โš ๏ธ
Instruction Conflict

System prompt says "be concise." User message context implies long output. Model follows whichever is statistically stronger โ€” often the one that appears nearest the end.

  • Audit for contradictory instructions
  • System prompt wins when explicit
  • Add "regardless of the input length" clarifiers
Hallucination TypeCauseFix
Factual hallucination Model fills gaps with plausible-sounding facts not in training data Add explicit "say I don't know if uncertain" instruction. Use RAG with grounding check. Add citation requirement.
Context hallucination (RAG) Retrieved docs don't contain the answer; model extrapolates from partial information Improve retrieval (hybrid search, re-ranking). Add: "Only use information from the provided documents."
Confident wrong answer Model lacks uncertainty calibration; outputs high confidence regardless Prompt: "If you are not certain, explicitly say so before answering." Add LLM-as-judge calibration eval.
Temporal hallucination Model answers about post-training events as if it knows them Add training cutoff date to system prompt. Provide current date. Tell model to acknowledge cutoff.
Structural hallucination Model invents required fields not in source (e.g. JSON fields it was asked for but source lacks) Add: "Leave fields as null if information is not present โ€” do not guess." Use structured output with Optional fields.
The Grounding Test

To diagnose RAG hallucinations precisely: replace the retrieved context with the oracle answer verbatim and re-run the query. If the model now answers correctly, the problem is retrieval โ€” your docs are wrong, missing, or irrelevant. If the model still hallucinates even with the correct context in front of it, the problem is prompting โ€” the model isn't being told to stay grounded.

In multi-step pipelines, errors compound: a bad Step 1 output becomes the corrupted input to Step 2. The failure often surfaces in Step 3 or 4 but was caused in Step 1. Traces are the only way to see this clearly.

๐Ÿ”
Step Isolation

Test each step independently with ideal inputs. If Step 2 works perfectly with handcrafted input, the bug is in Step 1's output. Walk the pipeline backwards from the failure point.

๐Ÿ“‹
Intermediate Output Logging

Log every intermediate output โ€” not just the final result. Without step-by-step traces, you can't see where the corruption entered. This is non-negotiable for multi-step systems.

๐Ÿ”„
Error Propagation

Validate every step output before passing to the next step. A format error in Step 2 that isn't caught there will manifest as a confusing error in Step 5. Fail fast, fail clearly.

∑ Chapter 08 — Key Takeaways

  • Debugging workflow: reproduce โ†’ isolate component โ†’ classify failure โ†’ root cause โ†’ fix โ†’ verify โ†’ add to golden set
  • Seven failure classes: format, instruction violation, hallucination, retrieval, capability gap, context overflow, tool failure
  • Prompt isolation technique: strip everything, test alone, add components back โ€” narrows failure to one layer
  • Lost-in-the-middle: move ignored instructions to the beginning or end โ€” buried instructions are ignored
  • Grounding test: replace retrieved context with oracle answer โ€” if model answers correctly, bug is retrieval; if still wrong, bug is prompting
  • Multi-step debugging: log every intermediate output, walk backwards from failure, validate each step before passing to the next
09
Chapter 09 ยท Automation
CI/CD Integration โ€” Automated Evaluation in Your Pipeline

An eval pipeline you run manually will eventually not get run. Automated eval in CI is the only sustainable enforcement mechanism โ€” it gates every merge, runs without human intervention, and creates an auditable quality history for every change to your LLM system.

Eval gates in the LLM system development and deployment cycle
Local Dev manual smoke test fast subset PR Gate โœ“ L1 det. checks L2 LLM judge blocks merge if fail Staging Gate โ‰ˆ full golden set L2 + L3 eval warns on regression Deploy โœ“ canary / 10% monitor 30 min auto-rollback on alert Production online eval running Alert / Rollback if drift detected

The standard CI pattern: run deterministic checks (fast, free) on every commit; run LLM judge eval on every PR; block merge if either fails.

โš™๏ธ
.github/workflows/eval.yml โ€” PR eval gate
name: LLM Eval Gate on: pull_request: paths: - 'prompts/**' - 'src/llm/**' - 'evals/**' jobs: deterministic-eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: { python-version: '3.12' } - run: pip install -r requirements.txt - name: Run deterministic checks run: python -m pytest evals/deterministic/ -v # Schema validation, exact match, format checks # Exit 1 on any failure โ†’ blocks merge llm-judge-eval: runs-on: ubuntu-latest needs: deterministic-eval steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: { python-version: '3.12' } - run: pip install -r requirements.txt - name: Run LLM judge evaluation env: OPENAI_API_KEY: {{ secrets.OPENAI_API_KEY }} run: | python evals/run_judge_eval.py \ --golden evals/golden_set.jsonl \ --baseline evals/baseline.json \ --threshold 0.05 # Exits non-zero if accuracy drops >5% โ†’ blocks merge - name: Upload eval report uses: actions/upload-artifact@v4 with: name: eval-report path: evals/report.json
โšก
Path-based Triggers

Only run LLM eval when LLM-related files change. Don't waste API calls and eval time on CSS or docs changes.

  • Trigger on: prompts/**, src/llm/**, evals/**
  • Skip on: docs/**, *.md, styles/**
  • Saves 80%+ of unnecessary eval runs
๐Ÿ“Š
PR Comment Report

Post eval results as a PR comment so reviewers see the quality impact inline โ€” not buried in CI logs.

  • Accuracy: 94.2% (baseline: 93.8%) โœ…
  • Format compliance: 100% โœ…
  • Avg tokens: 847 (baseline: 812) โš ๏ธ +4%
  • Cost per run: $0.73
๐Ÿ’พ
Baseline Management

Store the baseline JSON in the repo. When you intentionally improve quality, update the baseline โ€” making new improvements the new floor.

  • evals/baseline.json in version control
  • Update with python evals/update_baseline.py
  • Commit baseline update with the change that caused it
LLM Eval in CI Has a Cost

100 golden set cases ร— GPT-4o judge ร— 3 dimensions = ~300 judge calls ร— $0.005 = $1.50 per PR eval run. On an active team with 20 PRs/day, that's $30/day or ~$900/month. Optimize: use GPT-4o-mini as judge where calibrated (often works as well at 10ร— lower cost), run full eval only on significant prompt/model changes, use path triggers to skip irrelevant PRs. Track eval pipeline cost as a project metric.

Prompts change frequently and silently break things. Without version control and an eval audit trail, you have no history of what changed, when, and what effect it had on quality.

What to Version Control

โœ… All prompt templates (system + user)

โœ… Model selection per feature

โœ… Eval golden set (as code)

โœ… Baseline scores per golden set version

โœ… Judge prompts and rubrics

โœ… Model API parameters (temperature, max_tokens)

The Prompt Change Record (in commit message)

What changed: System prompt โ€” added instruction to cite sources

Why: Reduce hallucination rate in doc QA feature

Eval result: Groundedness: 3.8 โ†’ 4.2 (+0.4). Accuracy: no change.

Cost impact: +15 tokens/query (avg)

Baseline updated: Yes โ€” new floor

∑ Chapter 09 — Key Takeaways

  • Eval in CI has four gates: PR gate (L1 + L2 judge) โ†’ staging gate (full set) โ†’ deploy (canary) โ†’ production (online eval)
  • Use path-based triggers โ€” only run LLM eval when LLM-related files change; saves 80%+ of unnecessary API calls
  • Post eval results as PR comment โ€” accuracy delta, format compliance, token count change, cost per run
  • CI eval cost: ~$1.50/run at 100 cases ร— GPT-4o judge ร— 3 dimensions โ€” use GPT-4o-mini judge where calibrated (10ร— cheaper)
  • Prompt versioning: treat prompts, model selection, golden set, baselines, and judge configs as version-controlled code
  • Every prompt change commit should record: what changed, why, eval result, cost impact, whether baseline was updated
10
Chapter 10 ยท Ecosystem
Tooling โ€” LangSmith, Langfuse, Promptfoo, and More

The eval and observability tooling ecosystem has matured rapidly. You don't need to build everything from scratch โ€” but you do need to pick the right tools for your stack. The wrong tool choice leads to vendor lock-in, missing features, or paying for capabilities you don't need.

๐Ÿ”ญ
Category 1: Tracing & Observability

Capture, store, and visualize traces from production. Focus: what happened in this specific request?

  • LangSmith โ€” LangChain ecosystem
  • Langfuse โ€” open-source, self-hostable
  • Arize Phoenix โ€” local-first, OTel native
  • Custom OTel + Grafana
๐Ÿงช
Category 2: Evaluation & Testing

Run eval pipelines, compare prompt versions, gate deployments. Focus: is this better or worse than before?

  • promptfoo โ€” open-source CI eval
  • Braintrust โ€” eval + experiment tracking
  • RAGAS โ€” RAG-specific eval metrics
  • LangSmith โ€” datasets + eval integrations
๐Ÿ—๏ธ
Category 3: Prompt Management

Version, store, and deploy prompts. Focus: which prompt version is in production right now?

  • LangSmith Hub โ€” prompt registry
  • Langfuse Prompts โ€” prompt versioning + A/B
  • Braintrust โ€” prompt snapshots
  • Git + plain files โ€” the simplest option
๐Ÿ“Š
Category 4: Analytics & Cost

Track aggregate quality, cost, and usage over time. Focus: is quality trending up or down this week?

  • Langfuse โ€” cost dashboard built-in
  • Custom Grafana dashboards โ€” from OTel metrics
  • Provider dashboards โ€” OpenAI, Anthropic usage pages
ToolCore StrengthTracingEvalPrompt MgmtHostingCost
LangSmith End-to-end LangChain observability โœ… Native โœ… Datasets + CI โœ… Hub SaaS only Free tier + paid plans
Langfuse Self-hostable full-stack LLM observability โœ… SDK + OTel โœ… Built-in + RAGAS โœ… Versioning + A/B Self-hosted or SaaS Free self-hosted
promptfoo CI-first eval framework โŒ Not tracing โœ… Best-in-class CLI + CI โš ๏ธ Basic Open-source CLI Free (open-source)
Braintrust Eval + experiment tracking โš ๏ธ Basic spans โœ… Strong eval + A/B โœ… Prompt snapshots SaaS only Paid (usage-based)
Arize Phoenix Local-first debugging + evals โœ… OTel native โœ… Evals + embedding viz โŒ Local + cloud Free local tier
RAGAS RAG-specific eval metrics โŒ โœ… RAG metrics only โŒ Open-source library Free (open-source)
ConsiderationChoose SaaSChoose Self-Hosted
Data sensitivity Prompts contain no PII / confidential data Prompts contain PII, IP, or regulated data (HIPAA, GDPR)
Team size / infra Small team, no dedicated infra engineer Mature infra team; existing K8s / monitoring stack
Time to value Need tracing working in hours, not days Can accept 1โ€“2 days for initial setup
Cost at scale SaaS costs rise linearly โ€” large volumes become expensive Fixed infra cost amortizes over volume
Compliance & audit Provider compliance certifications sufficient Need full data residency control and audit logs
The Recommended Starting Point

Start with promptfoo for CI eval (open-source, no data leaves your system) + Langfuse self-hosted for tracing and online eval (Docker Compose in 10 minutes, free, your data stays local). This covers 90% of production evaluation needs at zero ongoing cost. Graduate to SaaS tools (LangSmith, Braintrust) if you need richer integrations with LangChain or dedicated eval UX for a larger team.

The best observability stack for most teams is not one monolithic tool โ€” it's lightweight integration of best-of-breed components, each doing one thing well.

๐Ÿ”ง
Langfuse tracing โ€” 4-line integration
from langfuse.openai import openai # Drop-in replacement for openai from langfuse.decorators import langfuse_context, observe # That's it โ€” all OpenAI calls are now automatically traced # Langfuse captures: model, tokens, cost, latency, input, output @observe # Wraps any function as a traced span def run_rag_pipeline(query: str) -> str: docs = retrieve(query) # retrieval span response = openai.chat.completions.create( # auto-traced LLM span model="gpt-4o-mini", messages=build_messages(query, docs) ) langfuse_context.update_current_observation( metadata={"doc_count": len(docs), "feature": "doc_qa"} ) return response.choices[0].message.content
๐Ÿ”
Recommended Stack (most teams)
  • promptfoo โ€” CI eval gate on PRs
  • Langfuse (self-hosted) โ€” tracing + online eval
  • Git โ€” golden set + prompt versioning
  • Grafana โ€” dashboards from OTel/Langfuse metrics
๐Ÿข
Enterprise Stack
  • LangSmith or Braintrust โ€” team eval UI
  • Custom OTel + existing APM (Datadog)
  • Internal prompt registry + CI gates
  • RAGAS for RAG-specific metrics
๐Ÿš€
Startup / Fast Start
  • LangSmith free tier โ€” instant setup
  • promptfoo โ€” CI eval (free)
  • Provider dashboards for cost tracking
  • Upgrade to self-hosted when data sensitivity requires it

∑ Chapter 10 — Key Takeaways

  • Four tool categories: tracing, evaluation/testing, prompt management, analytics/cost โ€” often need one from each
  • Best-in-class: promptfoo (CI eval), Langfuse (tracing + online eval, self-hostable), RAGAS (RAG metrics), Braintrust (eval UX)
  • Self-host when: PII/confidential prompts, regulated data, HIPAA/GDPR, high volume. Use SaaS when: small team, speed needed, no PII
  • Recommend starting stack: promptfoo + Langfuse self-hosted + Git โ€” covers 90% of needs at zero ongoing cost
  • Langfuse tracing integration: 4 lines of code โ€” drop-in OpenAI replacement auto-traces all calls
  • Track eval pipeline cost itself as a project metric โ€” it can reach $900+/month on active teams if unmanaged

A production LLM system requires all four evaluation layers working together. Each layer serves a different purpose and catches failures the others miss. This is the minimum architecture โ€” not an aspirational target.

The four-layer production evaluation system โ€” all required, all complementary
โ‘  Offline Evaluation Golden set (50โ€“200 cases) Deterministic checks LLM judge on PRs CI gate โ† blocks merge Tools: promptfoo + Git Cost: ~$1/PR โ‘ก Online Evaluation 1โ€“5% traffic sampling Async LLM judge scoring 100% triggered (anomalies) Real distribution signal Tools: Langfuse + judge Cost: scales with traffic โ‘ข Monitoring Quality trend dashboards Latency / cost metrics Drift detection Alerting thresholds Tools: Grafana / dashboards Cost: infra only โ‘ฃ Feedback Loop Failed cases โ†’ golden set Prod samples โ†’ test data Regressions โ†’ CI tests Improves Layer 1 over time Process: monthly review Cost: engineering time Layer 4 feeds Layer 1 โ€” every production failure strengthens the offline test suite. The system improves itself over time.
โœ…
What Each Layer Catches
  • Offline: prompt regressions, model update breakage, format regressions
  • Online: distribution shift, new input patterns, subtle quality erosion
  • Monitoring: drift over time, cost anomalies, latency spikes
  • Feedback: known-failure regression prevention; evolving coverage
โš ๏ธ
What Breaks Without Each Layer
  • No offline eval: prompt changes break things silently
  • No online eval: distribution shift goes undetected for weeks
  • No monitoring: cost spikes and quality drifts are invisible until too late
  • No feedback loop: same failures recur; golden set never improves

Human inspection does not scale. Reading 10 responses and concluding "it looks good" is not evaluation โ€” it is survivorship bias. The cases you inspect are rarely the cases that fail in production.

๐Ÿ™ˆ
What Looks Correct
  • May fail on the specific edge cases your users actually send
  • May pass today but degrade silently after the next model update
  • May hallucinate confidently on topics that are rare in your test set

Human inspection tells you about the inputs and outputs you chose to look at. It tells you almost nothing about the distribution of inputs you haven't seen.

๐Ÿ“Š
What Measurement Gives You
  • A statistical signal over real inputs โ€” not a cherry-picked sample
  • A baseline to detect regression โ€” not a subjective "feels better"
  • A continuous production signal โ€” not a one-time check before launch

Measurement turns "I think it works" into "it passes 94.2% of cases with format compliance at 99.1%." One is a guess. The other is an engineering decision.

๐ŸŽฏ
The Engineering Standard
  • Every prompt change: measure before and after
  • Every model upgrade: run full eval suite first
  • Every production deployment: have an online eval signal within hours

If you wouldn't deploy a backend service without monitoring and error rate tracking, don't deploy an LLM system without eval pipelines and quality dashboards.

If you are not measuring it, you are not controlling it. LLM quality is not self-evident. It is measured, tracked, and defended โ€” with golden sets, judges, traces, dashboards, and feedback loops. Every component in this guide exists because "it seemed fine" was not enough.