Evaluation & Observability
Measuring what matters โ benchmarks, LLM-as-judge, regression testing, tracing, and production monitoring for AI systems.
"It works in the demo" is not a deployment criterion. Without systematic evaluation and observability, you are flying blind โ shipping changes that might improve or destroy quality, with no way to know which.
Evaluation is the discipline that separates LLM experimentation from LLM engineering. Without it, every prompt change is a guess, every model upgrade is a risk, and every production incident is a surprise. Eval is not a QA step โ it is the feedback loop that makes iteration possible.
LLM-based systems do not behave like traditional software. The same input can produce different outputs, different reasoning paths, and different failure modes across runs. This is not a bug โ it is the architecture. Evaluation must account for this explicitly.
A system that is "90% correct" fails 1 in 10 requests. At 10K queries/day that is 1,000 failures โ even if every individual call "seems to work" when you test it manually.
- Measure pass rate, not "does it work"
- Track failure rate per input category
- Report p50/p90 quality, not just avg
Testing one or two inputs tells you almost nothing about system reliability. Evaluation requires a dataset large enough to detect real signal from noise.
- Single-input testing โ evaluation
- Minimum: 50 diverse cases for signal
- Run multiple times to estimate variance
Non-determinism cannot be eliminated โ only bounded. Production systems must wrap LLM calls with validation, retries, and fallbacks that enforce acceptable behavior even when the model doesn't.
- Validate every output before use
- Retry on format failures (max 2โ3ร)
- Fallback on repeated failure
LLM outputs are hard to evaluate by reading them. A response can look fluent, well-formatted, and confident โ while being factually wrong, missing a key constraint, or breaking a downstream parser 15% of the time. Human spot-checking at scale is too slow, inconsistent, and biased.
Fluent โ correct. LLMs produce grammatically perfect, confident-sounding text regardless of accuracy. Evaluating output by reading it catches obvious failures โ not subtle ones.
- Wrong answer, perfect prose
- Missing constraints, elegant format
- Hallucinated facts, appropriate hedging
Without live eval, quality degrades invisibly. Model updates, prompt changes, schema changes, or input distribution shifts all erode quality โ silently, until a user reports it.
- Prompt change improves A, breaks B
- Model upgrade changes tone/format
- Edge case inputs grow over time
Evaluation converts vague "something feels off" into measurable "format compliance dropped 8% after the last prompt change." That's actionable โ the former is not.
- Quantified quality signal
- A/B comparison between versions
- Regression detection before users notice
Every change to an LLM system โ prompt, model, temperature, schema, retrieval โ is a hypothesis. Evaluation is the experiment that tells you whether the hypothesis is correct. Without evaluation, you're not engineering โ you're guessing with extra steps.
LLM evaluation is not a single number. Different properties require different measurement approaches โ and they don't always move together. A prompt change can improve accuracy while breaking format compliance.
| Dimension | What It Measures | How to Measure | Priority |
|---|---|---|---|
| Functional correctness | Does output satisfy the task? (right answer, correct classification) | Exact match / regex / unit tests on output | Highest โ ship-blocking |
| Format compliance | Is output parseable? Does it match the required schema? | JSON parse attempt / schema validation / regex | High โ downstream systems break on failure |
| Factual accuracy | Are stated facts true? Are claims grounded in source? | LLM-as-judge / grounding check / human review | High for knowledge tasks |
| Quality / tone | Is output helpful, appropriate, on-brand? | LLM-as-judge with rubric / human rating | Medium โ subjective but important |
| Safety / refusal | Does the system refuse harmful requests? Does it over-refuse benign ones? | Red-team datasets / adversarial test suite | Critical for user-facing systems |
| Latency | Time to first token / total response time | Instrumented timing / p50/p95/p99 | Medium โ SLA dependent |
| Cost per query | Tokens in + tokens out ร model price | Token count logging ร price table | Medium โ economics |
Evaluation pipelines themselves incur real cost. Running the wrong eval strategy will produce both false confidence and unnecessary API spend โ simultaneously.
| Eval Type | Run When | Cost per Run | Why This Cadence |
|---|---|---|---|
| Deterministic checks | Every commit, every output in production | ~$0 | Zero marginal cost โ no reason not to run always |
| LLM judge (golden set) | Every pull request | $0.50โ$2.00 per run (100โ200 samples) | Catches quality regressions before merge; cost is low vs risk |
| Human eval | Major releases / model changes | $50โ$500+ per review | Too slow and expensive for every change; reserved for high-stakes decisions |
| Online sampling + judge | Continuous in production (1โ5% of traffic) | Scales with traffic volume | Real distribution signal; catches drift offline tests miss |
Running LLM judge on every commit (not just PRs) can burn $20โ$50/day with no added signal. Using verbose judge prompts inflates tokens and judge costs. Evaluating with GPT-4o when GPT-4o-mini is calibrated to give the same scores at 10ร lower cost. Track your eval pipeline spend as a separate budget line โ it's a real operational cost, not a sunk cost.
Not all evals should run on every change. The principle: fast, cheap evals run always; slow, expensive evals run on significant changes. This is the eval equivalent of a testing pyramid โ unit tests at the base, integration tests in the middle, human review at the top.
Deterministic checks have zero marginal cost. JSON parse, schema validation, field presence checks, regex format checks โ these should run on every single output in tests and in production.
- Takes milliseconds per sample
- Catches format regressions immediately
- Gate all deployments on 100% pass
Run LLM-as-judge evaluation on your golden set for every pull request. 100โ200 samples ร $0.005/judge call โ $0.50โ$1.00 per PR. Worth it โ catches quality regressions before merge.
- ~$0.50โ$1.00 per eval run
- Catches subtle quality changes
- Automated โ no human bottleneck
Before any LLM-based evaluation runs, enforce deterministic checks. These have near-zero cost, run in milliseconds, and catch the most impactful failures โ format errors that would crash downstream systems or produce silently corrupt data.
โ JSON schema validation โ does the output match the required schema?
โ Required field presence โ are all expected fields non-null?
โ Regex constraints โ does a field match its expected pattern (date, email, ID)?
โ Type validation โ is a numeric field actually a number?
โ Enum validation โ is a classification label one of the allowed values?
If a failure can be caught deterministically, it must never reach an LLM judge.
Running an LLM judge on a malformed JSON response costs money and adds latency โ while the right answer is to fail immediately with a clear error.
Deterministic guards: ~$0, <1ms, run on every output
LLM judge: $0.005+, 1โ3s, run on sampled outputs
Apply the cheapest check that can detect the failure โ and only escalate when cheaper checks pass.
What: Run fixed test set against system before deployment
When: During development, on every significant change, blocking deployment
Pros: Controlled, reproducible, no user impact
Cons: Test set may not match production distribution; eval inputs may become stale
Tools: promptfoo, LangSmith, custom pytest harness
What: Continuously sample and evaluate live traffic
When: Always running in production at 1โ10% sampling rate
Pros: Real distribution, catches drift, finds failures offline tests miss
Cons: Failures already reached users; LLM judge adds cost per sample
Tools: LangSmith, Langfuse, custom sampling + judge pipeline
A golden test set built in January will not cover the inputs your users are actually sending in July. Production inputs drift over time โ new topics, new edge cases, adversarial inputs. Online evaluation sampling at 5% of production traffic, judged automatically, is the only way to know if quality is holding as inputs evolve. Run both โ they catch different things.
| Failure Pattern | How It Happens | Discovery Without Eval | Prevention |
|---|---|---|---|
| Prompt regression | Fix one failure mode in a prompt โ silently break 3 others | User complaints weeks later | Eval on full golden set before merge |
| Model update breakage | Provider updates GPT-4o silently; JSON structure changes slightly | Parser 500s in production | Eval + model version pinning |
| Distribution shift | New user cohort sends different query types than anticipated | Low satisfaction scores over weeks | Online eval sampling detects drop early |
| Format drift | Downstream parser changes; LLM still outputs old format | Silent data corruption in DB | Schema validation on every output |
| Cost explosion | Prompt grows longer; token count doubles; nobody notices until bill arrives | Monthly invoice shock | Token count tracking in eval + alerts |
LLM failures in production fall into four distinct categories โ each with different visibility, detection difficulty, and downstream impact. Understanding which category a failure belongs to determines how to catch and fix it.
System crashes, parser throws, API returns error. These are easy to detect but still must be handled gracefully.
- Invalid JSON โ parser throws
- Missing required field โ null pointer
- Wrong data type โ downstream cast fails
- Detection: deterministic checks, error monitoring
Output parses successfully but is partially incorrect. These pass format checks but fail quality checks โ often found only via LLM judge or human review.
- Partially correct answer (misses one constraint)
- Correct structure, wrong content
- Missing context that changes the answer
- Detection: LLM judge with multi-dimension rubric
Output looks correct, passes all checks, reaches users. But it's wrong. These corrupt downstream systems, erode user trust, and are almost impossible to detect without continuous quality sampling.
- Hallucinated values that look plausible
- Confident wrong answers with no hedging
- Format drift (subtle schema deviation)
- Detection: online eval sampling + grounding checks
The system works but quality drifts over time. Tone changes, verbosity increases, instruction adherence drops. No single failure โ just a slow erosion of quality.
- Longer responses than specified
- Brand tone gradually shifts
- Reliability of structured output drops week-over-week
- Detection: rolling avg judge scores over time
Hard failures get fixed immediately โ they break the system loudly. Silent failures pass all your checks, reach all your users, and corrupt downstream data silently. By the time you notice (user report, data audit), the failure has been happening for days or weeks. The only defense is continuous online evaluation that samples and judges live traffic โ not just offline testing that only catches what you thought to test for.
Quality and reliability are different goals โ and they require different engineering approaches. A system can be high quality on good inputs while being completely unreliable at scale.
Produces excellent outputs on typical inputs
Fails unpredictably on edge cases
Hard to test because failures are non-obvious
Users experience occasional great results โ and occasional crashes
Trust level: low โ users can't predict when it works
Produces consistently acceptable outputs across all inputs
Edge cases handled gracefully โ fallback, "I don't know," or structured error
Testable: pass/fail rate stable across runs
Users experience predictable behavior โ not occasionally brilliant
Trust level: high โ behavior is predictable
A system that produces brilliant output 70% of the time and crashes or hallucinates 30% of the time is not production-grade. Reliability is what makes users trust the system โ and trust is built by consistent, predictable behavior, not occasional impressive outputs. Target reliability first (measure failure rate, build guardrails) before optimizing for peak quality.
Prompt engineering without evaluation is unstable. Most prompt changes fix one issue and introduce new failures. Without a full eval suite, you can't know whether a change is a net improvement or a net regression.
| Scenario | Without Eval | With Eval |
|---|---|---|
| Prompt change to fix hallucination | Seems fixed in manual testing โ format compliance silently dropped 5% | Eval shows: hallucination โ 15%, format compliance โ 5%. Net positive โ but catch the regression. |
| Model upgrade (mini โ full) | Quality feels better โ cost increased 16ร, token use up 30%. Unknown. | Eval shows: accuracy +3%, cost +16ร, tokens +28%. Decide intentionally. |
| Adding few-shot examples | Looks better on the 5 cases you tested โ broke 8% of edge cases. | Eval on 200-case golden set shows: common cases improved, edge cases regressed. |
Every prompt change should: (1) improve at least one metric, (2) not regress any metric beyond threshold, (3) be recorded with before/after eval scores. Track metric deltas โ not just pass/fail. A change that improves accuracy from 88% to 91% is valuable. The same change that simultaneously drops format compliance from 99% to 94% may not be worth shipping.
| Anti-Pattern | What Happens | The Fix |
|---|---|---|
| "Looks good to me" | Human approval of a few test cases masquerades as evaluation. Fails silently on edge cases. | Require automated eval on 50+ cases before any merge. Manual review is a supplement, not a replacement. |
| Single-metric optimization | Accuracy improves; format compliance, latency, and cost all regress. Nobody noticed because only one metric was tracked. | Track all critical dimensions on every eval run. Block if any key metric regresses. |
| Static golden set | Golden set from 6 months ago. New features and user behaviors not covered. High offline scores, poor production quality. | Add cases from production failures monthly. Assign ownership for golden set maintenance. |
| Over-relying on LLM judge | LLM judge gives 4.2/5 โ feels like high quality. JSON parse failure rate is 8%. Judge never checked format. | Always run deterministic checks first. LLM judge is for quality dimensions that determinism can't cover. |
| Ignoring eval cost | LLM judge runs on every commit with GPT-4o. $30/day eval spend. Nobody noticed for 3 months. | Run LLM judge on PRs only. Use mini model when calibrated. Track eval pipeline cost as a budget line. |
∑ Chapter 01 — Key Takeaways
- Evaluation is the feedback loop that turns LLM changes from guesses into measurable improvements
- Fluent โ correct โ silent degradation is invisible without systematic measurement
- Measure 7 dimensions: functional correctness, format compliance, factual accuracy, quality, safety, latency, cost
- The eval hierarchy: deterministic checks (always) โ LLM judge (per PR) โ human eval (major changes) โ A/B (production)
- Offline eval gates deployment; online eval catches drift in production โ both are required
- The five silent failure patterns: prompt regression, model update breakage, distribution shift, format drift, cost explosion
Benchmarks let you compare models on standardized tasks. But benchmark performance and production performance are not the same thing. Understanding what benchmarks measure โ and what they miss โ is essential before using them to make model selection decisions.
| Benchmark | What It Tests | Format | Useful For |
|---|---|---|---|
| MMLU Hendrycks et al. 2021 | World knowledge across 57 academic subjects (STEM, humanities, law, medicine) | Multiple choice, 4 options | Comparing knowledge breadth; model selection for knowledge-intensive tasks |
| HumanEval OpenAI 2021 | Python function completion from docstrings; 164 programming problems | Code generation, unit test pass rate | Code assistant model selection; comparing coding capability |
| MT-Bench LMSYS 2023 | Multi-turn conversation quality across 8 categories (writing, math, coding, reasoning) | LLM-as-judge scoring by GPT-4 | Chat model quality; instruction following across domains |
| GPQA Google 2023 | Graduate-level science questions designed to be hard for non-experts | Multiple choice, expert-validated | Frontier model capability; distinguishing top-tier models |
| GSM8K Cobbe et al. 2021 | Grade-school math word problems requiring multi-step arithmetic reasoning | Free-form answer, exact match | Multi-step reasoning; CoT effectiveness |
| HellaSwag 2019 | Commonsense reasoning โ which sentence continues an activity description correctly | Multiple choice | Common sense; less useful for frontier models (most score 95%+) |
| LMSYS Chatbot Arena LMSYS 2023 | Human preference ranking โ users compare two anonymous model responses head-to-head | ELO ranking from human votes | Real user preference; most production-relevant benchmark |
Of all public benchmarks, LMSYS Chatbot Arena most closely predicts which models users prefer in practice โ because it uses real human preference data rather than academic tasks. MMLU tells you about knowledge breadth. Arena tells you about perceived output quality. Use both, weight Arena more heavily for user-facing applications.
If benchmark questions appeared in the model's training data, scores reflect memorization โ not capability. Increasingly common as benchmarks become widely used.
- Impossible to verify from outside
- Inflates reported scores
- New benchmarks contaminate faster than expected
When most frontier models score 85โ92% on a benchmark, it can no longer distinguish between them. HellaSwag is essentially useless for comparing GPT-4o vs Claude 3.5 โ both score 95%+.
- Old benchmarks can't rank new models
- Need constantly harder challenges
- GPQA was designed specifically for this
Academic benchmarks test standardized tasks. Your production system has a specific input distribution. A model that tops MMLU may be worse than a smaller model on your specific task type.
- MMLU doesn't predict JSON extraction quality
- HumanEval โ Python code review quality
- Always run domain-specific eval
Once a benchmark becomes a target, it ceases to be a good measure. Labs optimize specifically for leaderboard benchmarks โ through training data selection, prompt engineering, and sometimes cherry-picking evaluation conditions. A model's actual usefulness on your task may be uncorrelated with its leaderboard position. Always validate on your own data before making model selection decisions from benchmarks alone.
| Decision | Use Benchmarks? | What to Use Instead / Also |
|---|---|---|
| Initial model shortlisting | Yes โ filter obvious losers | MMLU for knowledge tasks, HumanEval for code, Arena ELO for general quality |
| Final model selection | Insufficient alone | Your own golden set + task-specific eval is mandatory |
| Tracking model provider updates | Watch for score changes | Your production eval set is more reliable signal |
| Comparing your fine-tuned model to base | Yes โ use MMLU to detect capability regression | Domain eval for capability gain measurement |
| Communicating model quality externally | Use with caveats | Benchmark + task-specific results together tell a more honest story |
For production systems, custom benchmarks tuned to your task type are more valuable than any public benchmark. They measure what you actually care about โ on representative inputs from your users.
Start with 50 diverse inputs โ not 500. Cover: (1) typical cases (60%), (2) hard/ambiguous cases (20%), (3) edge cases and failures you've seen in production (20%). A 50-case eval that runs in CI catches 80% of regressions. Perfect coverage is the enemy of getting started. Add cases as you discover new failure modes.
∑ Chapter 02 — Key Takeaways
- Key benchmarks: MMLU (knowledge), HumanEval (code), MT-Bench (chat quality), GSM8K (reasoning), Arena (real user preference)
- LMSYS Chatbot Arena is the most production-relevant benchmark โ uses real human preference data
- Three benchmark failure modes: data contamination, saturation, task-distribution mismatch
- Benchmarks are for initial shortlisting โ final model selection requires your own task-specific eval
- Goodhart's Law: when a benchmark becomes a target, it stops being a good measure โ leaderboard โ production performance
- Build a 50-case custom benchmark before anything else โ typical (60%), hard (20%), edge cases (20%)
Human evaluation is the gold standard โ but it costs $0.10โ$1+ per sample and can't scale to thousands of daily outputs. LLM-as-judge closes the gap: automated qualitative evaluation that costs $0.001โ$0.01 per sample and scales infinitely. When designed correctly, it correlates with human judgment at 80โ90%.
LLM-as-judge uses a capable frontier model (GPT-4o, Claude 3.5 Sonnet) to evaluate the output of another model โ or even the same model. The judge receives the original input, relevant context, the output to evaluate, and a scoring rubric. It returns a score + brief justification.
โ Evaluating text quality, helpfulness, tone
โ Grounding checks (does output use source material?)
โ Multi-turn conversation coherence
โ Relative comparison: "which response is better?"
โ Structured scoring with clear rubrics (1โ5 scale)
โ Mathematical / code correctness (use unit tests)
โ Tasks requiring specific domain expertise
โ Very long outputs (>2K tokens) โ judge loses focus
โ Fine-grained factual claims without reference source
โ Safety evaluation โ judges can be jailbroken
The quality of your LLM judge is almost entirely determined by the quality of your judge prompt. A weak judge prompt produces noisy, inconsistent scores. A strong judge prompt produces reliable, calibrated scores that correlate with human judgment.
Problems: "quality" is undefined, scale is vague (what's a 6?), no structured output, inconsistent across runs.
Specific dimension, explicit per-score definitions, structured JSON output, consistent reasoning required.
| Design Principle | Why It Matters |
|---|---|
| One dimension per judge | Helpfulness + accuracy + tone in one prompt โ confused, noisy scores. One judge per dimension, aggregated separately. |
| Explicit per-level rubric | A scale of 1โ5 without definitions means different things on each call. Define exactly what each score value means. |
| Require structured output | JSON output is parseable and consistent. Freeform reasoning varies in format and is hard to aggregate. |
| Include reference answer if available | Comparing to a ground-truth answer dramatically improves accuracy evaluation over judging in isolation. |
| Ask for a reason | The justification catches judge errors โ if the reason contradicts the score, the evaluation is unreliable. |
One overall quality score hides signal. A response can be perfectly accurate but poorly formatted, or beautifully written but factually wrong. Multi-dimension scoring separates these signals so you know what to fix.
"Is the answer factually accurate and does it satisfy the user's stated request?"
- 5: Fully correct, nothing to dispute
- 3: Mostly correct, minor inaccuracy
- 1: Incorrect or misleading answer
"Does the answer cover all aspects of the question, or does it miss key parts?"
- 5: All aspects addressed
- 3: Main answer present, details missing
- 1: Major parts of question unanswered
"For RAG systems: is every claim in the answer supported by the provided source documents?"
- 5: Every claim traced to source
- 3: Mostly grounded, one unsupported claim
- 1: Significant hallucination present
Most production LLM systems benefit from exactly four judge dimensions: (1) Correctness โ is the answer right? (2) Completeness โ does it cover everything asked? (3) Groundedness โ are claims supported? (critical for RAG) (4) Format compliance โ handled deterministically, not by LLM judge. Run one judge call per dimension, return structured JSON per call, aggregate across your golden set.
| Bias | What It Is | Mitigation |
|---|---|---|
| Verbosity bias | Judges rate longer responses higher, even when a shorter answer is better | Add explicit instruction: "Do not reward verbosity. A concise correct answer scores higher than a verbose correct answer." |
| Self-preference bias | GPT-4o-as-judge prefers GPT-4o outputs; Claude-as-judge prefers Claude outputs | Use a different model family as judge than the model being evaluated. Or use multiple judges and average. |
| Position bias | In A-vs-B comparisons, judge prefers whichever response appears first (or second) | Run each comparison twice with reversed order. Only accept agreements; reclassify disagreements as ties. |
| Sycophancy | If you include "this response is from our best model", judge inflates score | Never include model identity in judge prompt. Blind evaluation only. |
| Formatting halo | Well-formatted responses (headers, bullet points) get higher scores regardless of content quality | Add: "Evaluate content quality, not formatting. Ignore markdown styling when scoring." |
Before trusting your LLM judge at scale, validate that its scores correlate with human judgment. This is called calibration โ and it's what separates a reliable eval pipeline from a false sense of measurement.
An LLM judge you haven't validated against human labels is measuring something โ you just don't know what. A judge that looks correct on casual inspection may be systematically scoring a known failure mode as passing. Validate on at least 50 human-labeled examples before using any judge in CI. If judge-human agreement is below 75%, your rubric needs work before the judge is trustworthy.
∑ Chapter 03 — Key Takeaways
- LLM-as-judge: $0.001โ0.01 per sample, correlates with human judgment at 80โ90% when designed correctly
- Works for: quality, helpfulness, groundedness, coherence. Does not work for: math, code correctness, safety
- One dimension per judge call โ combined rubrics produce noisy, ambiguous scores
- Strong judge prompts need: explicit per-level rubric, structured JSON output, reasoning field
- Five key biases to mitigate: verbosity, self-preference, position, sycophancy, formatting halo
- Calibrate against 50+ human labels before trusting any judge in CI โ target โฅ80% agreement on binary pass/fail
- Use temperature=0 and json_object mode for consistent, parseable judge outputs
A golden set is the foundation of every eval pipeline. Without one, you have no baseline, no regression detection, and no way to compare prompt versions. Building a high-quality golden set is the most important engineering task in LLM evaluation โ and it's done once, then maintained continuously.
A golden set is a curated collection of (input, expected output) pairs โ or more precisely, (input, evaluation criteria) pairs โ that represent the task your system must perform. "Golden" means human-verified: each case has been reviewed and annotated to define what correct looks like.
The query, document, or task given to the system. Should be representative of real production traffic โ not synthetic or idealized.
- Sampled from real user inputs
- Anonymized if needed (PII removal)
- Diverse across your task distribution
Defines what "correct" means for this input. Can be an exact reference answer, a set of required elements, or rubric criteria.
- Exact answer (classification, extraction)
- Required fields / key points (summarization)
- Rubric criteria (open-ended quality)
Tags that enable sliced analysis โ query type, difficulty level, failure category, date added. Essential for understanding which cases regressed.
- Difficulty: easy / medium / hard
- Category: topic or task type
- Source: synthetic / prod sample / manual
| Source | How | Pros | Cons |
|---|---|---|---|
| Production sampling | Log 1โ5% of live traffic; human-annotate a sample | Real distribution; catches actual failure modes | Requires annotation pipeline; PII concerns |
| Manual curation | Domain expert writes inputs covering known difficulty areas | High quality; targets known hard cases | Slow; may not cover real distribution |
| Failure mining | Collect every confirmed system failure โ add to golden set | Directly prevents known regressions | Reactive; only catches known problems |
| LLM-assisted generation | Use a strong model to generate diverse inputs; human-verify | Fast at scale; can cover edge cases systematically | Distribution differs from real users; needs review |
| Adversarial construction | Deliberately craft inputs that test edge cases, ambiguity, format stress | Finds failure modes that sampling misses | Requires effort; hard to know what to target |
A production-grade golden set should contain: 60% production-sampled (real distribution), 20% failure-mined (known regressions), 20% adversarial/edge cases (hard cases your sampling won't catch naturally). The failure-mined cases are critical โ they ensure that every bug you've fixed stays fixed.
Your golden set must represent the full range of task types your system handles โ not just the most common. An eval set of only easy cases gives you a false sense of quality.
- All major task categories proportionally represented
- Long inputs and short inputs included
- All supported languages/domains
Edge cases are where production systems break. Empty inputs, extremely long inputs, ambiguous queries, multi-intent queries, adversarial phrasing.
- Empty / one-word inputs
- Inputs near context window limit
- Ambiguous or contradictory requests
Test inputs that stress your output format: inputs that require nested JSON, inputs in different languages, inputs with special characters, very short or very long expected outputs.
- Special characters in input (quotes, brackets)
- Inputs that should produce minimal output
- Inputs that should produce structured output
Every confirmed production failure becomes a golden set case. This is your regression suite โ evidence that fixed bugs stay fixed across future changes.
- Add case within 1 day of confirming a failure
- Tag with failure date and root cause
- Never remove โ only deprecate with reason
Annotation is the hardest part of building a golden set. The goal is to define "correct" precisely enough that an automated evaluator (schema check, exact match, or LLM judge) can reliably determine pass/fail.
| Task Type | Annotation Format | Eval Method | Example |
|---|---|---|---|
| Classification | Exact label(s) | Exact match | "Label: BILLING" |
| Extraction | Required field values | Key-value match / schema check | {"name": "John", "date": "2024-01-15"} |
| Summarization | Key points that must be present | LLM judge (completeness rubric) | Required: [acquisition price, date, acquirer] |
| Q&A / Factual | Reference answer + acceptable variants | Exact / fuzzy match + LLM judge | Answer: "42.5 million" or "42,500,000" |
| Generation / Writing | Rubric criteria (tone, structure, constraints) | LLM judge (multi-dimension) | Must: professional tone, <200 words, include CTA |
| Code generation | Unit tests that must pass | Execute + test pass rate | assert output(4) == 16 # squares input |
50 cases: Minimum viable โ catches major regressions, runs in <5 min. Start here.
200 cases: Production standard โ statistically meaningful, covers all categories.
500+ cases: Large system / multiple task types โ run on releases, not every PR.
Rule: Run time < 10 minutes for PR-blocking evals. Split larger sets into fast (PR) and full (release) tiers.
Store in Git: Golden set is code โ it belongs in version control with change history.
JSONL format: One case per line, easy to diff and append.
Never delete cases โ mark deprecated with reason and date.
Tag with schema version โ when eval format changes, old cases can still run against old schema.
A golden set that hasn't been updated in 6 months while your product has evolved will pass regressions you care about and block on cases that are no longer relevant. Review and add to your golden set monthly: (1) add cases for new features, (2) add cases from production failures, (3) deprecate cases for removed features. Assign ownership โ golden set maintenance is an engineering responsibility, not a one-time task.
A golden set that was excellent six months ago may be misleading today. Input distribution, product scope, and user behavior all change over time โ and a static test set can produce high offline scores that do not reflect production reality.
| Drift Cause | Symptom | Detection | Response |
|---|---|---|---|
| Changing user behavior | Offline score stable; production quality drops; new query types not covered | Online eval score diverges from offline eval score | Sample production queries monthly; add new input types to golden set |
| New edge cases | System breaks on inputs that never appeared before; golden set doesn't cover them | Production errors cluster to specific input categories | Mine production failures; add as regression cases within 1 day |
| Evolving system scope | New features added; golden set tests old behavior only; no coverage on new paths | New features untested โ discovered only on user report | Add golden set cases as part of every feature development cycle |
| Obsolete cases | Deprecated features still in golden set; cases always pass (trivially); set size inflated | Cases with 100% pass rate for 3+ months | Deprecate with reason and date โ never delete; just mark inactive |
∑ Chapter 04 — Key Takeaways
- A golden set is (input, expected output / criteria, metadata) โ human-verified pairs that define correctness
- Best collection mix: 60% production-sampled, 20% failure-mined, 20% adversarial/edge cases
- Coverage must include: task distribution, edge cases, format stress, and regression cases (every confirmed failure)
- Annotation format depends on task: exact label (classification), key-value (extraction), unit tests (code), rubric (generation)
- Size: 50 (minimum viable), 200 (production standard), 500+ (split into PR and release tiers)
- Store in Git as JSONL, never delete cases, review and extend monthly โ ownership is an engineering responsibility
Regression testing answers one question: "Did this change make things worse?" For LLM systems it is the primary quality gate โ because unlike traditional software, LLM changes (prompts, models, parameters) are hard to reason about and easy to get subtly wrong. Regression tests catch the "works on the cases I checked, broke on the ones I didn't" failure pattern.
A regression is a drop relative to a baseline. Without a recorded baseline, all you have is a current score with no context. Baselines must be stable, reproducible, and stored โ not just computed on demand.
| Baseline Metric | What to Record | Update Frequency |
|---|---|---|
| Format compliance rate | % of outputs that parse as valid JSON/schema | Update after any format-affecting change |
| Functional accuracy | % correct on classification/extraction tasks | Update after prompt or model change |
| LLM judge score | Avg score per dimension (correctness, completeness, etc.) | Update after any semantic change |
| P95 latency | 95th percentile response time in ms | Update after model or infrastructure change |
| Avg tokens per query | Input + output tokens averaged over golden set | Update after prompt or schema change |
Not all regressions are equal. A 0.5% accuracy drop may be noise; a 5% drop may be a real regression; a format compliance drop from 100% to 95% is always a blocker. Thresholds encode your beliefs about what matters.
Deployment stops. These regressions always indicate a real problem that must be fixed before release.
- Format compliance drops below 98%
- Any required field missing on >1% of cases
- Accuracy drops >5% from baseline
- P95 latency increases >50%
Deployment can proceed with explicit human sign-off. Requires a documented reason for the regression.
- Accuracy drops 2โ5% from baseline
- Judge score drops >0.3 on any dimension
- Token count increases >20%
- New failure pattern appears on 3+ cases
Change is safe to deploy without manual review. Score is within acceptable variance.
- Accuracy delta within ยฑ2% of baseline
- Format compliance โฅ98%
- Latency within ยฑ20% of baseline
- No new failure categories introduced
If your thresholds block every small change, engineers route around them โ running fewer evals, skipping the process. Thresholds should block real regressions, not noise. For an LLM judge score (inherently variable), ยฑ0.2 is noise; ยฑ0.5 is signal. For exact match accuracy on a 100-case eval set, a 2% swing (2 cases) can be a single annotation error. Calibrate thresholds against your eval's natural variance before enforcing them.
With small eval sets, observed score differences can be coincidental. A 3% accuracy difference on a 50-case set might not be statistically significant โ it could easily be 1โ2 cases flipping due to LLM non-determinism.
| Eval Set Size | Minimum Meaningful Difference | Confidence Level |
|---|---|---|
| 50 cases | ~8โ10% difference to be confident (4โ5 cases) | Use directional signal only, not hard thresholds |
| 100 cases | ~5โ6% difference meaningful (5โ6 cases) | Reasonable for PR gates with soft thresholds |
| 200 cases | ~3โ4% difference meaningful | Good for release gates with hard thresholds |
| 500+ cases | ~2% difference meaningful | Strong statistical confidence; suitable for A/B |
For non-deterministic evals (LLM judge at temperature>0), run the eval twice and take the average. If the two runs differ by more than 3%, something is wrong with your judge (temperature too high, rubric too vague). Use temperature=0 for all judge calls to eliminate this variance โ the judge should be fully deterministic even when the system under test is not.
promptfoo is the most widely used open-source LLM testing framework. It runs your prompt against your golden set, applies assertions, and generates a pass/fail report suitable for CI integration.
| Signal | Rollback Trigger | Action |
|---|---|---|
| Offline eval regression | Hard threshold violation pre-deploy | Block merge โ fix prompt โ re-run eval |
| Production error rate spike | Parser 500s increase >2ร in 30 min post-deploy | Immediate rollback to previous version |
| Online eval quality drop | LLM judge avg drops >0.5 on live sample over 2h | Review samples โ decide rollback or hotfix |
| Cost explosion | Token avg increases >50% vs previous version | Alert + review โ rollback if no valid reason |
| User-reported failures | Confirmed reports cluster to specific input type | Mine failure cases โ add to golden set โ patch |
∑ Chapter 05 — Key Takeaways
- Regression testing answers: "Did this change make things worse?" โ requires a recorded baseline to compare against
- Record baselines for 5 metrics: format compliance, accuracy, LLM judge score, P95 latency, avg tokens
- Three threshold tiers: hard block (deploy stops), warning (review required), pass (auto-approve)
- Statistical significance: small eval sets need larger differences to be meaningful โ 200 cases for reliable hard thresholds
- Use temperature=0 for all judge calls โ eliminates eval variance; makes regressions real not noise
- Rollback triggers: hard threshold violation, parser 500s, online quality drop >0.5, 50%+ cost increase
When an LLM system returns a wrong answer, how do you know which step failed? Tracing gives you the answer: a complete record of every step in a request's execution โ which LLM was called, with what prompt, what it returned, how long it took, and what it cost. Without traces, debugging is guesswork.
Tracing for LLM systems borrows from distributed systems observability. Every user request generates one trace, which is a tree of spans โ each span representing one unit of work. For LLM systems, spans map directly to the operations that matter.
| Span Type | Fields to Capture | Why It Matters |
|---|---|---|
| LLM call | model, prompt (truncated), response (truncated), tokens_in, tokens_out, cost, latency, TTFT, retry_count | Primary cost and latency driver; contains the most debugging information |
| Vector/DB retrieval | query, top_k, similarity scores, retrieved_doc_ids, latency | Retrieval quality is leading indicator of RAG answer quality |
| Tool call | tool_name, input_args, output, latency, status (success/error) | Tool failures are the #1 agent failure mode; input/output needed for debugging |
| Output validation | schema_passed, parsed_output, validation_errors | Reveals format failure rate; links validation failures to which LLM call produced them |
| Whole request (root span) | request_id, user_id, total_latency, total_cost, total_tokens, status, feature_name | Enables per-user cost attribution and system-level performance dashboards |
Full prompt logging contains user input โ which may contain PII, confidential business data, or sensitive content. Before logging prompts in production: (1) implement PII detection and redaction, (2) restrict trace access to authorized personnel, (3) set retention policies (30โ90 days typical), (4) check your privacy policy and data residency requirements. Truncating prompts to 500 characters captures enough for debugging without logging full user content.
Application logs tell you what happened โ which endpoint was called, what status code was returned, how long it took. They do not tell you whether the LLM output was correct, helpful, or safe. That gap is the central observability problem for LLM systems.
โ Request volume and error rates
โ Response latency (total)
โ HTTP status codes
โ Token count per call
โ Whether the answer was correct
โ Whether quality is degrading week-over-week
โ Which failure category the error belongs to
๐ Output quality scores โ LLM judge per dimension, rolling avg
๐ Format pass rate โ % of outputs passing schema validation
๐ TTFT + generation latency โ separately, not just total
๐ Token usage breakdown โ input vs output, by feature
๐ Retry rate โ % of requests that required โฅ1 retry
๐ Fallback trigger rate โ % of requests falling back to cheaper/cached response
OpenTelemetry (OTel) is the industry standard for distributed tracing. The OpenTelemetry Semantic Conventions for LLMs (GenAI conventions) define standard span attribute names for LLM calls โ enabling consistent tooling across providers.
gen_ai.systemโ "openai", "anthropic"gen_ai.request.modelโ "gpt-4o-mini"gen_ai.usage.prompt_tokensgen_ai.usage.completion_tokensgen_ai.response.finish_reasonsgen_ai.request.temperature
These libraries automatically wrap LLM SDK calls with OTel spans โ no manual instrumentation needed for basic tracing.
- opentelemetry-instrumentation-openai
- LangSmith โ LangChain native tracing
- Langfuse โ open-source, self-hostable
- Arize Phoenix โ local + cloud
| Tool | Hosting | Strengths | Best For |
|---|---|---|---|
| LangSmith | SaaS (paid) | Deep LangChain integration, eval pipelines, human feedback, datasets | Teams using LangChain; want eval + tracing in one tool |
| Langfuse | Open-source + SaaS | Self-hostable, SDK-agnostic, cost tracking, LLM-as-judge built in | Privacy-sensitive deployments; teams wanting data control |
| Arize Phoenix | Open-source (local-first) | Notebook-friendly, evals, embedding visualization, OTel native | ML engineers; debugging sessions; local dev tracing |
| Braintrust | SaaS | Strong eval + experiment tracking, prompt versioning, CI integration | Teams focused on eval-driven development |
| Custom OTel + Jaeger/Tempo | Self-hosted | Full control, integrates with existing observability stack (Grafana) | Orgs with mature monitoring infra; enterprise scale |
A trace doesn't just show you where time went โ it shows you why. Three latency patterns diagnose different root causes and point to different solutions.
Time-to-first-token is >1s. LLM span starts late even though retrieval is fast.
Root causes: Long input prompt, provider load, large model. Fix: Prompt compression, prompt cache, smaller model, multi-provider failover.
TTFT is fast but total LLM span is 5โ10s. Output token count is very high.
Root causes: Model generating unnecessarily long responses. Fix: Set aggressive max_tokens, add "be concise" to prompt, use streaming to hide latency.
Total latency 2โ3ร what a single call should be. Retry spans visible in trace.
Root causes: Timeout too low, intermittent provider issues, format validation failing. Fix: Tune timeout, fix format issue that's causing retries, improve output parsing.
For every production p95 latency alert: (1) pull a trace from the 95th percentile, (2) identify the span contributing the most time, (3) check if it's TTFT (prompt issue), generation (output length issue), or retry (reliability issue). Traces turn "it's slow" into "the 1,847-token system prompt is adding 340ms of TTFT on every call" โ a fixable problem.
∑ Chapter 06 — Key Takeaways
- A trace is a tree of spans for one request โ each span records one operation's input, output, latency, cost, and status
- Minimum span set: LLM call, retrieval, tool call, output validation, root request โ each with timing + cost
- Use OpenTelemetry GenAI conventions for standard attribute names โ enables consistent tooling and dashboards
- Log prompts safely: truncate to 500 chars, redact PII, restrict access, set 30โ90 day retention
- Tool choice: Langfuse (self-hosted, privacy), LangSmith (LangChain teams), Phoenix (local dev), Braintrust (eval-focused)
- Three latency patterns: high TTFT (prompt too long), long generation (max_tokens not set), retry inflation (format failures or timeouts)
Offline eval tells you the system worked before you deployed. Monitoring tells you whether it's working right now. Production quality degrades silently until someone complains โ unless you're continuously sampling, evaluating, and alerting on live traffic.
Fixed test set, controlled inputs, run before deployment
Tests the cases you thought to include
Snapshot in time โ result is stable
Catches regressions relative to your golden set
Gap: Real traffic evolves; your test set doesn't
Live traffic sample, real user inputs, continuous
Tests the cases users actually send
Rolling signal โ changes as inputs and behavior change
Catches drift that offline eval can't see
Value: Discovers new failure modes before users escalate
A complete production monitoring stack has three layers: (1) Hard metrics โ latency, error rate, cost (from logs, near real-time). (2) Quality sampling โ LLM judge on 1โ5% of live traffic (near real-time). (3) Drift detection โ aggregate trend analysis over hours/days. Layer 1 alerts in minutes; Layer 2 in hours; Layer 3 in days. Each catches different failure modes.
You can't judge every production request โ it doubles your LLM costs. Strategic sampling gets you signal coverage at manageable cost.
| Strategy | What It Does | Sample Rate | Best For |
|---|---|---|---|
| Random sampling | Evaluate a uniform random subset of all requests | 1โ5% of traffic | Baseline quality tracking, cost drift detection |
| Stratified sampling | Ensure all query categories / user cohorts are represented equally | Varies per stratum | Systems with very unequal query distributions |
| Triggered sampling | Always evaluate when: long latency, retry occurred, format validation failed, high token count | 100% of anomalous cases | Catching the worst failures immediately |
| User feedback sampling | Evaluate all requests where user gave explicit negative feedback (๐ / edit / rephrase) | 100% of flagged cases | Connecting quality scores to user satisfaction |
| Time-window sampling | Heavier sampling for first hour after a deployment, then fall back to baseline rate | 10โ20% post-deploy, 1โ2% steady-state | Catching deployment regressions fast |
The online eval pipeline must be completely asynchronous. The user receives their response immediately. Evaluation happens in a background worker, writing to a separate metrics store. If your eval pipeline goes down, users are unaffected. Coupling eval to the critical path is a common mistake that turns a monitoring failure into a user-facing outage.
LLM judge scores trend downward over days/weeks without a single obvious cause. Often caused by gradual input distribution shift.
- Track 7-day rolling avg judge score
- Alert if 7-day avg drops >0.3 below 30-day avg
- Pull failing traces to identify new input patterns
The types of queries users send change. New topics, new use cases, seasonal patterns. Your system wasn't designed for these inputs.
- Track query topic distribution over time
- Embed queries, monitor cluster centroids
- Alert on new topic clusters with low scores
Average tokens per query increases over weeks. Often caused by growing conversation history, longer user inputs, or unhealthy retry patterns.
- Track avg tokens/query daily
- Alert on >20% increase week-over-week
- Break down by feature and model
| Alert Type | Trigger Condition | Severity | Initial Response |
|---|---|---|---|
| Format failure spike | JSON parse failure rate >5% in any 10-min window | P1 โ page on-call immediately | Check recent deployment; roll back if <2h since deploy |
| Error rate increase | LLM API error rate >10% (timeouts, 429s, 5xx) | P1 โ activate fallback provider | Switch to backup model; check provider status page |
| P95 latency breach | P95 response time >2ร of 7-day baseline for >5 min | P2 โ investigate within 30 min | Check trace for retry inflation or prompt length increase |
| Quality score drop | Rolling 1-hour judge avg drops >0.5 below baseline | P2 โ review within 1 hour | Sample recent failing traces; check for new input patterns |
| Cost anomaly | Hourly cost >2ร the 7-day hourly average | P2 โ investigate within 1 hour | Check for token count spike; look for runaway agent loops |
| Quality drift | 7-day rolling avg drops >0.3 below 30-day avg | P3 โ review in next working day | Analyze input distribution shifts; plan golden set expansion |
- Requests/min + error rate
- P50 / P95 / P99 latency
- Format compliance % (last 100 requests)
- Current $ cost/hour
- Active provider + fallback status
- Rolling average judge scores per dimension
- Pass rate on sampled traffic
- Failure category breakdown (format / content / safety)
- Token trend (avg tokens/query over time)
- Cost per query trend
- Recent low-score samples (judge score <3)
- Format failures with error detail
- High-latency traces (p99 outliers)
- High-cost requests (>$0.10/req)
- Link to full trace for each row
- % of traffic per model (mini vs frontier)
- Routing correctness (are cheap queries going to cheap model?)
- Cache hit rate
- Batch API vs sync API split
∑ Chapter 07 — Key Takeaways
- Production monitoring has three layers: hard metrics (minutes), quality sampling (hours), drift detection (days)
- Best sampling mix: 1โ5% random + 100% triggered (anomalous requests) + 10โ20% immediately post-deploy
- Online eval pipeline must be fully asynchronous โ never block user responses on evaluation
- Three drift types to monitor: quality drift (judge scores), distribution drift (query topics), cost drift (tokens/query)
- P1 alerts: format failure >5%, error rate >10% โ page on-call. P2: latency 2ร, quality drop 0.5 โ review within 1h
- Dashboard covers four panels: real-time health, quality trend, failure explorer, model routing
Debugging LLM systems is qualitatively different from debugging traditional software. There is no stack trace for "the model gave a wrong answer." Systematic debugging requires traces, structured reproduction, and an understanding of LLM failure taxonomy โ otherwise you're changing prompts at random and hoping.
The most common debugging mistake in LLM systems is jumping straight to "fix the prompt" without first diagnosing which component failed and why. A wrong answer in a RAG system could be a retrieval failure, a prompting failure, a model capability failure, or a parsing failure โ each has a different fix.
| Failure Class | Symptoms | Root Component | Diagnostic |
|---|---|---|---|
| Format failure | Parser throws, missing fields, wrong data types | Output parsing / LLM output | Check raw LLM response before parsing; use structured output mode |
| Instruction violation | Model ignores a constraint (language, length, tone, field) | Prompt / system prompt priority | Test prompt in isolation; check instruction placement (start/end wins) |
| Hallucination | Model states facts not in the provided context or training data | Model + retrieval (for RAG) | Check if retrieval returned relevant docs; test with oracle context |
| Retrieval failure | Correct docs not returned; answer misses key information | Embedding / vector search / chunking | Check retrieved doc IDs and scores in trace; test retrieval in isolation |
| Capability gap | Model can't perform the task regardless of prompting | Model selection | Try frontier model (GPT-4o); if frontier succeeds, route task differently |
| Context overflow | Key instructions or context silently truncated; model ignores injected content | Context management | Count tokens; check if input exceeds window; trim/summarize context |
| Tool failure | Agent calls wrong tool, with wrong args, or loop doesn't terminate | Tool schema / agent prompt | Check tool input/output in trace; reduce visible tools; tighten schemas |
Strip everything away. Test the prompt against the failing input with no context, no history, no tools. If it still fails, the issue is in the prompt itself. If it passes, the issue is in one of the stripped components.
- Test system prompt alone first
- Add context back in stages
- Pin to temperature=0 during debugging
LLMs suffer from the lost-in-the-middle effect. Instructions buried in the middle of a long prompt are often ignored or underweighted.
- Move ignored instructions to beginning or end
- Repeat critical constraints at end of prompt
- Use XML-style delimiters to separate sections
The most powerful debugging tool: find the shortest prompt that still fails. This eliminates noise and focuses attention on the actual problem.
- Start with failing case, strip tokens
- Stop stripping when failure disappears
- That stripped context is the cause
System prompt says "be concise." User message context implies long output. Model follows whichever is statistically stronger โ often the one that appears nearest the end.
- Audit for contradictory instructions
- System prompt wins when explicit
- Add "regardless of the input length" clarifiers
| Hallucination Type | Cause | Fix |
|---|---|---|
| Factual hallucination | Model fills gaps with plausible-sounding facts not in training data | Add explicit "say I don't know if uncertain" instruction. Use RAG with grounding check. Add citation requirement. |
| Context hallucination (RAG) | Retrieved docs don't contain the answer; model extrapolates from partial information | Improve retrieval (hybrid search, re-ranking). Add: "Only use information from the provided documents." |
| Confident wrong answer | Model lacks uncertainty calibration; outputs high confidence regardless | Prompt: "If you are not certain, explicitly say so before answering." Add LLM-as-judge calibration eval. |
| Temporal hallucination | Model answers about post-training events as if it knows them | Add training cutoff date to system prompt. Provide current date. Tell model to acknowledge cutoff. |
| Structural hallucination | Model invents required fields not in source (e.g. JSON fields it was asked for but source lacks) | Add: "Leave fields as null if information is not present โ do not guess." Use structured output with Optional fields. |
To diagnose RAG hallucinations precisely: replace the retrieved context with the oracle answer verbatim and re-run the query. If the model now answers correctly, the problem is retrieval โ your docs are wrong, missing, or irrelevant. If the model still hallucinates even with the correct context in front of it, the problem is prompting โ the model isn't being told to stay grounded.
In multi-step pipelines, errors compound: a bad Step 1 output becomes the corrupted input to Step 2. The failure often surfaces in Step 3 or 4 but was caused in Step 1. Traces are the only way to see this clearly.
Test each step independently with ideal inputs. If Step 2 works perfectly with handcrafted input, the bug is in Step 1's output. Walk the pipeline backwards from the failure point.
Log every intermediate output โ not just the final result. Without step-by-step traces, you can't see where the corruption entered. This is non-negotiable for multi-step systems.
Validate every step output before passing to the next step. A format error in Step 2 that isn't caught there will manifest as a confusing error in Step 5. Fail fast, fail clearly.
∑ Chapter 08 — Key Takeaways
- Debugging workflow: reproduce โ isolate component โ classify failure โ root cause โ fix โ verify โ add to golden set
- Seven failure classes: format, instruction violation, hallucination, retrieval, capability gap, context overflow, tool failure
- Prompt isolation technique: strip everything, test alone, add components back โ narrows failure to one layer
- Lost-in-the-middle: move ignored instructions to the beginning or end โ buried instructions are ignored
- Grounding test: replace retrieved context with oracle answer โ if model answers correctly, bug is retrieval; if still wrong, bug is prompting
- Multi-step debugging: log every intermediate output, walk backwards from failure, validate each step before passing to the next
An eval pipeline you run manually will eventually not get run. Automated eval in CI is the only sustainable enforcement mechanism โ it gates every merge, runs without human intervention, and creates an auditable quality history for every change to your LLM system.
The standard CI pattern: run deterministic checks (fast, free) on every commit; run LLM judge eval on every PR; block merge if either fails.
Only run LLM eval when LLM-related files change. Don't waste API calls and eval time on CSS or docs changes.
- Trigger on:
prompts/**,src/llm/**,evals/** - Skip on:
docs/**,*.md,styles/** - Saves 80%+ of unnecessary eval runs
Post eval results as a PR comment so reviewers see the quality impact inline โ not buried in CI logs.
- Accuracy: 94.2% (baseline: 93.8%) โ
- Format compliance: 100% โ
- Avg tokens: 847 (baseline: 812) โ ๏ธ +4%
- Cost per run: $0.73
Store the baseline JSON in the repo. When you intentionally improve quality, update the baseline โ making new improvements the new floor.
evals/baseline.jsonin version control- Update with
python evals/update_baseline.py - Commit baseline update with the change that caused it
100 golden set cases ร GPT-4o judge ร 3 dimensions = ~300 judge calls ร $0.005 = $1.50 per PR eval run. On an active team with 20 PRs/day, that's $30/day or ~$900/month. Optimize: use GPT-4o-mini as judge where calibrated (often works as well at 10ร lower cost), run full eval only on significant prompt/model changes, use path triggers to skip irrelevant PRs. Track eval pipeline cost as a project metric.
Prompts change frequently and silently break things. Without version control and an eval audit trail, you have no history of what changed, when, and what effect it had on quality.
โ All prompt templates (system + user)
โ Model selection per feature
โ Eval golden set (as code)
โ Baseline scores per golden set version
โ Judge prompts and rubrics
โ Model API parameters (temperature, max_tokens)
What changed: System prompt โ added instruction to cite sources
Why: Reduce hallucination rate in doc QA feature
Eval result: Groundedness: 3.8 โ 4.2 (+0.4). Accuracy: no change.
Cost impact: +15 tokens/query (avg)
Baseline updated: Yes โ new floor
∑ Chapter 09 — Key Takeaways
- Eval in CI has four gates: PR gate (L1 + L2 judge) โ staging gate (full set) โ deploy (canary) โ production (online eval)
- Use path-based triggers โ only run LLM eval when LLM-related files change; saves 80%+ of unnecessary API calls
- Post eval results as PR comment โ accuracy delta, format compliance, token count change, cost per run
- CI eval cost: ~$1.50/run at 100 cases ร GPT-4o judge ร 3 dimensions โ use GPT-4o-mini judge where calibrated (10ร cheaper)
- Prompt versioning: treat prompts, model selection, golden set, baselines, and judge configs as version-controlled code
- Every prompt change commit should record: what changed, why, eval result, cost impact, whether baseline was updated
The eval and observability tooling ecosystem has matured rapidly. You don't need to build everything from scratch โ but you do need to pick the right tools for your stack. The wrong tool choice leads to vendor lock-in, missing features, or paying for capabilities you don't need.
Capture, store, and visualize traces from production. Focus: what happened in this specific request?
- LangSmith โ LangChain ecosystem
- Langfuse โ open-source, self-hostable
- Arize Phoenix โ local-first, OTel native
- Custom OTel + Grafana
Run eval pipelines, compare prompt versions, gate deployments. Focus: is this better or worse than before?
- promptfoo โ open-source CI eval
- Braintrust โ eval + experiment tracking
- RAGAS โ RAG-specific eval metrics
- LangSmith โ datasets + eval integrations
Version, store, and deploy prompts. Focus: which prompt version is in production right now?
- LangSmith Hub โ prompt registry
- Langfuse Prompts โ prompt versioning + A/B
- Braintrust โ prompt snapshots
- Git + plain files โ the simplest option
Track aggregate quality, cost, and usage over time. Focus: is quality trending up or down this week?
- Langfuse โ cost dashboard built-in
- Custom Grafana dashboards โ from OTel metrics
- Provider dashboards โ OpenAI, Anthropic usage pages
| Tool | Core Strength | Tracing | Eval | Prompt Mgmt | Hosting | Cost |
|---|---|---|---|---|---|---|
| LangSmith | End-to-end LangChain observability | โ Native | โ Datasets + CI | โ Hub | SaaS only | Free tier + paid plans |
| Langfuse | Self-hostable full-stack LLM observability | โ SDK + OTel | โ Built-in + RAGAS | โ Versioning + A/B | Self-hosted or SaaS | Free self-hosted |
| promptfoo | CI-first eval framework | โ Not tracing | โ Best-in-class CLI + CI | โ ๏ธ Basic | Open-source CLI | Free (open-source) |
| Braintrust | Eval + experiment tracking | โ ๏ธ Basic spans | โ Strong eval + A/B | โ Prompt snapshots | SaaS only | Paid (usage-based) |
| Arize Phoenix | Local-first debugging + evals | โ OTel native | โ Evals + embedding viz | โ | Local + cloud | Free local tier |
| RAGAS | RAG-specific eval metrics | โ | โ RAG metrics only | โ | Open-source library | Free (open-source) |
| Consideration | Choose SaaS | Choose Self-Hosted |
|---|---|---|
| Data sensitivity | Prompts contain no PII / confidential data | Prompts contain PII, IP, or regulated data (HIPAA, GDPR) |
| Team size / infra | Small team, no dedicated infra engineer | Mature infra team; existing K8s / monitoring stack |
| Time to value | Need tracing working in hours, not days | Can accept 1โ2 days for initial setup |
| Cost at scale | SaaS costs rise linearly โ large volumes become expensive | Fixed infra cost amortizes over volume |
| Compliance & audit | Provider compliance certifications sufficient | Need full data residency control and audit logs |
Start with promptfoo for CI eval (open-source, no data leaves your system) + Langfuse self-hosted for tracing and online eval (Docker Compose in 10 minutes, free, your data stays local). This covers 90% of production evaluation needs at zero ongoing cost. Graduate to SaaS tools (LangSmith, Braintrust) if you need richer integrations with LangChain or dedicated eval UX for a larger team.
The best observability stack for most teams is not one monolithic tool โ it's lightweight integration of best-of-breed components, each doing one thing well.
- promptfoo โ CI eval gate on PRs
- Langfuse (self-hosted) โ tracing + online eval
- Git โ golden set + prompt versioning
- Grafana โ dashboards from OTel/Langfuse metrics
- LangSmith or Braintrust โ team eval UI
- Custom OTel + existing APM (Datadog)
- Internal prompt registry + CI gates
- RAGAS for RAG-specific metrics
- LangSmith free tier โ instant setup
- promptfoo โ CI eval (free)
- Provider dashboards for cost tracking
- Upgrade to self-hosted when data sensitivity requires it
∑ Chapter 10 — Key Takeaways
- Four tool categories: tracing, evaluation/testing, prompt management, analytics/cost โ often need one from each
- Best-in-class: promptfoo (CI eval), Langfuse (tracing + online eval, self-hostable), RAGAS (RAG metrics), Braintrust (eval UX)
- Self-host when: PII/confidential prompts, regulated data, HIPAA/GDPR, high volume. Use SaaS when: small team, speed needed, no PII
- Recommend starting stack: promptfoo + Langfuse self-hosted + Git โ covers 90% of needs at zero ongoing cost
- Langfuse tracing integration: 4 lines of code โ drop-in OpenAI replacement auto-traces all calls
- Track eval pipeline cost itself as a project metric โ it can reach $900+/month on active teams if unmanaged
A production LLM system requires all four evaluation layers working together. Each layer serves a different purpose and catches failures the others miss. This is the minimum architecture โ not an aspirational target.
- Offline: prompt regressions, model update breakage, format regressions
- Online: distribution shift, new input patterns, subtle quality erosion
- Monitoring: drift over time, cost anomalies, latency spikes
- Feedback: known-failure regression prevention; evolving coverage
- No offline eval: prompt changes break things silently
- No online eval: distribution shift goes undetected for weeks
- No monitoring: cost spikes and quality drifts are invisible until too late
- No feedback loop: same failures recur; golden set never improves
Human inspection does not scale. Reading 10 responses and concluding "it looks good" is not evaluation โ it is survivorship bias. The cases you inspect are rarely the cases that fail in production.
- May fail on the specific edge cases your users actually send
- May pass today but degrade silently after the next model update
- May hallucinate confidently on topics that are rare in your test set
Human inspection tells you about the inputs and outputs you chose to look at. It tells you almost nothing about the distribution of inputs you haven't seen.
- A statistical signal over real inputs โ not a cherry-picked sample
- A baseline to detect regression โ not a subjective "feels better"
- A continuous production signal โ not a one-time check before launch
Measurement turns "I think it works" into "it passes 94.2% of cases with format compliance at 99.1%." One is a guess. The other is an engineering decision.
- Every prompt change: measure before and after
- Every model upgrade: run full eval suite first
- Every production deployment: have an online eval signal within hours
If you wouldn't deploy a backend service without monitoring and error rate tracking, don't deploy an LLM system without eval pipelines and quality dashboards.
If you are not measuring it, you are not controlling it. LLM quality is not self-evident. It is measured, tracked, and defended โ with golden sets, judges, traces, dashboards, and feedback loops. Every component in this guide exists because "it seemed fine" was not enough.