AI Advanced · LLM System Design

LLM System Design

Architecture patterns for LLM applications — from single-model APIs to multi-model orchestration, scaling, and production infrastructure.

Building with LLMs is not like building traditional software. Latency is measured in seconds, costs scale with tokens, and outputs are non-deterministic. This guide teaches you to design systems that work despite — and because of — these constraints.

Chapter 01 · Foundations

LLM Design Principles — What Makes LLM Systems Different

Every design decision in an LLM system is shaped by three constraints that don't exist in traditional software: non-deterministic outputs, second-scale latency, and per-token costs. Understanding these isn't optional — it's the foundation everything else builds on.

LLM Systems Are Controlled Systems — The Core Mental Model Foundation

LLMs are not reliable components by default. A production LLM system is not just LLM + prompt. It is:

The four layers every production LLM system requires

🎛️

Without a Control Layer

Systems become unpredictable, expensive, and hard to debug. The LLM produces output; the control layer decides whether to trust it.

Costs spike without warning
Errors cascade silently
No visibility into failures

⚙️

What the Control Layer Enforces

Limits retries, enforces timeouts, tracks token usage, terminates loops, validates output format before downstream use.

Max retries: 2–3 per call
Per-call timeout: 10–30s
Token budget per request

📐

Evaluation Is Not Optional

Without an eval layer, you can't know if your system is working or regressing. Every production system needs a baseline quality signal.

Offline evals in CI
Online quality sampling
Cost-per-quality tracking

The Three Constraints — Why LLM Systems Are Different Foundation

Traditional software is deterministic, fast, and free at the margin. LLM software is none of these. Every architecture decision must account for these three constraints — or the system will be unreliable, slow, or unaffordable.

The three constraints that shape every LLM system design decision

Traditional Software

Deterministic: f(x) = y, always

Fast: <50ms response times typical

Cheap: Fixed infra cost, marginal cost ~$0

Testable: Unit tests verify exact behavior

LLM Software

Non-deterministic: f(x) ≈ y (different each time)

Slow: 200ms–5s per call, seconds for complex tasks

Expensive: Per-token pricing, cost grows with usage

Hard to test: Eval suites, fuzzy matching, LLM-as-judge

The Missing Layer — Control Core

Every production LLM system includes a control layer — whether engineers planned for it or grew it reactively. This layer is responsible for bounding cost, latency, and failure cascades that the LLM itself cannot prevent.

①Requestenters system

②LLM Callnon-deterministic

③Validateformat + content

④Retry?if failed, max 2–3×

⑤Fallbackcheaper model / cache

⑥Returncontrolled output

Control Responsibility	Without It	Implementation
Retry limiting	Infinite retries → runaway cost	Max 2–3 retries with exponential backoff
Timeout enforcement	Stalled requests block workers forever	Hard timeout per call (10–30s)
Token tracking	A single request can burn the budget	Count tokens before and after each call
Loop termination	Agent loops run indefinitely	Max steps cap (e.g. 10) + cost circuit breaker

Designing for Non-determinism — The Core Challenge Core

In traditional software, if the code is correct, the output is correct. In LLM systems, even correct code can produce wrong outputs. The model might hallucinate, return malformed JSON, miss a key fact, or give a subtly wrong answer. Your system must handle this gracefully.

🔄

Output Variability

Same prompt, same model, temperature=0 → still slightly different outputs across calls. Structure your system to tolerate variation.

Parse loosely, validate strictly
Use structured output (JSON mode)
Retry on format failures

❌

Failure as Normal

LLM calls fail regularly: hallucination, refusal, rate limit, malformed output. Unlike traditional APIs, failure is not exceptional — it's expected.

Design retry + fallback for every call
Validate outputs before trusting them
Never trust LLM output as ground truth

✅

Evaluation-Driven Development

You can't unit-test LLM outputs. You need eval suites — sets of (input, expected_output) scored by automated metrics or LLM-as-judge.

Build eval set before building features
Run evals on every prompt change
Treat evals like integration tests

The Design Principle

Never make an LLM call a single point of failure. Every LLM call in your system should have: (1) output validation, (2) a retry strategy, (3) a fallback path (cheaper model, cached result, or graceful "I don't know"). Systems that treat LLM calls like database queries — reliable and deterministic — break in production.

Behavioral Failures — Beyond Wrong Answers In-depth

LLM failures are not limited to incorrect answers. The subtler class of behavioral failures is harder to detect and more dangerous in production because they often pass naive validation.

Behavioral Failure	What It Looks Like	Detection Strategy
Repeated mistakes across retries	Each retry returns the same wrong answer — retrying is futile	Hash outputs across retries; escalate if identical failures
Partially correct but misleading	90% correct + 10% confidently wrong — harder to catch than fully wrong	LLM-as-judge or grounding checks on key claims
Instruction ignored	Model responds in wrong language, skips a required field, ignores constraints	Schema validation + presence checks on required fields
Overconfident wrong answer	Model says "definitely X" when it should say "I don't know"	Calibration eval; add explicit uncertainty instructions
Inconsistent outputs	Same question → different answers across sessions (inconsistent brand voice, logic)	Consistency eval suite; lock temperature to 0 for deterministic tasks

Retrying an Already-Failed Prompt Is Usually Wrong

If a model returns the same malformed output on three consecutive retries, more retries will not help — the prompt or schema is the problem. Build retry logic that modifies the prompt on failure (e.g. appending "Return valid JSON. Error was: …") rather than blindly resending the same request. Blind retries multiply cost with zero benefit.

LLM Latency Characteristics — It's Not Like a REST API In-depth

LLM latency isn't a single number. It has two distinct phases, and understanding them changes how you design your system.

LLM latency breakdown — TTFT + generation time

Latency Factor	Impact	Design Implication
Input tokens (prompt length)	+50–200ms per 1K tokens of input	Keep prompts short. Cache system prompts. Compress context.
Output tokens	+20–80ms per output token	Set `max_tokens` aggressively. Ask for concise responses.
Model size	Larger model = slower	Route simple tasks to smaller/faster models.
Provider load	Peak times = 2–3× latency	Multi-provider failover, request queuing.
Streaming	Reduces perceived latency to TTFT (~200ms)	Always stream for user-facing responses.

The LLM Cost Model — Pay Per Token Core

In traditional SaaS, you pay for servers. In LLM systems, you pay for tokens — every input token and every output token, every call. This fundamentally changes how you think about system design.

Model (2024–2025)	Input Cost	Output Cost	Cost per 1K queries (avg 2K in + 500 out)
GPT-4o	$2.50 / 1M tokens	$10.00 / 1M tokens	$10.00
GPT-4o-mini	$0.15 / 1M tokens	$0.60 / 1M tokens	$0.60
Claude 3.5 Sonnet	$3.00 / 1M tokens	$15.00 / 1M tokens	$13.50
Claude 3.5 Haiku	$0.25 / 1M tokens	$1.25 / 1M tokens	$1.13
Gemini 1.5 Flash	$0.075 / 1M tokens	$0.30 / 1M tokens	$0.30

The 16× cost difference matters. GPT-4o costs 16× more than GPT-4o-mini per query. If 70% of your queries are simple enough for mini, routing them saves ~60% of your LLM spend. Model routing is the single highest-ROI design decision for cost optimization.

How Costs Spiral — System Behavior, Not Single Calls In-depth

LLM cost overruns rarely come from a single expensive call. They come from emergent system behavior — patterns that only appear at scale or under edge-case inputs.

🔁

Agent Loop Multiplication

1 user request → 5 agent steps → 3 retries each → 15 LLM calls for what should have been 1–2.

Fix: Cap steps (max_iterations=10), cap retries (max=2), add cost circuit breaker per request.

📈

Context Window Growth

Multi-turn conversations or document chains grow context exponentially. Turn 10 of a chat can have 10× the tokens of turn 1 — same cost structure, very different price.

Fix: Summarize old turns. Never pass raw history unbounded.

⚡

Fan-out Without Limit

MapReduce on user-uploaded documents: a 500-page PDF becomes 250 LLM map calls. Multiply by daily uploads.

Fix: Cap input size. Estimate cost before processing. Require user confirmation above threshold.

🔄

Silent Retry Storms

A bug causes 100% of requests to fail validation → all retry 3× → 4× provider load → rate limiting → more retries. Cost spikes 4× with zero benefit.

Fix: Circuit breaker: if error rate >50% in 60s, stop retrying and alert.

LLM-Specific Failure Modes — What Breaks In-depth

Failure Mode	Traditional Software	LLM Systems
Wrong output	Bug → fix code → fixed forever	Hallucination → tweak prompt → might recur
Slow response	Profile code → optimize → consistent improvement	Depends on provider load, token count, model — varies per call
Rate limiting	Scale horizontally, add servers	Provider-imposed limits, can't self-serve more capacity
Cost overrun	Fixed infrastructure cost, predictable	Usage-based, runaway loops can burn budget in minutes
Format errors	Type system prevents malformed data	LLM can return any string — malformed JSON, truncated output
Provider outage	Your infra, your control	Third-party dependency — OpenAI goes down, your app goes down

The Provider Dependency

Most LLM systems depend on 1–2 providers (OpenAI, Anthropic). When they have outages — and they do, regularly — your entire application stops. Production systems need multi-provider failover: primary model (GPT-4o) → fallback model (Claude Sonnet) → degraded mode (cached responses or smaller model). Chapter 3 covers model selection and routing.

Seven Design Principles for LLM Systems Core

①

Minimize LLM Calls

Every LLM call costs money and time. Ask: can this be done without calling the LLM? Can I cache the result? Can I batch multiple queries?

②

Use the Cheapest Model That Works

Don't use GPT-4o for classification. Don't use Claude Opus for extraction. Route each task to the cheapest model that achieves acceptable quality.

③

Validate Every Output

LLMs can return anything. Parse, validate, and type-check every response. Use JSON mode / structured outputs where available.

④

Stream Everything User-Facing

A 3-second response feels fast when streamed token-by-token. Without streaming, users stare at a blank screen for 3 seconds. Always stream.

⑤

Design for Failure

Every LLM call can fail: hallucination, rate limit, timeout, malformed output. Retry, fallback, degrade gracefully — never crash on LLM failure.

⑥

Measure Everything

Track latency, cost, quality, and error rates per model, per endpoint, per prompt version. You can't optimize what you can't measure.

⑦

Build the Simplest System That Works

Don't start with agents, multi-model routing, and semantic caching. Start with a single model, a simple prompt, and an eval set. Add complexity only when measurement shows you need it. The best LLM system is the one with the fewest LLM calls.

LLM Application Reference Architecture Core

Generic LLM application architecture — all the layers

Every LLM application has these layers, whether you build them explicitly or not. The chapters that follow cover each in depth: architecture patterns (Ch 2), model selection (Ch 3), API design (Ch 4), caching (Ch 5), scaling (Ch 6), latency (Ch 7), cost (Ch 8), infrastructure (Ch 9), and real-world case studies (Ch 10).

What Must Be Logged — Minimum Observability for LLM Systems Core

Without structured logging, debugging LLM failures is guesswork. Every call must emit a structured log entry. For multi-step systems, every intermediate step must be logged — not just the final output.

Single-call minimum log

• request_id — trace across services

• user_id — cost attribution

• prompt — sanitized if PII possible

• response — actual output

• tokens_in / tokens_out — cost calculation

• latency_ms — performance tracking

• model — which model was used

• retries — number of retry attempts

• fallback_triggered — boolean

Multi-step / agent additional log

• step_index — which step in the chain

• step_type — llm_call / tool_call / validate

• tool_name + tool_input + tool_output

• intermediate_output — per step

• total_cost_usd — running total

• terminated_early — if circuit breaker fired

• loop_count — for agent iterations

• parent_request_id — for sub-calls

Hidden Complexity — The Application Layer Grows Fastest In-depth

In the architecture diagram, the Application Layer appears as a thin band. In practice, it becomes the largest and most complex part of the system. Unlike the LLM layer (a managed API) and the data layer (a database), the application layer is entirely custom code that grows with every feature.

The Application Layer Tax

The application layer accumulates: prompt templates, output parsers, validation schemas, retry logic, routing rules, cost tracking hooks, fallback chains, streaming wrappers, and tool dispatch. Each added to handle a specific failure in production. Design it to be modular from day one — test each component independently, make each observable, and version your prompt templates. A monolithic application layer is the #1 source of debugging pain in mature LLM systems.

Start Simple — The Most Important Principle Core

1️⃣Single modelone prompt, one call

2️⃣Add evalmeasure quality

3️⃣Add cachingreduce cost/latency

4️⃣Add routingcheap model for simple

5️⃣Add RAGif knowledge needed

The Complexity Tax

Every component you add (RAG, routing, caching, agents) adds failure modes, latency, and maintenance burden. A single GPT-4o call with a good prompt can often outperform an overengineered pipeline with multiple models and retrieval steps. Only add complexity when your eval suite proves simple isn't good enough.

What Most Systems Don't Need Core

Most production LLM systems do not need agents, multi-model orchestration, or complex pipelines. A single well-designed prompt and model solves the majority of use cases — at lower cost, lower latency, and higher reliability.

Component	Add It When…	Don't Add It Because…
Agents / orchestration	Control flow is genuinely dynamic; tool use is required	It looks powerful or is trending
Multi-model routing	Evals show 1 model can't handle all task types and costs differ	You want to use multiple providers
RAG pipeline	Knowledge is too large or dynamic for the context window	You have <100 documents
Semantic caching	Exact cache hit rate <5% and queries are repetitive in meaning	You haven't measured exact cache hit rate yet
Self-hosted models	Volume exceeds ~10M tokens/day or strict data privacy mandate	You want more control in principle

The best LLM system is the one with the fewest LLM calls. Every call is a source of latency, cost, and non-determinism. Reduce calls through caching, batching, and simpler architectures — and your system will be faster, cheaper, and more reliable.

∑ Chapter 01 — Key Takeaways

LLM systems are constrained by three things traditional software isn't: non-determinism, second-scale latency, and per-token costs
Never make an LLM call a single point of failure — validate outputs, retry, fallback, degrade gracefully
Latency has two phases: TTFT (input processing) and generation (output tokens) — streaming hides TTFT
Cost varies 16× across models — routing simple tasks to cheap models is the highest-ROI optimization
LLM failure modes are different: hallucination, rate limits, format errors, provider outages — design for all of them
Seven principles: minimize calls, cheapest model, validate outputs, stream, design for failure, measure, keep it simple
Start with the simplest system — add complexity only when evaluation proves you need it
The best LLM system is the one with the fewest LLM calls

Chapter 02 · Patterns

Architecture Patterns — Common LLM Application Architectures

Every LLM application is built from a small set of composable patterns. Understanding these patterns lets you pick the right architecture for your problem instead of over-engineering or under-building. Start with the simplest pattern that works.

The Six Core Patterns Core

①

Single Call

One prompt → one LLM call → one response. The simplest possible pattern.

Classification, extraction, summarization
Latency: 200ms–2s
Cost: 1 LLM call
Start here.

②

Chain (Sequential)

Output of call A becomes input to call B. Multi-step processing with deterministic order.

Extract → classify → format
Latency: N × single call
Cost: N LLM calls
Each step can use different model

③

Router

Classify input first, then route to the appropriate handler (model, prompt, or pipeline).

Intent detection → specialized pipeline
Latency: classifier + handler
Cost: 1 cheap classify + 1 handler
Key pattern for model routing

④

Parallel Fan-out

Send the same input to multiple LLM calls simultaneously, aggregate results.

Generate 3 drafts → pick best
Latency: max(calls) not sum
Cost: N × single call
Needs aggregation logic

⑤

MapReduce

Split large input into chunks, process each (map), then combine results (reduce).

Summarize 100-page document
Latency: map (parallel) + reduce
Cost: N map calls + 1 reduce
Handles inputs beyond context window

⑥

Orchestrator (Agent)

LLM decides what to do next in a loop. Non-deterministic control flow.

Tool use, multi-step reasoning
Latency: unpredictable (3–15 steps)
Cost: high, variable
Use only when others can't work

Pattern Comparison — When to Use What In-depth

Pattern	LLM Calls	Latency	Predictability	Best For
Single Call	1	200ms–2s	High	Classification, extraction, simple Q&A
Chain	2–5	1–5s	High	Multi-step processing, transform pipelines
Router	2	0.5–3s	High	Cost optimization, intent-based dispatch
Parallel Fan-out	N (parallel)	max(calls)	High	Quality improvement, consensus, diversity
MapReduce	N+1	1–10s	High	Large docs, batch processing
Orchestrator	3–15+	3–30s+	Low	Dynamic multi-step, tool use, research

Tool Scaling Problem — When More Tools Means Worse Results In-depth

In tool-using agents and routers, there is a non-obvious scaling limit: as the number of available tools grows, model tool selection accuracy degrades. More tools = more confusion, not more capability.

✅

5–10 tools

Reliable selection. Model consistently picks the right tool, descriptions are easy to differentiate.

Selection accuracy: ~95%
Manageable prompt overhead
Good tool descriptions sufficient

⚠️

11–20 tools

Noticeable degradation. Model occasionally picks wrong tool or combines incompatible tools.

Selection accuracy: ~80–85%
Requires more specific descriptions
Needs mitigation strategies

❌

20+ tools

Significant degradation. Tool selection becomes a primary failure source, outweighing other problems.

Selection accuracy: <70%
High hallucination of tool names
Requires structural mitigation

Three Mitigation Strategies

(1) Group tools by function — expose only the relevant group per task (search tools vs write tools vs compute tools). (2) Use a router before tool exposure — classify intent first, then present only the 3–5 relevant tools for that intent category. (3) Limit visible tools per step — in multi-step agents, expose only the tools needed for the current step. The goal: never present more than 10 tools at once.

Pattern Architecture Diagrams Core

Router pattern — classify intent, dispatch to specialized handler

MapReduce pattern — process large documents that exceed context window

Composing Patterns — Real Systems Use Multiple Core

Real LLM applications compose multiple patterns. A customer support system might use: Router (classify intent) → RAG (retrieve knowledge for FAQ queries) → Chain (extract + respond for complex issues) → Agent (multi-step for account changes).

The Progression

Most successful LLM applications follow this evolution: Single Call (prototype) → Chain (add structure) → Router (add cost optimization) → RAG/MapReduce (add knowledge) → Agent (add autonomy, only if needed). Each step is driven by evaluation showing the simpler pattern isn't sufficient.

Reality Check — Most Production Systems Are Not Agents Core

Despite the popularity of agents in demos and research, the majority of production LLM systems use simpler patterns — because simpler patterns are cheaper, faster, and more predictable.

What most production systems use

• Single call — extraction, classification, summarization

• Chain — structured multi-step processing

• Router — cost and quality optimization

• RAG pipeline — knowledge-grounded Q&A

These cover ~90% of real-world LLM use cases.

When agents are actually justified

• Control flow is genuinely dynamic — can't be predetermined

• Tool use is required (search, code execution, APIs)

• Problem requires multi-step reasoning with branching

• Simpler patterns have been tried and measured as insufficient

The Agent Complexity Cost

Agents add three compounding costs: latency (3–15 LLM calls instead of 1–2), cost (multiplicative with step count), and unpredictability (non-deterministic control flow is hard to test and debug). If a Chain or Router can solve the problem, use it. Agents should be the last resort, not the first architecture.

∑ Chapter 02 — Key Takeaways

Six core patterns: Single Call, Chain, Router, Parallel Fan-out, MapReduce, Orchestrator
Single Call first — most tasks don't need multi-step processing
Router is the key cost pattern — classify intent, dispatch to cheapest capable handler
MapReduce handles documents larger than context window — map in parallel, reduce to one answer
Orchestrator (Agent) is the most powerful but most expensive and unpredictable — use last
Real systems compose patterns — router → chain → RAG is a common production stack

Chapter 03 · Models

Model Selection & Routing — Picking the Right Model for Each Task

There is no "best model." There is only the best model for this task at this cost. GPT-4o is overkill for classification. Haiku is too weak for complex reasoning. Model selection and routing is how you get quality and affordability.

Model Capabilities Matrix — What Each Model Is Good At Core

Capability	GPT-4o	GPT-4o-mini	Claude Sonnet	Claude Haiku	Gemini Flash
Complex reasoning	★★★★★	★★★	★★★★★	★★★	★★★
Code generation	★★★★★	★★★★	★★★★★	★★★	★★★★
Classification	★★★★★	★★★★★	★★★★★	★★★★	★★★★
Extraction	★★★★★	★★★★	★★★★★	★★★★	★★★★
Long context	128K	128K	200K	200K	1M
Speed	Medium	Fast	Medium	Very Fast	Very Fast
Cost	$$$	$	$$$	$	$

The Key Insight

For classification, extraction, and simple formatting, cheap models (mini, Haiku, Flash) perform within 1–2% of frontier models — at 10–20× lower cost. Reserve expensive models for complex reasoning, nuanced writing, and multi-step analysis.

Model Routing Strategies — How to Choose at Runtime In-depth

📋

Task-based Routing

Map task types to models at design time. Simplest approach.

Classification → mini
Summarization → Haiku
Complex analysis → GPT-4o
Static, no runtime overhead

🤖

LLM-based Routing

Use a cheap model to classify query complexity, then route to the appropriate model.

Classifier (mini) → decides: simple/complex
Simple → mini ($0.0003)
Complex → GPT-4o ($0.02)
Overhead: 1 cheap LLM call

📊

Confidence-based Routing

Try cheap model first. If confidence is low, escalate to expensive model.

Try mini → if uncertain, try 4o
80% of queries handled by mini
20% escalated to 4o
Best quality-cost tradeoff

🔧

Confidence-based routing implementation

async def route_query(query: str, cheap_model, expensive_model): """Try cheap model first, escalate if uncertain.""" # Step 1: Try cheap model response = await cheap_model.generate( query, temperature=0, logprobs=True # Get confidence scores ) # Step 2: Check confidence avg_logprob = mean(response.logprobs) confidence = math.exp(avg_logprob) # 0–1 scale if confidence > 0.85: return response # Cheap model is confident → use it # Step 3: Escalate to expensive model return await expensive_model.generate(query, temperature=0)

Fallback Chains — Multi-Provider Resilience Core

1️⃣PrimaryGPT-4o

2️⃣FallbackClaude Sonnet

3️⃣BudgetGPT-4o-mini

4️⃣CacheCached response

5️⃣Degrade"Try again later"

Trigger	Fallback Action	User Impact
Provider timeout (>10s)	Switch to secondary provider	Slightly different style, same quality
Rate limit (429)	Queue + retry with backoff, or secondary	200–500ms added delay
Provider outage	Switch to secondary provider entirely	Seamless if well-tested
All providers down	Serve cached responses for common queries	Stale but available
Budget exhausted	Route all traffic to cheapest model	Lower quality, still functional

Test Your Fallbacks

A fallback chain that's never been tested doesn't work. Regularly simulate provider failures (chaos engineering) and verify: your fallback triggers correctly, the secondary provider returns compatible output, and your parsers handle the different response format. The worst time to discover your fallback is broken is during an actual outage.

∑ Chapter 03 — Key Takeaways

There's no "best model" — only the best model for this task at this cost
Cheap models (mini, Haiku, Flash) match frontier models for classification and extraction at 10–20× lower cost
Three routing strategies: task-based (static), LLM-based (classify then route), confidence-based (try cheap, escalate if unsure)
Confidence-based routing handles 80% of queries cheaply, escalates 20% to expensive models
Fallback chains across providers prevent single-provider outages from taking down your system
Test your fallbacks — untested fallback chains fail when you need them most

Chapter 04 · Interfaces

API Design — Designing LLM-Powered APIs

Your LLM system's API is the contract with your consumers. LLM-powered APIs have unique challenges: long response times, streaming output, non-deterministic results, and variable costs per call. Traditional REST patterns don't always apply.

Streaming vs Batch — Two Delivery Models Core

Batch (Request-Response)

Client sends request, waits for complete response.

Pro: Simple, standard REST. Easy to cache, retry, log.

Con: User stares at spinner for 2–5s. Feels slow.

Best for: API-to-API calls, background processing, short responses.

Streaming (SSE / WebSocket)

Server sends tokens as they're generated.

Pro: First token in ~200ms. Feels instant. Better UX.

Con: Harder to cache, parse, and handle errors mid-stream.

Best for: User-facing chat, long responses, real-time interaction.

The Rule

Stream for humans, batch for machines. If the consumer is a user looking at a screen, stream via SSE. If the consumer is another service that needs a complete JSON response, use standard request-response. Many systems expose both: a streaming endpoint for the frontend and a batch endpoint for internal services.

Async Patterns — Handling Long-Running LLM Calls In-depth

⏱️

Synchronous + Timeout

Simple: send request, wait for response with a timeout. Best for calls under 10s.

Timeout: 10–30s
Return 504 on timeout
Client retries

📬

Job Queue (Submit + Poll)

Client submits job, gets job_id, polls for result. Best for calls 10s–5min.

POST /jobs → returns job_id
GET /jobs/{id} → status + result
Or: webhook on completion

📡

WebSocket (Bidirectional)

Persistent connection. Server pushes updates. Best for real-time + multi-turn.

Progress updates: "Searching..."
Streaming tokens
Client can cancel mid-generation

Structured Responses — Getting Reliable Output Core

LLMs return strings. Your API consumers expect structured data. The gap between these two is where most production bugs live.

Strategy	How	Reliability	When
JSON Mode	OpenAI/Anthropic native: `response_format: {"type": "json_object"}`	Very high — model forced to output valid JSON	Always, when available
Structured Outputs	OpenAI: define JSON schema, model must match it exactly	Highest — schema-enforced	When you need guaranteed schema
Prompt + parse	Ask for JSON in prompt, parse manually	Medium — model may add markdown fences, skip fields	When structured output unavailable
Retry on parse failure	If JSON parsing fails, retry with error feedback	Good with 1–2 retries	Always as fallback layer

🔧

Robust output parsing with retry

import json from pydantic import BaseModel, ValidationError class AnalysisResult(BaseModel): sentiment: str # "positive" | "negative" | "neutral" confidence: float # 0.0–1.0 summary: str async def get_structured_response(prompt, llm, max_retries=2): for attempt in range(max_retries + 1): response = await llm.generate( prompt, response_format={"type": "json_object"} ) try: data = json.loads(response) return AnalysisResult(**data) # Validates schema except (json.JSONDecodeError, ValidationError) as e: if attempt == max_retries: raise prompt += f"\n\nError: {e}. Return valid JSON matching the schema."

Rate Limiting & Backpressure Core

LLM provider rate limits are strict and per-organization. If one customer sends 1000 requests, they can exhaust your rate limit for everyone. You need rate limiting at your API layer too.

👤

Per-User Limits

Cap requests per user per minute. Prevents one user from starving others.

Free tier: 10 req/min
Pro tier: 60 req/min
Return 429 with Retry-After header

💰

Token Budgets

Cap total tokens per user per day/month. Prevents cost overrun.

Track cumulative tokens per API key
Return 429 when budget exhausted
Dashboard showing usage

🚦

Global Backpressure

When approaching provider rate limits, queue requests instead of failing.

Request queue with priority
Return 202 + job_id when queued
Shed load during peak

Idempotency — Handling Retries Safely In-depth

LLM calls are expensive and non-deterministic. When a client retries a timed-out request, you don't want to run (and pay for) the LLM call again. Idempotency keys solve this.

❌ Without Idempotency

Client sends request → timeout → retries → LLM called twice.

You pay double. User may get different answers for the "same" request.

✅ With Idempotency Key

Client sends request + Idempotency-Key: abc123.

First call: runs LLM, caches result keyed by abc123.

Retry: returns cached result. No second LLM call.

Non-determinism + Retries = Confusion

Without idempotency, a client that retries the same request might get a different answer — because LLMs are non-deterministic. This confuses users and breaks downstream systems that expect consistent results. Always cache the first response for a given idempotency key (TTL: 24h typical).

∑ Chapter 04 — Key Takeaways

Stream for humans, batch for machines — expose both endpoints
Async patterns: sync + timeout (<10s), job queue (10s–5min), WebSocket (real-time + multi-turn)
Use JSON Mode or Structured Outputs whenever available — prompt-based JSON is fragile
Always add Pydantic/schema validation + retry on parse failure as a safety net
Rate limit at your API layer — per-user request limits + token budgets + global backpressure
Idempotency keys prevent double LLM calls on retries and ensure consistent responses

Chapter 05 · Performance

Caching Strategies — Reducing Cost and Latency with Smart Caching

The cheapest and fastest LLM call is the one you don't make. Caching is the most impactful optimization for LLM systems — it reduces cost, latency, and provider dependency in one move. But LLM caching is harder than traditional caching because inputs are natural language, not exact keys.

Four Caching Layers — From Simple to Sophisticated Core

①

Exact Match Cache

Hash the full prompt → cache response. Identical prompt = cache hit.

Hit rate: 5–15% (prompts vary)
Implementation: Redis / in-memory
Zero false positives
Always implement this first

②

Semantic Cache

Embed the query → find similar past queries in vector DB → return cached answer if similar enough.

Hit rate: 15–35% (catches paraphrases)
Implementation: Vector DB + threshold
Risk: false positives if threshold too low
Saves the most money

③

Prompt Cache (Provider)

OpenAI/Anthropic cache system prompts across calls. Same prefix = faster + cheaper.

Automatic for long system prompts
50% input cost reduction
Reduced TTFT
No implementation needed — built in

④

KV Cache Reuse (Self-hosted)

When self-hosting: reuse key-value cache across requests with shared prefixes.

vLLM automatic prefix caching
Same system prompt = cached KV
30–60% faster TTFT
Only for self-hosted models

Semantic Caching — Catching Paraphrases In-depth

Semantic cache flow — embed query, search for similar, return cached or call LLM

The False Positive Trap

Setting the similarity threshold too low causes wrong answers returned from cache. "What's the refund policy?" and "What's the return policy?" may be 0.92 similar but have different answers. Start with threshold ≥ 0.95 and lower only with testing. A wrong cached answer is worse than a slow correct one.

Cache Invalidation — The Hard Part Core

Strategy	How	Best For
TTL (time-to-live)	Cache expires after N hours/days	General answers, FAQs (TTL: 6–24h)
Version key	Include prompt version in cache key — new prompt = new cache	When you update prompts/models
Event-driven	Clear cache when underlying data changes	RAG: doc updated → clear related caches
Manual purge	Admin action to clear specific cache entries	Wrong answers discovered in production

∑ Chapter 05 — Key Takeaways

Four caching layers: exact match (simple), semantic (paraphrases), prompt cache (provider), KV cache (self-hosted)
Exact match first — zero false positives, easy to implement, 5–15% hit rate
Semantic cache catches paraphrases (15–35% hit rate) but needs careful threshold tuning (≥0.95)
Provider prompt caching is free optimization — 50% input cost reduction on long system prompts
Cache invalidation: TTL for general, version keys for prompt changes, events for data changes
A wrong cached answer is worse than a slow correct one — tune thresholds conservatively

Chapter 06 · Scale

Scaling LLM Applications — From Prototype to Production Load

Scaling LLM applications is fundamentally different from scaling traditional web apps. You can't just add servers — your bottleneck is third-party API rate limits, not your own compute. Scaling strategy is about managing concurrency, queuing, and provider capacity.

Why LLM Apps Are Hard to Scale Foundation

Challenge	Traditional App	LLM App
Bottleneck	Your servers (scalable)	Provider API rate limits (not in your control)
Response time	<100ms (scale horizontally)	1–5s per call (can't parallelize a single call)
Cost of scale	Fixed infrastructure → amortized	Linear: 2× queries = 2× LLM cost
Capacity planning	Auto-scale based on CPU/memory	Pre-negotiate rate limits, multi-provider
Request weight	All requests ~equal cost	Requests vary 10–100× in token cost

Queue-Based Architecture — The Production Pattern Core

Queue-based scaling — decouple request intake from LLM execution

Why Queues Work

The queue absorbs burst traffic, the worker pool enforces the provider rate limit. Clients get immediate acknowledgment (202 Accepted), workers process at the maximum rate the provider allows. Scale workers up to match rate limits, not beyond. Queue depth is your auto-scaling signal — high queue = add workers (up to rate limit cap).

Rate Limit Management — Playing Within the Limits In-depth

📊

Track Usage in Real-Time

Monitor requests/min and tokens/min against provider limits. Throttle before hitting the limit.

Sliding window counter
Alert at 80% of limit
Auto-throttle at 90%

🔄

Multi-Provider Spreading

Split traffic across multiple providers to multiply effective rate limits.

60% OpenAI, 40% Anthropic
Weighted routing per model quality
2× effective capacity

📅

Request Prioritization

When approaching limits, serve high-priority requests first.

Paid users before free users
Real-time before batch
Short requests before long

Scaling Reference Numbers Core

Scale Tier	Daily Queries	Architecture	Estimated LLM Cost/Day
Prototype	<1K	Single server, sync calls	$1–$10
Small prod	1K–10K	App server + cache (Redis)	$10–$100
Medium prod	10K–100K	Queue + worker pool + multi-provider	$100–$1,000
Large prod	100K–1M+	Full queue arch + model routing + caching + self-host mix	$1,000–$10,000+

∑ Chapter 06 — Key Takeaways

LLM scaling bottleneck is provider rate limits, not your servers — you can't just add compute
Queue-based architecture decouples intake from execution — workers process at max provider rate
Scale workers to match rate limits, not beyond — queue depth is your auto-scaling signal
Multi-provider spreading multiplies effective rate limits (60/40 split = 2× capacity)
Prioritize: paid before free, real-time before batch, short before long
Cost scales linearly — 2× queries = 2× LLM cost (caching and routing are your only levers)

Chapter 07 · Speed

Latency Optimization — Making LLM Applications Feel Fast

Users expect sub-second responses. LLM calls take 1–5 seconds. This gap is where latency engineering lives — reducing actual latency where possible, and masking it with streaming and progressive rendering where not.

The Latency Budget — Where Time Goes Core

Component	Typical Latency	Optimization	Savings
Network (client → server)	10–50ms	CDN, edge deployment	20–30ms
Your app logic	5–20ms	Optimize prompts, pre-compute	10ms
Cache check	1–5ms (Redis)	In-memory for hot queries	Skips LLM entirely on hit
TTFT (LLM)	200ms–2s	Shorter prompts, prompt caching, smaller model	50–500ms
Token generation	1–5s total	max_tokens limit, concise prompts, streaming	Perceived: ~0ms with streaming
Post-processing	5–50ms	Async validation, stream while processing	Overlap with generation

Optimization Techniques — Ranked by Impact Core

🏆

1. Streaming

First token in ~200ms vs waiting 3s for full response. The single biggest UX improvement.

SSE for REST APIs
WebSocket for bidirectional
Effort: Low

🥈

2. Caching

Skip the LLM call entirely. Cache hit = response in <10ms instead of 2s.

Exact match + semantic cache
15–35% of queries cached
Effort: Medium

🥉

3. Smaller Models

GPT-4o-mini is 2–3× faster than GPT-4o. Route simple queries to fast models.

Classifier → route to mini/Haiku
70% of queries can use mini
Effort: Medium

4️⃣

Shorter Prompts

Every 1K fewer input tokens saves 50–200ms TTFT. Remove fluff from system prompts.

5️⃣

Parallel Calls

Independent LLM calls run simultaneously. Latency = max(calls) not sum(calls).

6️⃣

Progressive Rendering

"Searching..." → "Found 3 docs..." → "Generating..." → streamed answer. Each step feels fast.

Streaming Architecture — End-to-End In-depth

🔧

SSE streaming endpoint (FastAPI)

from fastapi import FastAPI from fastapi.responses import StreamingResponse import openai app = FastAPI() @app.post("/chat") async def chat_stream(request: ChatRequest): async def generate(): stream = await openai.chat.completions.create( model="gpt-4o", messages=request.messages, stream=True, ) async for chunk in stream: if chunk.choices[0].delta.content: token = chunk.choices[0].delta.content yield f"data: {json.dumps({'token': token})}\n\n" yield "data: [DONE]\n\n" return StreamingResponse( generate(), media_type="text/event-stream" )

Streaming + Structured Output = Tricky

When streaming JSON responses, you get partial JSON tokens: {"sen → timent": → "pos → itive"}. You can't parse until the full object arrives. Solutions: (1) stream text, return metadata separately, (2) use a streaming JSON parser (jsonstream), or (3) stream the text response and return structured data as a final non-streamed event.

∑ Chapter 07 — Key Takeaways

Streaming is #1 — first token in ~200ms vs 3s wait. The single biggest UX improvement you can make.
Caching is #2 — skip the LLM entirely. Cache hit = <10ms response.
Smaller models are #3 — mini/Haiku are 2–3× faster for simple tasks
Shorter prompts directly reduce TTFT — every 1K fewer tokens saves 50–200ms
Run independent calls in parallel — latency = max, not sum
Progressive rendering (showing steps) makes any system feel faster

Chapter 08 · Economics

Cost Management — Keeping LLM Costs Under Control

LLM costs are deceptive: they start small and grow linearly with usage. A system that costs $10/day at launch can cost $10,000/day at scale without any architecture changes. Cost management is not an afterthought — it's a first-class design constraint.

The Token Cost Equation — Where Money Goes Core

Every dollar you spend on LLMs comes from three levers: how many calls you make, how many tokens per call, and which model you use. Cost optimization attacks all three.

LLM cost breakdown — three levers you control

The Cost Formula

Daily Cost = Queries/day × Avg Tokens/query × Model Price/token. Optimize all three. Model routing alone (sending 70% of queries to mini instead of GPT-4o) can cut costs by 60–70%. Combined with caching (eliminating 20–30% of calls) and prompt compression (reducing tokens 30%), total cost reduction of 75–85% is achievable without quality loss.

Token Budget System — Preventing Cost Overruns Core

Without token budgets, a single runaway agent loop or abusive user can burn your monthly budget in minutes. Token budgets enforce hard and soft limits at every level.

Budget Level	What It Limits	Implementation	Enforcement
Per request	Max tokens per single LLM call	Set `max_tokens` parameter	Provider enforces — free
Per user/day	Total tokens a user can spend daily	Track in Redis with daily TTL key	Return 429 when exhausted
Per API key/month	Total organization spend	Provider dashboard spend limits	Provider cuts off at limit
Per feature	Limit expensive features (agents, long-form)	Feature flags based on user tier	App-level enforcement
System-wide	Total queries per hour (circuit breaker)	Global counter, open circuit if exceeded	Degrade gracefully when triggered

🔧

Per-user token budget with Redis

import redis from datetime import datetime r = redis.Redis() DAILY_TOKEN_LIMIT = 100_000 # tokens/user/day async def check_and_deduct_budget(user_id: str, estimated_tokens: int) -> bool: key = f"tokens:{user_id}:{datetime.utcnow().strftime('%Y-%m-%d')}" # Atomic increment with TTL pipe = r.pipeline() pipe.incrby(key, estimated_tokens) pipe.expire(key, 86400) # 24h TTL results = pipe.execute() total_used = results[0] if total_used > DAILY_TOKEN_LIMIT: r.decrby(key, estimated_tokens) # Roll back return False # Budget exhausted return True

Model Tiering — The Highest-ROI Optimization In-depth

💚

Tier 1 — Free / Near-Free

Tasks solvable without LLM — use rule-based or traditional ML.

Keyword matching, regex
Traditional classifiers (sklearn)
Embedding similarity (no LLM)
Cost: ~$0

💙

Tier 2 — Cheap LLM

Simple tasks where a small/fast model performs within 2% of frontier.

GPT-4o-mini / Claude Haiku / Gemini Flash
Classification, extraction, formatting
Short Q&A, paraphrase detection
Cost: $0.01–$0.10 / 1K queries

💛

Tier 3 — Frontier LLM

Complex tasks requiring frontier reasoning or nuanced writing.

GPT-4o / Claude Sonnet / Gemini Pro
Multi-step reasoning, code gen
Complex analysis, long-form writing
Cost: $0.50–$5.00 / 1K queries

The Tiering Rule

Benchmark each task type across tiers before committing to a model. In practice, 60–75% of production LLM queries are Tier 2 — classification, simple extraction, FAQ answers, format conversion. These can run on cheap models at 10–20× lower cost with <2% quality delta. Only reserve Tier 3 for tasks where you've measured it makes a difference.

Batch Processing — 50% Cost Reduction on Non-Realtime Work Core

OpenAI and Anthropic offer batch APIs that process requests asynchronously at a 50% price discount. If your use case tolerates minutes-to-hours latency, batch processing is a free cost halving.

Synchronous API (real-time)

Latency: 1–5s per request

Cost: Full price (e.g. $2.50/1M tokens)

Best for: User-facing generation, interactive features

When: Response needed in <10s

Batch API (async)

Latency: Minutes to 24 hours

Cost: 50% off (e.g. $1.25/1M tokens)

Best for: Embedding runs, bulk eval, background analysis

When: Any offline/background processing

Use Case	API Type	Rationale
Chat response	Sync (real-time)	User waiting — can't delay
Embedding 10K documents	Batch	No user waiting — 50% savings
Nightly eval run	Batch	Background job — 50% savings
Bulk data extraction	Batch	No realtime need — 50% savings
Content moderation	Depends	Sync if blocking publish; batch if reviewing async

Prompt Compression — Reduce Tokens Without Reducing Quality In-depth

✂️

System Prompt Audit

System prompts repeat every call. Every 1K tokens of system prompt = 1K input tokens per query. Audit ruthlessly.

Remove examples that aren't needed
Delete redundant instructions
Use terse style over verbose
Typical saving: 30–50%

🗜️

Context Window Pruning

For multi-turn conversations, don't send complete history. Summarize old turns instead.

Keep last N turns verbatim
Summarize older turns (~10% of tokens)
Drop irrelevant retrieved docs
Typical saving: 40–70% on long chats

📏

Output Length Control

Output tokens cost 3–5× more than input tokens (on most models). Constrain output aggressively.

Set aggressive max_tokens
Ask for "concise" / "one sentence"
Request JSON vs prose
Typical saving: 20–40%

Over-Compression Kills Quality

Prompt compression has diminishing and then negative returns. Removing too much context causes the model to hallucinate missing information, produce wrong formats, or miss nuance. Always run your eval suite after compressing prompts. The goal is to remove tokens that don't affect output quality — not to minimize prompts at any cost.

Cost Monitoring — You Can't Optimize What You Don't Measure Core

📊

Per-Request Cost Logging

Log tokens in + tokens out + model + cost for every LLM call. Attach user ID, feature name, request ID.

Cost per feature / per endpoint
Cost per user cohort
Identify expensive edge cases

🚨

Cost Anomaly Alerts

Alert when spend rate deviates from baseline. A 5× cost spike in 10 minutes is a runaway loop, not a traffic spike.

Alert: hourly cost > 2× 7-day avg
Alert: single request > $0.50
Alert: daily budget > 80%

📈

Cost-Quality Dashboard

Track cost alongside quality metrics. A 30% cost reduction that drops quality 10% may not be worth it.

Cost per quality point (eval score)
Model routing effectiveness
Cache hit rate vs cost saved

🔍

Attribution by Feature

Know which feature is driving cost. Often one feature (e.g. agent with long context) drives 80% of LLM spend.

Tag every call with feature name
Cost breakdown by feature weekly
Prioritize optimization by cost share

∑ Chapter 08 — Key Takeaways

Three levers for cost: call volume (caching/batching), tokens per call (compression), model choice (tiering/routing)
Model tiering is highest ROI — 60–75% of queries can run on cheap models at 10–20× lower cost
Token budgets at every level prevent runaway costs — per request, per user, per day, per feature
Batch API gives 50% cost reduction on any non-realtime workload — nightly evals, bulk embedding, background analysis
Prompt compression (audit, prune, constrain output) typically saves 30–60% tokens — always run evals after
Measure cost per feature — one expensive feature often drives 80% of LLM spend; find it and optimize it first

Chapter 09 · Infrastructure

Infrastructure — Self-Hosted vs API, GPUs, and Deployment

Your infrastructure choice is the single most consequential technical decision in an LLM system: API inference vs self-hosting. The wrong choice costs 10× more than necessary or requires months of re-engineering. This chapter gives you the framework to choose correctly and build it right.

API Providers vs Self-Hosting — The Core Trade-off Core

Dimension	API Providers (OpenAI, Anthropic…)	Self-Hosted (vLLM, TGI…)
Setup time	Minutes — just an API key	Days to weeks (GPU, infra, tuning)
Model quality	Frontier models (GPT-4o, Claude 3.5)	Open-source models (Llama 3, Mistral)
Cost at low volume	Cheap — pay per token, no infra	Expensive — GPU cost even at idle
Cost at high volume	Linear — $cost = $tokens	Fixed GPU cost, amortizes over volume
Data privacy	Data leaves your infrastructure	Full data control, on-prem option
Rate limits	Provider-imposed, shared quotas	Your hardware, your limits
Maintenance	Zero — provider handles everything	GPU infra, model updates, monitoring
Customisation	Limited (fine-tuning via API)	Full — fine-tune, modify, distill

The Decision Rule

Use API providers until one of these triggers hits: (1) volume exceeds ~10M tokens/day (self-hosting becomes cheaper), (2) data privacy requirements mandate on-prem, (3) you need a customized/fine-tuned model, (4) rate limits block growth. Most companies never hit these triggers — API is the right default.

API Provider Comparison — Picking the Right Provider Stack Core

🟢

OpenAI (GPT-4o family)

Largest ecosystem, best tooling, JSON mode + Structured Outputs. Primary choice for most teams.

Models: GPT-4o, GPT-4o-mini, o1
Context: 128K tokens
Strengths: code, reasoning, function calling
Best for: general-purpose default

🟠

Anthropic (Claude family)

Long context leader (200K), best for document analysis. Excellent instruction following.

Models: Claude 3.5 Sonnet, Claude Haiku
Context: 200K tokens
Strengths: long-context, nuanced writing
Best for: document QA, long analysis

🔵

Google (Gemini family)

1M context window, multimodal by default, competitive pricing. Strong for bulk/cheap processing.

Models: Gemini 1.5 Pro, Gemini Flash
Context: 1M tokens
Strengths: massive context, multimodal
Best for: very long docs, video, cost efficiency

⚡

Groq / Together / Fireworks

Inference-optimized providers for open-source models. 10–50× faster than traditional GPU hosting.

Models: Llama 3, Mistral, Mixtral
Strengths: ultra-low latency (50ms TTFT)
Speeds: 200–800 tok/s vs 30–80 for OpenAI
Best for: latency-critical, open-source models

Self-Hosting Stack — vLLM, TGI, and Ollama In-depth

Tool	Use Case	Key Feature	Best For
vLLM	Production serving	PagedAttention — 24× higher throughput, continuous batching	High-volume production self-hosting
TGI (Text Gen Inference)	Production serving	HuggingFace ecosystem, OpenAI-compatible API	HuggingFace model ecosystem
Ollama	Dev / local	One-command model management, Mac/Linux support	Local development, testing, prototyping
llama.cpp	Edge / CPU	Quantized models on CPU (no GPU needed)	Edge deployment, air-gapped systems
LiteLLM	Proxy / abstraction	Unified OpenAI-compatible interface over 100+ models	Multi-model routing, provider abstraction

🚀

vLLM — getting started

# Install and serve Llama 3 8B with OpenAI-compatible API pip install vllm # Serve model on port 8000 (OpenAI-compatible) python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Meta-Llama-3-8B-Instruct \ --port 8000 \ --tensor-parallel-size 1 # number of GPUs # Use with standard OpenAI client from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") response = client.chat.completions.create( model="meta-llama/Meta-Llama-3-8B-Instruct", messages=[{"role": "user", "content": "Hello!"}] )

GPU Selection — Matching Hardware to Workload Core

GPU	VRAM	Max Model Size	Cloud Cost/hr	Best For
NVIDIA T4	16 GB	7B models (fp16)	~$0.50	Dev/small inference, budget prod
NVIDIA A10G	24 GB	13B models, 7B with headroom	~$1.50	Small production workloads
NVIDIA A100 (40GB)	40 GB	30B models	~$3.00–$4.00	Medium production, fine-tuning
NVIDIA A100 (80GB)	80 GB	70B models, 30B+ with batching	~$5.00–$7.00	Large production workloads
NVIDIA H100 (80GB)	80 GB	70B models at highest throughput	~$15–$25	Highest throughput demand

VRAM sizing rule: Model parameters × 2 bytes (fp16) = minimum VRAM. Llama 3 8B ≈ 16 GB, Llama 3 70B ≈ 140 GB (2× A100 80GB). Always add 20% headroom for KV cache. For multi-GPU: use tensor parallelism (vLLM --tensor-parallel-size N).

Hybrid Architecture — API + Self-Hosted Together In-depth

Hybrid architecture — use the right infrastructure for each query type

A common pattern: self-host a capable open-source model (Llama 3 70B) for bulk, sensitive, or high-volume requests; use API providers (GPT-4o) for frontier-quality tasks. LiteLLM acts as a transparent proxy, making both look like the same OpenAI API to your application.

∑ Chapter 09 — Key Takeaways

API providers by default — only self-host when volume (>10M tokens/day), privacy, or customization demands it
Provider strengths: OpenAI (ecosystem, code), Anthropic (200K context, documents), Google (1M context, cheap), Groq (ultra-low latency)
vLLM is the production standard for self-hosting — PagedAttention gives 24× higher throughput
VRAM sizing: model params × 2 bytes (fp16) + 20% headroom for KV cache
LiteLLM provides a unified OpenAI-compatible interface over all providers — use it to avoid vendor lock-in
Hybrid architecture: self-host for bulk/sensitive, use API for frontier quality — router decides per request

Chapter 10 · Real World

Case Studies — LLM System Design in Practice

Theory meets reality. Four complete system design walkthroughs: a customer support chatbot, a code assistant, a document Q&A pipeline, and a content generation system. Each applies the full design toolkit from this guide — architecture, model routing, caching, scaling, cost, and observability.

Case Study 1 — Customer Support Chatbot Core

Requirements

Scale: 50K queries/day, peak 200 req/min

Latency: <2s first token (streaming)

Quality: Grounded in knowledge base, accurate

Cost target: <$0.01/query

Constraints: Customer data privacy

Architecture Decisions

Pattern: Router → RAG → Chain

Primary model: GPT-4o-mini (80% of queries)

Complex model: GPT-4o (20% escalated)

Cache: Exact + semantic (Redis + Qdrant)

Queue: SQS for burst absorption

Customer support chatbot — full architecture

Metric	Target	How Achieved
Cost per query	$0.008 (vs $0.025 naive)	Routing 80% → mini + 25% cache hit rate
P95 latency	1.2s TTFT	Streaming + cache + mini for most queries
Hallucination rate	<1%	RAG grounding + output validator
Availability	99.9%	OpenAI primary → Anthropic fallback → cached degraded

Case Study 2 — Code Assistant (IDE Plugin) In-depth

Requirements

Scale: 200K completions/day (heavy users)

Latency: <500ms first token (inline autocomplete)

Quality: Context-aware, correct syntax

Cost target: <$0.005/completion

Special: Must work on private repos (data sensitivity)

Architecture Decisions

Inline completions: Self-hosted Llama 3 8B (privacy)

Complex generation: GPT-4o via API (quality)

Context: Sliding window of relevant files

Cache: Prefix cache (same method stubs = cache hit)

Infra: vLLM + 2× A100 80GB

⌨️

Inline Autocomplete

Self-hosted Llama 3 8B on vLLM. 300ms P90 latency. No data leaves the company.

vLLM with speculative decoding
Prefix caching on file headers
Code-specific fine-tune
Cost: GPU amortized (~$0.0001)

💬

Chat / Explain

User asks "explain this function" or "refactor this code" → GPT-4o via API for quality.

Full conversation context
Code context injection (RAG)
OpenAI API (data anonymized)
Cost: ~$0.02/chat exchange

🔍

Semantic Search

Find relevant code snippets across repo. Embedding + vector search, no LLM call.

text-embedding-3-small
Qdrant vector store
Embeds on file save
Cost: <$0.0001/search

Context Window Management is Critical

Code assistants are uniquely difficult because relevant context (imports, type definitions, caller code) may be spread across many files. Naively concatenating files burns the context window and buries the relevant code in noise. Use semantic retrieval to find the 3–5 most relevant code chunks, not the 100 lines immediately above the cursor. This also cuts input token cost by 60–80%.

Case Study 3 — Document Q&A System (Enterprise) Core

Requirements

Scale: 5K queries/day over 100K documents

Documents: PDFs, Word docs, PPTs up to 500 pages

Quality: Grounded answers with citations, no hallucination

Latency: <5s (batch-acceptable)

Cost target: <$0.05/query

Architecture Decisions

Pattern: RAG with re-ranking

Chunking: Semantic (512 tokens, 10% overlap)

Retrieval: Hybrid (BM25 + dense), top-20 → re-rank → top-5

Model: Claude 3.5 Sonnet (200K context)

Vector store: Qdrant (self-hosted)

Pipeline Stage	Component	Why This Choice
Document parsing	Unstructured.io (PDF, PPTX, DOCX)	Handles tables, images, complex layouts
Chunking	Semantic chunking (sentence-transformers)	Respects paragraph/section boundaries
Embedding	text-embedding-3-large (3072d)	Best retrieval quality on enterprise docs
Retrieval	Hybrid BM25 + dense (RRF fusion)	Dense catches semantics; BM25 catches keywords
Re-ranking	Cohere Rerank v3 (top-20 → top-5)	+15% answer accuracy vs raw retrieval
Generation	Claude 3.5 Sonnet + citation prompting	Long context + strong grounding instructions
Validation	Answer grounding check (LLM judge)	Catches hallucinations before serving

Case Study 4 — Content Generation Pipeline In-depth

Requirements

Scale: 50K articles/month, burst possible

Quality: Brand-consistent, SEO-aware, fact-checked

Latency: Hours acceptable (batch job)

Cost target: <$0.50/article

Multi-step: Brief → outline → draft → edit → SEO

Architecture Decisions

Pattern: Sequential chain (5 stages)

Batch API: Yes — 50% cost savings

Models: Mix of GPT-4o (quality) + mini (SEO, outline)

Queue: Celery + Redis for async pipeline

Human-in-loop: Review gate before publish

1️⃣Brief Inputtopic, audience, keywords

2️⃣Outlinemini (cheap)

3️⃣DraftGPT-4o (quality)

4️⃣Edit + Fact-checkGPT-4o + search tool

5️⃣SEO optimizemini (cheap, structured)

💰

Cost Breakdown per Article

Step 1 (outline, mini): $0.003
Step 2 (draft, GPT-4o): $0.28
Step 3 (edit, GPT-4o): $0.12
Step 4 (SEO, mini): $0.005
Total: ~$0.41 (vs $0.85 naive)

⚡

Optimization Wins

Batch API: saves 50% on outline + SEO
Model routing: mini for cheap steps
Prompt compression: 40% shorter system prompt
Result: $0.41 vs $0.85 baseline

✅

Quality Controls

Fact-check step with web search
Brand voice eval (LLM judge)
SEO score check (keyword density)
Human review gate before publish

Production Launch Checklist — Before You Ship Core

🛡️

Reliability

✅ Multi-provider fallback configured and tested
✅ Retry logic with exponential backoff
✅ Circuit breaker on all LLM calls
✅ Graceful degradation path defined
✅ Timeout set on every LLM call

💸

Cost Controls

✅ Per-request max_tokens set
✅ Per-user token budget enforced
✅ Provider spend limits configured
✅ Cost anomaly alerts firing
✅ Model routing tested and validated

📊

Observability

✅ Every LLM call logged (tokens, cost, latency)
✅ Trace IDs propagated end-to-end
✅ Error rates dashboarded
✅ P50/P95/P99 latency tracked per endpoint
✅ Evals running in CI on prompt changes

🔒

Security

✅ API keys in secrets manager (not env files)
✅ Output sanitization before rendering
✅ Input length limits enforced
✅ Prompt injection mitigations in place
✅ PII not logged in traces

∑ Chapter 10 — Key Takeaways

Customer support chatbot: Router → RAG → Chain with model tiering (mini for 80%) achieves $0.008/query vs $0.025 naive
Code assistant: Self-host for inline completions (privacy + speed), API for complex chat — best of both worlds
Document Q&A: Hybrid retrieval + re-ranking + citation prompting → <1% hallucination rate
Content generation: Multi-step chain + batch API + model tiering = 50% cost reduction vs single-model pipeline
Every production system needs: fallbacks, cost controls, observability, and security — all of them, before launch
The best architecture evolves from simple to complex — driven by measured gaps, not anticipated ones

← Agents in Production Evaluation & Observability →