AI Advanced · LLM System Design

LLM System Design

Architecture patterns for LLM applications β€” from single-model APIs to multi-model orchestration, scaling, and production infrastructure.

Building with LLMs is not like building traditional software. Latency is measured in seconds, costs scale with tokens, and outputs are non-deterministic. This guide teaches you to design systems that work despite β€” and because of β€” these constraints.

01
Chapter 01 Β· Foundations
LLM Design Principles β€” What Makes LLM Systems Different

Every design decision in an LLM system is shaped by three constraints that don't exist in traditional software: non-deterministic outputs, second-scale latency, and per-token costs. Understanding these isn't optional β€” it's the foundation everything else builds on.

LLMs are not reliable components by default. A production LLM system is not just LLM + prompt. It is:

The four layers every production LLM system requires
LLM non-deterministic Control Layer validate Β· retry Β· budget Β· timeout Data knowledge Β· cache Β· history Evaluation quality Β· cost Β· latency
πŸŽ›οΈ
Without a Control Layer

Systems become unpredictable, expensive, and hard to debug. The LLM produces output; the control layer decides whether to trust it.

  • Costs spike without warning
  • Errors cascade silently
  • No visibility into failures
βš™οΈ
What the Control Layer Enforces

Limits retries, enforces timeouts, tracks token usage, terminates loops, validates output format before downstream use.

  • Max retries: 2–3 per call
  • Per-call timeout: 10–30s
  • Token budget per request
πŸ“
Evaluation Is Not Optional

Without an eval layer, you can't know if your system is working or regressing. Every production system needs a baseline quality signal.

  • Offline evals in CI
  • Online quality sampling
  • Cost-per-quality tracking

Traditional software is deterministic, fast, and free at the margin. LLM software is none of these. Every architecture decision must account for these three constraints β€” or the system will be unreliable, slow, or unaffordable.

The three constraints that shape every LLM system design decision
Non-determinism Same input β†’ different output Can't unit test like normal code Outputs may be wrong, partial, or formatted unexpectedly β†’ Build for variability, not correctness Latency 200ms–5s per LLM call 10–100Γ— slower than a DB query Scales with token count Streaming hides but doesn't fix it β†’ Design for perceived speed Cost $0.001–$0.06 per 1K tokens Cost grows with every call Context = input tokens = money No fixed pricing β€” usage-based β†’ Minimize calls and tokens
Traditional Software

Deterministic: f(x) = y, always

Fast: <50ms response times typical

Cheap: Fixed infra cost, marginal cost ~$0

Testable: Unit tests verify exact behavior

LLM Software

Non-deterministic: f(x) β‰ˆ y (different each time)

Slow: 200ms–5s per call, seconds for complex tasks

Expensive: Per-token pricing, cost grows with usage

Hard to test: Eval suites, fuzzy matching, LLM-as-judge

Every production LLM system includes a control layer β€” whether engineers planned for it or grew it reactively. This layer is responsible for bounding cost, latency, and failure cascades that the LLM itself cannot prevent.

β‘ Requestenters system
β‘‘LLM Callnon-deterministic
β‘’Validateformat + content
β‘£Retry?if failed, max 2–3Γ—
β‘€Fallbackcheaper model / cache
β‘₯Returncontrolled output
Control ResponsibilityWithout ItImplementation
Retry limiting Infinite retries β†’ runaway cost Max 2–3 retries with exponential backoff
Timeout enforcement Stalled requests block workers forever Hard timeout per call (10–30s)
Token tracking A single request can burn the budget Count tokens before and after each call
Loop termination Agent loops run indefinitely Max steps cap (e.g. 10) + cost circuit breaker

In traditional software, if the code is correct, the output is correct. In LLM systems, even correct code can produce wrong outputs. The model might hallucinate, return malformed JSON, miss a key fact, or give a subtly wrong answer. Your system must handle this gracefully.

πŸ”„
Output Variability

Same prompt, same model, temperature=0 β†’ still slightly different outputs across calls. Structure your system to tolerate variation.

  • Parse loosely, validate strictly
  • Use structured output (JSON mode)
  • Retry on format failures
❌
Failure as Normal

LLM calls fail regularly: hallucination, refusal, rate limit, malformed output. Unlike traditional APIs, failure is not exceptional β€” it's expected.

  • Design retry + fallback for every call
  • Validate outputs before trusting them
  • Never trust LLM output as ground truth
βœ…
Evaluation-Driven Development

You can't unit-test LLM outputs. You need eval suites β€” sets of (input, expected_output) scored by automated metrics or LLM-as-judge.

  • Build eval set before building features
  • Run evals on every prompt change
  • Treat evals like integration tests
The Design Principle

Never make an LLM call a single point of failure. Every LLM call in your system should have: (1) output validation, (2) a retry strategy, (3) a fallback path (cheaper model, cached result, or graceful "I don't know"). Systems that treat LLM calls like database queries β€” reliable and deterministic β€” break in production.

LLM failures are not limited to incorrect answers. The subtler class of behavioral failures is harder to detect and more dangerous in production because they often pass naive validation.

Behavioral FailureWhat It Looks LikeDetection Strategy
Repeated mistakes across retries Each retry returns the same wrong answer β€” retrying is futile Hash outputs across retries; escalate if identical failures
Partially correct but misleading 90% correct + 10% confidently wrong β€” harder to catch than fully wrong LLM-as-judge or grounding checks on key claims
Instruction ignored Model responds in wrong language, skips a required field, ignores constraints Schema validation + presence checks on required fields
Overconfident wrong answer Model says "definitely X" when it should say "I don't know" Calibration eval; add explicit uncertainty instructions
Inconsistent outputs Same question β†’ different answers across sessions (inconsistent brand voice, logic) Consistency eval suite; lock temperature to 0 for deterministic tasks
Retrying an Already-Failed Prompt Is Usually Wrong

If a model returns the same malformed output on three consecutive retries, more retries will not help β€” the prompt or schema is the problem. Build retry logic that modifies the prompt on failure (e.g. appending "Return valid JSON. Error was: …") rather than blindly resending the same request. Blind retries multiply cost with zero benefit.

LLM latency isn't a single number. It has two distinct phases, and understanding them changes how you design your system.

LLM latency breakdown β€” TTFT + generation time
TTFT (Time to First Token) Token Generation (streaming output) 200ms–2s Prompt processing Scales with INPUT tokens 20–80ms per token Sequential token generation Scales with OUTPUT tokens Total: 0.5–5s typical 10Γ— slower than a DB query
Latency FactorImpactDesign Implication
Input tokens (prompt length) +50–200ms per 1K tokens of input Keep prompts short. Cache system prompts. Compress context.
Output tokens +20–80ms per output token Set max_tokens aggressively. Ask for concise responses.
Model size Larger model = slower Route simple tasks to smaller/faster models.
Provider load Peak times = 2–3Γ— latency Multi-provider failover, request queuing.
Streaming Reduces perceived latency to TTFT (~200ms) Always stream for user-facing responses.

In traditional SaaS, you pay for servers. In LLM systems, you pay for tokens β€” every input token and every output token, every call. This fundamentally changes how you think about system design.

Model (2024–2025)Input CostOutput CostCost per 1K queries (avg 2K in + 500 out)
GPT-4o $2.50 / 1M tokens $10.00 / 1M tokens $10.00
GPT-4o-mini $0.15 / 1M tokens $0.60 / 1M tokens $0.60
Claude 3.5 Sonnet $3.00 / 1M tokens $15.00 / 1M tokens $13.50
Claude 3.5 Haiku $0.25 / 1M tokens $1.25 / 1M tokens $1.13
Gemini 1.5 Flash $0.075 / 1M tokens $0.30 / 1M tokens $0.30

The 16Γ— cost difference matters. GPT-4o costs 16Γ— more than GPT-4o-mini per query. If 70% of your queries are simple enough for mini, routing them saves ~60% of your LLM spend. Model routing is the single highest-ROI design decision for cost optimization.

LLM cost overruns rarely come from a single expensive call. They come from emergent system behavior β€” patterns that only appear at scale or under edge-case inputs.

πŸ”
Agent Loop Multiplication

1 user request β†’ 5 agent steps β†’ 3 retries each β†’ 15 LLM calls for what should have been 1–2.

Fix: Cap steps (max_iterations=10), cap retries (max=2), add cost circuit breaker per request.

πŸ“ˆ
Context Window Growth

Multi-turn conversations or document chains grow context exponentially. Turn 10 of a chat can have 10Γ— the tokens of turn 1 β€” same cost structure, very different price.

Fix: Summarize old turns. Never pass raw history unbounded.

⚑
Fan-out Without Limit

MapReduce on user-uploaded documents: a 500-page PDF becomes 250 LLM map calls. Multiply by daily uploads.

Fix: Cap input size. Estimate cost before processing. Require user confirmation above threshold.

πŸ”„
Silent Retry Storms

A bug causes 100% of requests to fail validation β†’ all retry 3Γ— β†’ 4Γ— provider load β†’ rate limiting β†’ more retries. Cost spikes 4Γ— with zero benefit.

Fix: Circuit breaker: if error rate >50% in 60s, stop retrying and alert.

Failure ModeTraditional SoftwareLLM Systems
Wrong output Bug β†’ fix code β†’ fixed forever Hallucination β†’ tweak prompt β†’ might recur
Slow response Profile code β†’ optimize β†’ consistent improvement Depends on provider load, token count, model β€” varies per call
Rate limiting Scale horizontally, add servers Provider-imposed limits, can't self-serve more capacity
Cost overrun Fixed infrastructure cost, predictable Usage-based, runaway loops can burn budget in minutes
Format errors Type system prevents malformed data LLM can return any string β€” malformed JSON, truncated output
Provider outage Your infra, your control Third-party dependency β€” OpenAI goes down, your app goes down
The Provider Dependency

Most LLM systems depend on 1–2 providers (OpenAI, Anthropic). When they have outages β€” and they do, regularly β€” your entire application stops. Production systems need multi-provider failover: primary model (GPT-4o) β†’ fallback model (Claude Sonnet) β†’ degraded mode (cached responses or smaller model). Chapter 3 covers model selection and routing.

β‘ 
Minimize LLM Calls

Every LLM call costs money and time. Ask: can this be done without calling the LLM? Can I cache the result? Can I batch multiple queries?

β‘‘
Use the Cheapest Model That Works

Don't use GPT-4o for classification. Don't use Claude Opus for extraction. Route each task to the cheapest model that achieves acceptable quality.

β‘’
Validate Every Output

LLMs can return anything. Parse, validate, and type-check every response. Use JSON mode / structured outputs where available.

β‘£
Stream Everything User-Facing

A 3-second response feels fast when streamed token-by-token. Without streaming, users stare at a blank screen for 3 seconds. Always stream.

β‘€
Design for Failure

Every LLM call can fail: hallucination, rate limit, timeout, malformed output. Retry, fallback, degrade gracefully β€” never crash on LLM failure.

β‘₯
Measure Everything

Track latency, cost, quality, and error rates per model, per endpoint, per prompt version. You can't optimize what you can't measure.

⑦
Build the Simplest System That Works

Don't start with agents, multi-model routing, and semantic caching. Start with a single model, a simple prompt, and an eval set. Add complexity only when measurement shows you need it. The best LLM system is the one with the fewest LLM calls.

Generic LLM application architecture β€” all the layers
Client Layer: Web app / API consumer / Mobile app API Gateway: Auth, rate limiting, request routing, streaming SSE/WebSocket Application Layer Prompt builder Output parser Validator Router Retry / Fallback Cache check LLM Providers: OpenAI / Anthropic / Google / Self-hosted Data: Vector DB / Cache (Redis) / DB / External APIs Observability: Logging β†’ Traces β†’ Metrics β†’ Cost tracking β†’ Alerts β†’ Eval pipeline

Every LLM application has these layers, whether you build them explicitly or not. The chapters that follow cover each in depth: architecture patterns (Ch 2), model selection (Ch 3), API design (Ch 4), caching (Ch 5), scaling (Ch 6), latency (Ch 7), cost (Ch 8), infrastructure (Ch 9), and real-world case studies (Ch 10).

Without structured logging, debugging LLM failures is guesswork. Every call must emit a structured log entry. For multi-step systems, every intermediate step must be logged β€” not just the final output.

Single-call minimum log

β€’ request_id β€” trace across services

β€’ user_id β€” cost attribution

β€’ prompt β€” sanitized if PII possible

β€’ response β€” actual output

β€’ tokens_in / tokens_out β€” cost calculation

β€’ latency_ms β€” performance tracking

β€’ model β€” which model was used

β€’ retries β€” number of retry attempts

β€’ fallback_triggered β€” boolean

Multi-step / agent additional log

β€’ step_index β€” which step in the chain

β€’ step_type β€” llm_call / tool_call / validate

β€’ tool_name + tool_input + tool_output

β€’ intermediate_output β€” per step

β€’ total_cost_usd β€” running total

β€’ terminated_early β€” if circuit breaker fired

β€’ loop_count β€” for agent iterations

β€’ parent_request_id β€” for sub-calls

In the architecture diagram, the Application Layer appears as a thin band. In practice, it becomes the largest and most complex part of the system. Unlike the LLM layer (a managed API) and the data layer (a database), the application layer is entirely custom code that grows with every feature.

The Application Layer Tax

The application layer accumulates: prompt templates, output parsers, validation schemas, retry logic, routing rules, cost tracking hooks, fallback chains, streaming wrappers, and tool dispatch. Each added to handle a specific failure in production. Design it to be modular from day one β€” test each component independently, make each observable, and version your prompt templates. A monolithic application layer is the #1 source of debugging pain in mature LLM systems.

1️⃣Single modelone prompt, one call
2️⃣Add evalmeasure quality
3️⃣Add cachingreduce cost/latency
4️⃣Add routingcheap model for simple
5️⃣Add RAGif knowledge needed
The Complexity Tax

Every component you add (RAG, routing, caching, agents) adds failure modes, latency, and maintenance burden. A single GPT-4o call with a good prompt can often outperform an overengineered pipeline with multiple models and retrieval steps. Only add complexity when your eval suite proves simple isn't good enough.

Most production LLM systems do not need agents, multi-model orchestration, or complex pipelines. A single well-designed prompt and model solves the majority of use cases β€” at lower cost, lower latency, and higher reliability.

ComponentAdd It When…Don't Add It Because…
Agents / orchestration Control flow is genuinely dynamic; tool use is required It looks powerful or is trending
Multi-model routing Evals show 1 model can't handle all task types and costs differ You want to use multiple providers
RAG pipeline Knowledge is too large or dynamic for the context window You have <100 documents
Semantic caching Exact cache hit rate <5% and queries are repetitive in meaning You haven't measured exact cache hit rate yet
Self-hosted models Volume exceeds ~10M tokens/day or strict data privacy mandate You want more control in principle

The best LLM system is the one with the fewest LLM calls. Every call is a source of latency, cost, and non-determinism. Reduce calls through caching, batching, and simpler architectures β€” and your system will be faster, cheaper, and more reliable.

∑ Chapter 01 — Key Takeaways

  • LLM systems are constrained by three things traditional software isn't: non-determinism, second-scale latency, and per-token costs
  • Never make an LLM call a single point of failure β€” validate outputs, retry, fallback, degrade gracefully
  • Latency has two phases: TTFT (input processing) and generation (output tokens) β€” streaming hides TTFT
  • Cost varies 16Γ— across models β€” routing simple tasks to cheap models is the highest-ROI optimization
  • LLM failure modes are different: hallucination, rate limits, format errors, provider outages β€” design for all of them
  • Seven principles: minimize calls, cheapest model, validate outputs, stream, design for failure, measure, keep it simple
  • Start with the simplest system β€” add complexity only when evaluation proves you need it
  • The best LLM system is the one with the fewest LLM calls
02
Chapter 02 Β· Patterns
Architecture Patterns β€” Common LLM Application Architectures

Every LLM application is built from a small set of composable patterns. Understanding these patterns lets you pick the right architecture for your problem instead of over-engineering or under-building. Start with the simplest pattern that works.

β‘ 
Single Call

One prompt β†’ one LLM call β†’ one response. The simplest possible pattern.

  • Classification, extraction, summarization
  • Latency: 200ms–2s
  • Cost: 1 LLM call
  • Start here.
β‘‘
Chain (Sequential)

Output of call A becomes input to call B. Multi-step processing with deterministic order.

  • Extract β†’ classify β†’ format
  • Latency: N Γ— single call
  • Cost: N LLM calls
  • Each step can use different model
β‘’
Router

Classify input first, then route to the appropriate handler (model, prompt, or pipeline).

  • Intent detection β†’ specialized pipeline
  • Latency: classifier + handler
  • Cost: 1 cheap classify + 1 handler
  • Key pattern for model routing
β‘£
Parallel Fan-out

Send the same input to multiple LLM calls simultaneously, aggregate results.

  • Generate 3 drafts β†’ pick best
  • Latency: max(calls) not sum
  • Cost: N Γ— single call
  • Needs aggregation logic
β‘€
MapReduce

Split large input into chunks, process each (map), then combine results (reduce).

  • Summarize 100-page document
  • Latency: map (parallel) + reduce
  • Cost: N map calls + 1 reduce
  • Handles inputs beyond context window
β‘₯
Orchestrator (Agent)

LLM decides what to do next in a loop. Non-deterministic control flow.

  • Tool use, multi-step reasoning
  • Latency: unpredictable (3–15 steps)
  • Cost: high, variable
  • Use only when others can't work
PatternLLM CallsLatencyPredictabilityBest For
Single Call 1 200ms–2s High Classification, extraction, simple Q&A
Chain 2–5 1–5s High Multi-step processing, transform pipelines
Router 2 0.5–3s High Cost optimization, intent-based dispatch
Parallel Fan-out N (parallel) max(calls) High Quality improvement, consensus, diversity
MapReduce N+1 1–10s High Large docs, batch processing
Orchestrator 3–15+ 3–30s+ Low Dynamic multi-step, tool use, research

In tool-using agents and routers, there is a non-obvious scaling limit: as the number of available tools grows, model tool selection accuracy degrades. More tools = more confusion, not more capability.

βœ…
5–10 tools

Reliable selection. Model consistently picks the right tool, descriptions are easy to differentiate.

  • Selection accuracy: ~95%
  • Manageable prompt overhead
  • Good tool descriptions sufficient
⚠️
11–20 tools

Noticeable degradation. Model occasionally picks wrong tool or combines incompatible tools.

  • Selection accuracy: ~80–85%
  • Requires more specific descriptions
  • Needs mitigation strategies
❌
20+ tools

Significant degradation. Tool selection becomes a primary failure source, outweighing other problems.

  • Selection accuracy: <70%
  • High hallucination of tool names
  • Requires structural mitigation
Three Mitigation Strategies

(1) Group tools by function β€” expose only the relevant group per task (search tools vs write tools vs compute tools). (2) Use a router before tool exposure β€” classify intent first, then present only the 3–5 relevant tools for that intent category. (3) Limit visible tools per step β€” in multi-step agents, expose only the tools needed for the current step. The goal: never present more than 10 tools at once.

Router pattern β€” classify intent, dispatch to specialized handler
User Input Classifier GPT-4o-mini ~100ms, $0.0002 Simple Q&A β†’ mini 70% of queries, cheap Complex β†’ GPT-4o 20% of queries, quality Knowledge β†’ RAG pipeline 10% of queries, retrieval Response ~60% cost savings
MapReduce pattern β€” process large documents that exceed context window
Large Doc 100 pages split Map: summarize chunk 1 Map: summarize chunk 2 Map: summarize chunk N parallel β€” fast Reduce Combine summaries Final Summary

Real LLM applications compose multiple patterns. A customer support system might use: Router (classify intent) β†’ RAG (retrieve knowledge for FAQ queries) β†’ Chain (extract + respond for complex issues) β†’ Agent (multi-step for account changes).

The Progression

Most successful LLM applications follow this evolution: Single Call (prototype) β†’ Chain (add structure) β†’ Router (add cost optimization) β†’ RAG/MapReduce (add knowledge) β†’ Agent (add autonomy, only if needed). Each step is driven by evaluation showing the simpler pattern isn't sufficient.

Despite the popularity of agents in demos and research, the majority of production LLM systems use simpler patterns β€” because simpler patterns are cheaper, faster, and more predictable.

What most production systems use

β€’ Single call β€” extraction, classification, summarization

β€’ Chain β€” structured multi-step processing

β€’ Router β€” cost and quality optimization

β€’ RAG pipeline β€” knowledge-grounded Q&A

These cover ~90% of real-world LLM use cases.

When agents are actually justified

β€’ Control flow is genuinely dynamic β€” can't be predetermined

β€’ Tool use is required (search, code execution, APIs)

β€’ Problem requires multi-step reasoning with branching

β€’ Simpler patterns have been tried and measured as insufficient

The Agent Complexity Cost

Agents add three compounding costs: latency (3–15 LLM calls instead of 1–2), cost (multiplicative with step count), and unpredictability (non-deterministic control flow is hard to test and debug). If a Chain or Router can solve the problem, use it. Agents should be the last resort, not the first architecture.

∑ Chapter 02 — Key Takeaways

  • Six core patterns: Single Call, Chain, Router, Parallel Fan-out, MapReduce, Orchestrator
  • Single Call first β€” most tasks don't need multi-step processing
  • Router is the key cost pattern β€” classify intent, dispatch to cheapest capable handler
  • MapReduce handles documents larger than context window β€” map in parallel, reduce to one answer
  • Orchestrator (Agent) is the most powerful but most expensive and unpredictable β€” use last
  • Real systems compose patterns β€” router β†’ chain β†’ RAG is a common production stack
03
Chapter 03 Β· Models
Model Selection & Routing β€” Picking the Right Model for Each Task

There is no "best model." There is only the best model for this task at this cost. GPT-4o is overkill for classification. Haiku is too weak for complex reasoning. Model selection and routing is how you get quality and affordability.

CapabilityGPT-4oGPT-4o-miniClaude SonnetClaude HaikuGemini Flash
Complex reasoning β˜…β˜…β˜…β˜…β˜… β˜…β˜…β˜… β˜…β˜…β˜…β˜…β˜… β˜…β˜…β˜… β˜…β˜…β˜…
Code generation β˜…β˜…β˜…β˜…β˜… β˜…β˜…β˜…β˜… β˜…β˜…β˜…β˜…β˜… β˜…β˜…β˜… β˜…β˜…β˜…β˜…
Classification β˜…β˜…β˜…β˜…β˜… β˜…β˜…β˜…β˜…β˜… β˜…β˜…β˜…β˜…β˜… β˜…β˜…β˜…β˜… β˜…β˜…β˜…β˜…
Extraction β˜…β˜…β˜…β˜…β˜… β˜…β˜…β˜…β˜… β˜…β˜…β˜…β˜…β˜… β˜…β˜…β˜…β˜… β˜…β˜…β˜…β˜…
Long context 128K 128K 200K 200K 1M
Speed Medium Fast Medium Very Fast Very Fast
Cost $$$ $ $$$ $ $
The Key Insight

For classification, extraction, and simple formatting, cheap models (mini, Haiku, Flash) perform within 1–2% of frontier models β€” at 10–20Γ— lower cost. Reserve expensive models for complex reasoning, nuanced writing, and multi-step analysis.

πŸ“‹
Task-based Routing

Map task types to models at design time. Simplest approach.

  • Classification β†’ mini
  • Summarization β†’ Haiku
  • Complex analysis β†’ GPT-4o
  • Static, no runtime overhead
πŸ€–
LLM-based Routing

Use a cheap model to classify query complexity, then route to the appropriate model.

  • Classifier (mini) β†’ decides: simple/complex
  • Simple β†’ mini ($0.0003)
  • Complex β†’ GPT-4o ($0.02)
  • Overhead: 1 cheap LLM call
πŸ“Š
Confidence-based Routing

Try cheap model first. If confidence is low, escalate to expensive model.

  • Try mini β†’ if uncertain, try 4o
  • 80% of queries handled by mini
  • 20% escalated to 4o
  • Best quality-cost tradeoff
πŸ”§
Confidence-based routing implementation
async def route_query(query: str, cheap_model, expensive_model): """Try cheap model first, escalate if uncertain.""" # Step 1: Try cheap model response = await cheap_model.generate( query, temperature=0, logprobs=True # Get confidence scores ) # Step 2: Check confidence avg_logprob = mean(response.logprobs) confidence = math.exp(avg_logprob) # 0–1 scale if confidence > 0.85: return response # Cheap model is confident β†’ use it # Step 3: Escalate to expensive model return await expensive_model.generate(query, temperature=0)
1️⃣PrimaryGPT-4o
2️⃣FallbackClaude Sonnet
3️⃣BudgetGPT-4o-mini
4️⃣CacheCached response
5️⃣Degrade"Try again later"
TriggerFallback ActionUser Impact
Provider timeout (>10s) Switch to secondary provider Slightly different style, same quality
Rate limit (429) Queue + retry with backoff, or secondary 200–500ms added delay
Provider outage Switch to secondary provider entirely Seamless if well-tested
All providers down Serve cached responses for common queries Stale but available
Budget exhausted Route all traffic to cheapest model Lower quality, still functional
Test Your Fallbacks

A fallback chain that's never been tested doesn't work. Regularly simulate provider failures (chaos engineering) and verify: your fallback triggers correctly, the secondary provider returns compatible output, and your parsers handle the different response format. The worst time to discover your fallback is broken is during an actual outage.

∑ Chapter 03 — Key Takeaways

  • There's no "best model" β€” only the best model for this task at this cost
  • Cheap models (mini, Haiku, Flash) match frontier models for classification and extraction at 10–20Γ— lower cost
  • Three routing strategies: task-based (static), LLM-based (classify then route), confidence-based (try cheap, escalate if unsure)
  • Confidence-based routing handles 80% of queries cheaply, escalates 20% to expensive models
  • Fallback chains across providers prevent single-provider outages from taking down your system
  • Test your fallbacks β€” untested fallback chains fail when you need them most
04
Chapter 04 Β· Interfaces
API Design β€” Designing LLM-Powered APIs

Your LLM system's API is the contract with your consumers. LLM-powered APIs have unique challenges: long response times, streaming output, non-deterministic results, and variable costs per call. Traditional REST patterns don't always apply.

Batch (Request-Response)

Client sends request, waits for complete response.

Pro: Simple, standard REST. Easy to cache, retry, log.

Con: User stares at spinner for 2–5s. Feels slow.

Best for: API-to-API calls, background processing, short responses.

Streaming (SSE / WebSocket)

Server sends tokens as they're generated.

Pro: First token in ~200ms. Feels instant. Better UX.

Con: Harder to cache, parse, and handle errors mid-stream.

Best for: User-facing chat, long responses, real-time interaction.

The Rule

Stream for humans, batch for machines. If the consumer is a user looking at a screen, stream via SSE. If the consumer is another service that needs a complete JSON response, use standard request-response. Many systems expose both: a streaming endpoint for the frontend and a batch endpoint for internal services.

⏱️
Synchronous + Timeout

Simple: send request, wait for response with a timeout. Best for calls under 10s.

  • Timeout: 10–30s
  • Return 504 on timeout
  • Client retries
πŸ“¬
Job Queue (Submit + Poll)

Client submits job, gets job_id, polls for result. Best for calls 10s–5min.

  • POST /jobs β†’ returns job_id
  • GET /jobs/{id} β†’ status + result
  • Or: webhook on completion
πŸ“‘
WebSocket (Bidirectional)

Persistent connection. Server pushes updates. Best for real-time + multi-turn.

  • Progress updates: "Searching..."
  • Streaming tokens
  • Client can cancel mid-generation

LLMs return strings. Your API consumers expect structured data. The gap between these two is where most production bugs live.

StrategyHowReliabilityWhen
JSON Mode OpenAI/Anthropic native: response_format: {"type": "json_object"} Very high β€” model forced to output valid JSON Always, when available
Structured Outputs OpenAI: define JSON schema, model must match it exactly Highest β€” schema-enforced When you need guaranteed schema
Prompt + parse Ask for JSON in prompt, parse manually Medium β€” model may add markdown fences, skip fields When structured output unavailable
Retry on parse failure If JSON parsing fails, retry with error feedback Good with 1–2 retries Always as fallback layer
πŸ”§
Robust output parsing with retry
import json from pydantic import BaseModel, ValidationError class AnalysisResult(BaseModel): sentiment: str # "positive" | "negative" | "neutral" confidence: float # 0.0–1.0 summary: str async def get_structured_response(prompt, llm, max_retries=2): for attempt in range(max_retries + 1): response = await llm.generate( prompt, response_format={"type": "json_object"} ) try: data = json.loads(response) return AnalysisResult(**data) # Validates schema except (json.JSONDecodeError, ValidationError) as e: if attempt == max_retries: raise prompt += f"\n\nError: {e}. Return valid JSON matching the schema."

LLM provider rate limits are strict and per-organization. If one customer sends 1000 requests, they can exhaust your rate limit for everyone. You need rate limiting at your API layer too.

πŸ‘€
Per-User Limits

Cap requests per user per minute. Prevents one user from starving others.

  • Free tier: 10 req/min
  • Pro tier: 60 req/min
  • Return 429 with Retry-After header
πŸ’°
Token Budgets

Cap total tokens per user per day/month. Prevents cost overrun.

  • Track cumulative tokens per API key
  • Return 429 when budget exhausted
  • Dashboard showing usage
🚦
Global Backpressure

When approaching provider rate limits, queue requests instead of failing.

  • Request queue with priority
  • Return 202 + job_id when queued
  • Shed load during peak

LLM calls are expensive and non-deterministic. When a client retries a timed-out request, you don't want to run (and pay for) the LLM call again. Idempotency keys solve this.

❌ Without Idempotency

Client sends request β†’ timeout β†’ retries β†’ LLM called twice.

You pay double. User may get different answers for the "same" request.

βœ… With Idempotency Key

Client sends request + Idempotency-Key: abc123.

First call: runs LLM, caches result keyed by abc123.

Retry: returns cached result. No second LLM call.

Non-determinism + Retries = Confusion

Without idempotency, a client that retries the same request might get a different answer β€” because LLMs are non-deterministic. This confuses users and breaks downstream systems that expect consistent results. Always cache the first response for a given idempotency key (TTL: 24h typical).

∑ Chapter 04 — Key Takeaways

  • Stream for humans, batch for machines β€” expose both endpoints
  • Async patterns: sync + timeout (<10s), job queue (10s–5min), WebSocket (real-time + multi-turn)
  • Use JSON Mode or Structured Outputs whenever available β€” prompt-based JSON is fragile
  • Always add Pydantic/schema validation + retry on parse failure as a safety net
  • Rate limit at your API layer β€” per-user request limits + token budgets + global backpressure
  • Idempotency keys prevent double LLM calls on retries and ensure consistent responses
05
Chapter 05 Β· Performance
Caching Strategies β€” Reducing Cost and Latency with Smart Caching

The cheapest and fastest LLM call is the one you don't make. Caching is the most impactful optimization for LLM systems β€” it reduces cost, latency, and provider dependency in one move. But LLM caching is harder than traditional caching because inputs are natural language, not exact keys.

β‘ 
Exact Match Cache

Hash the full prompt β†’ cache response. Identical prompt = cache hit.

  • Hit rate: 5–15% (prompts vary)
  • Implementation: Redis / in-memory
  • Zero false positives
  • Always implement this first
β‘‘
Semantic Cache

Embed the query β†’ find similar past queries in vector DB β†’ return cached answer if similar enough.

  • Hit rate: 15–35% (catches paraphrases)
  • Implementation: Vector DB + threshold
  • Risk: false positives if threshold too low
  • Saves the most money
β‘’
Prompt Cache (Provider)

OpenAI/Anthropic cache system prompts across calls. Same prefix = faster + cheaper.

  • Automatic for long system prompts
  • 50% input cost reduction
  • Reduced TTFT
  • No implementation needed β€” built in
β‘£
KV Cache Reuse (Self-hosted)

When self-hosting: reuse key-value cache across requests with shared prefixes.

  • vLLM automatic prefix caching
  • Same system prompt = cached KV
  • 30–60% faster TTFT
  • Only for self-hosted models
Semantic cache flow β€” embed query, search for similar, return cached or call LLM
Query Embed β†’ vector Vector Search similarity > 0.95? ~5ms Hit Return cached 0ms LLM cost Miss Call LLM 1–3s, $$ Store in cache for next time
The False Positive Trap

Setting the similarity threshold too low causes wrong answers returned from cache. "What's the refund policy?" and "What's the return policy?" may be 0.92 similar but have different answers. Start with threshold β‰₯ 0.95 and lower only with testing. A wrong cached answer is worse than a slow correct one.

StrategyHowBest For
TTL (time-to-live) Cache expires after N hours/days General answers, FAQs (TTL: 6–24h)
Version key Include prompt version in cache key β€” new prompt = new cache When you update prompts/models
Event-driven Clear cache when underlying data changes RAG: doc updated β†’ clear related caches
Manual purge Admin action to clear specific cache entries Wrong answers discovered in production

∑ Chapter 05 — Key Takeaways

  • Four caching layers: exact match (simple), semantic (paraphrases), prompt cache (provider), KV cache (self-hosted)
  • Exact match first β€” zero false positives, easy to implement, 5–15% hit rate
  • Semantic cache catches paraphrases (15–35% hit rate) but needs careful threshold tuning (β‰₯0.95)
  • Provider prompt caching is free optimization β€” 50% input cost reduction on long system prompts
  • Cache invalidation: TTL for general, version keys for prompt changes, events for data changes
  • A wrong cached answer is worse than a slow correct one β€” tune thresholds conservatively
06
Chapter 06 Β· Scale
Scaling LLM Applications β€” From Prototype to Production Load

Scaling LLM applications is fundamentally different from scaling traditional web apps. You can't just add servers β€” your bottleneck is third-party API rate limits, not your own compute. Scaling strategy is about managing concurrency, queuing, and provider capacity.

ChallengeTraditional AppLLM App
Bottleneck Your servers (scalable) Provider API rate limits (not in your control)
Response time <100ms (scale horizontally) 1–5s per call (can't parallelize a single call)
Cost of scale Fixed infrastructure β†’ amortized Linear: 2Γ— queries = 2Γ— LLM cost
Capacity planning Auto-scale based on CPU/memory Pre-negotiate rate limits, multi-provider
Request weight All requests ~equal cost Requests vary 10–100Γ— in token cost
Queue-based scaling β€” decouple request intake from LLM execution
Clients 100+ req/s burst traffic variable load API Server validates, enqueues Queue Redis / SQS priority ordering Worker Pool N concurrent workers rate-limit aware auto-scale on queue depth N ≀ provider rate limit LLM API rate limited Store
Why Queues Work

The queue absorbs burst traffic, the worker pool enforces the provider rate limit. Clients get immediate acknowledgment (202 Accepted), workers process at the maximum rate the provider allows. Scale workers up to match rate limits, not beyond. Queue depth is your auto-scaling signal β€” high queue = add workers (up to rate limit cap).

πŸ“Š
Track Usage in Real-Time

Monitor requests/min and tokens/min against provider limits. Throttle before hitting the limit.

  • Sliding window counter
  • Alert at 80% of limit
  • Auto-throttle at 90%
πŸ”„
Multi-Provider Spreading

Split traffic across multiple providers to multiply effective rate limits.

  • 60% OpenAI, 40% Anthropic
  • Weighted routing per model quality
  • 2Γ— effective capacity
πŸ“…
Request Prioritization

When approaching limits, serve high-priority requests first.

  • Paid users before free users
  • Real-time before batch
  • Short requests before long
Scale TierDaily QueriesArchitectureEstimated LLM Cost/Day
Prototype <1K Single server, sync calls $1–$10
Small prod 1K–10K App server + cache (Redis) $10–$100
Medium prod 10K–100K Queue + worker pool + multi-provider $100–$1,000
Large prod 100K–1M+ Full queue arch + model routing + caching + self-host mix $1,000–$10,000+

∑ Chapter 06 — Key Takeaways

  • LLM scaling bottleneck is provider rate limits, not your servers β€” you can't just add compute
  • Queue-based architecture decouples intake from execution β€” workers process at max provider rate
  • Scale workers to match rate limits, not beyond β€” queue depth is your auto-scaling signal
  • Multi-provider spreading multiplies effective rate limits (60/40 split = 2Γ— capacity)
  • Prioritize: paid before free, real-time before batch, short before long
  • Cost scales linearly β€” 2Γ— queries = 2Γ— LLM cost (caching and routing are your only levers)
07
Chapter 07 Β· Speed
Latency Optimization β€” Making LLM Applications Feel Fast

Users expect sub-second responses. LLM calls take 1–5 seconds. This gap is where latency engineering lives β€” reducing actual latency where possible, and masking it with streaming and progressive rendering where not.

ComponentTypical LatencyOptimizationSavings
Network (client β†’ server) 10–50ms CDN, edge deployment 20–30ms
Your app logic 5–20ms Optimize prompts, pre-compute 10ms
Cache check 1–5ms (Redis) In-memory for hot queries Skips LLM entirely on hit
TTFT (LLM) 200ms–2s Shorter prompts, prompt caching, smaller model 50–500ms
Token generation 1–5s total max_tokens limit, concise prompts, streaming Perceived: ~0ms with streaming
Post-processing 5–50ms Async validation, stream while processing Overlap with generation
πŸ†
1. Streaming

First token in ~200ms vs waiting 3s for full response. The single biggest UX improvement.

  • SSE for REST APIs
  • WebSocket for bidirectional
  • Effort: Low
πŸ₯ˆ
2. Caching

Skip the LLM call entirely. Cache hit = response in <10ms instead of 2s.

  • Exact match + semantic cache
  • 15–35% of queries cached
  • Effort: Medium
πŸ₯‰
3. Smaller Models

GPT-4o-mini is 2–3Γ— faster than GPT-4o. Route simple queries to fast models.

  • Classifier β†’ route to mini/Haiku
  • 70% of queries can use mini
  • Effort: Medium
4️⃣
Shorter Prompts

Every 1K fewer input tokens saves 50–200ms TTFT. Remove fluff from system prompts.

5️⃣
Parallel Calls

Independent LLM calls run simultaneously. Latency = max(calls) not sum(calls).

6️⃣
Progressive Rendering

"Searching..." β†’ "Found 3 docs..." β†’ "Generating..." β†’ streamed answer. Each step feels fast.

πŸ”§
SSE streaming endpoint (FastAPI)
from fastapi import FastAPI from fastapi.responses import StreamingResponse import openai app = FastAPI() @app.post("/chat") async def chat_stream(request: ChatRequest): async def generate(): stream = await openai.chat.completions.create( model="gpt-4o", messages=request.messages, stream=True, ) async for chunk in stream: if chunk.choices[0].delta.content: token = chunk.choices[0].delta.content yield f"data: {json.dumps({'token': token})}\n\n" yield "data: [DONE]\n\n" return StreamingResponse( generate(), media_type="text/event-stream" )
Streaming + Structured Output = Tricky

When streaming JSON responses, you get partial JSON tokens: {"sen β†’ timent": β†’ "pos β†’ itive"}. You can't parse until the full object arrives. Solutions: (1) stream text, return metadata separately, (2) use a streaming JSON parser (jsonstream), or (3) stream the text response and return structured data as a final non-streamed event.

∑ Chapter 07 — Key Takeaways

  • Streaming is #1 β€” first token in ~200ms vs 3s wait. The single biggest UX improvement you can make.
  • Caching is #2 β€” skip the LLM entirely. Cache hit = <10ms response.
  • Smaller models are #3 β€” mini/Haiku are 2–3Γ— faster for simple tasks
  • Shorter prompts directly reduce TTFT β€” every 1K fewer tokens saves 50–200ms
  • Run independent calls in parallel β€” latency = max, not sum
  • Progressive rendering (showing steps) makes any system feel faster
08
Chapter 08 Β· Economics
Cost Management β€” Keeping LLM Costs Under Control

LLM costs are deceptive: they start small and grow linearly with usage. A system that costs $10/day at launch can cost $10,000/day at scale without any architecture changes. Cost management is not an afterthought β€” it's a first-class design constraint.

Every dollar you spend on LLMs comes from three levers: how many calls you make, how many tokens per call, and which model you use. Cost optimization attacks all three.

LLM cost breakdown β€” three levers you control
Lever 1: Call Volume How often do you call the LLM? Reduce via: caching, batching routing non-LLM solutions Impact: 20–50% fewer calls Lever 2: Tokens per Call How long are prompts & responses? Reduce via: prompt compression max_tokens limits, context pruning Impact: 30–60% token reduction Lever 3: Model Choice Which model handles each request? Reduce via: model routing, tiering cheap model for simple tasks Impact: 50–80% cost reduction
The Cost Formula

Daily Cost = Queries/day Γ— Avg Tokens/query Γ— Model Price/token. Optimize all three. Model routing alone (sending 70% of queries to mini instead of GPT-4o) can cut costs by 60–70%. Combined with caching (eliminating 20–30% of calls) and prompt compression (reducing tokens 30%), total cost reduction of 75–85% is achievable without quality loss.

Without token budgets, a single runaway agent loop or abusive user can burn your monthly budget in minutes. Token budgets enforce hard and soft limits at every level.

Budget LevelWhat It LimitsImplementationEnforcement
Per request Max tokens per single LLM call Set max_tokens parameter Provider enforces β€” free
Per user/day Total tokens a user can spend daily Track in Redis with daily TTL key Return 429 when exhausted
Per API key/month Total organization spend Provider dashboard spend limits Provider cuts off at limit
Per feature Limit expensive features (agents, long-form) Feature flags based on user tier App-level enforcement
System-wide Total queries per hour (circuit breaker) Global counter, open circuit if exceeded Degrade gracefully when triggered
πŸ”§
Per-user token budget with Redis
import redis from datetime import datetime r = redis.Redis() DAILY_TOKEN_LIMIT = 100_000 # tokens/user/day async def check_and_deduct_budget(user_id: str, estimated_tokens: int) -> bool: key = f"tokens:{user_id}:{datetime.utcnow().strftime('%Y-%m-%d')}" # Atomic increment with TTL pipe = r.pipeline() pipe.incrby(key, estimated_tokens) pipe.expire(key, 86400) # 24h TTL results = pipe.execute() total_used = results[0] if total_used > DAILY_TOKEN_LIMIT: r.decrby(key, estimated_tokens) # Roll back return False # Budget exhausted return True
πŸ’š
Tier 1 β€” Free / Near-Free

Tasks solvable without LLM β€” use rule-based or traditional ML.

  • Keyword matching, regex
  • Traditional classifiers (sklearn)
  • Embedding similarity (no LLM)
  • Cost: ~$0
πŸ’™
Tier 2 β€” Cheap LLM

Simple tasks where a small/fast model performs within 2% of frontier.

  • GPT-4o-mini / Claude Haiku / Gemini Flash
  • Classification, extraction, formatting
  • Short Q&A, paraphrase detection
  • Cost: $0.01–$0.10 / 1K queries
πŸ’›
Tier 3 β€” Frontier LLM

Complex tasks requiring frontier reasoning or nuanced writing.

  • GPT-4o / Claude Sonnet / Gemini Pro
  • Multi-step reasoning, code gen
  • Complex analysis, long-form writing
  • Cost: $0.50–$5.00 / 1K queries
The Tiering Rule

Benchmark each task type across tiers before committing to a model. In practice, 60–75% of production LLM queries are Tier 2 β€” classification, simple extraction, FAQ answers, format conversion. These can run on cheap models at 10–20Γ— lower cost with <2% quality delta. Only reserve Tier 3 for tasks where you've measured it makes a difference.

OpenAI and Anthropic offer batch APIs that process requests asynchronously at a 50% price discount. If your use case tolerates minutes-to-hours latency, batch processing is a free cost halving.

Synchronous API (real-time)

Latency: 1–5s per request

Cost: Full price (e.g. $2.50/1M tokens)

Best for: User-facing generation, interactive features

When: Response needed in <10s

Batch API (async)

Latency: Minutes to 24 hours

Cost: 50% off (e.g. $1.25/1M tokens)

Best for: Embedding runs, bulk eval, background analysis

When: Any offline/background processing

Use CaseAPI TypeRationale
Chat response Sync (real-time) User waiting β€” can't delay
Embedding 10K documents Batch No user waiting β€” 50% savings
Nightly eval run Batch Background job β€” 50% savings
Bulk data extraction Batch No realtime need β€” 50% savings
Content moderation Depends Sync if blocking publish; batch if reviewing async
βœ‚οΈ
System Prompt Audit

System prompts repeat every call. Every 1K tokens of system prompt = 1K input tokens per query. Audit ruthlessly.

  • Remove examples that aren't needed
  • Delete redundant instructions
  • Use terse style over verbose
  • Typical saving: 30–50%
πŸ—œοΈ
Context Window Pruning

For multi-turn conversations, don't send complete history. Summarize old turns instead.

  • Keep last N turns verbatim
  • Summarize older turns (~10% of tokens)
  • Drop irrelevant retrieved docs
  • Typical saving: 40–70% on long chats
πŸ“
Output Length Control

Output tokens cost 3–5Γ— more than input tokens (on most models). Constrain output aggressively.

  • Set aggressive max_tokens
  • Ask for "concise" / "one sentence"
  • Request JSON vs prose
  • Typical saving: 20–40%
Over-Compression Kills Quality

Prompt compression has diminishing and then negative returns. Removing too much context causes the model to hallucinate missing information, produce wrong formats, or miss nuance. Always run your eval suite after compressing prompts. The goal is to remove tokens that don't affect output quality β€” not to minimize prompts at any cost.

πŸ“Š
Per-Request Cost Logging

Log tokens in + tokens out + model + cost for every LLM call. Attach user ID, feature name, request ID.

  • Cost per feature / per endpoint
  • Cost per user cohort
  • Identify expensive edge cases
🚨
Cost Anomaly Alerts

Alert when spend rate deviates from baseline. A 5Γ— cost spike in 10 minutes is a runaway loop, not a traffic spike.

  • Alert: hourly cost > 2Γ— 7-day avg
  • Alert: single request > $0.50
  • Alert: daily budget > 80%
πŸ“ˆ
Cost-Quality Dashboard

Track cost alongside quality metrics. A 30% cost reduction that drops quality 10% may not be worth it.

  • Cost per quality point (eval score)
  • Model routing effectiveness
  • Cache hit rate vs cost saved
πŸ”
Attribution by Feature

Know which feature is driving cost. Often one feature (e.g. agent with long context) drives 80% of LLM spend.

  • Tag every call with feature name
  • Cost breakdown by feature weekly
  • Prioritize optimization by cost share

∑ Chapter 08 — Key Takeaways

  • Three levers for cost: call volume (caching/batching), tokens per call (compression), model choice (tiering/routing)
  • Model tiering is highest ROI β€” 60–75% of queries can run on cheap models at 10–20Γ— lower cost
  • Token budgets at every level prevent runaway costs β€” per request, per user, per day, per feature
  • Batch API gives 50% cost reduction on any non-realtime workload β€” nightly evals, bulk embedding, background analysis
  • Prompt compression (audit, prune, constrain output) typically saves 30–60% tokens β€” always run evals after
  • Measure cost per feature β€” one expensive feature often drives 80% of LLM spend; find it and optimize it first
09
Chapter 09 Β· Infrastructure
Infrastructure β€” Self-Hosted vs API, GPUs, and Deployment

Your infrastructure choice is the single most consequential technical decision in an LLM system: API inference vs self-hosting. The wrong choice costs 10Γ— more than necessary or requires months of re-engineering. This chapter gives you the framework to choose correctly and build it right.

DimensionAPI Providers (OpenAI, Anthropic…)Self-Hosted (vLLM, TGI…)
Setup time Minutes β€” just an API key Days to weeks (GPU, infra, tuning)
Model quality Frontier models (GPT-4o, Claude 3.5) Open-source models (Llama 3, Mistral)
Cost at low volume Cheap β€” pay per token, no infra Expensive β€” GPU cost even at idle
Cost at high volume Linear β€” $cost = $tokens Fixed GPU cost, amortizes over volume
Data privacy Data leaves your infrastructure Full data control, on-prem option
Rate limits Provider-imposed, shared quotas Your hardware, your limits
Maintenance Zero β€” provider handles everything GPU infra, model updates, monitoring
Customisation Limited (fine-tuning via API) Full β€” fine-tune, modify, distill
The Decision Rule

Use API providers until one of these triggers hits: (1) volume exceeds ~10M tokens/day (self-hosting becomes cheaper), (2) data privacy requirements mandate on-prem, (3) you need a customized/fine-tuned model, (4) rate limits block growth. Most companies never hit these triggers β€” API is the right default.

🟒
OpenAI (GPT-4o family)

Largest ecosystem, best tooling, JSON mode + Structured Outputs. Primary choice for most teams.

  • Models: GPT-4o, GPT-4o-mini, o1
  • Context: 128K tokens
  • Strengths: code, reasoning, function calling
  • Best for: general-purpose default
🟠
Anthropic (Claude family)

Long context leader (200K), best for document analysis. Excellent instruction following.

  • Models: Claude 3.5 Sonnet, Claude Haiku
  • Context: 200K tokens
  • Strengths: long-context, nuanced writing
  • Best for: document QA, long analysis
πŸ”΅
Google (Gemini family)

1M context window, multimodal by default, competitive pricing. Strong for bulk/cheap processing.

  • Models: Gemini 1.5 Pro, Gemini Flash
  • Context: 1M tokens
  • Strengths: massive context, multimodal
  • Best for: very long docs, video, cost efficiency
⚑
Groq / Together / Fireworks

Inference-optimized providers for open-source models. 10–50Γ— faster than traditional GPU hosting.

  • Models: Llama 3, Mistral, Mixtral
  • Strengths: ultra-low latency (50ms TTFT)
  • Speeds: 200–800 tok/s vs 30–80 for OpenAI
  • Best for: latency-critical, open-source models
ToolUse CaseKey FeatureBest For
vLLM Production serving PagedAttention β€” 24Γ— higher throughput, continuous batching High-volume production self-hosting
TGI (Text Gen Inference) Production serving HuggingFace ecosystem, OpenAI-compatible API HuggingFace model ecosystem
Ollama Dev / local One-command model management, Mac/Linux support Local development, testing, prototyping
llama.cpp Edge / CPU Quantized models on CPU (no GPU needed) Edge deployment, air-gapped systems
LiteLLM Proxy / abstraction Unified OpenAI-compatible interface over 100+ models Multi-model routing, provider abstraction
πŸš€
vLLM β€” getting started
# Install and serve Llama 3 8B with OpenAI-compatible API pip install vllm # Serve model on port 8000 (OpenAI-compatible) python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Meta-Llama-3-8B-Instruct \ --port 8000 \ --tensor-parallel-size 1 # number of GPUs # Use with standard OpenAI client from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") response = client.chat.completions.create( model="meta-llama/Meta-Llama-3-8B-Instruct", messages=[{"role": "user", "content": "Hello!"}] )
GPUVRAMMax Model SizeCloud Cost/hrBest For
NVIDIA T4 16 GB 7B models (fp16) ~$0.50 Dev/small inference, budget prod
NVIDIA A10G 24 GB 13B models, 7B with headroom ~$1.50 Small production workloads
NVIDIA A100 (40GB) 40 GB 30B models ~$3.00–$4.00 Medium production, fine-tuning
NVIDIA A100 (80GB) 80 GB 70B models, 30B+ with batching ~$5.00–$7.00 Large production workloads
NVIDIA H100 (80GB) 80 GB 70B models at highest throughput ~$15–$25 Highest throughput demand

VRAM sizing rule: Model parameters Γ— 2 bytes (fp16) = minimum VRAM. Llama 3 8B β‰ˆ 16 GB, Llama 3 70B β‰ˆ 140 GB (2Γ— A100 80GB). Always add 20% headroom for KV cache. For multi-GPU: use tensor parallelism (vLLM --tensor-parallel-size N).

Hybrid architecture β€” use the right infrastructure for each query type
Request incoming Router classify task + sensitivity Self-hosted (vLLM) β€” Llama 3 70B sensitive data, high-volume, bulk jobs OpenAI GPT-4o (API) complex reasoning, frontier quality GPT-4o-mini / Haiku (API) simple tasks, classification, fast Response

A common pattern: self-host a capable open-source model (Llama 3 70B) for bulk, sensitive, or high-volume requests; use API providers (GPT-4o) for frontier-quality tasks. LiteLLM acts as a transparent proxy, making both look like the same OpenAI API to your application.

∑ Chapter 09 — Key Takeaways

  • API providers by default β€” only self-host when volume (>10M tokens/day), privacy, or customization demands it
  • Provider strengths: OpenAI (ecosystem, code), Anthropic (200K context, documents), Google (1M context, cheap), Groq (ultra-low latency)
  • vLLM is the production standard for self-hosting β€” PagedAttention gives 24Γ— higher throughput
  • VRAM sizing: model params Γ— 2 bytes (fp16) + 20% headroom for KV cache
  • LiteLLM provides a unified OpenAI-compatible interface over all providers β€” use it to avoid vendor lock-in
  • Hybrid architecture: self-host for bulk/sensitive, use API for frontier quality β€” router decides per request
10
Chapter 10 Β· Real World
Case Studies β€” LLM System Design in Practice

Theory meets reality. Four complete system design walkthroughs: a customer support chatbot, a code assistant, a document Q&A pipeline, and a content generation system. Each applies the full design toolkit from this guide β€” architecture, model routing, caching, scaling, cost, and observability.

Requirements

Scale: 50K queries/day, peak 200 req/min

Latency: <2s first token (streaming)

Quality: Grounded in knowledge base, accurate

Cost target: <$0.01/query

Constraints: Customer data privacy

Architecture Decisions

Pattern: Router β†’ RAG β†’ Chain

Primary model: GPT-4o-mini (80% of queries)

Complex model: GPT-4o (20% escalated)

Cache: Exact + semantic (Redis + Qdrant)

Queue: SQS for burst absorption

Customer support chatbot β€” full architecture
Customer web/mobile Cache exact+semantic ~25% hit rate Classifier GPT-4o-mini intent + complex? RAG + mini 80% of queries RAG + GPT-4o 20% complex Validator grounding check hallucination scan Stream to Customer Observability: every call logged with user_id, intent, model, tokens, cost, latency, grounding_score
MetricTargetHow Achieved
Cost per query $0.008 (vs $0.025 naive) Routing 80% β†’ mini + 25% cache hit rate
P95 latency 1.2s TTFT Streaming + cache + mini for most queries
Hallucination rate <1% RAG grounding + output validator
Availability 99.9% OpenAI primary β†’ Anthropic fallback β†’ cached degraded
Requirements

Scale: 200K completions/day (heavy users)

Latency: <500ms first token (inline autocomplete)

Quality: Context-aware, correct syntax

Cost target: <$0.005/completion

Special: Must work on private repos (data sensitivity)

Architecture Decisions

Inline completions: Self-hosted Llama 3 8B (privacy)

Complex generation: GPT-4o via API (quality)

Context: Sliding window of relevant files

Cache: Prefix cache (same method stubs = cache hit)

Infra: vLLM + 2Γ— A100 80GB

⌨️
Inline Autocomplete

Self-hosted Llama 3 8B on vLLM. 300ms P90 latency. No data leaves the company.

  • vLLM with speculative decoding
  • Prefix caching on file headers
  • Code-specific fine-tune
  • Cost: GPU amortized (~$0.0001)
πŸ’¬
Chat / Explain

User asks "explain this function" or "refactor this code" β†’ GPT-4o via API for quality.

  • Full conversation context
  • Code context injection (RAG)
  • OpenAI API (data anonymized)
  • Cost: ~$0.02/chat exchange
πŸ”
Semantic Search

Find relevant code snippets across repo. Embedding + vector search, no LLM call.

  • text-embedding-3-small
  • Qdrant vector store
  • Embeds on file save
  • Cost: <$0.0001/search
Context Window Management is Critical

Code assistants are uniquely difficult because relevant context (imports, type definitions, caller code) may be spread across many files. Naively concatenating files burns the context window and buries the relevant code in noise. Use semantic retrieval to find the 3–5 most relevant code chunks, not the 100 lines immediately above the cursor. This also cuts input token cost by 60–80%.

Requirements

Scale: 5K queries/day over 100K documents

Documents: PDFs, Word docs, PPTs up to 500 pages

Quality: Grounded answers with citations, no hallucination

Latency: <5s (batch-acceptable)

Cost target: <$0.05/query

Architecture Decisions

Pattern: RAG with re-ranking

Chunking: Semantic (512 tokens, 10% overlap)

Retrieval: Hybrid (BM25 + dense), top-20 β†’ re-rank β†’ top-5

Model: Claude 3.5 Sonnet (200K context)

Vector store: Qdrant (self-hosted)

Pipeline StageComponentWhy This Choice
Document parsing Unstructured.io (PDF, PPTX, DOCX) Handles tables, images, complex layouts
Chunking Semantic chunking (sentence-transformers) Respects paragraph/section boundaries
Embedding text-embedding-3-large (3072d) Best retrieval quality on enterprise docs
Retrieval Hybrid BM25 + dense (RRF fusion) Dense catches semantics; BM25 catches keywords
Re-ranking Cohere Rerank v3 (top-20 β†’ top-5) +15% answer accuracy vs raw retrieval
Generation Claude 3.5 Sonnet + citation prompting Long context + strong grounding instructions
Validation Answer grounding check (LLM judge) Catches hallucinations before serving
Requirements

Scale: 50K articles/month, burst possible

Quality: Brand-consistent, SEO-aware, fact-checked

Latency: Hours acceptable (batch job)

Cost target: <$0.50/article

Multi-step: Brief β†’ outline β†’ draft β†’ edit β†’ SEO

Architecture Decisions

Pattern: Sequential chain (5 stages)

Batch API: Yes β€” 50% cost savings

Models: Mix of GPT-4o (quality) + mini (SEO, outline)

Queue: Celery + Redis for async pipeline

Human-in-loop: Review gate before publish

1️⃣Brief Inputtopic, audience, keywords
2️⃣Outlinemini (cheap)
3️⃣DraftGPT-4o (quality)
4️⃣Edit + Fact-checkGPT-4o + search tool
5️⃣SEO optimizemini (cheap, structured)
πŸ’°
Cost Breakdown per Article
  • Step 1 (outline, mini): $0.003
  • Step 2 (draft, GPT-4o): $0.28
  • Step 3 (edit, GPT-4o): $0.12
  • Step 4 (SEO, mini): $0.005
  • Total: ~$0.41 (vs $0.85 naive)
⚑
Optimization Wins
  • Batch API: saves 50% on outline + SEO
  • Model routing: mini for cheap steps
  • Prompt compression: 40% shorter system prompt
  • Result: $0.41 vs $0.85 baseline
βœ…
Quality Controls
  • Fact-check step with web search
  • Brand voice eval (LLM judge)
  • SEO score check (keyword density)
  • Human review gate before publish
πŸ›‘οΈ
Reliability
  • βœ… Multi-provider fallback configured and tested
  • βœ… Retry logic with exponential backoff
  • βœ… Circuit breaker on all LLM calls
  • βœ… Graceful degradation path defined
  • βœ… Timeout set on every LLM call
πŸ’Έ
Cost Controls
  • βœ… Per-request max_tokens set
  • βœ… Per-user token budget enforced
  • βœ… Provider spend limits configured
  • βœ… Cost anomaly alerts firing
  • βœ… Model routing tested and validated
πŸ“Š
Observability
  • βœ… Every LLM call logged (tokens, cost, latency)
  • βœ… Trace IDs propagated end-to-end
  • βœ… Error rates dashboarded
  • βœ… P50/P95/P99 latency tracked per endpoint
  • βœ… Evals running in CI on prompt changes
πŸ”’
Security
  • βœ… API keys in secrets manager (not env files)
  • βœ… Output sanitization before rendering
  • βœ… Input length limits enforced
  • βœ… Prompt injection mitigations in place
  • βœ… PII not logged in traces

∑ Chapter 10 — Key Takeaways

  • Customer support chatbot: Router β†’ RAG β†’ Chain with model tiering (mini for 80%) achieves $0.008/query vs $0.025 naive
  • Code assistant: Self-host for inline completions (privacy + speed), API for complex chat β€” best of both worlds
  • Document Q&A: Hybrid retrieval + re-ranking + citation prompting β†’ <1% hallucination rate
  • Content generation: Multi-step chain + batch API + model tiering = 50% cost reduction vs single-model pipeline
  • Every production system needs: fallbacks, cost controls, observability, and security β€” all of them, before launch
  • The best architecture evolves from simple to complex β€” driven by measured gaps, not anticipated ones