LLM System Design
Architecture patterns for LLM applications β from single-model APIs to multi-model orchestration, scaling, and production infrastructure.
Building with LLMs is not like building traditional software. Latency is measured in seconds, costs scale with tokens, and outputs are non-deterministic. This guide teaches you to design systems that work despite β and because of β these constraints.
Every design decision in an LLM system is shaped by three constraints that don't exist in traditional software: non-deterministic outputs, second-scale latency, and per-token costs. Understanding these isn't optional β it's the foundation everything else builds on.
LLMs are not reliable components by default. A production LLM system is not just LLM + prompt. It is:
Systems become unpredictable, expensive, and hard to debug. The LLM produces output; the control layer decides whether to trust it.
- Costs spike without warning
- Errors cascade silently
- No visibility into failures
Limits retries, enforces timeouts, tracks token usage, terminates loops, validates output format before downstream use.
- Max retries: 2β3 per call
- Per-call timeout: 10β30s
- Token budget per request
Without an eval layer, you can't know if your system is working or regressing. Every production system needs a baseline quality signal.
- Offline evals in CI
- Online quality sampling
- Cost-per-quality tracking
Traditional software is deterministic, fast, and free at the margin. LLM software is none of these. Every architecture decision must account for these three constraints β or the system will be unreliable, slow, or unaffordable.
Deterministic: f(x) = y, always
Fast: <50ms response times typical
Cheap: Fixed infra cost, marginal cost ~$0
Testable: Unit tests verify exact behavior
Non-deterministic: f(x) β y (different each time)
Slow: 200msβ5s per call, seconds for complex tasks
Expensive: Per-token pricing, cost grows with usage
Hard to test: Eval suites, fuzzy matching, LLM-as-judge
Every production LLM system includes a control layer β whether engineers planned for it or grew it reactively. This layer is responsible for bounding cost, latency, and failure cascades that the LLM itself cannot prevent.
| Control Responsibility | Without It | Implementation |
|---|---|---|
| Retry limiting | Infinite retries β runaway cost | Max 2β3 retries with exponential backoff |
| Timeout enforcement | Stalled requests block workers forever | Hard timeout per call (10β30s) |
| Token tracking | A single request can burn the budget | Count tokens before and after each call |
| Loop termination | Agent loops run indefinitely | Max steps cap (e.g. 10) + cost circuit breaker |
In traditional software, if the code is correct, the output is correct. In LLM systems, even correct code can produce wrong outputs. The model might hallucinate, return malformed JSON, miss a key fact, or give a subtly wrong answer. Your system must handle this gracefully.
Same prompt, same model, temperature=0 β still slightly different outputs across calls. Structure your system to tolerate variation.
- Parse loosely, validate strictly
- Use structured output (JSON mode)
- Retry on format failures
LLM calls fail regularly: hallucination, refusal, rate limit, malformed output. Unlike traditional APIs, failure is not exceptional β it's expected.
- Design retry + fallback for every call
- Validate outputs before trusting them
- Never trust LLM output as ground truth
You can't unit-test LLM outputs. You need eval suites β sets of (input, expected_output) scored by automated metrics or LLM-as-judge.
- Build eval set before building features
- Run evals on every prompt change
- Treat evals like integration tests
Never make an LLM call a single point of failure. Every LLM call in your system should have: (1) output validation, (2) a retry strategy, (3) a fallback path (cheaper model, cached result, or graceful "I don't know"). Systems that treat LLM calls like database queries β reliable and deterministic β break in production.
LLM failures are not limited to incorrect answers. The subtler class of behavioral failures is harder to detect and more dangerous in production because they often pass naive validation.
| Behavioral Failure | What It Looks Like | Detection Strategy |
|---|---|---|
| Repeated mistakes across retries | Each retry returns the same wrong answer β retrying is futile | Hash outputs across retries; escalate if identical failures |
| Partially correct but misleading | 90% correct + 10% confidently wrong β harder to catch than fully wrong | LLM-as-judge or grounding checks on key claims |
| Instruction ignored | Model responds in wrong language, skips a required field, ignores constraints | Schema validation + presence checks on required fields |
| Overconfident wrong answer | Model says "definitely X" when it should say "I don't know" | Calibration eval; add explicit uncertainty instructions |
| Inconsistent outputs | Same question β different answers across sessions (inconsistent brand voice, logic) | Consistency eval suite; lock temperature to 0 for deterministic tasks |
If a model returns the same malformed output on three consecutive retries, more retries will not help β the prompt or schema is the problem. Build retry logic that modifies the prompt on failure (e.g. appending "Return valid JSON. Error was: β¦") rather than blindly resending the same request. Blind retries multiply cost with zero benefit.
LLM latency isn't a single number. It has two distinct phases, and understanding them changes how you design your system.
| Latency Factor | Impact | Design Implication |
|---|---|---|
| Input tokens (prompt length) | +50β200ms per 1K tokens of input | Keep prompts short. Cache system prompts. Compress context. |
| Output tokens | +20β80ms per output token | Set max_tokens aggressively. Ask for concise responses. |
| Model size | Larger model = slower | Route simple tasks to smaller/faster models. |
| Provider load | Peak times = 2β3Γ latency | Multi-provider failover, request queuing. |
| Streaming | Reduces perceived latency to TTFT (~200ms) | Always stream for user-facing responses. |
In traditional SaaS, you pay for servers. In LLM systems, you pay for tokens β every input token and every output token, every call. This fundamentally changes how you think about system design.
| Model (2024β2025) | Input Cost | Output Cost | Cost per 1K queries (avg 2K in + 500 out) |
|---|---|---|---|
| GPT-4o | $2.50 / 1M tokens | $10.00 / 1M tokens | $10.00 |
| GPT-4o-mini | $0.15 / 1M tokens | $0.60 / 1M tokens | $0.60 |
| Claude 3.5 Sonnet | $3.00 / 1M tokens | $15.00 / 1M tokens | $13.50 |
| Claude 3.5 Haiku | $0.25 / 1M tokens | $1.25 / 1M tokens | $1.13 |
| Gemini 1.5 Flash | $0.075 / 1M tokens | $0.30 / 1M tokens | $0.30 |
The 16Γ cost difference matters. GPT-4o costs 16Γ more than GPT-4o-mini per query. If 70% of your queries are simple enough for mini, routing them saves ~60% of your LLM spend. Model routing is the single highest-ROI design decision for cost optimization.
LLM cost overruns rarely come from a single expensive call. They come from emergent system behavior β patterns that only appear at scale or under edge-case inputs.
1 user request β 5 agent steps β 3 retries each β 15 LLM calls for what should have been 1β2.
Fix: Cap steps (max_iterations=10), cap retries (max=2), add cost circuit breaker per request.
Multi-turn conversations or document chains grow context exponentially. Turn 10 of a chat can have 10Γ the tokens of turn 1 β same cost structure, very different price.
Fix: Summarize old turns. Never pass raw history unbounded.
MapReduce on user-uploaded documents: a 500-page PDF becomes 250 LLM map calls. Multiply by daily uploads.
Fix: Cap input size. Estimate cost before processing. Require user confirmation above threshold.
A bug causes 100% of requests to fail validation β all retry 3Γ β 4Γ provider load β rate limiting β more retries. Cost spikes 4Γ with zero benefit.
Fix: Circuit breaker: if error rate >50% in 60s, stop retrying and alert.
| Failure Mode | Traditional Software | LLM Systems |
|---|---|---|
| Wrong output | Bug β fix code β fixed forever | Hallucination β tweak prompt β might recur |
| Slow response | Profile code β optimize β consistent improvement | Depends on provider load, token count, model β varies per call |
| Rate limiting | Scale horizontally, add servers | Provider-imposed limits, can't self-serve more capacity |
| Cost overrun | Fixed infrastructure cost, predictable | Usage-based, runaway loops can burn budget in minutes |
| Format errors | Type system prevents malformed data | LLM can return any string β malformed JSON, truncated output |
| Provider outage | Your infra, your control | Third-party dependency β OpenAI goes down, your app goes down |
Most LLM systems depend on 1β2 providers (OpenAI, Anthropic). When they have outages β and they do, regularly β your entire application stops. Production systems need multi-provider failover: primary model (GPT-4o) β fallback model (Claude Sonnet) β degraded mode (cached responses or smaller model). Chapter 3 covers model selection and routing.
Every LLM call costs money and time. Ask: can this be done without calling the LLM? Can I cache the result? Can I batch multiple queries?
Don't use GPT-4o for classification. Don't use Claude Opus for extraction. Route each task to the cheapest model that achieves acceptable quality.
LLMs can return anything. Parse, validate, and type-check every response. Use JSON mode / structured outputs where available.
A 3-second response feels fast when streamed token-by-token. Without streaming, users stare at a blank screen for 3 seconds. Always stream.
Every LLM call can fail: hallucination, rate limit, timeout, malformed output. Retry, fallback, degrade gracefully β never crash on LLM failure.
Track latency, cost, quality, and error rates per model, per endpoint, per prompt version. You can't optimize what you can't measure.
Don't start with agents, multi-model routing, and semantic caching. Start with a single model, a simple prompt, and an eval set. Add complexity only when measurement shows you need it. The best LLM system is the one with the fewest LLM calls.
Every LLM application has these layers, whether you build them explicitly or not. The chapters that follow cover each in depth: architecture patterns (Ch 2), model selection (Ch 3), API design (Ch 4), caching (Ch 5), scaling (Ch 6), latency (Ch 7), cost (Ch 8), infrastructure (Ch 9), and real-world case studies (Ch 10).
Without structured logging, debugging LLM failures is guesswork. Every call must emit a structured log entry. For multi-step systems, every intermediate step must be logged β not just the final output.
β’ request_id β trace across services
β’ user_id β cost attribution
β’ prompt β sanitized if PII possible
β’ response β actual output
β’ tokens_in / tokens_out β cost calculation
β’ latency_ms β performance tracking
β’ model β which model was used
β’ retries β number of retry attempts
β’ fallback_triggered β boolean
β’ step_index β which step in the chain
β’ step_type β llm_call / tool_call / validate
β’ tool_name + tool_input + tool_output
β’ intermediate_output β per step
β’ total_cost_usd β running total
β’ terminated_early β if circuit breaker fired
β’ loop_count β for agent iterations
β’ parent_request_id β for sub-calls
In the architecture diagram, the Application Layer appears as a thin band. In practice, it becomes the largest and most complex part of the system. Unlike the LLM layer (a managed API) and the data layer (a database), the application layer is entirely custom code that grows with every feature.
The application layer accumulates: prompt templates, output parsers, validation schemas, retry logic, routing rules, cost tracking hooks, fallback chains, streaming wrappers, and tool dispatch. Each added to handle a specific failure in production. Design it to be modular from day one β test each component independently, make each observable, and version your prompt templates. A monolithic application layer is the #1 source of debugging pain in mature LLM systems.
Every component you add (RAG, routing, caching, agents) adds failure modes, latency, and maintenance burden. A single GPT-4o call with a good prompt can often outperform an overengineered pipeline with multiple models and retrieval steps. Only add complexity when your eval suite proves simple isn't good enough.
Most production LLM systems do not need agents, multi-model orchestration, or complex pipelines. A single well-designed prompt and model solves the majority of use cases β at lower cost, lower latency, and higher reliability.
| Component | Add It When⦠| Don't Add It Because⦠|
|---|---|---|
| Agents / orchestration | Control flow is genuinely dynamic; tool use is required | It looks powerful or is trending |
| Multi-model routing | Evals show 1 model can't handle all task types and costs differ | You want to use multiple providers |
| RAG pipeline | Knowledge is too large or dynamic for the context window | You have <100 documents |
| Semantic caching | Exact cache hit rate <5% and queries are repetitive in meaning | You haven't measured exact cache hit rate yet |
| Self-hosted models | Volume exceeds ~10M tokens/day or strict data privacy mandate | You want more control in principle |
The best LLM system is the one with the fewest LLM calls. Every call is a source of latency, cost, and non-determinism. Reduce calls through caching, batching, and simpler architectures β and your system will be faster, cheaper, and more reliable.
∑ Chapter 01 — Key Takeaways
- LLM systems are constrained by three things traditional software isn't: non-determinism, second-scale latency, and per-token costs
- Never make an LLM call a single point of failure β validate outputs, retry, fallback, degrade gracefully
- Latency has two phases: TTFT (input processing) and generation (output tokens) β streaming hides TTFT
- Cost varies 16Γ across models β routing simple tasks to cheap models is the highest-ROI optimization
- LLM failure modes are different: hallucination, rate limits, format errors, provider outages β design for all of them
- Seven principles: minimize calls, cheapest model, validate outputs, stream, design for failure, measure, keep it simple
- Start with the simplest system β add complexity only when evaluation proves you need it
- The best LLM system is the one with the fewest LLM calls
Every LLM application is built from a small set of composable patterns. Understanding these patterns lets you pick the right architecture for your problem instead of over-engineering or under-building. Start with the simplest pattern that works.
One prompt β one LLM call β one response. The simplest possible pattern.
- Classification, extraction, summarization
- Latency: 200msβ2s
- Cost: 1 LLM call
- Start here.
Output of call A becomes input to call B. Multi-step processing with deterministic order.
- Extract β classify β format
- Latency: N Γ single call
- Cost: N LLM calls
- Each step can use different model
Classify input first, then route to the appropriate handler (model, prompt, or pipeline).
- Intent detection β specialized pipeline
- Latency: classifier + handler
- Cost: 1 cheap classify + 1 handler
- Key pattern for model routing
Send the same input to multiple LLM calls simultaneously, aggregate results.
- Generate 3 drafts β pick best
- Latency: max(calls) not sum
- Cost: N Γ single call
- Needs aggregation logic
Split large input into chunks, process each (map), then combine results (reduce).
- Summarize 100-page document
- Latency: map (parallel) + reduce
- Cost: N map calls + 1 reduce
- Handles inputs beyond context window
LLM decides what to do next in a loop. Non-deterministic control flow.
- Tool use, multi-step reasoning
- Latency: unpredictable (3β15 steps)
- Cost: high, variable
- Use only when others can't work
| Pattern | LLM Calls | Latency | Predictability | Best For |
|---|---|---|---|---|
| Single Call | 1 | 200msβ2s | High | Classification, extraction, simple Q&A |
| Chain | 2β5 | 1β5s | High | Multi-step processing, transform pipelines |
| Router | 2 | 0.5β3s | High | Cost optimization, intent-based dispatch |
| Parallel Fan-out | N (parallel) | max(calls) | High | Quality improvement, consensus, diversity |
| MapReduce | N+1 | 1β10s | High | Large docs, batch processing |
| Orchestrator | 3β15+ | 3β30s+ | Low | Dynamic multi-step, tool use, research |
In tool-using agents and routers, there is a non-obvious scaling limit: as the number of available tools grows, model tool selection accuracy degrades. More tools = more confusion, not more capability.
Reliable selection. Model consistently picks the right tool, descriptions are easy to differentiate.
- Selection accuracy: ~95%
- Manageable prompt overhead
- Good tool descriptions sufficient
Noticeable degradation. Model occasionally picks wrong tool or combines incompatible tools.
- Selection accuracy: ~80β85%
- Requires more specific descriptions
- Needs mitigation strategies
Significant degradation. Tool selection becomes a primary failure source, outweighing other problems.
- Selection accuracy: <70%
- High hallucination of tool names
- Requires structural mitigation
(1) Group tools by function β expose only the relevant group per task (search tools vs write tools vs compute tools). (2) Use a router before tool exposure β classify intent first, then present only the 3β5 relevant tools for that intent category. (3) Limit visible tools per step β in multi-step agents, expose only the tools needed for the current step. The goal: never present more than 10 tools at once.
Real LLM applications compose multiple patterns. A customer support system might use: Router (classify intent) β RAG (retrieve knowledge for FAQ queries) β Chain (extract + respond for complex issues) β Agent (multi-step for account changes).
Most successful LLM applications follow this evolution: Single Call (prototype) β Chain (add structure) β Router (add cost optimization) β RAG/MapReduce (add knowledge) β Agent (add autonomy, only if needed). Each step is driven by evaluation showing the simpler pattern isn't sufficient.
Despite the popularity of agents in demos and research, the majority of production LLM systems use simpler patterns β because simpler patterns are cheaper, faster, and more predictable.
β’ Single call β extraction, classification, summarization
β’ Chain β structured multi-step processing
β’ Router β cost and quality optimization
β’ RAG pipeline β knowledge-grounded Q&A
These cover ~90% of real-world LLM use cases.
β’ Control flow is genuinely dynamic β can't be predetermined
β’ Tool use is required (search, code execution, APIs)
β’ Problem requires multi-step reasoning with branching
β’ Simpler patterns have been tried and measured as insufficient
Agents add three compounding costs: latency (3β15 LLM calls instead of 1β2), cost (multiplicative with step count), and unpredictability (non-deterministic control flow is hard to test and debug). If a Chain or Router can solve the problem, use it. Agents should be the last resort, not the first architecture.
∑ Chapter 02 — Key Takeaways
- Six core patterns: Single Call, Chain, Router, Parallel Fan-out, MapReduce, Orchestrator
- Single Call first β most tasks don't need multi-step processing
- Router is the key cost pattern β classify intent, dispatch to cheapest capable handler
- MapReduce handles documents larger than context window β map in parallel, reduce to one answer
- Orchestrator (Agent) is the most powerful but most expensive and unpredictable β use last
- Real systems compose patterns β router β chain β RAG is a common production stack
There is no "best model." There is only the best model for this task at this cost. GPT-4o is overkill for classification. Haiku is too weak for complex reasoning. Model selection and routing is how you get quality and affordability.
| Capability | GPT-4o | GPT-4o-mini | Claude Sonnet | Claude Haiku | Gemini Flash |
|---|---|---|---|---|---|
| Complex reasoning | β β β β β | β β β | β β β β β | β β β | β β β |
| Code generation | β β β β β | β β β β | β β β β β | β β β | β β β β |
| Classification | β β β β β | β β β β β | β β β β β | β β β β | β β β β |
| Extraction | β β β β β | β β β β | β β β β β | β β β β | β β β β |
| Long context | 128K | 128K | 200K | 200K | 1M |
| Speed | Medium | Fast | Medium | Very Fast | Very Fast |
| Cost | $$$ | $ | $$$ | $ | $ |
For classification, extraction, and simple formatting, cheap models (mini, Haiku, Flash) perform within 1β2% of frontier models β at 10β20Γ lower cost. Reserve expensive models for complex reasoning, nuanced writing, and multi-step analysis.
Map task types to models at design time. Simplest approach.
- Classification β mini
- Summarization β Haiku
- Complex analysis β GPT-4o
- Static, no runtime overhead
Use a cheap model to classify query complexity, then route to the appropriate model.
- Classifier (mini) β decides: simple/complex
- Simple β mini ($0.0003)
- Complex β GPT-4o ($0.02)
- Overhead: 1 cheap LLM call
Try cheap model first. If confidence is low, escalate to expensive model.
- Try mini β if uncertain, try 4o
- 80% of queries handled by mini
- 20% escalated to 4o
- Best quality-cost tradeoff
| Trigger | Fallback Action | User Impact |
|---|---|---|
| Provider timeout (>10s) | Switch to secondary provider | Slightly different style, same quality |
| Rate limit (429) | Queue + retry with backoff, or secondary | 200β500ms added delay |
| Provider outage | Switch to secondary provider entirely | Seamless if well-tested |
| All providers down | Serve cached responses for common queries | Stale but available |
| Budget exhausted | Route all traffic to cheapest model | Lower quality, still functional |
A fallback chain that's never been tested doesn't work. Regularly simulate provider failures (chaos engineering) and verify: your fallback triggers correctly, the secondary provider returns compatible output, and your parsers handle the different response format. The worst time to discover your fallback is broken is during an actual outage.
∑ Chapter 03 — Key Takeaways
- There's no "best model" β only the best model for this task at this cost
- Cheap models (mini, Haiku, Flash) match frontier models for classification and extraction at 10β20Γ lower cost
- Three routing strategies: task-based (static), LLM-based (classify then route), confidence-based (try cheap, escalate if unsure)
- Confidence-based routing handles 80% of queries cheaply, escalates 20% to expensive models
- Fallback chains across providers prevent single-provider outages from taking down your system
- Test your fallbacks β untested fallback chains fail when you need them most
Your LLM system's API is the contract with your consumers. LLM-powered APIs have unique challenges: long response times, streaming output, non-deterministic results, and variable costs per call. Traditional REST patterns don't always apply.
Client sends request, waits for complete response.
Pro: Simple, standard REST. Easy to cache, retry, log.
Con: User stares at spinner for 2β5s. Feels slow.
Best for: API-to-API calls, background processing, short responses.
Server sends tokens as they're generated.
Pro: First token in ~200ms. Feels instant. Better UX.
Con: Harder to cache, parse, and handle errors mid-stream.
Best for: User-facing chat, long responses, real-time interaction.
Stream for humans, batch for machines. If the consumer is a user looking at a screen, stream via SSE. If the consumer is another service that needs a complete JSON response, use standard request-response. Many systems expose both: a streaming endpoint for the frontend and a batch endpoint for internal services.
Simple: send request, wait for response with a timeout. Best for calls under 10s.
- Timeout: 10β30s
- Return 504 on timeout
- Client retries
Client submits job, gets job_id, polls for result. Best for calls 10sβ5min.
- POST /jobs β returns job_id
- GET /jobs/{id} β status + result
- Or: webhook on completion
Persistent connection. Server pushes updates. Best for real-time + multi-turn.
- Progress updates: "Searching..."
- Streaming tokens
- Client can cancel mid-generation
LLMs return strings. Your API consumers expect structured data. The gap between these two is where most production bugs live.
| Strategy | How | Reliability | When |
|---|---|---|---|
| JSON Mode | OpenAI/Anthropic native: response_format: {"type": "json_object"} | Very high β model forced to output valid JSON | Always, when available |
| Structured Outputs | OpenAI: define JSON schema, model must match it exactly | Highest β schema-enforced | When you need guaranteed schema |
| Prompt + parse | Ask for JSON in prompt, parse manually | Medium β model may add markdown fences, skip fields | When structured output unavailable |
| Retry on parse failure | If JSON parsing fails, retry with error feedback | Good with 1β2 retries | Always as fallback layer |
LLM provider rate limits are strict and per-organization. If one customer sends 1000 requests, they can exhaust your rate limit for everyone. You need rate limiting at your API layer too.
Cap requests per user per minute. Prevents one user from starving others.
- Free tier: 10 req/min
- Pro tier: 60 req/min
- Return 429 with Retry-After header
Cap total tokens per user per day/month. Prevents cost overrun.
- Track cumulative tokens per API key
- Return 429 when budget exhausted
- Dashboard showing usage
When approaching provider rate limits, queue requests instead of failing.
- Request queue with priority
- Return 202 + job_id when queued
- Shed load during peak
LLM calls are expensive and non-deterministic. When a client retries a timed-out request, you don't want to run (and pay for) the LLM call again. Idempotency keys solve this.
Client sends request β timeout β retries β LLM called twice.
You pay double. User may get different answers for the "same" request.
Client sends request + Idempotency-Key: abc123.
First call: runs LLM, caches result keyed by abc123.
Retry: returns cached result. No second LLM call.
Without idempotency, a client that retries the same request might get a different answer β because LLMs are non-deterministic. This confuses users and breaks downstream systems that expect consistent results. Always cache the first response for a given idempotency key (TTL: 24h typical).
∑ Chapter 04 — Key Takeaways
- Stream for humans, batch for machines β expose both endpoints
- Async patterns: sync + timeout (<10s), job queue (10sβ5min), WebSocket (real-time + multi-turn)
- Use JSON Mode or Structured Outputs whenever available β prompt-based JSON is fragile
- Always add Pydantic/schema validation + retry on parse failure as a safety net
- Rate limit at your API layer β per-user request limits + token budgets + global backpressure
- Idempotency keys prevent double LLM calls on retries and ensure consistent responses
The cheapest and fastest LLM call is the one you don't make. Caching is the most impactful optimization for LLM systems β it reduces cost, latency, and provider dependency in one move. But LLM caching is harder than traditional caching because inputs are natural language, not exact keys.
Hash the full prompt β cache response. Identical prompt = cache hit.
- Hit rate: 5β15% (prompts vary)
- Implementation: Redis / in-memory
- Zero false positives
- Always implement this first
Embed the query β find similar past queries in vector DB β return cached answer if similar enough.
- Hit rate: 15β35% (catches paraphrases)
- Implementation: Vector DB + threshold
- Risk: false positives if threshold too low
- Saves the most money
OpenAI/Anthropic cache system prompts across calls. Same prefix = faster + cheaper.
- Automatic for long system prompts
- 50% input cost reduction
- Reduced TTFT
- No implementation needed β built in
When self-hosting: reuse key-value cache across requests with shared prefixes.
- vLLM automatic prefix caching
- Same system prompt = cached KV
- 30β60% faster TTFT
- Only for self-hosted models
Setting the similarity threshold too low causes wrong answers returned from cache. "What's the refund policy?" and "What's the return policy?" may be 0.92 similar but have different answers. Start with threshold β₯ 0.95 and lower only with testing. A wrong cached answer is worse than a slow correct one.
| Strategy | How | Best For |
|---|---|---|
| TTL (time-to-live) | Cache expires after N hours/days | General answers, FAQs (TTL: 6β24h) |
| Version key | Include prompt version in cache key β new prompt = new cache | When you update prompts/models |
| Event-driven | Clear cache when underlying data changes | RAG: doc updated β clear related caches |
| Manual purge | Admin action to clear specific cache entries | Wrong answers discovered in production |
∑ Chapter 05 — Key Takeaways
- Four caching layers: exact match (simple), semantic (paraphrases), prompt cache (provider), KV cache (self-hosted)
- Exact match first β zero false positives, easy to implement, 5β15% hit rate
- Semantic cache catches paraphrases (15β35% hit rate) but needs careful threshold tuning (β₯0.95)
- Provider prompt caching is free optimization β 50% input cost reduction on long system prompts
- Cache invalidation: TTL for general, version keys for prompt changes, events for data changes
- A wrong cached answer is worse than a slow correct one β tune thresholds conservatively
Scaling LLM applications is fundamentally different from scaling traditional web apps. You can't just add servers β your bottleneck is third-party API rate limits, not your own compute. Scaling strategy is about managing concurrency, queuing, and provider capacity.
| Challenge | Traditional App | LLM App |
|---|---|---|
| Bottleneck | Your servers (scalable) | Provider API rate limits (not in your control) |
| Response time | <100ms (scale horizontally) | 1β5s per call (can't parallelize a single call) |
| Cost of scale | Fixed infrastructure β amortized | Linear: 2Γ queries = 2Γ LLM cost |
| Capacity planning | Auto-scale based on CPU/memory | Pre-negotiate rate limits, multi-provider |
| Request weight | All requests ~equal cost | Requests vary 10β100Γ in token cost |
The queue absorbs burst traffic, the worker pool enforces the provider rate limit. Clients get immediate acknowledgment (202 Accepted), workers process at the maximum rate the provider allows. Scale workers up to match rate limits, not beyond. Queue depth is your auto-scaling signal β high queue = add workers (up to rate limit cap).
Monitor requests/min and tokens/min against provider limits. Throttle before hitting the limit.
- Sliding window counter
- Alert at 80% of limit
- Auto-throttle at 90%
Split traffic across multiple providers to multiply effective rate limits.
- 60% OpenAI, 40% Anthropic
- Weighted routing per model quality
- 2Γ effective capacity
When approaching limits, serve high-priority requests first.
- Paid users before free users
- Real-time before batch
- Short requests before long
| Scale Tier | Daily Queries | Architecture | Estimated LLM Cost/Day |
|---|---|---|---|
| Prototype | <1K | Single server, sync calls | $1β$10 |
| Small prod | 1Kβ10K | App server + cache (Redis) | $10β$100 |
| Medium prod | 10Kβ100K | Queue + worker pool + multi-provider | $100β$1,000 |
| Large prod | 100Kβ1M+ | Full queue arch + model routing + caching + self-host mix | $1,000β$10,000+ |
∑ Chapter 06 — Key Takeaways
- LLM scaling bottleneck is provider rate limits, not your servers β you can't just add compute
- Queue-based architecture decouples intake from execution β workers process at max provider rate
- Scale workers to match rate limits, not beyond β queue depth is your auto-scaling signal
- Multi-provider spreading multiplies effective rate limits (60/40 split = 2Γ capacity)
- Prioritize: paid before free, real-time before batch, short before long
- Cost scales linearly β 2Γ queries = 2Γ LLM cost (caching and routing are your only levers)
Users expect sub-second responses. LLM calls take 1β5 seconds. This gap is where latency engineering lives β reducing actual latency where possible, and masking it with streaming and progressive rendering where not.
| Component | Typical Latency | Optimization | Savings |
|---|---|---|---|
| Network (client β server) | 10β50ms | CDN, edge deployment | 20β30ms |
| Your app logic | 5β20ms | Optimize prompts, pre-compute | 10ms |
| Cache check | 1β5ms (Redis) | In-memory for hot queries | Skips LLM entirely on hit |
| TTFT (LLM) | 200msβ2s | Shorter prompts, prompt caching, smaller model | 50β500ms |
| Token generation | 1β5s total | max_tokens limit, concise prompts, streaming | Perceived: ~0ms with streaming |
| Post-processing | 5β50ms | Async validation, stream while processing | Overlap with generation |
First token in ~200ms vs waiting 3s for full response. The single biggest UX improvement.
- SSE for REST APIs
- WebSocket for bidirectional
- Effort: Low
Skip the LLM call entirely. Cache hit = response in <10ms instead of 2s.
- Exact match + semantic cache
- 15β35% of queries cached
- Effort: Medium
GPT-4o-mini is 2β3Γ faster than GPT-4o. Route simple queries to fast models.
- Classifier β route to mini/Haiku
- 70% of queries can use mini
- Effort: Medium
Every 1K fewer input tokens saves 50β200ms TTFT. Remove fluff from system prompts.
Independent LLM calls run simultaneously. Latency = max(calls) not sum(calls).
"Searching..." β "Found 3 docs..." β "Generating..." β streamed answer. Each step feels fast.
When streaming JSON responses, you get partial JSON tokens: {"sen β timent": β "pos β itive"}. You can't parse until the full object arrives. Solutions: (1) stream text, return metadata separately, (2) use a streaming JSON parser (jsonstream), or (3) stream the text response and return structured data as a final non-streamed event.
∑ Chapter 07 — Key Takeaways
- Streaming is #1 β first token in ~200ms vs 3s wait. The single biggest UX improvement you can make.
- Caching is #2 β skip the LLM entirely. Cache hit = <10ms response.
- Smaller models are #3 β mini/Haiku are 2β3Γ faster for simple tasks
- Shorter prompts directly reduce TTFT β every 1K fewer tokens saves 50β200ms
- Run independent calls in parallel β latency = max, not sum
- Progressive rendering (showing steps) makes any system feel faster
LLM costs are deceptive: they start small and grow linearly with usage. A system that costs $10/day at launch can cost $10,000/day at scale without any architecture changes. Cost management is not an afterthought β it's a first-class design constraint.
Every dollar you spend on LLMs comes from three levers: how many calls you make, how many tokens per call, and which model you use. Cost optimization attacks all three.
Daily Cost = Queries/day Γ Avg Tokens/query Γ Model Price/token. Optimize all three. Model routing alone (sending 70% of queries to mini instead of GPT-4o) can cut costs by 60β70%. Combined with caching (eliminating 20β30% of calls) and prompt compression (reducing tokens 30%), total cost reduction of 75β85% is achievable without quality loss.
Without token budgets, a single runaway agent loop or abusive user can burn your monthly budget in minutes. Token budgets enforce hard and soft limits at every level.
| Budget Level | What It Limits | Implementation | Enforcement |
|---|---|---|---|
| Per request | Max tokens per single LLM call | Set max_tokens parameter | Provider enforces β free |
| Per user/day | Total tokens a user can spend daily | Track in Redis with daily TTL key | Return 429 when exhausted |
| Per API key/month | Total organization spend | Provider dashboard spend limits | Provider cuts off at limit |
| Per feature | Limit expensive features (agents, long-form) | Feature flags based on user tier | App-level enforcement |
| System-wide | Total queries per hour (circuit breaker) | Global counter, open circuit if exceeded | Degrade gracefully when triggered |
Tasks solvable without LLM β use rule-based or traditional ML.
- Keyword matching, regex
- Traditional classifiers (sklearn)
- Embedding similarity (no LLM)
- Cost: ~$0
Simple tasks where a small/fast model performs within 2% of frontier.
- GPT-4o-mini / Claude Haiku / Gemini Flash
- Classification, extraction, formatting
- Short Q&A, paraphrase detection
- Cost: $0.01β$0.10 / 1K queries
Complex tasks requiring frontier reasoning or nuanced writing.
- GPT-4o / Claude Sonnet / Gemini Pro
- Multi-step reasoning, code gen
- Complex analysis, long-form writing
- Cost: $0.50β$5.00 / 1K queries
Benchmark each task type across tiers before committing to a model. In practice, 60β75% of production LLM queries are Tier 2 β classification, simple extraction, FAQ answers, format conversion. These can run on cheap models at 10β20Γ lower cost with <2% quality delta. Only reserve Tier 3 for tasks where you've measured it makes a difference.
OpenAI and Anthropic offer batch APIs that process requests asynchronously at a 50% price discount. If your use case tolerates minutes-to-hours latency, batch processing is a free cost halving.
Latency: 1β5s per request
Cost: Full price (e.g. $2.50/1M tokens)
Best for: User-facing generation, interactive features
When: Response needed in <10s
Latency: Minutes to 24 hours
Cost: 50% off (e.g. $1.25/1M tokens)
Best for: Embedding runs, bulk eval, background analysis
When: Any offline/background processing
| Use Case | API Type | Rationale |
|---|---|---|
| Chat response | Sync (real-time) | User waiting β can't delay |
| Embedding 10K documents | Batch | No user waiting β 50% savings |
| Nightly eval run | Batch | Background job β 50% savings |
| Bulk data extraction | Batch | No realtime need β 50% savings |
| Content moderation | Depends | Sync if blocking publish; batch if reviewing async |
System prompts repeat every call. Every 1K tokens of system prompt = 1K input tokens per query. Audit ruthlessly.
- Remove examples that aren't needed
- Delete redundant instructions
- Use terse style over verbose
- Typical saving: 30β50%
For multi-turn conversations, don't send complete history. Summarize old turns instead.
- Keep last N turns verbatim
- Summarize older turns (~10% of tokens)
- Drop irrelevant retrieved docs
- Typical saving: 40β70% on long chats
Output tokens cost 3β5Γ more than input tokens (on most models). Constrain output aggressively.
- Set aggressive
max_tokens - Ask for "concise" / "one sentence"
- Request JSON vs prose
- Typical saving: 20β40%
Prompt compression has diminishing and then negative returns. Removing too much context causes the model to hallucinate missing information, produce wrong formats, or miss nuance. Always run your eval suite after compressing prompts. The goal is to remove tokens that don't affect output quality β not to minimize prompts at any cost.
Log tokens in + tokens out + model + cost for every LLM call. Attach user ID, feature name, request ID.
- Cost per feature / per endpoint
- Cost per user cohort
- Identify expensive edge cases
Alert when spend rate deviates from baseline. A 5Γ cost spike in 10 minutes is a runaway loop, not a traffic spike.
- Alert: hourly cost > 2Γ 7-day avg
- Alert: single request > $0.50
- Alert: daily budget > 80%
Track cost alongside quality metrics. A 30% cost reduction that drops quality 10% may not be worth it.
- Cost per quality point (eval score)
- Model routing effectiveness
- Cache hit rate vs cost saved
Know which feature is driving cost. Often one feature (e.g. agent with long context) drives 80% of LLM spend.
- Tag every call with feature name
- Cost breakdown by feature weekly
- Prioritize optimization by cost share
∑ Chapter 08 — Key Takeaways
- Three levers for cost: call volume (caching/batching), tokens per call (compression), model choice (tiering/routing)
- Model tiering is highest ROI β 60β75% of queries can run on cheap models at 10β20Γ lower cost
- Token budgets at every level prevent runaway costs β per request, per user, per day, per feature
- Batch API gives 50% cost reduction on any non-realtime workload β nightly evals, bulk embedding, background analysis
- Prompt compression (audit, prune, constrain output) typically saves 30β60% tokens β always run evals after
- Measure cost per feature β one expensive feature often drives 80% of LLM spend; find it and optimize it first
Your infrastructure choice is the single most consequential technical decision in an LLM system: API inference vs self-hosting. The wrong choice costs 10Γ more than necessary or requires months of re-engineering. This chapter gives you the framework to choose correctly and build it right.
| Dimension | API Providers (OpenAI, Anthropicβ¦) | Self-Hosted (vLLM, TGIβ¦) |
|---|---|---|
| Setup time | Minutes β just an API key | Days to weeks (GPU, infra, tuning) |
| Model quality | Frontier models (GPT-4o, Claude 3.5) | Open-source models (Llama 3, Mistral) |
| Cost at low volume | Cheap β pay per token, no infra | Expensive β GPU cost even at idle |
| Cost at high volume | Linear β $cost = $tokens | Fixed GPU cost, amortizes over volume |
| Data privacy | Data leaves your infrastructure | Full data control, on-prem option |
| Rate limits | Provider-imposed, shared quotas | Your hardware, your limits |
| Maintenance | Zero β provider handles everything | GPU infra, model updates, monitoring |
| Customisation | Limited (fine-tuning via API) | Full β fine-tune, modify, distill |
Use API providers until one of these triggers hits: (1) volume exceeds ~10M tokens/day (self-hosting becomes cheaper), (2) data privacy requirements mandate on-prem, (3) you need a customized/fine-tuned model, (4) rate limits block growth. Most companies never hit these triggers β API is the right default.
Largest ecosystem, best tooling, JSON mode + Structured Outputs. Primary choice for most teams.
- Models: GPT-4o, GPT-4o-mini, o1
- Context: 128K tokens
- Strengths: code, reasoning, function calling
- Best for: general-purpose default
Long context leader (200K), best for document analysis. Excellent instruction following.
- Models: Claude 3.5 Sonnet, Claude Haiku
- Context: 200K tokens
- Strengths: long-context, nuanced writing
- Best for: document QA, long analysis
1M context window, multimodal by default, competitive pricing. Strong for bulk/cheap processing.
- Models: Gemini 1.5 Pro, Gemini Flash
- Context: 1M tokens
- Strengths: massive context, multimodal
- Best for: very long docs, video, cost efficiency
Inference-optimized providers for open-source models. 10β50Γ faster than traditional GPU hosting.
- Models: Llama 3, Mistral, Mixtral
- Strengths: ultra-low latency (50ms TTFT)
- Speeds: 200β800 tok/s vs 30β80 for OpenAI
- Best for: latency-critical, open-source models
| Tool | Use Case | Key Feature | Best For |
|---|---|---|---|
| vLLM | Production serving | PagedAttention β 24Γ higher throughput, continuous batching | High-volume production self-hosting |
| TGI (Text Gen Inference) | Production serving | HuggingFace ecosystem, OpenAI-compatible API | HuggingFace model ecosystem |
| Ollama | Dev / local | One-command model management, Mac/Linux support | Local development, testing, prototyping |
| llama.cpp | Edge / CPU | Quantized models on CPU (no GPU needed) | Edge deployment, air-gapped systems |
| LiteLLM | Proxy / abstraction | Unified OpenAI-compatible interface over 100+ models | Multi-model routing, provider abstraction |
| GPU | VRAM | Max Model Size | Cloud Cost/hr | Best For |
|---|---|---|---|---|
| NVIDIA T4 | 16 GB | 7B models (fp16) | ~$0.50 | Dev/small inference, budget prod |
| NVIDIA A10G | 24 GB | 13B models, 7B with headroom | ~$1.50 | Small production workloads |
| NVIDIA A100 (40GB) | 40 GB | 30B models | ~$3.00β$4.00 | Medium production, fine-tuning |
| NVIDIA A100 (80GB) | 80 GB | 70B models, 30B+ with batching | ~$5.00β$7.00 | Large production workloads |
| NVIDIA H100 (80GB) | 80 GB | 70B models at highest throughput | ~$15β$25 | Highest throughput demand |
VRAM sizing rule: Model parameters Γ 2 bytes (fp16) = minimum VRAM. Llama 3 8B β 16 GB, Llama 3 70B β 140 GB (2Γ A100 80GB). Always add 20% headroom for KV cache. For multi-GPU: use tensor parallelism (vLLM --tensor-parallel-size N).
A common pattern: self-host a capable open-source model (Llama 3 70B) for bulk, sensitive, or high-volume requests; use API providers (GPT-4o) for frontier-quality tasks. LiteLLM acts as a transparent proxy, making both look like the same OpenAI API to your application.
∑ Chapter 09 — Key Takeaways
- API providers by default β only self-host when volume (>10M tokens/day), privacy, or customization demands it
- Provider strengths: OpenAI (ecosystem, code), Anthropic (200K context, documents), Google (1M context, cheap), Groq (ultra-low latency)
- vLLM is the production standard for self-hosting β PagedAttention gives 24Γ higher throughput
- VRAM sizing: model params Γ 2 bytes (fp16) + 20% headroom for KV cache
- LiteLLM provides a unified OpenAI-compatible interface over all providers β use it to avoid vendor lock-in
- Hybrid architecture: self-host for bulk/sensitive, use API for frontier quality β router decides per request
Theory meets reality. Four complete system design walkthroughs: a customer support chatbot, a code assistant, a document Q&A pipeline, and a content generation system. Each applies the full design toolkit from this guide β architecture, model routing, caching, scaling, cost, and observability.
Scale: 50K queries/day, peak 200 req/min
Latency: <2s first token (streaming)
Quality: Grounded in knowledge base, accurate
Cost target: <$0.01/query
Constraints: Customer data privacy
Pattern: Router β RAG β Chain
Primary model: GPT-4o-mini (80% of queries)
Complex model: GPT-4o (20% escalated)
Cache: Exact + semantic (Redis + Qdrant)
Queue: SQS for burst absorption
| Metric | Target | How Achieved |
|---|---|---|
| Cost per query | $0.008 (vs $0.025 naive) | Routing 80% β mini + 25% cache hit rate |
| P95 latency | 1.2s TTFT | Streaming + cache + mini for most queries |
| Hallucination rate | <1% | RAG grounding + output validator |
| Availability | 99.9% | OpenAI primary β Anthropic fallback β cached degraded |
Scale: 200K completions/day (heavy users)
Latency: <500ms first token (inline autocomplete)
Quality: Context-aware, correct syntax
Cost target: <$0.005/completion
Special: Must work on private repos (data sensitivity)
Inline completions: Self-hosted Llama 3 8B (privacy)
Complex generation: GPT-4o via API (quality)
Context: Sliding window of relevant files
Cache: Prefix cache (same method stubs = cache hit)
Infra: vLLM + 2Γ A100 80GB
Self-hosted Llama 3 8B on vLLM. 300ms P90 latency. No data leaves the company.
- vLLM with speculative decoding
- Prefix caching on file headers
- Code-specific fine-tune
- Cost: GPU amortized (~$0.0001)
User asks "explain this function" or "refactor this code" β GPT-4o via API for quality.
- Full conversation context
- Code context injection (RAG)
- OpenAI API (data anonymized)
- Cost: ~$0.02/chat exchange
Find relevant code snippets across repo. Embedding + vector search, no LLM call.
- text-embedding-3-small
- Qdrant vector store
- Embeds on file save
- Cost: <$0.0001/search
Code assistants are uniquely difficult because relevant context (imports, type definitions, caller code) may be spread across many files. Naively concatenating files burns the context window and buries the relevant code in noise. Use semantic retrieval to find the 3β5 most relevant code chunks, not the 100 lines immediately above the cursor. This also cuts input token cost by 60β80%.
Scale: 5K queries/day over 100K documents
Documents: PDFs, Word docs, PPTs up to 500 pages
Quality: Grounded answers with citations, no hallucination
Latency: <5s (batch-acceptable)
Cost target: <$0.05/query
Pattern: RAG with re-ranking
Chunking: Semantic (512 tokens, 10% overlap)
Retrieval: Hybrid (BM25 + dense), top-20 β re-rank β top-5
Model: Claude 3.5 Sonnet (200K context)
Vector store: Qdrant (self-hosted)
| Pipeline Stage | Component | Why This Choice |
|---|---|---|
| Document parsing | Unstructured.io (PDF, PPTX, DOCX) | Handles tables, images, complex layouts |
| Chunking | Semantic chunking (sentence-transformers) | Respects paragraph/section boundaries |
| Embedding | text-embedding-3-large (3072d) | Best retrieval quality on enterprise docs |
| Retrieval | Hybrid BM25 + dense (RRF fusion) | Dense catches semantics; BM25 catches keywords |
| Re-ranking | Cohere Rerank v3 (top-20 β top-5) | +15% answer accuracy vs raw retrieval |
| Generation | Claude 3.5 Sonnet + citation prompting | Long context + strong grounding instructions |
| Validation | Answer grounding check (LLM judge) | Catches hallucinations before serving |
Scale: 50K articles/month, burst possible
Quality: Brand-consistent, SEO-aware, fact-checked
Latency: Hours acceptable (batch job)
Cost target: <$0.50/article
Multi-step: Brief β outline β draft β edit β SEO
Pattern: Sequential chain (5 stages)
Batch API: Yes β 50% cost savings
Models: Mix of GPT-4o (quality) + mini (SEO, outline)
Queue: Celery + Redis for async pipeline
Human-in-loop: Review gate before publish
- Step 1 (outline, mini): $0.003
- Step 2 (draft, GPT-4o): $0.28
- Step 3 (edit, GPT-4o): $0.12
- Step 4 (SEO, mini): $0.005
- Total: ~$0.41 (vs $0.85 naive)
- Batch API: saves 50% on outline + SEO
- Model routing: mini for cheap steps
- Prompt compression: 40% shorter system prompt
- Result: $0.41 vs $0.85 baseline
- Fact-check step with web search
- Brand voice eval (LLM judge)
- SEO score check (keyword density)
- Human review gate before publish
- β Multi-provider fallback configured and tested
- β Retry logic with exponential backoff
- β Circuit breaker on all LLM calls
- β Graceful degradation path defined
- β Timeout set on every LLM call
- β Per-request max_tokens set
- β Per-user token budget enforced
- β Provider spend limits configured
- β Cost anomaly alerts firing
- β Model routing tested and validated
- β Every LLM call logged (tokens, cost, latency)
- β Trace IDs propagated end-to-end
- β Error rates dashboarded
- β P50/P95/P99 latency tracked per endpoint
- β Evals running in CI on prompt changes
- β API keys in secrets manager (not env files)
- β Output sanitization before rendering
- β Input length limits enforced
- β Prompt injection mitigations in place
- β PII not logged in traces
∑ Chapter 10 — Key Takeaways
- Customer support chatbot: Router β RAG β Chain with model tiering (mini for 80%) achieves $0.008/query vs $0.025 naive
- Code assistant: Self-host for inline completions (privacy + speed), API for complex chat β best of both worlds
- Document Q&A: Hybrid retrieval + re-ranking + citation prompting β <1% hallucination rate
- Content generation: Multi-step chain + batch API + model tiering = 50% cost reduction vs single-model pipeline
- Every production system needs: fallbacks, cost controls, observability, and security β all of them, before launch
- The best architecture evolves from simple to complex β driven by measured gaps, not anticipated ones