System Design · Reference

Reference & Resources

Patterns index, decision trees, key numbers, and interview guide.

Chapter One

System Design Patterns Index

The Pattern Catalog

Every system design problem is solved by combining a small set of recurring patterns. This index organizes them by concern — scalability, reliability, data, and communication — with one-line descriptions and links to deep coverage. Use this as a lookup table during design and review.

📈

Scalability Patterns

Horizontal Scaling — add machines, not CPU
Load Balancing — distribute requests across servers
Database Sharding — partition data by key range or hash
Read Replicas — separate read and write paths
CQRS — different models for read vs write
Auto-Scaling — adjust capacity based on load
CDN Caching — serve static content from edge
Scatter-Gather — fan out query to N nodes, aggregate results
Fan-Out on Write — push updates to followers at write time

🛡️

Reliability Patterns

Replication — multiple copies across nodes/regions
Circuit Breaker — fail fast when downstream is degraded
Bulkhead — isolate failures to subsystems
Retry + Backoff — handle transient failures gracefully
Health Checks — detect failures before users do
Graceful Degradation — serve partial data over no data
Idempotency — make retries safe
Dead Letter Queue — capture failed messages for inspection and replay
Saga Pattern — manage distributed transactions without 2PC

💾

Data Patterns

Cache-Aside — app reads cache, miss → DB → fill cache
Write-Through — write to cache + DB simultaneously
Event Sourcing — store events, derive state
Change Data Capture — stream DB changes to consumers
Consistent Hashing — distribute data with minimal reshuffling
Bloom Filter — probabilistic set membership test
Double-Entry Ledger — debit + credit = 0 invariant
Write-Behind Cache — write to cache, async flush to DB
Inverted Index — term → document_ids for full-text search

🔗

Communication Patterns

Pub/Sub — decouple producers from consumers
Request-Reply — synchronous call-and-wait
Event-Driven — react to events asynchronously
API Gateway — single entry point, route to services
Service Mesh — infrastructure-level traffic management
WebSocket — bidirectional persistent connection
gRPC — binary, typed, streaming RPC
Long Polling — HTTP hold until data available (chat fallback)
Fan-Out — one message to many consumers simultaneously

Pattern Selection by System Characteristic

📋 Chapter 1 — Summary

28 core patterns organized by concern: scalability, reliability, data, communication.
Identify your system's dominant characteristic (read-heavy, write-heavy, real-time, durable).
Select patterns that address the primary constraint first, then layer secondary patterns.

Chapter Two

Decision Trees

Structured Guidance for Common Design Decisions

System design decisions are not arbitrary — they follow logical paths based on your requirements. These decision trees encode the reasoning experienced engineers use intuitively. Start at the top, answer each question honestly, and follow the path to a well-reasoned choice.

Decision Tree: SQL vs NoSQL

🔄

Sync vs Async Communication

Use Sync (REST/gRPC) when:
Caller needs immediate response (user-facing API)
Operation is fast (<500ms), low-failure-rate
Strong consistency required between caller + callee
Use Async (Queue/Event) when:
Caller doesn't need immediate result (background job)
Downstream is unreliable or slow
Fan-out to multiple consumers
Need retry/replay capability

💾

When to Add a Cache

Add cache when:
Read:write ratio > 10:1
Same data requested repeatedly (temporal locality)
Computation to produce data is expensive
Source latency unacceptable for user experience
Don't cache when:
Data changes frequently (invalidation cost > benefit)
Every request is unique (no locality)
Stale data is unacceptable (strong consistency required)
Data is cheap to recompute

🏗️

Monolith vs Microservices

Start with monolith when:
Team < 10 engineers
Domain boundaries unclear (still learning)
Speed of iteration > operational complexity
Shared database is fine for consistency
Move to microservices when:
Teams own independent domains (5-9 people each)
Independent deployment cadence needed
Different scaling needs per component
Technology diversity required (ML in Python, API in Go)

📊

Replication vs Sharding

Replicate when:
Read throughput is the bottleneck (read replicas)
Availability is critical (failover to replica)
Data fits on a single node
Geographic distribution for latency
Shard when:
Data too large for one node
Write throughput is the bottleneck
Need to isolate tenants (noisy neighbor)
Often combine: shard for writes + replicate each shard for reads

Decision Tree: Cache Invalidation Strategy

Decision Tree: Sharding Key Selection

The most important system design decision you make is rarely about technology — it is about consistency model. Strong consistency (every read reflects the latest write) limits horizontal scaling because nodes must coordinate. Eventual consistency (reads may be stale) enables massive scale but requires your application to handle divergence. Before choosing a database, a cache strategy, or a replication model, ask: "Can my users tolerate reading slightly stale data?" If yes, you have design freedom. If no, every component on the read path must enforce consistency — and that constraint propagates through your entire architecture.

📋 Chapter 2 — Summary

SQL vs NoSQL: default PostgreSQL. Choose NoSQL by access pattern (KV, document, wide-column).
Sync vs Async: sync for user-facing fast ops. Async for background, fan-out, unreliable downstream.
Cache: add when read:write >10:1 and temporal locality exists.
Monolith → Microservices: start mono, split when team/domain boundaries are clear.

Chapter Three

Numbers Every Engineer Should Know

The Latency, Throughput and Storage Numbers That Drive Estimation

Back-of-envelope estimation is impossible without internalizing key numbers. These are not exact — they vary by hardware generation — but knowing the order of magnitude separates "this probably works" from "this definitely won't scale." Memorize the scale relationships, not the exact values.

Latency Numbers Every Programmer Should Know (2024)

📊

Throughput Numbers

Redis: 100K-1M ops/sec (single node)
PostgreSQL: 10K-50K queries/sec (depends on query)
Kafka: 1M messages/sec per broker
Nginx: 100K+ concurrent connections
Single server: ~1000 req/sec for typical web app

💾

Storage Numbers

1 char: 1 byte (ASCII) / 2-4 bytes (UTF-8)
1 tweet: ~1 KB (text + metadata)
1 photo: ~200 KB (compressed JPEG)
1 minute video: ~50 MB (720p compressed)
1 TB: = 1M photos or 300 hours of video

⏱️

Time Conversions

1 day: 86,400 seconds (~10⁵)
1 month: 2.6M seconds (~2.5 × 10⁶)
1 year: 31.5M seconds (~3 × 10⁷)
QPS from daily: daily_count / 86,400
Peak = 2-5x average (use 3x for estimates)

System Throughput — Orders of Magnitude

The key mental model: memory → disk → network, each 100x slower. L1 cache (1ns) → RAM (100ns) → SSD (16μs) → HDD (4ms) → network same-DC (0.5ms) → cross-globe (150ms). If you remember nothing else: "in-memory is microseconds, disk is milliseconds, cross-continent is hundreds of milliseconds." This tells you instantly whether a design will meet latency requirements.

📋 Chapter 3 — Summary

RAM is 100x faster than SSD. SSD is 100x faster than HDD. Network adds 0.5-150ms.
Redis: 100K+ ops/sec. PostgreSQL: 10-50K QPS. Kafka: 1M msg/sec.
1 day ≈ 100K seconds. Daily users / 100K = average QPS. Multiply by 3 for peak.

Chapter Four

Interview Guide

How to Ace the System Design Interview

The system design interview is not a test of how many systems you've memorized — it is an evaluation of how you think under ambiguity, communicate trade-offs, and drive toward a reasonable solution within constraints. Interviewers are evaluating your process, not just your answer.

45-Minute Interview — Time Allocation

The 4-Step System Design Framework

🚫

Common Mistakes

Jumping to solution without clarifying requirements
Designing for infinite scale when 1K users suffices
Naming technologies without explaining WHY
Ignoring failure modes and edge cases
Going silent — not explaining your thinking
Never discussing trade-offs — "this is perfect"

✅

What Interviewers Evaluate

Problem definition: Do you ask the right questions?
Scale awareness: Do you know what breaks at 10x/100x?
Trade-off reasoning: Can you explain WHY, not just WHAT?
Communication: Is your explanation clear and structured?
Depth on demand: Can you dive deep when asked?
Pragmatism: Do you solve the actual problem, not a fantasy?

💡

Pro Tips

Start with API design (inputs/outputs) — grounds the conversation
Draw the diagram early — visual discussion is 3x faster
Say "I'd choose X over Y because Z" — explicit trade-off
Use numbers: "1M users × 10 req/day = ~120 QPS"
Identify the hardest sub-problem and solve it well
End with: "If I had more time, I'd address X, Y, Z"

Weak Answer

Strong Answer

"I'd use Kafka for the message queue."
"We need a database — probably MySQL."
"Add a cache to make it fast."
"We'll shard the database."
"This handles 1M users."

"I'd use Kafka because we need ordered delivery per user partition, at-least-once guarantees, and replay capability. The trade-off over SQS is operational complexity."
"MySQL for user sessions (ACID, simple schema) and Cassandra for the event log (write-heavy, time-series, no joins needed)."
"Cache the top 20% of URLs in Redis with LRU — handles the 100:1 read:write ratio without hitting DB for 95% of traffic."
"Shard by user_id hash using consistent hashing with 150 virtual nodes — adding a node moves only 1/N keys."
"At 1M DAU × 50 events/day = 580 events/sec avg, ~1,800 peak (3×). Kafka at 1M msg/sec handles this with 99.9% headroom."

The #1 differentiator between mediocre and excellent candidates: trade-off articulation. A mediocre answer says "use Kafka." An excellent answer says "I'd use Kafka here because we need ordered delivery within a partition, at-least-once guarantees, and the ability to replay events if the consumer fails. The trade-off is operational complexity — if the team is small, an SQS queue would be simpler with acceptable ordering relaxation." Name the alternative, explain why you didn't choose it.

📋 Chapter 4 — Summary

Spend 55%+ of time on design, not requirements. Clarify fast, design deep.
Interviewers evaluate process and trade-off reasoning, not memorized architectures.
Always name alternatives: "X over Y because Z" — explicit trade-off.
Use numbers to justify decisions. "100K QPS → need caching" is better than "add cache."

Chapter Five

Recommended Resources

Books, Papers, and Channels Worth Your Time

System design expertise comes from reading how real systems work at real scale. These resources represent the highest-signal-to-noise materials available — each one has directly influenced how production systems are built today.

📚

Essential Books

Designing Data-Intensive Applications (Kleppmann) — the bible of distributed systems fundamentals
System Design Interview Vol 1 & 2 (Alex Xu) — structured case studies with diagrams
Building Microservices (Sam Newman) — practical service decomposition
Site Reliability Engineering (Google) — operations at scale
Web Scalability for Startup Engineers (Ejsmont) — pragmatic scaling patterns

📄

Landmark Papers

Google MapReduce (2004) — batch processing at scale
Amazon Dynamo (2007) — eventual consistency, consistent hashing
Google Bigtable (2006) — wide-column storage
Google Spanner (2012) — globally distributed SQL
Raft Consensus (2014) — understandable consensus
Kafka (LinkedIn, 2011) — distributed commit log
Facebook TAO (2013) — social graph caching at scale

🌐

Engineering Blogs

Netflix Tech Blog — microservices, chaos engineering, streaming
Uber Engineering — geo systems, real-time, scale
Cloudflare Blog — networking, CDN, edge computing
Stripe Engineering — payments, idempotency, API design
Meta Engineering — social graph, caching (TAO, Memcache)
AWS Architecture Blog — cloud patterns, well-architected
Discord Engineering — real-time, Elixir, ScyllaDB migration

🎥

Talks & Lectures

Martin Kleppmann (Cambridge) — distributed systems lectures (YouTube, free)
MIT 6.824 — distributed systems course with labs (YouTube)
StrangeLoop — annual conference, many architecture talks
QCon — engineering talks from practitioners at major companies
AWS re:Invent — architecture patterns, scale stories from AWS customers
"Turning the database inside-out" (Kleppmann) — event sourcing talk

Learning Path — From Zero to System Design Expert

Phase 1: Foundation (1-2 months)

Read DDIA chapters 1-9 (core distributed concepts)
Practice 3 beginner case studies: URL Shortener, Rate Limiter, Notification System
Focus on: clarifying requirements, back-of-envelope estimation, identifying dominant constraint
Memorize key numbers (latency, throughput, storage)

Phase 2: Depth (2-4 months)

Read 3-5 landmark papers (Dynamo, Spanner, Raft)
Practice 5+ intermediate/advanced case studies: Chat, Video Streaming, Social Feed, Payment, Distributed Cache
Add: timed practice with a peer (45 min per session)
Build a toy distributed system (consensus, KV store)

Phase 3: Mastery (ongoing)

Read one engineering blog post per week (pick from list above)
Do a design review for every new system you encounter at work
Practice teaching: explain a system to a junior engineer
Contribute to architecture discussions (propose, critique, document)

📋 Chapter 5 — Summary

Start with DDIA — it covers 80% of what you need for system design.
Papers for depth: Dynamo, Spanner, Raft, Kafka explain real production decisions.
Blogs for freshness: Netflix, Uber, Cloudflare, Stripe publish real architecture decisions.
Practice with peers: timed mock interviews are 10x more effective than solo reading.

← Case Studies

LearningTree

System Design Reference & Resources