Reference & Resources
Patterns index, decision trees, key numbers, and interview guide.
System Design Patterns Index
Every system design problem is solved by combining a small set of recurring patterns. This index organizes them by concern โ scalability, reliability, data, and communication โ with one-line descriptions and links to deep coverage. Use this as a lookup table during design and review.
Scalability Patterns
- Horizontal Scaling โ add machines, not CPU
- Load Balancing โ distribute requests across servers
- Database Sharding โ partition data by key range or hash
- Read Replicas โ separate read and write paths
- CQRS โ different models for read vs write
- Auto-Scaling โ adjust capacity based on load
- CDN Caching โ serve static content from edge
- Scatter-Gather โ fan out query to N nodes, aggregate results
- Fan-Out on Write โ push updates to followers at write time
Reliability Patterns
- Replication โ multiple copies across nodes/regions
- Circuit Breaker โ fail fast when downstream is degraded
- Bulkhead โ isolate failures to subsystems
- Retry + Backoff โ handle transient failures gracefully
- Health Checks โ detect failures before users do
- Graceful Degradation โ serve partial data over no data
- Idempotency โ make retries safe
- Dead Letter Queue โ capture failed messages for inspection and replay
- Saga Pattern โ manage distributed transactions without 2PC
Data Patterns
- Cache-Aside โ app reads cache, miss โ DB โ fill cache
- Write-Through โ write to cache + DB simultaneously
- Event Sourcing โ store events, derive state
- Change Data Capture โ stream DB changes to consumers
- Consistent Hashing โ distribute data with minimal reshuffling
- Bloom Filter โ probabilistic set membership test
- Double-Entry Ledger โ debit + credit = 0 invariant
- Write-Behind Cache โ write to cache, async flush to DB
- Inverted Index โ term โ document_ids for full-text search
Communication Patterns
- Pub/Sub โ decouple producers from consumers
- Request-Reply โ synchronous call-and-wait
- Event-Driven โ react to events asynchronously
- API Gateway โ single entry point, route to services
- Service Mesh โ infrastructure-level traffic management
- WebSocket โ bidirectional persistent connection
- gRPC โ binary, typed, streaming RPC
- Long Polling โ HTTP hold until data available (chat fallback)
- Fan-Out โ one message to many consumers simultaneously
- 28 core patterns organized by concern: scalability, reliability, data, communication.
- Identify your system's dominant characteristic (read-heavy, write-heavy, real-time, durable).
- Select patterns that address the primary constraint first, then layer secondary patterns.
Decision Trees
System design decisions are not arbitrary โ they follow logical paths based on your requirements. These decision trees encode the reasoning experienced engineers use intuitively. Start at the top, answer each question honestly, and follow the path to a well-reasoned choice.
Sync vs Async Communication
- Use Sync (REST/gRPC) when:
- Caller needs immediate response (user-facing API)
- Operation is fast (<500ms), low-failure-rate
- Strong consistency required between caller + callee
- Use Async (Queue/Event) when:
- Caller doesn't need immediate result (background job)
- Downstream is unreliable or slow
- Fan-out to multiple consumers
- Need retry/replay capability
When to Add a Cache
- Add cache when:
- Read:write ratio > 10:1
- Same data requested repeatedly (temporal locality)
- Computation to produce data is expensive
- Source latency unacceptable for user experience
- Don't cache when:
- Data changes frequently (invalidation cost > benefit)
- Every request is unique (no locality)
- Stale data is unacceptable (strong consistency required)
- Data is cheap to recompute
Monolith vs Microservices
- Start with monolith when:
- Team < 10 engineers
- Domain boundaries unclear (still learning)
- Speed of iteration > operational complexity
- Shared database is fine for consistency
- Move to microservices when:
- Teams own independent domains (5-9 people each)
- Independent deployment cadence needed
- Different scaling needs per component
- Technology diversity required (ML in Python, API in Go)
Replication vs Sharding
- Replicate when:
- Read throughput is the bottleneck (read replicas)
- Availability is critical (failover to replica)
- Data fits on a single node
- Geographic distribution for latency
- Shard when:
- Data too large for one node
- Write throughput is the bottleneck
- Need to isolate tenants (noisy neighbor)
- Often combine: shard for writes + replicate each shard for reads
The most important system design decision you make is rarely about technology โ it is about consistency model. Strong consistency (every read reflects the latest write) limits horizontal scaling because nodes must coordinate. Eventual consistency (reads may be stale) enables massive scale but requires your application to handle divergence. Before choosing a database, a cache strategy, or a replication model, ask: "Can my users tolerate reading slightly stale data?" If yes, you have design freedom. If no, every component on the read path must enforce consistency โ and that constraint propagates through your entire architecture.
- SQL vs NoSQL: default PostgreSQL. Choose NoSQL by access pattern (KV, document, wide-column).
- Sync vs Async: sync for user-facing fast ops. Async for background, fan-out, unreliable downstream.
- Cache: add when read:write >10:1 and temporal locality exists.
- Monolith โ Microservices: start mono, split when team/domain boundaries are clear.
Numbers Every Engineer Should Know
Back-of-envelope estimation is impossible without internalizing key numbers. These are not exact โ they vary by hardware generation โ but knowing the order of magnitude separates "this probably works" from "this definitely won't scale." Memorize the scale relationships, not the exact values.
Throughput Numbers
- Redis: 100K-1M ops/sec (single node)
- PostgreSQL: 10K-50K queries/sec (depends on query)
- Kafka: 1M messages/sec per broker
- Nginx: 100K+ concurrent connections
- Single server: ~1000 req/sec for typical web app
Storage Numbers
- 1 char: 1 byte (ASCII) / 2-4 bytes (UTF-8)
- 1 tweet: ~1 KB (text + metadata)
- 1 photo: ~200 KB (compressed JPEG)
- 1 minute video: ~50 MB (720p compressed)
- 1 TB: = 1M photos or 300 hours of video
Time Conversions
- 1 day: 86,400 seconds (~10โต)
- 1 month: 2.6M seconds (~2.5 ร 10โถ)
- 1 year: 31.5M seconds (~3 ร 10โท)
- QPS from daily: daily_count / 86,400
- Peak = 2-5x average (use 3x for estimates)
The key mental model: memory โ disk โ network, each 100x slower. L1 cache (1ns) โ RAM (100ns) โ SSD (16ฮผs) โ HDD (4ms) โ network same-DC (0.5ms) โ cross-globe (150ms). If you remember nothing else: "in-memory is microseconds, disk is milliseconds, cross-continent is hundreds of milliseconds." This tells you instantly whether a design will meet latency requirements.
- RAM is 100x faster than SSD. SSD is 100x faster than HDD. Network adds 0.5-150ms.
- Redis: 100K+ ops/sec. PostgreSQL: 10-50K QPS. Kafka: 1M msg/sec.
- 1 day โ 100K seconds. Daily users / 100K = average QPS. Multiply by 3 for peak.
Interview Guide
The system design interview is not a test of how many systems you've memorized โ it is an evaluation of how you think under ambiguity, communicate trade-offs, and drive toward a reasonable solution within constraints. Interviewers are evaluating your process, not just your answer.
Common Mistakes
- Jumping to solution without clarifying requirements
- Designing for infinite scale when 1K users suffices
- Naming technologies without explaining WHY
- Ignoring failure modes and edge cases
- Going silent โ not explaining your thinking
- Never discussing trade-offs โ "this is perfect"
What Interviewers Evaluate
- Problem definition: Do you ask the right questions?
- Scale awareness: Do you know what breaks at 10x/100x?
- Trade-off reasoning: Can you explain WHY, not just WHAT?
- Communication: Is your explanation clear and structured?
- Depth on demand: Can you dive deep when asked?
- Pragmatism: Do you solve the actual problem, not a fantasy?
Pro Tips
- Start with API design (inputs/outputs) โ grounds the conversation
- Draw the diagram early โ visual discussion is 3x faster
- Say "I'd choose X over Y because Z" โ explicit trade-off
- Use numbers: "1M users ร 10 req/day = ~120 QPS"
- Identify the hardest sub-problem and solve it well
- End with: "If I had more time, I'd address X, Y, Z"
- "I'd use Kafka for the message queue."
- "We need a database โ probably MySQL."
- "Add a cache to make it fast."
- "We'll shard the database."
- "This handles 1M users."
- "I'd use Kafka because we need ordered delivery per user partition, at-least-once guarantees, and replay capability. The trade-off over SQS is operational complexity."
- "MySQL for user sessions (ACID, simple schema) and Cassandra for the event log (write-heavy, time-series, no joins needed)."
- "Cache the top 20% of URLs in Redis with LRU โ handles the 100:1 read:write ratio without hitting DB for 95% of traffic."
- "Shard by user_id hash using consistent hashing with 150 virtual nodes โ adding a node moves only 1/N keys."
- "At 1M DAU ร 50 events/day = 580 events/sec avg, ~1,800 peak (3ร). Kafka at 1M msg/sec handles this with 99.9% headroom."
The #1 differentiator between mediocre and excellent candidates: trade-off articulation. A mediocre answer says "use Kafka." An excellent answer says "I'd use Kafka here because we need ordered delivery within a partition, at-least-once guarantees, and the ability to replay events if the consumer fails. The trade-off is operational complexity โ if the team is small, an SQS queue would be simpler with acceptable ordering relaxation." Name the alternative, explain why you didn't choose it.
- Spend 55%+ of time on design, not requirements. Clarify fast, design deep.
- Interviewers evaluate process and trade-off reasoning, not memorized architectures.
- Always name alternatives: "X over Y because Z" โ explicit trade-off.
- Use numbers to justify decisions. "100K QPS โ need caching" is better than "add cache."
Recommended Resources
System design expertise comes from reading how real systems work at real scale. These resources represent the highest-signal-to-noise materials available โ each one has directly influenced how production systems are built today.
Essential Books
- Designing Data-Intensive Applications (Kleppmann) โ the bible of distributed systems fundamentals
- System Design Interview Vol 1 & 2 (Alex Xu) โ structured case studies with diagrams
- Building Microservices (Sam Newman) โ practical service decomposition
- Site Reliability Engineering (Google) โ operations at scale
- Web Scalability for Startup Engineers (Ejsmont) โ pragmatic scaling patterns
Landmark Papers
- Google MapReduce (2004) โ batch processing at scale
- Amazon Dynamo (2007) โ eventual consistency, consistent hashing
- Google Bigtable (2006) โ wide-column storage
- Google Spanner (2012) โ globally distributed SQL
- Raft Consensus (2014) โ understandable consensus
- Kafka (LinkedIn, 2011) โ distributed commit log
- Facebook TAO (2013) โ social graph caching at scale
Engineering Blogs
- Netflix Tech Blog โ microservices, chaos engineering, streaming
- Uber Engineering โ geo systems, real-time, scale
- Cloudflare Blog โ networking, CDN, edge computing
- Stripe Engineering โ payments, idempotency, API design
- Meta Engineering โ social graph, caching (TAO, Memcache)
- AWS Architecture Blog โ cloud patterns, well-architected
- Discord Engineering โ real-time, Elixir, ScyllaDB migration
Talks & Lectures
- Martin Kleppmann (Cambridge) โ distributed systems lectures (YouTube, free)
- MIT 6.824 โ distributed systems course with labs (YouTube)
- StrangeLoop โ annual conference, many architecture talks
- QCon โ engineering talks from practitioners at major companies
- AWS re:Invent โ architecture patterns, scale stories from AWS customers
- "Turning the database inside-out" (Kleppmann) โ event sourcing talk
Phase 1: Foundation (1-2 months)
- Read DDIA chapters 1-9 (core distributed concepts)
- Practice 3 beginner case studies: URL Shortener, Rate Limiter, Notification System
- Focus on: clarifying requirements, back-of-envelope estimation, identifying dominant constraint
- Memorize key numbers (latency, throughput, storage)
Phase 2: Depth (2-4 months)
- Read 3-5 landmark papers (Dynamo, Spanner, Raft)
- Practice 5+ intermediate/advanced case studies: Chat, Video Streaming, Social Feed, Payment, Distributed Cache
- Add: timed practice with a peer (45 min per session)
- Build a toy distributed system (consensus, KV store)
Phase 3: Mastery (ongoing)
- Read one engineering blog post per week (pick from list above)
- Do a design review for every new system you encounter at work
- Practice teaching: explain a system to a junior engineer
- Contribute to architecture discussions (propose, critique, document)
- Start with DDIA โ it covers 80% of what you need for system design.
- Papers for depth: Dynamo, Spanner, Raft, Kafka explain real production decisions.
- Blogs for freshness: Netflix, Uber, Cloudflare, Stripe publish real architecture decisions.
- Practice with peers: timed mock interviews are 10x more effective than solo reading.