System Design

System Design

How software systems are built to scale, survive, and evolve.

01
Chapter One

What Is System Design

From "It Works" to "It Works at Scale"

A system that handles a thousand users flawlessly can fall over at a hundred thousand — not because the code is bad, but because nobody designed for what comes after launch. That gap between working software and software that works at scale is precisely what system design addresses. It is the discipline of making deliberate decisions about how components connect, communicate, scale, and fail — before and while you are building them.

🏛️

Software Architecture

Strategic structure and principles. Long-term system organization, patterns, and evolution. Academic and organizational in scope.

Asks: "What should we build and why?"

⚙️

System Design

Practical scaling decisions. How components connect, communicate, and survive failure at real load. Where rubber meets road.

Asks: "How do we build it at scale?"

💻

Software Engineering

Implementation craft. Algorithms, data structures, testing, tooling. The code that makes it run.

Asks: "What code makes it work?"

The insight: Any system that grew beyond one server had to be designed. The only question is whether that design was deliberate or accidental. Accidental designs accumulate invisible constraints until they collapse under their own weight — usually at the worst possible moment for the business.

Where System Design Sits
The Design Spectrum — Three Disciplines, One Continuum
Software Engineering System Design Software Architecture Code & Implementation Strategy & Organization Algorithm choice · Data structures Testing approach · Tooling Database selection · Caching layer API design · Scale strategy Team structure · Build vs buy Long-term evolution ← example decisions at each level → → Architecture Foundation section
Go Deeper on Software Architecture — The right side of this spectrum is covered in depth in the Software Architecture Foundation section: principles, design patterns, structural styles, quality attributes, and the architect's role. System Design is its applied, at-scale companion.
📋 Chapter 1 — Summary
  • System design fills the gap between working code and software that works at scale.
  • Software Architecture — strategic structure: the WHAT and WHY.
  • System Design — practical scaling decisions: the HOW and AT WHAT SCALE.
  • Software Engineering — implementation craft: the WITH WHAT CODE.
  • Every system is designed. The only question is whether that design was deliberate.
02
Chapter Two

Why System Design Matters

The Consequences Are Real

Poor system design is not an academic problem. It is a business problem that shows up at the worst possible time — during a product launch, on the morning of peak traffic, when you have just signed your largest customer. The failure mode is almost always the same: a system that was never built to handle what it is now being asked to handle.

🐳

Twitter Fail Whale — 2008–2012

A monolithic architecture buckled every time a celebrity sent a tweet. Cascading failures brought down the entire platform. The iconic “fail whale” became a cultural symbol. Rebuilding took years and a complete re-architecture.

Lesson: A single monolith cannot absorb viral, uneven load. Fan-out at celebrity scale requires explicit design.

📉

Knight Capital Group — 2012

$440 million lost in 45 minutes. A flawed deployment left decommissioned high-frequency trading code running alongside new code. No circuit breaker, no kill switch, no operational rollback design.

Lesson: System design includes operational design. Deployments, rollbacks, and runaway process containment are not afterthoughts.

💬

WhatsApp — 2014 Acquisition (Success)

50 engineers. 600 million users. $19 billion acquisition. Built on Erlang with minimal infrastructure, designed from day one around extreme reliability and doing exactly one thing exceptionally well.

Lesson: Good system design multiplies team output. The right architecture lets a small team operate at enormous, sustained scale.

🏥

Healthcare.gov — 2013 Launch

$2 billion spent. Crashed on day one under real user load. Hundreds of contractors, no unified architecture, no integration testing, no load testing. Design by committee with no coherent technical vision.

Lesson: Budget does not substitute for design. Fragmented ownership produces fragmented, incompatible systems.

The Compounding Cost of Poor Design
Cost to Fix Over Time — Good Design vs Poor Design
Day 1 Year 1 Year 3 Year 5 Cost to Fix Break-even ~Year 1 First scaling crisis Emergency rewrite Too expensive to fix Good design (invest upfront) Poor design (pay later, with interest)

The 3am principle: The cost of poor system design is not paid on day one. It is paid at 3am when your system falls over during the most important moment for your business — a launch, a traffic surge, a live demo. Good design is invisible: systems just work. Poor design is very visible, and very expensive.

📋 Chapter 2 — Summary
  • Poor design fails visibly — at peak load, during launches, at maximum audience.
  • Twitter: monolith cannot absorb viral fan-out. WhatsApp: 50 engineers, 600M users — good design multiplies output.
  • Knight Capital: operational design (deployments, rollbacks) is inseparable from system design.
  • Cost of fixing poor design compounds over time. After a threshold, rewriting is cheaper than patching.
03
Chapter Three

The Domains of System Design

Nine Areas, One Discipline

Saying “I know system design” is like saying “I know medicine.” There are specialties, and there is breadth. Engineers do not master system design as a single monolithic skill — they develop depth in specific domains and breadth across all of them. What follows is a map of the nine domains that together constitute system design at scale. None of them are optional. All of them are covered in this section.

System Design Domain Map
01 Fund. 02 Blocks 03 Scale 04 APIs 05 Data 06 Sec+Obs 07 Distrib. 08 Arch. System Design 09 Cases Applied Practice Foundational Intermediate Applied Advanced
All Nine Domains
🧭

01 · Fundamentals

Requirements, trade-offs, capacity estimation. The thinking tools behind every design decision.

Without these, every other domain is guesswork.

Level: Beginner

🔩

02 · Building Blocks

Caches, databases, queues, load balancers. The universal components every scaled system assembles from.

Know these before designing anything.

Level: Beginner – Intermediate

📈

03 · Scalability & Reliability

How systems handle more load without falling over. The difference between a prototype and a production system.

Level: Intermediate

🔗

04 · Communication & APIs

How components talk to each other. REST, gRPC, events, async patterns. Poor communication design creates invisible bottlenecks.

Level: Intermediate

🗄️

05 · Data at Scale

When data outgrows one machine. Storage strategies, consistency models, stream processing. Where the hardest problems live.

Level: Intermediate

🔒

06 · Security & Observability

Protecting systems and understanding what is happening inside them. You cannot fix what you cannot see.

Level: Intermediate

🌐

07 · Distributed Systems

Clocks, consensus, partition tolerance. Where distributed systems theory meets engineering reality. The deep end.

Level: Advanced

🏗️

08 · Architecture Styles

Monolith vs microservices vs serverless. Not a technology choice — a trade-off choice that shapes everything else.

Level: Intermediate

📋

09 · Case Studies

Real systems designed from scratch. Where all domains come together. Knowledge becomes judgment.

Level: All levels

Breadth before depth. Know that all nine domains exist before going deep on any single one. An engineer who knows one domain thoroughly but is blind to the others will consistently solve the wrong problem with the wrong tool.

📋 Chapter 3 — Summary
  • System design is not one skill — it is a collection of nine related domains.
  • Fundamentals and Building Blocks are the entry points: learn these before the others.
  • Distributed Systems is the most theoretically demanding; Case Studies is where all domains converge.
  • Engineers develop depth in specific domains and breadth across all of them. Both matter.
04
Chapter Four

How Systems Evolve Over Time

Complexity Is Earned, Not Chosen

No system starts complex. Complexity is forced — by growth, by usage patterns, by requirements that could not be anticipated on day one. This evolution arc is the most important mental model for understanding why every component in system design exists. Every pattern, every building block, every architecture style was invented because someone hit a wall with a simpler approach and needed a way through it.

System Evolution — From Single Server to Distributed Architecture
simple complex ← complexity curve → App +DB 1 Single Server Day 1 Single point of failure App DB 2 Separate Database ~1K users Sessions tied to one app LB DB 3 Load Balancer ~10K users DB is now bottleneck R R 4 Read Replicas ~100K users Write bottleneck Cache DB 5 Cache Layer ~500K users Cache invalidation S1 S2 S3 6 DB Sharding ~1M+ users Cross-shard queries hard 7 Micro- services Team scale Distributed complexity Simple Moderate High complexity Full-scale distribution

The key insight: Nobody jumps from Stage 1 to Stage 7. Each stage solves the bottleneck of the previous stage and introduces new problems. Understanding this arc is understanding why every building block in system design was invented. Every component exists because someone hit a wall at a specific scale.

Warning: Premature scaling is as dangerous as under-scaling. Moving to Stage 5 when you have Stage 2 traffic adds real complexity with no real benefit. Match architecture to actual scale, not aspirational scale.
📋 Chapter 4 — Summary
  • No system starts complex — complexity is earned by growth, not chosen by preference.
  • 7 stages: Single Server → Separate DB → Load Balancer → Read Replicas → Cache → Sharding → Microservices.
  • Each stage solves the bottleneck of the stage before it and creates the bottleneck of the stage after it.
  • Premature scaling adds real cost with no real benefit. Match architecture to actual, not imagined, scale.
05
Chapter Five

How to Use This Section

Three Paths Through One Map

This section is not a textbook to read cover to cover. It is a map. Some readers need to start at the beginning. Some need to fill specific gaps. Some are here to formalize instincts built over a decade of production experience. All three approaches are valid — but each has a different optimal entry point, and getting that wrong wastes time that could have been spent building.

Learning Paths by Experience Level
PATH 1 — BEGINNER Overview → Fundamentals → Building Blocks → Scalability → each domain as it becomes relevant 3–4 weeks foundation PATH 2 — PRACTITIONER (3–7 years) NFRs + Estimation → Building Blocks gaps → Scalability → domain of current relevance → Case Studies Reference as needed PATH 3 — SENIOR ENGINEER (7+ years) Distributed Systems → Case Studies → Architecture Styles → Reference patterns and decision trees Targeted sessions
Path 01 — The Beginner
You write code. You have not thought much about scale yet.
  1. Read this Overview completely — orientation before depth
  2. Work through Fundamentals chapter by chapter
  3. Read Building Blocks in order — each one builds on the last
  4. Then Scalability & Reliability
  5. Return to other domains as they become relevant to your actual work
⏱ 3–4 weeks for a solid, durable foundation
Path 02 — The Practitioner
3–7 years experience. You have built systems but have not always known why certain decisions were made.
  1. Skim Fundamentals Ch3–Ch5 (NFRs, Estimation, Trade-offs)
  2. Go deep on Building Blocks you use daily
  3. Read Scalability end to end
  4. Pick the domain most relevant to your current work
  5. Use Case Studies as practice — design before reading the solution
⏱ Use as a reference — deep-dive where you have gaps
Path 03 — The Senior Engineer
7+ years. Strong instincts. Ready to formalize your thinking or prepare to mentor others.
  1. Distributed Systems — clocks, consensus, replication theory
  2. Case Studies — compare your instincts against the guided designs
  3. Architecture Styles — validate your trade-off thinking
  4. Reference section for patterns index and decision trees
⏱ Focused sessions on specific topics — not linear reading
What to Expect on Every Domain Page
🎯

WHY Before HOW

Every chapter opens with the problem being solved, not the solution. Context first. Mechanism second.

📐

Visual Before Text

Diagrams introduce concepts visually before prose explains them. Patterns are spatial — diagrams first is intentional.

🌉

Beginner Bridge Present

Every advanced concept has a beginner entry point. Advanced readers skip it; beginners have a path in.

📋

Chapter Summaries

Each chapter ends with a tight summary. Review without re-reading. Useful for spaced repetition.

🔗

Architecture Cross-Links

Where system design meets its architectural underpinning, there is always a link to the Architecture Foundation.

⚖️

Trade-offs Explicit

Every decision shows the trade-off, not a "best answer." There are no best answers — only documented trade-offs.

This section rewards curiosity more than linear reading. Follow what interests you. The Fundamentals chapter will always be here when you need to ground a concept. The deeper domains will always be here when a production problem sends you looking for answers.

📋 Chapter 5 — Summary
  • Beginner: Start at Overview, work through Fundamentals and Building Blocks in order.
  • Practitioner: Fill gaps — NFRs, estimation, then the domain you work in most.
  • Senior: Start at Distributed Systems and Case Studies; use everything else as validation.
  • Every page: WHY before HOW. Visual before text. Explicit trade-offs, not imaginary best practices.
06
Chapter Six

Key Vocabulary

12 Terms That Appear in Every Discussion

This is not a glossary. A glossary gives you definitions to memorize. What follows is a mental model primer — the twelve terms that appear in almost every system design conversation. Understanding their intuition before their formal definition is what lets you read the rest of this section without constantly stopping to look things up.

Latency
How long does one thing take?

Definition: Time between a request being sent and the response being received.

Example: 200ms to load a webpage. 1ms for a database query on the same server.

Key nuance: Measure p99, not the average. Averages hide the worst 1% of user experiences — the ones who churn.

Throughput
How much can flow through a pipe per second?

Definition: Number of operations a system can handle per unit of time. Measured as QPS, RPS, or TPS.

Example: A payment service processing 10,000 transactions per second.

Key nuance: High throughput and low latency are often in tension. Batching increases throughput but adds latency to each item.

Availability
Is the system up when someone tries to use it?

Definition: Percentage of time a system is operational and accessible. Expressed as the nines.

Example: 99.9% = 8.76 hours of downtime per year. 99.99% = 52 minutes per year.

Key nuance: Availability is a promise that must be designed for, not hoped for. Redundancy and automatic failover are the mechanisms.

Scalability
Can it handle 10x the load without a rewrite?

Definition: Ability of a system to handle increased load by adding resources rather than redesigning from scratch.

Example: Adding servers to handle Black Friday traffic, then removing them afterward.

Key nuance: Horizontal scaling (more machines) scales further but requires stateless design. Vertical scaling (bigger machine) has a ceiling.

Consistency
Do all users see the same data at the same time?

Definition: Every read receives the most recent write, or an error. No stale data served.

Example: A bank balance must be consistent — $500 on one device cannot show as $1,000 on another.

Key nuance: Strong consistency has a latency cost. In distributed systems, it conflicts with availability during partitions (CAP Theorem).

Partition Tolerance
What happens when two parts of your system cannot talk to each other?

Definition: System continues to operate despite network partitions between nodes.

Example: A datacenter network split — servers on both sides keep serving requests.

Key nuance: In distributed systems, partitions will happen. The real design question is how to behave when they do.

Fault Tolerance
Can the system survive individual component failures?

Definition: Ability to continue operating correctly when components fail, without service interruption.

Example: Netflix continues streaming when one server crashes because requests route to healthy replicas automatically.

Key nuance: Fault tolerance is designed through redundancy and graceful degradation — not through hoping components do not fail.

Idempotency
Is it safe to do the same thing twice?

Definition: An operation that produces the same result whether applied once or multiple times.

Example: A payment retry must not charge the customer twice. DELETE on an already-deleted record must not error.

Key nuance: Critical for retry logic in distributed systems. Without idempotency, retries are dangerous. With it, retries are safe.

Replication
Keeping copies of data in multiple places.

Definition: Maintaining copies of data across multiple nodes for reliability, availability, or read performance.

Example: Database read replicas — one primary handles writes, three replicas distribute read traffic.

Key nuance: Replication introduces consistency challenges. How stale is acceptable? When does a replica become the source of truth?

Sharding
Splitting data across multiple machines.

Definition: Horizontal partitioning of data across multiple database instances. Each shard owns a subset of the data.

Example: Users A–M on shard 1, users N–Z on shard 2. Each shard is an independent database.

Key nuance: Sharding makes certain queries very hard — joining across shards is expensive. Shard key selection is the most consequential decision.

Caching
Storing frequently used data closer to where it is needed.

Definition: Temporary storage of copies of data to reduce access latency and database load.

Example: Redis storing user session data — 1ms retrieval instead of a 20ms database query on every request.

Key nuance: Cache invalidation is one of the two genuinely hard problems in computer science. Stale cache means wrong data served at scale.

SLA · SLO · SLI
Promises about how well a system will perform.

SLA (Service Level Agreement): A contract with external consequences. Breach it and there are financial or legal penalties.

SLO (Service Level Objective): An internal target, stricter than the SLA. The goal you aim for to stay safely within SLA bounds.

SLI (Service Level Indicator): The actual measurement. What you are observing in production right now.

Example: SLA = 99.9% uptime (contractual). SLO = 99.95% (internal target). SLI = 99.97% (measured today).

Key nuance: SLOs must be stricter than SLAs. Always leave yourself a margin. Driving the SLI close to the SLA leaves no runway for incidents.

Vocabulary Clusters — How These Concepts Relate
PERFORMANCE Latency Throughput (often in tension) RELIABILITY Availability Fault Tolerance Replication SCALE Scalability Sharding Caching CONSISTENCY Consistency Partition Tolerance Idempotency MEASUREMENT SLA · SLO · SLI enables impacts constrains measures

You do not need to memorize these. You need to understand their trade-offs. Every system design decision is a negotiation between at least two of these concepts. The engineer who knows why latency and throughput tension against each other makes better decisions than one who memorized definitions.

📋 Chapter 6 — Summary
  • Latency (how long) and Throughput (how much) are often in tension — batching trades one for the other.
  • Availability is a promise requiring design; Fault Tolerance is its implementation mechanism.
  • Consistency conflicts with Availability during network partitions — this is the CAP Theorem.
  • Sharding and Caching are the primary tools for scaling data beyond one machine.
  • SLOs must be stricter than SLAs — always leave yourself a margin before the contractual penalty fires.
Explore the Domains
System Design at a Glance
01 · What Is System Design

Deliberate Structural Decisions

  • Bridges working code and software that works at scale
  • Not about code — about structure, boundaries, and trade-offs
  • Distinct from architecture (strategic) and engineering (implementation)
  • Every system is designed. Deliberately or accidentally.
02 · Why It Matters

The Cost Is Paid at 3am

  • Poor design fails visibly — at peak load, during launches
  • Good design is invisible: systems just work
  • WhatsApp: 50 engineers, 600M users — good design multiplies output
  • Cost of fixing poor design compounds exponentially over time
03 · The Domains

Nine Areas of Expertise

  • 9 distinct domains from Fundamentals to Distributed Systems
  • Breadth before depth — know all domains before deep-diving any
  • Each domain builds on the ones before it
  • Case Studies is where all domains converge into judgment
04 · How Systems Evolve

Complexity Is Earned, Not Chosen

  • No system starts complex — growth forces it
  • 7 stages: Single Server to Microservices
  • Each stage solves the previous bottleneck and creates the next
  • Premature scaling is as dangerous as under-scaling
05 · How to Use This Section

Three Learning Paths

  • Beginner: Overview → Fundamentals → Building Blocks → Scalability
  • Practitioner: fill gaps, focus on current domain, use Case Studies
  • Senior: Distributed Systems → Case Studies → Architecture Styles
  • WHY always comes before HOW on every page
06 · Key Vocabulary

12 Terms, Not Definitions

  • Latency (how long) and Throughput (how much) are often in tension
  • Availability and Fault Tolerance: designed, never hoped for
  • Consistency conflicts with Availability during partitions (CAP)
  • Every design decision negotiates between at least two of these