System Design

How software systems are built to scale, survive, and evolve.

Chapter One

What Is System Design

From "It Works" to "It Works at Scale"

A system that handles a thousand users flawlessly can fall over at a hundred thousand — not because the code is bad, but because nobody designed for what comes after launch. That gap between working software and software that works at scale is precisely what system design addresses. It is the discipline of making deliberate decisions about how components connect, communicate, scale, and fail — before and while you are building them.

🏛️

Software Architecture

Strategic structure and principles. Long-term system organization, patterns, and evolution. Academic and organizational in scope.

Asks: "What should we build and why?"

⚙️

System Design

Practical scaling decisions. How components connect, communicate, and survive failure at real load. Where rubber meets road.

Asks: "How do we build it at scale?"

💻

Software Engineering

Implementation craft. Algorithms, data structures, testing, tooling. The code that makes it run.

Asks: "What code makes it work?"

The insight: Any system that grew beyond one server had to be designed. The only question is whether that design was deliberate or accidental. Accidental designs accumulate invisible constraints until they collapse under their own weight — usually at the worst possible moment for the business.

Where System Design Sits

The Design Spectrum — Three Disciplines, One Continuum

Go Deeper on Software Architecture — The right side of this spectrum is covered in depth in the Software Architecture Foundation section: principles, design patterns, structural styles, quality attributes, and the architect's role. System Design is its applied, at-scale companion.

📋 Chapter 1 — Summary

System design fills the gap between working code and software that works at scale.
Software Architecture — strategic structure: the WHAT and WHY.
System Design — practical scaling decisions: the HOW and AT WHAT SCALE.
Software Engineering — implementation craft: the WITH WHAT CODE.
Every system is designed. The only question is whether that design was deliberate.

Chapter Two

Why System Design Matters

The Consequences Are Real

Poor system design is not an academic problem. It is a business problem that shows up at the worst possible time — during a product launch, on the morning of peak traffic, when you have just signed your largest customer. The failure mode is almost always the same: a system that was never built to handle what it is now being asked to handle.

🐳

Twitter Fail Whale — 2008–2012

A monolithic architecture buckled every time a celebrity sent a tweet. Cascading failures brought down the entire platform. The iconic “fail whale” became a cultural symbol. Rebuilding took years and a complete re-architecture.

Lesson: A single monolith cannot absorb viral, uneven load. Fan-out at celebrity scale requires explicit design.

📉

Knight Capital Group — 2012

$440 million lost in 45 minutes. A flawed deployment left decommissioned high-frequency trading code running alongside new code. No circuit breaker, no kill switch, no operational rollback design.

Lesson: System design includes operational design. Deployments, rollbacks, and runaway process containment are not afterthoughts.

💬

WhatsApp — 2014 Acquisition (Success)

50 engineers. 600 million users. $19 billion acquisition. Built on Erlang with minimal infrastructure, designed from day one around extreme reliability and doing exactly one thing exceptionally well.

Lesson: Good system design multiplies team output. The right architecture lets a small team operate at enormous, sustained scale.

🏥

Healthcare.gov — 2013 Launch

$2 billion spent. Crashed on day one under real user load. Hundreds of contractors, no unified architecture, no integration testing, no load testing. Design by committee with no coherent technical vision.

Lesson: Budget does not substitute for design. Fragmented ownership produces fragmented, incompatible systems.

The Compounding Cost of Poor Design

Cost to Fix Over Time — Good Design vs Poor Design

The 3am principle: The cost of poor system design is not paid on day one. It is paid at 3am when your system falls over during the most important moment for your business — a launch, a traffic surge, a live demo. Good design is invisible: systems just work. Poor design is very visible, and very expensive.

📋 Chapter 2 — Summary

Poor design fails visibly — at peak load, during launches, at maximum audience.
Twitter: monolith cannot absorb viral fan-out. WhatsApp: 50 engineers, 600M users — good design multiplies output.
Knight Capital: operational design (deployments, rollbacks) is inseparable from system design.
Cost of fixing poor design compounds over time. After a threshold, rewriting is cheaper than patching.

Chapter Three

The Domains of System Design

Nine Areas, One Discipline

Saying “I know system design” is like saying “I know medicine.” There are specialties, and there is breadth. Engineers do not master system design as a single monolithic skill — they develop depth in specific domains and breadth across all of them. What follows is a map of the nine domains that together constitute system design at scale. None of them are optional. All of them are covered in this section.

System Design Domain Map

All Nine Domains

🧭

01 · Fundamentals

Requirements, trade-offs, capacity estimation. The thinking tools behind every design decision.

Without these, every other domain is guesswork.

Level: Beginner

🔩

02 · Building Blocks

Caches, databases, queues, load balancers. The universal components every scaled system assembles from.

Know these before designing anything.

Level: Beginner – Intermediate

📈

03 · Scalability & Reliability

How systems handle more load without falling over. The difference between a prototype and a production system.

Level: Intermediate

🔗

04 · Communication & APIs

How components talk to each other. REST, gRPC, events, async patterns. Poor communication design creates invisible bottlenecks.

Level: Intermediate

🗄️

05 · Data at Scale

When data outgrows one machine. Storage strategies, consistency models, stream processing. Where the hardest problems live.

Level: Intermediate

🔒

06 · Security & Observability

Protecting systems and understanding what is happening inside them. You cannot fix what you cannot see.

Level: Intermediate

🌐

07 · Distributed Systems

Clocks, consensus, partition tolerance. Where distributed systems theory meets engineering reality. The deep end.

Level: Advanced

🏗️

08 · Architecture Styles

Monolith vs microservices vs serverless. Not a technology choice — a trade-off choice that shapes everything else.

Level: Intermediate

📋

09 · Case Studies

Real systems designed from scratch. Where all domains come together. Knowledge becomes judgment.

Level: All levels

Breadth before depth. Know that all nine domains exist before going deep on any single one. An engineer who knows one domain thoroughly but is blind to the others will consistently solve the wrong problem with the wrong tool.

📋 Chapter 3 — Summary

System design is not one skill — it is a collection of nine related domains.
Fundamentals and Building Blocks are the entry points: learn these before the others.
Distributed Systems is the most theoretically demanding; Case Studies is where all domains converge.
Engineers develop depth in specific domains and breadth across all of them. Both matter.

Chapter Four

How Systems Evolve Over Time

Complexity Is Earned, Not Chosen

No system starts complex. Complexity is forced — by growth, by usage patterns, by requirements that could not be anticipated on day one. This evolution arc is the most important mental model for understanding why every component in system design exists. Every pattern, every building block, every architecture style was invented because someone hit a wall with a simpler approach and needed a way through it.

System Evolution — From Single Server to Distributed Architecture

The key insight: Nobody jumps from Stage 1 to Stage 7. Each stage solves the bottleneck of the previous stage and introduces new problems. Understanding this arc is understanding why every building block in system design was invented. Every component exists because someone hit a wall at a specific scale.

Warning: Premature scaling is as dangerous as under-scaling. Moving to Stage 5 when you have Stage 2 traffic adds real complexity with no real benefit. Match architecture to actual scale, not aspirational scale.

📋 Chapter 4 — Summary

No system starts complex — complexity is earned by growth, not chosen by preference.
7 stages: Single Server → Separate DB → Load Balancer → Read Replicas → Cache → Sharding → Microservices.
Each stage solves the bottleneck of the stage before it and creates the bottleneck of the stage after it.
Premature scaling adds real cost with no real benefit. Match architecture to actual, not imagined, scale.

Chapter Five

How to Use This Section

Three Paths Through One Map

This section is not a textbook to read cover to cover. It is a map. Some readers need to start at the beginning. Some need to fill specific gaps. Some are here to formalize instincts built over a decade of production experience. All three approaches are valid — but each has a different optimal entry point, and getting that wrong wastes time that could have been spent building.

Learning Paths by Experience Level

Path 01 — The Beginner

You write code. You have not thought much about scale yet.

Read this Overview completely — orientation before depth
Work through Fundamentals chapter by chapter
Read Building Blocks in order — each one builds on the last
Then Scalability & Reliability
Return to other domains as they become relevant to your actual work

⏱ 3–4 weeks for a solid, durable foundation

Path 02 — The Practitioner

3–7 years experience. You have built systems but have not always known why certain decisions were made.

Skim Fundamentals Ch3–Ch5 (NFRs, Estimation, Trade-offs)
Go deep on Building Blocks you use daily
Read Scalability end to end
Pick the domain most relevant to your current work
Use Case Studies as practice — design before reading the solution

⏱ Use as a reference — deep-dive where you have gaps

Path 03 — The Senior Engineer

7+ years. Strong instincts. Ready to formalize your thinking or prepare to mentor others.

Distributed Systems — clocks, consensus, replication theory
Case Studies — compare your instincts against the guided designs
Architecture Styles — validate your trade-off thinking
Reference section for patterns index and decision trees

⏱ Focused sessions on specific topics — not linear reading

What to Expect on Every Domain Page

🎯

WHY Before HOW

Every chapter opens with the problem being solved, not the solution. Context first. Mechanism second.

📐

Visual Before Text

Diagrams introduce concepts visually before prose explains them. Patterns are spatial — diagrams first is intentional.

🌉

Beginner Bridge Present

Every advanced concept has a beginner entry point. Advanced readers skip it; beginners have a path in.

📋

Chapter Summaries

Each chapter ends with a tight summary. Review without re-reading. Useful for spaced repetition.

🔗

Architecture Cross-Links

Where system design meets its architectural underpinning, there is always a link to the Architecture Foundation.

⚖️

Trade-offs Explicit

Every decision shows the trade-off, not a "best answer." There are no best answers — only documented trade-offs.

This section rewards curiosity more than linear reading. Follow what interests you. The Fundamentals chapter will always be here when you need to ground a concept. The deeper domains will always be here when a production problem sends you looking for answers.

📋 Chapter 5 — Summary

Beginner: Start at Overview, work through Fundamentals and Building Blocks in order.
Practitioner: Fill gaps — NFRs, estimation, then the domain you work in most.
Senior: Start at Distributed Systems and Case Studies; use everything else as validation.
Every page: WHY before HOW. Visual before text. Explicit trade-offs, not imaginary best practices.

Chapter Six

Key Vocabulary

12 Terms That Appear in Every Discussion

This is not a glossary. A glossary gives you definitions to memorize. What follows is a mental model primer — the twelve terms that appear in almost every system design conversation. Understanding their intuition before their formal definition is what lets you read the rest of this section without constantly stopping to look things up.

Latency

How long does one thing take?

Definition: Time between a request being sent and the response being received.

Example: 200ms to load a webpage. 1ms for a database query on the same server.

Key nuance: Measure p99, not the average. Averages hide the worst 1% of user experiences — the ones who churn.

Throughput

How much can flow through a pipe per second?

Definition: Number of operations a system can handle per unit of time. Measured as QPS, RPS, or TPS.

Example: A payment service processing 10,000 transactions per second.

Key nuance: High throughput and low latency are often in tension. Batching increases throughput but adds latency to each item.

Availability

Is the system up when someone tries to use it?

Definition: Percentage of time a system is operational and accessible. Expressed as the nines.

Example: 99.9% = 8.76 hours of downtime per year. 99.99% = 52 minutes per year.

Key nuance: Availability is a promise that must be designed for, not hoped for. Redundancy and automatic failover are the mechanisms.

Scalability

Can it handle 10x the load without a rewrite?

Definition: Ability of a system to handle increased load by adding resources rather than redesigning from scratch.

Example: Adding servers to handle Black Friday traffic, then removing them afterward.

Key nuance: Horizontal scaling (more machines) scales further but requires stateless design. Vertical scaling (bigger machine) has a ceiling.

Consistency

Do all users see the same data at the same time?

Definition: Every read receives the most recent write, or an error. No stale data served.

Example: A bank balance must be consistent — $500 on one device cannot show as $1,000 on another.

Key nuance: Strong consistency has a latency cost. In distributed systems, it conflicts with availability during partitions (CAP Theorem).

Partition Tolerance

What happens when two parts of your system cannot talk to each other?

Definition: System continues to operate despite network partitions between nodes.

Example: A datacenter network split — servers on both sides keep serving requests.

Key nuance: In distributed systems, partitions will happen. The real design question is how to behave when they do.

Fault Tolerance

Can the system survive individual component failures?

Definition: Ability to continue operating correctly when components fail, without service interruption.

Example: Netflix continues streaming when one server crashes because requests route to healthy replicas automatically.

Key nuance: Fault tolerance is designed through redundancy and graceful degradation — not through hoping components do not fail.

Idempotency

Is it safe to do the same thing twice?

Definition: An operation that produces the same result whether applied once or multiple times.

Example: A payment retry must not charge the customer twice. DELETE on an already-deleted record must not error.

Key nuance: Critical for retry logic in distributed systems. Without idempotency, retries are dangerous. With it, retries are safe.

Replication

Keeping copies of data in multiple places.

Definition: Maintaining copies of data across multiple nodes for reliability, availability, or read performance.

Example: Database read replicas — one primary handles writes, three replicas distribute read traffic.

Key nuance: Replication introduces consistency challenges. How stale is acceptable? When does a replica become the source of truth?

Sharding

Splitting data across multiple machines.

Definition: Horizontal partitioning of data across multiple database instances. Each shard owns a subset of the data.

Example: Users A–M on shard 1, users N–Z on shard 2. Each shard is an independent database.

Key nuance: Sharding makes certain queries very hard — joining across shards is expensive. Shard key selection is the most consequential decision.

Caching

Storing frequently used data closer to where it is needed.

Definition: Temporary storage of copies of data to reduce access latency and database load.

Example: Redis storing user session data — 1ms retrieval instead of a 20ms database query on every request.

Key nuance: Cache invalidation is one of the two genuinely hard problems in computer science. Stale cache means wrong data served at scale.

SLA · SLO · SLI

Promises about how well a system will perform.

SLA (Service Level Agreement): A contract with external consequences. Breach it and there are financial or legal penalties.

SLO (Service Level Objective): An internal target, stricter than the SLA. The goal you aim for to stay safely within SLA bounds.

SLI (Service Level Indicator): The actual measurement. What you are observing in production right now.

Example: SLA = 99.9% uptime (contractual). SLO = 99.95% (internal target). SLI = 99.97% (measured today).

Key nuance: SLOs must be stricter than SLAs. Always leave yourself a margin. Driving the SLI close to the SLA leaves no runway for incidents.

Vocabulary Clusters — How These Concepts Relate

You do not need to memorize these. You need to understand their trade-offs. Every system design decision is a negotiation between at least two of these concepts. The engineer who knows why latency and throughput tension against each other makes better decisions than one who memorized definitions.

📋 Chapter 6 — Summary

Latency (how long) and Throughput (how much) are often in tension — batching trades one for the other.
Availability is a promise requiring design; Fault Tolerance is its implementation mechanism.
Consistency conflicts with Availability during network partitions — this is the CAP Theorem.
Sharding and Caching are the primary tools for scaling data beyond one machine.
SLOs must be stricter than SLAs — always leave yourself a margin before the contractual penalty fires.

Explore the Domains

🧭

01 · Fundamentals

Requirements, trade-offs, capacity estimation. The thinking tools behind every design decision.

Beginner

🔩

02 · Building Blocks

Caches, databases, queues, load balancers. Universal components every scaled system uses.

Beginner – Intermediate

📈

03 · Scalability & Reliability

How systems handle more load without falling over. Where prototypes become production systems.

Intermediate

🔗

04 · Communication & APIs

How components talk to each other and to the outside world. REST, gRPC, events, async patterns.

Intermediate

🗄️

05 · Data at Scale

Storage, modeling, and stream processing when data outgrows one machine. The hardest problems.

Intermediate

🔒

06 · Security & Observability

Protecting systems and understanding what is happening inside them. You cannot fix what you cannot see.

Intermediate

🌐

07 · Distributed Systems

Clocks, consensus, partition tolerance. Where distributed systems theory meets engineering reality.

Advanced

🏗️

08 · Architecture Styles

Monoliths, microservices, serverless. Not a technology choice — a trade-off that shapes everything.

Intermediate

📋

09 · Case Studies

Real systems designed from scratch. Where all domains converge into judgment and practice.

All levels

📚

Reference

Patterns index, decision trees, key numbers, and the system design interview guide.

All levels

System Design at a Glance

01 · What Is System Design

Deliberate Structural Decisions

Bridges working code and software that works at scale
Not about code — about structure, boundaries, and trade-offs
Distinct from architecture (strategic) and engineering (implementation)
Every system is designed. Deliberately or accidentally.

02 · Why It Matters

The Cost Is Paid at 3am

Poor design fails visibly — at peak load, during launches
Good design is invisible: systems just work
WhatsApp: 50 engineers, 600M users — good design multiplies output
Cost of fixing poor design compounds exponentially over time

03 · The Domains

Nine Areas of Expertise

9 distinct domains from Fundamentals to Distributed Systems
Breadth before depth — know all domains before deep-diving any
Each domain builds on the ones before it
Case Studies is where all domains converge into judgment

04 · How Systems Evolve

Complexity Is Earned, Not Chosen

No system starts complex — growth forces it
7 stages: Single Server to Microservices
Each stage solves the previous bottleneck and creates the next
Premature scaling is as dangerous as under-scaling

05 · How to Use This Section

Three Learning Paths

Beginner: Overview → Fundamentals → Building Blocks → Scalability
Practitioner: fill gaps, focus on current domain, use Case Studies
Senior: Distributed Systems → Case Studies → Architecture Styles
WHY always comes before HOW on every page

06 · Key Vocabulary

12 Terms, Not Definitions

Latency (how long) and Throughput (how much) are often in tension
Availability and Fault Tolerance: designed, never hoped for
Consistency conflicts with Availability during partitions (CAP)
Every design decision negotiates between at least two of these

01 Fundamentals →