System Design
How software systems are built to scale, survive, and evolve.
What Is System Design
A system that handles a thousand users flawlessly can fall over at a hundred thousand — not because the code is bad, but because nobody designed for what comes after launch. That gap between working software and software that works at scale is precisely what system design addresses. It is the discipline of making deliberate decisions about how components connect, communicate, scale, and fail — before and while you are building them.
Software Architecture
Strategic structure and principles. Long-term system organization, patterns, and evolution. Academic and organizational in scope.
Asks: "What should we build and why?"
System Design
Practical scaling decisions. How components connect, communicate, and survive failure at real load. Where rubber meets road.
Asks: "How do we build it at scale?"
Software Engineering
Implementation craft. Algorithms, data structures, testing, tooling. The code that makes it run.
Asks: "What code makes it work?"
The insight: Any system that grew beyond one server had to be designed. The only question is whether that design was deliberate or accidental. Accidental designs accumulate invisible constraints until they collapse under their own weight — usually at the worst possible moment for the business.
- System design fills the gap between working code and software that works at scale.
- Software Architecture — strategic structure: the WHAT and WHY.
- System Design — practical scaling decisions: the HOW and AT WHAT SCALE.
- Software Engineering — implementation craft: the WITH WHAT CODE.
- Every system is designed. The only question is whether that design was deliberate.
Why System Design Matters
Poor system design is not an academic problem. It is a business problem that shows up at the worst possible time — during a product launch, on the morning of peak traffic, when you have just signed your largest customer. The failure mode is almost always the same: a system that was never built to handle what it is now being asked to handle.
Twitter Fail Whale — 2008–2012
A monolithic architecture buckled every time a celebrity sent a tweet. Cascading failures brought down the entire platform. The iconic “fail whale” became a cultural symbol. Rebuilding took years and a complete re-architecture.
Lesson: A single monolith cannot absorb viral, uneven load. Fan-out at celebrity scale requires explicit design.
Knight Capital Group — 2012
$440 million lost in 45 minutes. A flawed deployment left decommissioned high-frequency trading code running alongside new code. No circuit breaker, no kill switch, no operational rollback design.
Lesson: System design includes operational design. Deployments, rollbacks, and runaway process containment are not afterthoughts.
WhatsApp — 2014 Acquisition (Success)
50 engineers. 600 million users. $19 billion acquisition. Built on Erlang with minimal infrastructure, designed from day one around extreme reliability and doing exactly one thing exceptionally well.
Lesson: Good system design multiplies team output. The right architecture lets a small team operate at enormous, sustained scale.
Healthcare.gov — 2013 Launch
$2 billion spent. Crashed on day one under real user load. Hundreds of contractors, no unified architecture, no integration testing, no load testing. Design by committee with no coherent technical vision.
Lesson: Budget does not substitute for design. Fragmented ownership produces fragmented, incompatible systems.
The 3am principle: The cost of poor system design is not paid on day one. It is paid at 3am when your system falls over during the most important moment for your business — a launch, a traffic surge, a live demo. Good design is invisible: systems just work. Poor design is very visible, and very expensive.
- Poor design fails visibly — at peak load, during launches, at maximum audience.
- Twitter: monolith cannot absorb viral fan-out. WhatsApp: 50 engineers, 600M users — good design multiplies output.
- Knight Capital: operational design (deployments, rollbacks) is inseparable from system design.
- Cost of fixing poor design compounds over time. After a threshold, rewriting is cheaper than patching.
The Domains of System Design
Saying “I know system design” is like saying “I know medicine.” There are specialties, and there is breadth. Engineers do not master system design as a single monolithic skill — they develop depth in specific domains and breadth across all of them. What follows is a map of the nine domains that together constitute system design at scale. None of them are optional. All of them are covered in this section.
01 · Fundamentals
Requirements, trade-offs, capacity estimation. The thinking tools behind every design decision.
Without these, every other domain is guesswork.
Level: Beginner
02 · Building Blocks
Caches, databases, queues, load balancers. The universal components every scaled system assembles from.
Know these before designing anything.
Level: Beginner – Intermediate
03 · Scalability & Reliability
How systems handle more load without falling over. The difference between a prototype and a production system.
Level: Intermediate
04 · Communication & APIs
How components talk to each other. REST, gRPC, events, async patterns. Poor communication design creates invisible bottlenecks.
Level: Intermediate
05 · Data at Scale
When data outgrows one machine. Storage strategies, consistency models, stream processing. Where the hardest problems live.
Level: Intermediate
06 · Security & Observability
Protecting systems and understanding what is happening inside them. You cannot fix what you cannot see.
Level: Intermediate
07 · Distributed Systems
Clocks, consensus, partition tolerance. Where distributed systems theory meets engineering reality. The deep end.
Level: Advanced
08 · Architecture Styles
Monolith vs microservices vs serverless. Not a technology choice — a trade-off choice that shapes everything else.
Level: Intermediate
09 · Case Studies
Real systems designed from scratch. Where all domains come together. Knowledge becomes judgment.
Level: All levels
Breadth before depth. Know that all nine domains exist before going deep on any single one. An engineer who knows one domain thoroughly but is blind to the others will consistently solve the wrong problem with the wrong tool.
- System design is not one skill — it is a collection of nine related domains.
- Fundamentals and Building Blocks are the entry points: learn these before the others.
- Distributed Systems is the most theoretically demanding; Case Studies is where all domains converge.
- Engineers develop depth in specific domains and breadth across all of them. Both matter.
How Systems Evolve Over Time
No system starts complex. Complexity is forced — by growth, by usage patterns, by requirements that could not be anticipated on day one. This evolution arc is the most important mental model for understanding why every component in system design exists. Every pattern, every building block, every architecture style was invented because someone hit a wall with a simpler approach and needed a way through it.
The key insight: Nobody jumps from Stage 1 to Stage 7. Each stage solves the bottleneck of the previous stage and introduces new problems. Understanding this arc is understanding why every building block in system design was invented. Every component exists because someone hit a wall at a specific scale.
- No system starts complex — complexity is earned by growth, not chosen by preference.
- 7 stages: Single Server → Separate DB → Load Balancer → Read Replicas → Cache → Sharding → Microservices.
- Each stage solves the bottleneck of the stage before it and creates the bottleneck of the stage after it.
- Premature scaling adds real cost with no real benefit. Match architecture to actual, not imagined, scale.
How to Use This Section
This section is not a textbook to read cover to cover. It is a map. Some readers need to start at the beginning. Some need to fill specific gaps. Some are here to formalize instincts built over a decade of production experience. All three approaches are valid — but each has a different optimal entry point, and getting that wrong wastes time that could have been spent building.
- Read this Overview completely — orientation before depth
- Work through Fundamentals chapter by chapter
- Read Building Blocks in order — each one builds on the last
- Then Scalability & Reliability
- Return to other domains as they become relevant to your actual work
- Skim Fundamentals Ch3–Ch5 (NFRs, Estimation, Trade-offs)
- Go deep on Building Blocks you use daily
- Read Scalability end to end
- Pick the domain most relevant to your current work
- Use Case Studies as practice — design before reading the solution
- Distributed Systems — clocks, consensus, replication theory
- Case Studies — compare your instincts against the guided designs
- Architecture Styles — validate your trade-off thinking
- Reference section for patterns index and decision trees
WHY Before HOW
Every chapter opens with the problem being solved, not the solution. Context first. Mechanism second.
Visual Before Text
Diagrams introduce concepts visually before prose explains them. Patterns are spatial — diagrams first is intentional.
Beginner Bridge Present
Every advanced concept has a beginner entry point. Advanced readers skip it; beginners have a path in.
Chapter Summaries
Each chapter ends with a tight summary. Review without re-reading. Useful for spaced repetition.
Architecture Cross-Links
Where system design meets its architectural underpinning, there is always a link to the Architecture Foundation.
Trade-offs Explicit
Every decision shows the trade-off, not a "best answer." There are no best answers — only documented trade-offs.
This section rewards curiosity more than linear reading. Follow what interests you. The Fundamentals chapter will always be here when you need to ground a concept. The deeper domains will always be here when a production problem sends you looking for answers.
- Beginner: Start at Overview, work through Fundamentals and Building Blocks in order.
- Practitioner: Fill gaps — NFRs, estimation, then the domain you work in most.
- Senior: Start at Distributed Systems and Case Studies; use everything else as validation.
- Every page: WHY before HOW. Visual before text. Explicit trade-offs, not imaginary best practices.
Key Vocabulary
This is not a glossary. A glossary gives you definitions to memorize. What follows is a mental model primer — the twelve terms that appear in almost every system design conversation. Understanding their intuition before their formal definition is what lets you read the rest of this section without constantly stopping to look things up.
Definition: Time between a request being sent and the response being received.
Example: 200ms to load a webpage. 1ms for a database query on the same server.
Key nuance: Measure p99, not the average. Averages hide the worst 1% of user experiences — the ones who churn.
Definition: Number of operations a system can handle per unit of time. Measured as QPS, RPS, or TPS.
Example: A payment service processing 10,000 transactions per second.
Key nuance: High throughput and low latency are often in tension. Batching increases throughput but adds latency to each item.
Definition: Percentage of time a system is operational and accessible. Expressed as the nines.
Example: 99.9% = 8.76 hours of downtime per year. 99.99% = 52 minutes per year.
Key nuance: Availability is a promise that must be designed for, not hoped for. Redundancy and automatic failover are the mechanisms.
Definition: Ability of a system to handle increased load by adding resources rather than redesigning from scratch.
Example: Adding servers to handle Black Friday traffic, then removing them afterward.
Key nuance: Horizontal scaling (more machines) scales further but requires stateless design. Vertical scaling (bigger machine) has a ceiling.
Definition: Every read receives the most recent write, or an error. No stale data served.
Example: A bank balance must be consistent — $500 on one device cannot show as $1,000 on another.
Key nuance: Strong consistency has a latency cost. In distributed systems, it conflicts with availability during partitions (CAP Theorem).
Definition: System continues to operate despite network partitions between nodes.
Example: A datacenter network split — servers on both sides keep serving requests.
Key nuance: In distributed systems, partitions will happen. The real design question is how to behave when they do.
Definition: Ability to continue operating correctly when components fail, without service interruption.
Example: Netflix continues streaming when one server crashes because requests route to healthy replicas automatically.
Key nuance: Fault tolerance is designed through redundancy and graceful degradation — not through hoping components do not fail.
Definition: An operation that produces the same result whether applied once or multiple times.
Example: A payment retry must not charge the customer twice. DELETE on an already-deleted record must not error.
Key nuance: Critical for retry logic in distributed systems. Without idempotency, retries are dangerous. With it, retries are safe.
Definition: Maintaining copies of data across multiple nodes for reliability, availability, or read performance.
Example: Database read replicas — one primary handles writes, three replicas distribute read traffic.
Key nuance: Replication introduces consistency challenges. How stale is acceptable? When does a replica become the source of truth?
Definition: Horizontal partitioning of data across multiple database instances. Each shard owns a subset of the data.
Example: Users A–M on shard 1, users N–Z on shard 2. Each shard is an independent database.
Key nuance: Sharding makes certain queries very hard — joining across shards is expensive. Shard key selection is the most consequential decision.
Definition: Temporary storage of copies of data to reduce access latency and database load.
Example: Redis storing user session data — 1ms retrieval instead of a 20ms database query on every request.
Key nuance: Cache invalidation is one of the two genuinely hard problems in computer science. Stale cache means wrong data served at scale.
SLA (Service Level Agreement): A contract with external consequences. Breach it and there are financial or legal penalties.
SLO (Service Level Objective): An internal target, stricter than the SLA. The goal you aim for to stay safely within SLA bounds.
SLI (Service Level Indicator): The actual measurement. What you are observing in production right now.
Example: SLA = 99.9% uptime (contractual). SLO = 99.95% (internal target). SLI = 99.97% (measured today).
Key nuance: SLOs must be stricter than SLAs. Always leave yourself a margin. Driving the SLI close to the SLA leaves no runway for incidents.
You do not need to memorize these. You need to understand their trade-offs. Every system design decision is a negotiation between at least two of these concepts. The engineer who knows why latency and throughput tension against each other makes better decisions than one who memorized definitions.
- Latency (how long) and Throughput (how much) are often in tension — batching trades one for the other.
- Availability is a promise requiring design; Fault Tolerance is its implementation mechanism.
- Consistency conflicts with Availability during network partitions — this is the CAP Theorem.
- Sharding and Caching are the primary tools for scaling data beyond one machine.
- SLOs must be stricter than SLAs — always leave yourself a margin before the contractual penalty fires.
01 · Fundamentals
Requirements, trade-offs, capacity estimation. The thinking tools behind every design decision.
Beginner
02 · Building Blocks
Caches, databases, queues, load balancers. Universal components every scaled system uses.
Beginner – Intermediate
03 · Scalability & Reliability
How systems handle more load without falling over. Where prototypes become production systems.
Intermediate
04 · Communication & APIs
How components talk to each other and to the outside world. REST, gRPC, events, async patterns.
Intermediate
05 · Data at Scale
Storage, modeling, and stream processing when data outgrows one machine. The hardest problems.
Intermediate
06 · Security & Observability
Protecting systems and understanding what is happening inside them. You cannot fix what you cannot see.
Intermediate
07 · Distributed Systems
Clocks, consensus, partition tolerance. Where distributed systems theory meets engineering reality.
Advanced
08 · Architecture Styles
Monoliths, microservices, serverless. Not a technology choice — a trade-off that shapes everything.
Intermediate
09 · Case Studies
Real systems designed from scratch. Where all domains converge into judgment and practice.
All levels
Reference
Patterns index, decision trees, key numbers, and the system design interview guide.
All levels
Deliberate Structural Decisions
- Bridges working code and software that works at scale
- Not about code — about structure, boundaries, and trade-offs
- Distinct from architecture (strategic) and engineering (implementation)
- Every system is designed. Deliberately or accidentally.
The Cost Is Paid at 3am
- Poor design fails visibly — at peak load, during launches
- Good design is invisible: systems just work
- WhatsApp: 50 engineers, 600M users — good design multiplies output
- Cost of fixing poor design compounds exponentially over time
Nine Areas of Expertise
- 9 distinct domains from Fundamentals to Distributed Systems
- Breadth before depth — know all domains before deep-diving any
- Each domain builds on the ones before it
- Case Studies is where all domains converge into judgment
Complexity Is Earned, Not Chosen
- No system starts complex — growth forces it
- 7 stages: Single Server to Microservices
- Each stage solves the previous bottleneck and creates the next
- Premature scaling is as dangerous as under-scaling
Three Learning Paths
- Beginner: Overview → Fundamentals → Building Blocks → Scalability
- Practitioner: fill gaps, focus on current domain, use Case Studies
- Senior: Distributed Systems → Case Studies → Architecture Styles
- WHY always comes before HOW on every page
12 Terms, Not Definitions
- Latency (how long) and Throughput (how much) are often in tension
- Availability and Fault Tolerance: designed, never hoped for
- Consistency conflicts with Availability during partitions (CAP)
- Every design decision negotiates between at least two of these