System Design · Fundamentals

System Design Fundamentals

Requirements, trade-offs, and the thinking behind scalable systems.

01
Chapter One

System Design vs Software Architecture vs Software Engineering

Mapping the Disciplines

If you ask ten engineers what system design is, you will get ten different answers — and most of them will blur into software architecture, which will blur into software engineering. This confusion is not ignorance. These disciplines genuinely overlap, share vocabulary, and influence each other constantly. The ability to quickly identify which lens you need at any given moment is one of the clearest marks of engineering maturity.

🏛️

Software Architecture

The discipline of structure and strategy.

Concerned with the long-term organization of a system: which components exist, how they relate, what principles guide their evolution. Academic and organizational in scope.

Example decision: "Should this be a layered or hexagonal architecture?"

💻

Software Engineering

The craft of implementation.

Concerned with writing, testing, and maintaining code. Focused on algorithms, data structures, design patterns at the code level, testing strategies, and tooling.

Example decision: "Should I use a HashMap or a TreeMap here?"

⚙️

System Design

The practice of infrastructure decisions.

Concerned with how components connect, communicate, scale, and fail in production. Closer to engineering than theory. Focused on real trade-offs at real scale.

Example decision: "Should we use a message queue or direct service calls?"

How They Overlap
The Three Disciplines — Where They Meet
Software Architecture Structure · Principles Patterns · Evolution Software Engineering Code · Algorithms Testing · Tooling System Design Scale · Reliability Components · Trade-offs Patterns in Practice Code Structure Impl. Decisions Real Systems

The mental model: Architecture asks WHAT and WHY. System Design asks HOW and AT WHAT SCALE. Engineering asks WITH WHAT CODE. In practice, the best engineers move fluidly across all three — but they always know which hat they are wearing.

Cross-Reference — Go Deeper
Software Architecture Foundation

The Architecture Foundation section covers software architecture in depth: principles, design patterns, quality attributes, documentation, and the architect's role. System Design builds on that foundation — it is the applied, at-scale practice of the architectural thinking covered there.

Explore Architecture Foundation →
📋 Chapter 1 — Summary
  • Software Architecture — structure, principles, long-term evolution (WHAT/WHY)
  • Software Engineering — code, algorithms, implementation craft (WITH WHAT CODE)
  • System Design — scale, reliability, infrastructure trade-offs (HOW/AT WHAT SCALE)
  • In production, all three overlap. Knowing which lens applies at each decision is the skill.
02
Chapter Two

The System Design Process

An Iterative Loop, Not a Waterfall Phase

The most dangerous thing a junior engineer does in a system design session is jump to solutions. They hear "design Twitter" and within 90 seconds they are drawing microservices boxes. The experienced engineer pauses, asks questions, and does math before picking up a pen. System design is not a one-shot activity — it is a loop. Understanding requirements sends you back to recalibrate your design. A trade-off decision sends you back to re-examine constraints. This is not inefficiency. It is the process working correctly.

The System Design Process — An Iterative Loop
① Understand Requirements ② Define Constraints ③ Estimate Scale ④ High-Level Design ⑤ Component Deep Dive ⑥ Trade-off Analysis Iterative — revisit as needed — Revisit Scale
The Six Stages in Practice
🎯

① Understand Requirements

Functional: what the system does. Non-functional: how well it does it.

Key question: "What would make this system fail at its job?"

🚧

② Define Constraints

Budget, team, timeline, existing infrastructure, regulations. Constraints prevent over-engineering.

Key question: "What can we NOT do, regardless of what we want?"

📐

③ Estimate Scale

Order-of-magnitude math. DAU, QPS, storage growth rate. Grounds decisions in reality.

Key question: "What size problem are we actually solving?"

🗺

④ High-Level Design

Major components and data flows. No implementation details. Focus on boundaries.

Key question: "What are the 5–8 boxes and how do data move between them?"

🔬

⑤ Component Deep Dive

Pick the hardest or highest-risk component. Design it in detail. Repeat.

Key question: "Where is the most likely point of failure at scale?"

⚖️

⑥ Trade-off Analysis

Every decision has a cost. Make trade-offs explicit. Document what was rejected and why.

Key question: "What did we give up to get what we gained?"

The goal is not a perfect design. The goal is a design whose trade-offs are understood and documented. A system that surprises its operators is a poorly designed system, regardless of how elegant it looks on a whiteboard.

Hard-Earned Truth

"A design that cannot explain its trade-offs is not a design — it is a guess. And production environments are not forgiving of guesses."

📋 Chapter 2 — Summary
  • System design is an iterative loop, not a linear phase — trade-off findings feed back into earlier stages
  • Six stages: Requirements → Constraints → Scale Estimation → High-Level Design → Deep Dive → Trade-off Analysis
  • Most common mistake: jumping to solutions before understanding requirements and scale
  • Goal: a design with documented and understood trade-offs
03
Chapter Three

Functional vs Non-Functional Requirements

What the System Must Do vs How It Must Behave

Most engineers, when asked to design a system, start thinking about features. This is completely natural and completely backwards. The features — what the system does — are the easy part. The architecture is almost entirely determined by how the system must behave: its availability target, its latency budget, its consistency model, its durability guarantees. Two systems with identical features can require fundamentally different architectures based on NFRs alone. This is the insight that separates junior system designers from senior ones.

✅ Functional Requirements (FR)
⚡ Non-Functional Requirements (NFR)

What the system shall do. Concrete, observable behaviors. User-facing features.

  • Key question: "What does it do?"
  • Testing: Pass / fail (did it happen?)
  • Example: "Users can upload a photo"
  • Drives: Data model, API design, business logic
  • Stability: Can change as product evolves

How well the system does it. Quality attributes. Often called the "-ilities."

  • Key question: "How well must it do it?"
  • Testing: Measurement against threshold
  • Example: "Upload must complete in <500ms at p99"
  • Drives: Architecture, infrastructure choices
  • Stability: Extremely expensive to change later
The Availability Nines — What They Actually Cost

Availability is the most commonly cited NFR and the most commonly misunderstood. "99.9% availability" sounds impressive until you realize it allows eight and a half hours of downtime per year. Every additional nine costs an order of magnitude more in engineering effort, infrastructure, and operational complexity.

Availability Nines — Permitted Downtime
AVAILABILITY LEVEL DOWNTIME / YEAR DOWNTIME / MONTH 99% Two Nines 3.65 days 7.2 hours 99.9% Three Nines 8.76 hours 43.8 min 99.99% Four Nines 52.6 minutes 4.38 min 99.999% Five Nines 5.26 minutes 26.3 sec 99.9999% Six Nines 31.5 seconds 2.63 sec Each additional nine reduces permitted downtime by ~90% and multiplies engineering cost accordingly.
Key Non-Functional Requirements

Latency

Time for a single operation to complete.

Measurement: p50, p95, p99 percentiles

Averages lie — one slow request in a hundred destroys user experience.

📊

Throughput

Operations completed per unit of time.

Measurement: QPS, TPS, RPS

Often traded against latency — high throughput can mean individual requests wait longer.

🔄

Availability

Percentage of time the system is operational.

Measurement: uptime / (uptime + downtime)

Each nine costs roughly 10x more than the last.

💾

Durability

Probability that stored data survives failures.

Measurement: "Eleven nines" = 99.999999999%

S3 design goal. Achieved through replication + erasure coding.

🔁

Consistency

Whether all nodes see the same data at the same time.

Measurement: Strong / Eventual / Causal

Directly conflicts with availability in distributed systems (CAP).

📈

Scalability

Ability to handle increased load without architectural change.

Measurement: Performance at 2x, 10x, 100x load

Horizontal does not mean infinite — every system has a bottleneck.

🛡️

Security

Resistance to unauthorized access and data breaches.

Measurement: Compliance standards, pen test results

Every security control adds latency. That is a trade-off decision, not a bug.

🔧

Maintainability

How quickly defects can be fixed and features added.

Measurement: Mean time to change, deploy frequency

Often dropped under deadline pressure. Always regretted.

🔒

Reliability

Probability of correct operation over a time period.

Measurement: MTBF, MTTR, error rate

Availability is a subset of reliability. A system can be up but unreliable.

How NFRs Conflict With Each Other
NFR Conflict Map — Trade-offs You Will Face in Every System
Consistency Availability Performance Security Durability Scalability ⚡ CAP ⚡ overhead ⚡ fsync ✓ copies ⚡ Conflict ✓ Supports

NFRs are not constraints on the real work. NFRs ARE the real work. Features describe what the system does. NFRs describe whether it actually works for real users at real scale. Systems fail because of NFRs — never because a feature was technically implemented.

📋 Chapter 3 — Summary
  • Functional requirements define what the system does. Testable as pass/fail.
  • Non-functional requirements define how well it does it. Drive architecture choices.
  • The same FR set requires radically different architecture at 100 vs 100M users
  • Key NFRs: Availability, Latency (p99, not average), Throughput, Durability, Consistency, Scalability
  • NFRs conflict: Consistency vs Availability, Security vs Performance, Durability vs Write Latency
  • The nines: 99.9% allows 8.76 hours/year downtime. Each nine costs ~10x more to achieve.
04
Chapter Four

Capacity Estimation & Back-of-Envelope Math

Numbers Before Boxes

Decisions made without numbers are opinions. Decisions made with order-of-magnitude numbers are engineering. You don't need exact figures — you need to be in the right ballpark. Is this a problem that needs one database or fifty? Does this workload need a cache? Will the data fit on a single disk in three years? These are not philosophical questions. They have numerical answers, and those answers should arrive before you draw your first architecture box.

Latency Reference — Orders of Magnitude (Log Scale)
COMPONENT LATENCY RELATIVE SCALE (LOG) L1 Cache ~1 ns L2 Cache ~10 ns RAM Access ~100 ns SSD Read ~100 μs Same DC Network ~1 ms HDD Seek ~10 ms Cross-Continent ~100 ms Fast (ns) Medium (μs–ms) Slow Going to disk is 100,000x slower than RAM. Cross-continent is 100,000,000x slower than L1 cache.
💾

Storage Scale Reference

  • 1 KB — a short text message
  • 1 MB — a compressed photo
  • 1 GB — a feature film (compressed)
  • 1 TB — ~1,000 films or 250,000 photos
  • 1 PB — 500 billion pages of text

A tweet with text: ~300 bytes. A tweet with image: ~1 MB.

Time Scale Reference

  • 1 day = 86,400 seconds ≈ 105
  • 1 month ≈ 2.5 × 106 seconds
  • 1 year ≈ 3.15 × 107 seconds

Why it matters: 500M requests/day ÷ 86,400 = ~5,800 avg QPS. Peak is typically 2–5× average.

The Estimation Process
Back-of-Envelope Estimation — Six Steps to Architecture Inputs
Step 1 Identify Primary Entities Users · Posts · Photos · Messages · Transactions Step 2 Estimate Scale Per Entity DAU · MAU · Growth rate · Engagement rate Step 3 Calculate Read/Write Ratio Most systems are read-heavy. 100:1 is common. Step 4 Calculate Storage Needs item_size × daily_volume × retention_years Step 5 Calculate Bandwidth peak_QPS × avg_request_size Step 6 Identify Bottlenecks From Numbers Single DB? Need cache? CDN? Sharding? Architecture Inputs
Worked Example — Twitter at Scale
Worked Example — Twitter-Scale Estimation
Given
  • • 500M Daily Active Users (DAU)
  • • Each user reads timeline 5×/day
  • • Each user posts 0.1 tweets/day
  • • Avg tweet: 300 bytes text + metadata
  • • 10% of tweets include 1 MB media
Calculations
  • • Daily reads: 500M × 5 = 2.5B reads
  • • Daily writes: 500M × 0.1 = 50M tweets
  • • Read:Write ratio = 50:1
  • • Storage/day: ≈ 5 TB/day
  • • After 5 years: ~9 PB total
Peak QPS (3× average)
Read avg = 2.5B / 86,400 ≈ 29,000 QPS → peak ≈ 87,000 read QPS
Write avg = 50M / 86,400 ≈ 578 QPS → peak ≈ 1,750 write QPS
Architecture implications: 87K read QPS destroys a single DB — aggressive caching (Redis) required. 9 PB cannot live in one datacenter — distributed object storage required. Read:write of 50:1 means you optimize the read path first.
QPS
=
DAU × requests/user/day
÷
86,400 sec
Storage/Day
=
write QPS × avg item size
×
86,400

The numbers tell you whether you need one database or fifty. They tell you whether you need a cache. They tell you whether you need a CDN. Estimation is not arithmetic — it is architecture input.

📋 Chapter 4 — Summary
  • Order-of-magnitude accuracy is enough — the goal is the right ballpark, not the exact figure
  • Internalize: 86,400 sec/day · L1=1ns · RAM=100ns · SSD=100μs · same-DC=1ms · cross-continent=100ms
  • Estimation process: Entities → Scale → R/W Ratio → Storage → Bandwidth → Bottlenecks → Architecture Inputs
  • The output answers: "Do I need sharding? A cache? A CDN? Multiple datacenters?"
05
Chapter Five

Design Trade-offs & Decision Frameworks

Where Experience Becomes Visible

There is a pattern that appears in every architecture review: junior engineers pick technologies; senior engineers pick trade-offs. The junior engineer hears "we need a database" and reaches for what they know. The senior engineer asks what the system needs to optimize for, what it can sacrifice, and what the consequence of that sacrifice is at 2 AM in production. The difference is not knowledge of more tools — it is the habit of making trade-offs explicit before committing to a direction.

Juniors pick technologies. Seniors pick trade-offs. The technology is merely the mechanism through which you implement the trade-off decision you have already made.

CAP Theorem — The Trade-off That Defines Distributed Databases
CAP Theorem — C or A, P is Non-Negotiable in Distributed Systems
Consistency Same data, all nodes Availability Always responds Partition Tolerance Survives net splits CP Systems AP Systems CA — single node only HBase · ZooKeeper etcd · MongoDB(w:majority) Cassandra · CouchDB DynamoDB · DNS P is non-negotiable in distributed systems — the real choice is C vs A
PACELC — What Happens When There Is No Partition

CAP only describes behavior during a network failure. In normal operation, every distributed database trades Latency (respond fast, possibly with slightly stale data) against Consistency (ensure all replicas agree before responding). PACELC makes this everyday trade-off explicit alongside the partition-time behavior.

PACELC — The Full Picture of Distributed Database Trade-offs
During Partition (P) A Stay Available (serve stale data) or C Stay Consistent (refuse requests) AP: Cassandra, DynamoDB CouchDB, DNS CP: ZooKeeper, etcd HBase, MongoDB(w:majority) Else — Normal Operation (E) L Low Latency (slightly stale ok) or C Consistency (wait for quorum) EL: DynamoDB, Cassandra (tunable reads) EC: PostgreSQL, MySQL Google Spanner
Common Trade-offs in Practice

⚡ Consistency vs Availability

Optimize consistency: All reads return the latest data. Write latency increases. Requires coordination.

Optimize availability: Always responds, even with stale data. Higher write throughput. Users may see brief inconsistency.

Choose C: financial transactions, inventory. Choose A: social feeds, like counts, DNS.

⚡ Latency vs Throughput

Optimize latency: Each request processed immediately. Resources may be under-utilized.

Optimize throughput: Batch requests together. Each request waits slightly longer but the system processes more total work.

Choose latency: user-facing reads. Choose throughput: bulk ingestion, log processing.

⚡ Read vs Write Performance

Optimize reads: Pre-compute results, denormalize data, add indexes. Write cost increases.

Optimize writes: Normalize data, compute on read. Read cost increases.

Most consumer systems: optimize reads (100:1 read/write ratios are common).

⚡ Simplicity vs Scalability

Start simple: One DB, one server. Fast to build, easy to understand, but limited scale ceiling.

Start distributed: Scales to millions. But debugging complexity multiplies and failure modes compound.

Default: start simple. Migrate to distributed when the numbers demand it.

The Decision Framework
5-Step Trade-off Decision Framework
1
List your top 3 NFRs for this system
What does this system absolutely have to be? Where are the non-negotiable quality thresholds?
2
State what each candidate decision optimizes and what it sacrifices
Force yourself to name both sides. If you cannot name the sacrifice, you do not understand the decision.
3
Check if the sacrifice is acceptable given your NFR priorities
Does the sacrifice conflict with a top-3 NFR? If yes, discard this option.
4
Document the decision AND the rejected alternatives with reasons
Undocumented decisions become mysteries. Mysteries become incidents when the original engineer has left.
5
Set a review trigger
At what scale, date, or failure condition will you revisit this decision? Write it down now.
Note — Architecture Decision Records (ADRs)

The industry formalization of the documentation step is the Architecture Decision Record — a short, structured document capturing what was decided, the context, the alternatives rejected, and the consequences. Undocumented trade-offs become "technical debt" in 18 months when the engineer who made them has left the team.

Architecture Foundation → Documentation chapter for ADR templates →
📋 Chapter 5 — Summary
  • Juniors pick technologies. Seniors pick trade-offs. Technology is the mechanism, not the decision.
  • CAP: During a partition, choose Consistency (CP) or Availability (AP). P is mandatory in distributed systems.
  • PACELC: Normal operation also forces a choice — Latency or Consistency on every read/write.
  • CP examples: ZooKeeper, etcd, HBase. AP examples: Cassandra, DynamoDB, DNS.
  • Common trade-offs: C/A · Latency/Throughput · Read/Write perf · Simplicity/Scalability
  • 5-step framework: NFRs → name the sacrifice → check acceptability → document → set review trigger
Fundamentals at a Glance
01 · The Three Disciplines

Architecture, Engineering & Design

  • Architecture = WHAT and WHY (structure, principles)
  • System Design = HOW and AT WHAT SCALE (infra decisions)
  • Engineering = WITH WHAT CODE (implementation)
  • They overlap — knowing which lens to use is the skill
02 · The Design Process

An Iterative Loop, Not a Waterfall

  • 6 stages: Requirements → Constraints → Scale → HLD → Deep Dive → Trade-offs
  • Trade-off findings loop back to earlier stages — that is healthy
  • Goal: documented trade-offs, not a perfect design
  • Most common mistake: jumping to solutions before estimating scale
03 · Requirements

NFRs Drive Architecture

  • FRs describe what the system does (features)
  • NFRs describe how well it does it (quality attributes)
  • Same FR + different NFRs = completely different architecture
  • Systems fail because of NFRs, never because of features
  • 99.9% availability = 8.76 hours downtime per year
04 · Capacity Estimation

Numbers Before Boxes

  • Order-of-magnitude is enough — ballpark reveals architecture
  • 86,400 sec/day · RAM=100ns · SSD=100μs · same-DC=1ms
  • QPS = DAU × requests/user/day ÷ 86,400
  • Numbers reveal whether you need cache, CDN, sharding
05 · Trade-off Thinking

CAP, PACELC, and Decision Frameworks

  • CAP: During partition, choose C (consistency) or A (availability)
  • PACELC: Normal operation also forces L vs C on every operation
  • CP: ZooKeeper, etcd · AP: Cassandra, DynamoDB, DNS
  • Document decisions AND rejected alternatives — future self thanks you