System Design · Fundamentals

System Design Fundamentals

Requirements, trade-offs, and the thinking behind scalable systems.

Chapter One

System Design vs Software Architecture vs Software Engineering

Mapping the Disciplines

If you ask ten engineers what system design is, you will get ten different answers — and most of them will blur into software architecture, which will blur into software engineering. This confusion is not ignorance. These disciplines genuinely overlap, share vocabulary, and influence each other constantly. The ability to quickly identify which lens you need at any given moment is one of the clearest marks of engineering maturity.

🏛️

Software Architecture

The discipline of structure and strategy.

Concerned with the long-term organization of a system: which components exist, how they relate, what principles guide their evolution. Academic and organizational in scope.

Example decision: "Should this be a layered or hexagonal architecture?"

💻

Software Engineering

The craft of implementation.

Concerned with writing, testing, and maintaining code. Focused on algorithms, data structures, design patterns at the code level, testing strategies, and tooling.

Example decision: "Should I use a HashMap or a TreeMap here?"

⚙️

System Design

The practice of infrastructure decisions.

Concerned with how components connect, communicate, scale, and fail in production. Closer to engineering than theory. Focused on real trade-offs at real scale.

Example decision: "Should we use a message queue or direct service calls?"

How They Overlap

The Three Disciplines — Where They Meet

The mental model: Architecture asks WHAT and WHY. System Design asks HOW and AT WHAT SCALE. Engineering asks WITH WHAT CODE. In practice, the best engineers move fluidly across all three — but they always know which hat they are wearing.

Cross-Reference — Go Deeper

Software Architecture Foundation

The Architecture Foundation section covers software architecture in depth: principles, design patterns, quality attributes, documentation, and the architect's role. System Design builds on that foundation — it is the applied, at-scale practice of the architectural thinking covered there.

Explore Architecture Foundation →

📋 Chapter 1 — Summary

Software Architecture — structure, principles, long-term evolution (WHAT/WHY)
Software Engineering — code, algorithms, implementation craft (WITH WHAT CODE)
System Design — scale, reliability, infrastructure trade-offs (HOW/AT WHAT SCALE)
In production, all three overlap. Knowing which lens applies at each decision is the skill.

Chapter Two

The System Design Process

An Iterative Loop, Not a Waterfall Phase

The most dangerous thing a junior engineer does in a system design session is jump to solutions. They hear "design Twitter" and within 90 seconds they are drawing microservices boxes. The experienced engineer pauses, asks questions, and does math before picking up a pen. System design is not a one-shot activity — it is a loop. Understanding requirements sends you back to recalibrate your design. A trade-off decision sends you back to re-examine constraints. This is not inefficiency. It is the process working correctly.

The System Design Process — An Iterative Loop

The Six Stages in Practice

🎯

① Understand Requirements

Functional: what the system does. Non-functional: how well it does it.

Key question: "What would make this system fail at its job?"

🚧

② Define Constraints

Budget, team, timeline, existing infrastructure, regulations. Constraints prevent over-engineering.

Key question: "What can we NOT do, regardless of what we want?"

📐

③ Estimate Scale

Order-of-magnitude math. DAU, QPS, storage growth rate. Grounds decisions in reality.

Key question: "What size problem are we actually solving?"

🗺

④ High-Level Design

Major components and data flows. No implementation details. Focus on boundaries.

Key question: "What are the 5–8 boxes and how do data move between them?"

🔬

⑤ Component Deep Dive

Pick the hardest or highest-risk component. Design it in detail. Repeat.

Key question: "Where is the most likely point of failure at scale?"

⚖️

⑥ Trade-off Analysis

Every decision has a cost. Make trade-offs explicit. Document what was rejected and why.

Key question: "What did we give up to get what we gained?"

The goal is not a perfect design. The goal is a design whose trade-offs are understood and documented. A system that surprises its operators is a poorly designed system, regardless of how elegant it looks on a whiteboard.

Hard-Earned Truth

"A design that cannot explain its trade-offs is not a design — it is a guess. And production environments are not forgiving of guesses."

📋 Chapter 2 — Summary

System design is an iterative loop, not a linear phase — trade-off findings feed back into earlier stages
Six stages: Requirements → Constraints → Scale Estimation → High-Level Design → Deep Dive → Trade-off Analysis
Most common mistake: jumping to solutions before understanding requirements and scale
Goal: a design with documented and understood trade-offs

Chapter Three

Functional vs Non-Functional Requirements

What the System Must Do vs How It Must Behave

Most engineers, when asked to design a system, start thinking about features. This is completely natural and completely backwards. The features — what the system does — are the easy part. The architecture is almost entirely determined by how the system must behave: its availability target, its latency budget, its consistency model, its durability guarantees. Two systems with identical features can require fundamentally different architectures based on NFRs alone. This is the insight that separates junior system designers from senior ones.

✅ Functional Requirements (FR)

⚡ Non-Functional Requirements (NFR)

What the system shall do. Concrete, observable behaviors. User-facing features.

Key question: "What does it do?"
Testing: Pass / fail (did it happen?)
Example: "Users can upload a photo"
Drives: Data model, API design, business logic
Stability: Can change as product evolves

How well the system does it. Quality attributes. Often called the "-ilities."

Key question: "How well must it do it?"
Testing: Measurement against threshold
Example: "Upload must complete in <500ms at p99"
Drives: Architecture, infrastructure choices
Stability: Extremely expensive to change later

The Availability Nines — What They Actually Cost

Availability is the most commonly cited NFR and the most commonly misunderstood. "99.9% availability" sounds impressive until you realize it allows eight and a half hours of downtime per year. Every additional nine costs an order of magnitude more in engineering effort, infrastructure, and operational complexity.

Availability Nines — Permitted Downtime

Key Non-Functional Requirements

⏱

Latency

Time for a single operation to complete.

Measurement: p50, p95, p99 percentiles

Averages lie — one slow request in a hundred destroys user experience.

📊

Throughput

Operations completed per unit of time.

Measurement: QPS, TPS, RPS

Often traded against latency — high throughput can mean individual requests wait longer.

🔄

Availability

Percentage of time the system is operational.

Measurement: uptime / (uptime + downtime)

Each nine costs roughly 10x more than the last.

💾

Durability

Probability that stored data survives failures.

Measurement: "Eleven nines" = 99.999999999%

S3 design goal. Achieved through replication + erasure coding.

🔁

Consistency

Whether all nodes see the same data at the same time.

Measurement: Strong / Eventual / Causal

Directly conflicts with availability in distributed systems (CAP).

📈

Scalability

Ability to handle increased load without architectural change.

Measurement: Performance at 2x, 10x, 100x load

Horizontal does not mean infinite — every system has a bottleneck.

🛡️

Security

Resistance to unauthorized access and data breaches.

Measurement: Compliance standards, pen test results

Every security control adds latency. That is a trade-off decision, not a bug.

🔧

Maintainability

How quickly defects can be fixed and features added.

Measurement: Mean time to change, deploy frequency

Often dropped under deadline pressure. Always regretted.

🔒

Reliability

Probability of correct operation over a time period.

Measurement: MTBF, MTTR, error rate

Availability is a subset of reliability. A system can be up but unreliable.

How NFRs Conflict With Each Other

NFR Conflict Map — Trade-offs You Will Face in Every System

NFRs are not constraints on the real work. NFRs ARE the real work. Features describe what the system does. NFRs describe whether it actually works for real users at real scale. Systems fail because of NFRs — never because a feature was technically implemented.

📋 Chapter 3 — Summary

Functional requirements define what the system does. Testable as pass/fail.
Non-functional requirements define how well it does it. Drive architecture choices.
The same FR set requires radically different architecture at 100 vs 100M users
Key NFRs: Availability, Latency (p99, not average), Throughput, Durability, Consistency, Scalability
NFRs conflict: Consistency vs Availability, Security vs Performance, Durability vs Write Latency
The nines: 99.9% allows 8.76 hours/year downtime. Each nine costs ~10x more to achieve.

Chapter Four

Capacity Estimation & Back-of-Envelope Math

Numbers Before Boxes

Decisions made without numbers are opinions. Decisions made with order-of-magnitude numbers are engineering. You don't need exact figures — you need to be in the right ballpark. Is this a problem that needs one database or fifty? Does this workload need a cache? Will the data fit on a single disk in three years? These are not philosophical questions. They have numerical answers, and those answers should arrive before you draw your first architecture box.

Latency Reference — Orders of Magnitude (Log Scale)

💾

Storage Scale Reference

1 KB — a short text message
1 MB — a compressed photo
1 GB — a feature film (compressed)
1 TB — ~1,000 films or 250,000 photos
1 PB — 500 billion pages of text

A tweet with text: ~300 bytes. A tweet with image: ~1 MB.

⏰

Time Scale Reference

1 day = 86,400 seconds ≈ 10⁵
1 month ≈ 2.5 × 10⁶ seconds
1 year ≈ 3.15 × 10⁷ seconds

Why it matters: 500M requests/day ÷ 86,400 = ~5,800 avg QPS. Peak is typically 2–5× average.

The Estimation Process

Back-of-Envelope Estimation — Six Steps to Architecture Inputs

Worked Example — Twitter at Scale

Worked Example — Twitter-Scale Estimation

Given

• 500M Daily Active Users (DAU)
• Each user reads timeline 5×/day
• Each user posts 0.1 tweets/day
• Avg tweet: 300 bytes text + metadata
• 10% of tweets include 1 MB media

Calculations

• Daily reads: 500M × 5 = 2.5B reads
• Daily writes: 500M × 0.1 = 50M tweets
• Read:Write ratio = 50:1
• Storage/day: ≈ 5 TB/day
• After 5 years: ~9 PB total

Peak QPS (3× average)

Read avg = 2.5B / 86,400 ≈ 29,000 QPS → peak ≈ 87,000 read QPS

Write avg = 50M / 86,400 ≈ 578 QPS → peak ≈ 1,750 write QPS

Architecture implications: 87K read QPS destroys a single DB — aggressive caching (Redis) required. 9 PB cannot live in one datacenter — distributed object storage required. Read:write of 50:1 means you optimize the read path first.

QPS

DAU × requests/user/day

86,400 sec

Storage/Day

write QPS × avg item size

86,400

The numbers tell you whether you need one database or fifty. They tell you whether you need a cache. They tell you whether you need a CDN. Estimation is not arithmetic — it is architecture input.

📋 Chapter 4 — Summary

Order-of-magnitude accuracy is enough — the goal is the right ballpark, not the exact figure
Internalize: 86,400 sec/day · L1=1ns · RAM=100ns · SSD=100μs · same-DC=1ms · cross-continent=100ms
Estimation process: Entities → Scale → R/W Ratio → Storage → Bandwidth → Bottlenecks → Architecture Inputs
The output answers: "Do I need sharding? A cache? A CDN? Multiple datacenters?"

Chapter Five

Design Trade-offs & Decision Frameworks

Where Experience Becomes Visible

There is a pattern that appears in every architecture review: junior engineers pick technologies; senior engineers pick trade-offs. The junior engineer hears "we need a database" and reaches for what they know. The senior engineer asks what the system needs to optimize for, what it can sacrifice, and what the consequence of that sacrifice is at 2 AM in production. The difference is not knowledge of more tools — it is the habit of making trade-offs explicit before committing to a direction.

Juniors pick technologies. Seniors pick trade-offs. The technology is merely the mechanism through which you implement the trade-off decision you have already made.

CAP Theorem — The Trade-off That Defines Distributed Databases

CAP Theorem — C or A, P is Non-Negotiable in Distributed Systems

PACELC — What Happens When There Is No Partition

CAP only describes behavior during a network failure. In normal operation, every distributed database trades Latency (respond fast, possibly with slightly stale data) against Consistency (ensure all replicas agree before responding). PACELC makes this everyday trade-off explicit alongside the partition-time behavior.

PACELC — The Full Picture of Distributed Database Trade-offs

Common Trade-offs in Practice

⚡ Consistency vs Availability

Optimize consistency: All reads return the latest data. Write latency increases. Requires coordination.

Optimize availability: Always responds, even with stale data. Higher write throughput. Users may see brief inconsistency.

Choose C: financial transactions, inventory. Choose A: social feeds, like counts, DNS.

⚡ Latency vs Throughput

Optimize latency: Each request processed immediately. Resources may be under-utilized.

Optimize throughput: Batch requests together. Each request waits slightly longer but the system processes more total work.

Choose latency: user-facing reads. Choose throughput: bulk ingestion, log processing.

⚡ Read vs Write Performance

Optimize reads: Pre-compute results, denormalize data, add indexes. Write cost increases.

Optimize writes: Normalize data, compute on read. Read cost increases.

Most consumer systems: optimize reads (100:1 read/write ratios are common).

⚡ Simplicity vs Scalability

Start simple: One DB, one server. Fast to build, easy to understand, but limited scale ceiling.

Start distributed: Scales to millions. But debugging complexity multiplies and failure modes compound.

Default: start simple. Migrate to distributed when the numbers demand it.

The Decision Framework

5-Step Trade-off Decision Framework

List your top 3 NFRs for this system

What does this system absolutely have to be? Where are the non-negotiable quality thresholds?

State what each candidate decision optimizes and what it sacrifices

Force yourself to name both sides. If you cannot name the sacrifice, you do not understand the decision.

Check if the sacrifice is acceptable given your NFR priorities

Does the sacrifice conflict with a top-3 NFR? If yes, discard this option.

Document the decision AND the rejected alternatives with reasons

Undocumented decisions become mysteries. Mysteries become incidents when the original engineer has left.

Set a review trigger

At what scale, date, or failure condition will you revisit this decision? Write it down now.

Note — Architecture Decision Records (ADRs)

The industry formalization of the documentation step is the Architecture Decision Record — a short, structured document capturing what was decided, the context, the alternatives rejected, and the consequences. Undocumented trade-offs become "technical debt" in 18 months when the engineer who made them has left the team.

Architecture Foundation → Documentation chapter for ADR templates →

📋 Chapter 5 — Summary

Juniors pick technologies. Seniors pick trade-offs. Technology is the mechanism, not the decision.
CAP: During a partition, choose Consistency (CP) or Availability (AP). P is mandatory in distributed systems.
PACELC: Normal operation also forces a choice — Latency or Consistency on every read/write.
CP examples: ZooKeeper, etcd, HBase. AP examples: Cassandra, DynamoDB, DNS.
Common trade-offs: C/A · Latency/Throughput · Read/Write perf · Simplicity/Scalability
5-step framework: NFRs → name the sacrifice → check acceptability → document → set review trigger

Fundamentals at a Glance

01 · The Three Disciplines

Architecture, Engineering & Design

Architecture = WHAT and WHY (structure, principles)
System Design = HOW and AT WHAT SCALE (infra decisions)
Engineering = WITH WHAT CODE (implementation)
They overlap — knowing which lens to use is the skill

02 · The Design Process

An Iterative Loop, Not a Waterfall

6 stages: Requirements → Constraints → Scale → HLD → Deep Dive → Trade-offs
Trade-off findings loop back to earlier stages — that is healthy
Goal: documented trade-offs, not a perfect design
Most common mistake: jumping to solutions before estimating scale

03 · Requirements

NFRs Drive Architecture

FRs describe what the system does (features)
NFRs describe how well it does it (quality attributes)
Same FR + different NFRs = completely different architecture
Systems fail because of NFRs, never because of features
99.9% availability = 8.76 hours downtime per year

04 · Capacity Estimation

Numbers Before Boxes

Order-of-magnitude is enough — ballpark reveals architecture
86,400 sec/day · RAM=100ns · SSD=100μs · same-DC=1ms
QPS = DAU × requests/user/day ÷ 86,400
Numbers reveal whether you need cache, CDN, sharding

05 · Trade-off Thinking

CAP, PACELC, and Decision Frameworks

CAP: During partition, choose C (consistency) or A (availability)
PACELC: Normal operation also forces L vs C on every operation
CP: ZooKeeper, etcd · AP: Cassandra, DynamoDB, DNS
Document decisions AND rejected alternatives — future self thanks you

← System Design Overview Building Blocks →