System Design Fundamentals
Requirements, trade-offs, and the thinking behind scalable systems.
System Design vs Software Architecture vs Software Engineering
If you ask ten engineers what system design is, you will get ten different answers — and most of them will blur into software architecture, which will blur into software engineering. This confusion is not ignorance. These disciplines genuinely overlap, share vocabulary, and influence each other constantly. The ability to quickly identify which lens you need at any given moment is one of the clearest marks of engineering maturity.
Software Architecture
The discipline of structure and strategy.
Concerned with the long-term organization of a system: which components exist, how they relate, what principles guide their evolution. Academic and organizational in scope.
Example decision: "Should this be a layered or hexagonal architecture?"
Software Engineering
The craft of implementation.
Concerned with writing, testing, and maintaining code. Focused on algorithms, data structures, design patterns at the code level, testing strategies, and tooling.
Example decision: "Should I use a HashMap or a TreeMap here?"
System Design
The practice of infrastructure decisions.
Concerned with how components connect, communicate, scale, and fail in production. Closer to engineering than theory. Focused on real trade-offs at real scale.
Example decision: "Should we use a message queue or direct service calls?"
The mental model: Architecture asks WHAT and WHY. System Design asks HOW and AT WHAT SCALE. Engineering asks WITH WHAT CODE. In practice, the best engineers move fluidly across all three — but they always know which hat they are wearing.
The Architecture Foundation section covers software architecture in depth: principles, design patterns, quality attributes, documentation, and the architect's role. System Design builds on that foundation — it is the applied, at-scale practice of the architectural thinking covered there.
Explore Architecture Foundation →- Software Architecture — structure, principles, long-term evolution (WHAT/WHY)
- Software Engineering — code, algorithms, implementation craft (WITH WHAT CODE)
- System Design — scale, reliability, infrastructure trade-offs (HOW/AT WHAT SCALE)
- In production, all three overlap. Knowing which lens applies at each decision is the skill.
The System Design Process
The most dangerous thing a junior engineer does in a system design session is jump to solutions. They hear "design Twitter" and within 90 seconds they are drawing microservices boxes. The experienced engineer pauses, asks questions, and does math before picking up a pen. System design is not a one-shot activity — it is a loop. Understanding requirements sends you back to recalibrate your design. A trade-off decision sends you back to re-examine constraints. This is not inefficiency. It is the process working correctly.
① Understand Requirements
Functional: what the system does. Non-functional: how well it does it.
Key question: "What would make this system fail at its job?"
② Define Constraints
Budget, team, timeline, existing infrastructure, regulations. Constraints prevent over-engineering.
Key question: "What can we NOT do, regardless of what we want?"
③ Estimate Scale
Order-of-magnitude math. DAU, QPS, storage growth rate. Grounds decisions in reality.
Key question: "What size problem are we actually solving?"
④ High-Level Design
Major components and data flows. No implementation details. Focus on boundaries.
Key question: "What are the 5–8 boxes and how do data move between them?"
⑤ Component Deep Dive
Pick the hardest or highest-risk component. Design it in detail. Repeat.
Key question: "Where is the most likely point of failure at scale?"
⑥ Trade-off Analysis
Every decision has a cost. Make trade-offs explicit. Document what was rejected and why.
Key question: "What did we give up to get what we gained?"
The goal is not a perfect design. The goal is a design whose trade-offs are understood and documented. A system that surprises its operators is a poorly designed system, regardless of how elegant it looks on a whiteboard.
"A design that cannot explain its trade-offs is not a design — it is a guess. And production environments are not forgiving of guesses."
- System design is an iterative loop, not a linear phase — trade-off findings feed back into earlier stages
- Six stages: Requirements → Constraints → Scale Estimation → High-Level Design → Deep Dive → Trade-off Analysis
- Most common mistake: jumping to solutions before understanding requirements and scale
- Goal: a design with documented and understood trade-offs
Functional vs Non-Functional Requirements
Most engineers, when asked to design a system, start thinking about features. This is completely natural and completely backwards. The features — what the system does — are the easy part. The architecture is almost entirely determined by how the system must behave: its availability target, its latency budget, its consistency model, its durability guarantees. Two systems with identical features can require fundamentally different architectures based on NFRs alone. This is the insight that separates junior system designers from senior ones.
What the system shall do. Concrete, observable behaviors. User-facing features.
- Key question: "What does it do?"
- Testing: Pass / fail (did it happen?)
- Example: "Users can upload a photo"
- Drives: Data model, API design, business logic
- Stability: Can change as product evolves
How well the system does it. Quality attributes. Often called the "-ilities."
- Key question: "How well must it do it?"
- Testing: Measurement against threshold
- Example: "Upload must complete in <500ms at p99"
- Drives: Architecture, infrastructure choices
- Stability: Extremely expensive to change later
Availability is the most commonly cited NFR and the most commonly misunderstood. "99.9% availability" sounds impressive until you realize it allows eight and a half hours of downtime per year. Every additional nine costs an order of magnitude more in engineering effort, infrastructure, and operational complexity.
Latency
Time for a single operation to complete.
Measurement: p50, p95, p99 percentiles
Averages lie — one slow request in a hundred destroys user experience.
Throughput
Operations completed per unit of time.
Measurement: QPS, TPS, RPS
Often traded against latency — high throughput can mean individual requests wait longer.
Availability
Percentage of time the system is operational.
Measurement: uptime / (uptime + downtime)
Each nine costs roughly 10x more than the last.
Durability
Probability that stored data survives failures.
Measurement: "Eleven nines" = 99.999999999%
S3 design goal. Achieved through replication + erasure coding.
Consistency
Whether all nodes see the same data at the same time.
Measurement: Strong / Eventual / Causal
Directly conflicts with availability in distributed systems (CAP).
Scalability
Ability to handle increased load without architectural change.
Measurement: Performance at 2x, 10x, 100x load
Horizontal does not mean infinite — every system has a bottleneck.
Security
Resistance to unauthorized access and data breaches.
Measurement: Compliance standards, pen test results
Every security control adds latency. That is a trade-off decision, not a bug.
Maintainability
How quickly defects can be fixed and features added.
Measurement: Mean time to change, deploy frequency
Often dropped under deadline pressure. Always regretted.
Reliability
Probability of correct operation over a time period.
Measurement: MTBF, MTTR, error rate
Availability is a subset of reliability. A system can be up but unreliable.
NFRs are not constraints on the real work. NFRs ARE the real work. Features describe what the system does. NFRs describe whether it actually works for real users at real scale. Systems fail because of NFRs — never because a feature was technically implemented.
- Functional requirements define what the system does. Testable as pass/fail.
- Non-functional requirements define how well it does it. Drive architecture choices.
- The same FR set requires radically different architecture at 100 vs 100M users
- Key NFRs: Availability, Latency (p99, not average), Throughput, Durability, Consistency, Scalability
- NFRs conflict: Consistency vs Availability, Security vs Performance, Durability vs Write Latency
- The nines: 99.9% allows 8.76 hours/year downtime. Each nine costs ~10x more to achieve.
Capacity Estimation & Back-of-Envelope Math
Decisions made without numbers are opinions. Decisions made with order-of-magnitude numbers are engineering. You don't need exact figures — you need to be in the right ballpark. Is this a problem that needs one database or fifty? Does this workload need a cache? Will the data fit on a single disk in three years? These are not philosophical questions. They have numerical answers, and those answers should arrive before you draw your first architecture box.
Storage Scale Reference
- 1 KB — a short text message
- 1 MB — a compressed photo
- 1 GB — a feature film (compressed)
- 1 TB — ~1,000 films or 250,000 photos
- 1 PB — 500 billion pages of text
A tweet with text: ~300 bytes. A tweet with image: ~1 MB.
Time Scale Reference
- 1 day = 86,400 seconds ≈ 105
- 1 month ≈ 2.5 × 106 seconds
- 1 year ≈ 3.15 × 107 seconds
Why it matters: 500M requests/day ÷ 86,400 = ~5,800 avg QPS. Peak is typically 2–5× average.
- • 500M Daily Active Users (DAU)
- • Each user reads timeline 5×/day
- • Each user posts 0.1 tweets/day
- • Avg tweet: 300 bytes text + metadata
- • 10% of tweets include 1 MB media
- • Daily reads: 500M × 5 = 2.5B reads
- • Daily writes: 500M × 0.1 = 50M tweets
- • Read:Write ratio = 50:1
- • Storage/day: ≈ 5 TB/day
- • After 5 years: ~9 PB total
Write avg = 50M / 86,400 ≈ 578 QPS → peak ≈ 1,750 write QPS
The numbers tell you whether you need one database or fifty. They tell you whether you need a cache. They tell you whether you need a CDN. Estimation is not arithmetic — it is architecture input.
- Order-of-magnitude accuracy is enough — the goal is the right ballpark, not the exact figure
- Internalize: 86,400 sec/day · L1=1ns · RAM=100ns · SSD=100μs · same-DC=1ms · cross-continent=100ms
- Estimation process: Entities → Scale → R/W Ratio → Storage → Bandwidth → Bottlenecks → Architecture Inputs
- The output answers: "Do I need sharding? A cache? A CDN? Multiple datacenters?"
Design Trade-offs & Decision Frameworks
There is a pattern that appears in every architecture review: junior engineers pick technologies; senior engineers pick trade-offs. The junior engineer hears "we need a database" and reaches for what they know. The senior engineer asks what the system needs to optimize for, what it can sacrifice, and what the consequence of that sacrifice is at 2 AM in production. The difference is not knowledge of more tools — it is the habit of making trade-offs explicit before committing to a direction.
Juniors pick technologies. Seniors pick trade-offs. The technology is merely the mechanism through which you implement the trade-off decision you have already made.
CAP only describes behavior during a network failure. In normal operation, every distributed database trades Latency (respond fast, possibly with slightly stale data) against Consistency (ensure all replicas agree before responding). PACELC makes this everyday trade-off explicit alongside the partition-time behavior.
⚡ Consistency vs Availability
Optimize consistency: All reads return the latest data. Write latency increases. Requires coordination.
Optimize availability: Always responds, even with stale data. Higher write throughput. Users may see brief inconsistency.
Choose C: financial transactions, inventory. Choose A: social feeds, like counts, DNS.
⚡ Latency vs Throughput
Optimize latency: Each request processed immediately. Resources may be under-utilized.
Optimize throughput: Batch requests together. Each request waits slightly longer but the system processes more total work.
Choose latency: user-facing reads. Choose throughput: bulk ingestion, log processing.
⚡ Read vs Write Performance
Optimize reads: Pre-compute results, denormalize data, add indexes. Write cost increases.
Optimize writes: Normalize data, compute on read. Read cost increases.
Most consumer systems: optimize reads (100:1 read/write ratios are common).
⚡ Simplicity vs Scalability
Start simple: One DB, one server. Fast to build, easy to understand, but limited scale ceiling.
Start distributed: Scales to millions. But debugging complexity multiplies and failure modes compound.
Default: start simple. Migrate to distributed when the numbers demand it.
The industry formalization of the documentation step is the Architecture Decision Record — a short, structured document capturing what was decided, the context, the alternatives rejected, and the consequences. Undocumented trade-offs become "technical debt" in 18 months when the engineer who made them has left the team.
Architecture Foundation → Documentation chapter for ADR templates →- Juniors pick technologies. Seniors pick trade-offs. Technology is the mechanism, not the decision.
- CAP: During a partition, choose Consistency (CP) or Availability (AP). P is mandatory in distributed systems.
- PACELC: Normal operation also forces a choice — Latency or Consistency on every read/write.
- CP examples: ZooKeeper, etcd, HBase. AP examples: Cassandra, DynamoDB, DNS.
- Common trade-offs: C/A · Latency/Throughput · Read/Write perf · Simplicity/Scalability
- 5-step framework: NFRs → name the sacrifice → check acceptability → document → set review trigger
Architecture, Engineering & Design
- Architecture = WHAT and WHY (structure, principles)
- System Design = HOW and AT WHAT SCALE (infra decisions)
- Engineering = WITH WHAT CODE (implementation)
- They overlap — knowing which lens to use is the skill
An Iterative Loop, Not a Waterfall
- 6 stages: Requirements → Constraints → Scale → HLD → Deep Dive → Trade-offs
- Trade-off findings loop back to earlier stages — that is healthy
- Goal: documented trade-offs, not a perfect design
- Most common mistake: jumping to solutions before estimating scale
NFRs Drive Architecture
- FRs describe what the system does (features)
- NFRs describe how well it does it (quality attributes)
- Same FR + different NFRs = completely different architecture
- Systems fail because of NFRs, never because of features
- 99.9% availability = 8.76 hours downtime per year
Numbers Before Boxes
- Order-of-magnitude is enough — ballpark reveals architecture
- 86,400 sec/day · RAM=100ns · SSD=100μs · same-DC=1ms
- QPS = DAU × requests/user/day ÷ 86,400
- Numbers reveal whether you need cache, CDN, sharding
CAP, PACELC, and Decision Frameworks
- CAP: During partition, choose C (consistency) or A (availability)
- PACELC: Normal operation also forces L vs C on every operation
- CP: ZooKeeper, etcd · AP: Cassandra, DynamoDB, DNS
- Document decisions AND rejected alternatives — future self thanks you