System Design · Scalability & Reliability

Scalability & Reliability

Designing systems that handle load and survive failure.

Chapter One

Scalability Fundamentals

What Scalability Actually Means

Scalability is not a feature you add later. By the time you need it urgently, it is too late to add it cleanly. The systems that scale gracefully were designed with scale in mind from the start — not necessarily built for scale on day one, but designed so that scaling is possible without a rewrite. A system that requires a fundamental rearchitecture to handle 10× load is not scalable. A system that requires adding servers is.

Scalability: the ability of a system to handle increased load by adding resources — without requiring fundamental architectural changes. The key phrase is "without fundamental changes." That constraint is what separates scalable architecture from a prototype.

Vertical vs Horizontal Scaling

Vertical Scaling (Scale Up)

Horizontal Scaling (Scale Out)

Add more power to existing machines — bigger CPU, more RAM, faster SSD
Simple: no code changes, no distributed complexity
Hard ceiling: ~96 cores, ~12TB RAM on largest cloud instances
Single point of failure remains — one machine dies, system dies
Downtime often required for hardware upgrade
Best for: early stage, databases hard to distribute, buying time

Add more machines, distribute load across them
Theoretically unlimited scale — add servers as demand grows
Fault tolerant — lose one server, others handle the load
Cost-effective at scale: commodity hardware, not premium instances
Requires stateless application design
Adds distributed system complexity — network, consistency, coordination

Vertical vs Horizontal Scaling — How Each Handles Growth

Stateless vs Stateful Services

The prerequisite for horizontal scaling is stateless services. If a server holds session data, user state, or any request-specific context in memory, you cannot freely route requests to any instance. Stateless design pushes all state to dedicated stores — databases, Redis, object storage — so application servers become interchangeable and disposable.

Stateful vs Stateless — Why Stateless Scales

The Five Dimensions of Scale

📈

Traffic Volume

More requests/sec. Strategy: horizontal scaling, load balancing, caching, CDN.

💾

Data Volume

More data to store/query. Strategy: sharding, partitioning, tiered storage, archival.

🌍

Geographic Scale

Users in more regions. Strategy: CDN, multi-region deployment, edge computing.

🧩

Feature Scale

More system complexity. Strategy: modular architecture, service boundaries, API contracts.

👥

Team Scale

More engineers. Strategy: microservices, clear ownership, independent deployability.

⏱️

Operational Scale

More to monitor and maintain. Strategy: observability, automated runbooks, incident response frameworks.

Horizontal scaling is not a technology choice.

📋 Chapter 1 — Summary

Scalability = handling more load by adding resources without fundamental changes.
Vertical: simple but has a hard ceiling (~96 cores). Single point of failure remains.
Horizontal: theoretically unlimited — requires stateless design and distributed complexity.
Stateless prerequisite: push all state to dedicated stores. Servers become disposable.
Five dimensions: traffic, data, geography, features, and team — each needs different strategies.

Chapter Two

Reliability Patterns

Building Systems That Survive Failure

Distributed systems fail. Not might fail — will fail. Networks partition, servers crash, disks fill up, dependencies time out. The question is not whether your system will experience failures — it is whether those failures are isolated or cascading. Every pattern in this chapter exists to turn inevitable failures into manageable, contained events.

Circuit Breaker

Named after the electrical equivalent: when a downstream dependency is failing, the circuit breaker stops sending requests to it — giving it time to recover. Without a circuit breaker, a slow dependency causes thread pool exhaustion in the caller. One failing service takes down the entire system via cascade.

Circuit Breaker — Three States

🔄

Retry with Exponential Backoff

Wait = base × 2^attempt + random jitter
Jitter prevents thundering herd on recovery
Max retries: 3–5. Beyond that, return error
Only safe for idempotent operations
Retrying a payment without idempotency key = double charge
Implementation: client-generated UUID as idempotency key; server stores key→response; replay returns stored response

⏱️

Timeout

Every external call must have a timeout
Without it: threads hang forever, pool exhausts
Too short = false failures. Too long = cascade risk
Rule: p99 latency of dependency + 20% margin
Always set connect timeout AND read timeout separately

Bulkhead Pattern — One Leak Doesn't Sink the Ship

🎯

Fallback: Recommendations

Primary: personalized. Fallback: show popular items. Limitation: lower engagement.

⚙️

Fallback: Preferences

Primary: user settings. Fallback: defaults. Limitation: impersonal experience.

📦

Fallback: Inventory

Primary: real-time stock. Fallback: cached + "may be inaccurate." Limitation: overselling risk.

Active-Active

Active-Passive

Multiple instances all serve traffic simultaneously
Loss of one reduces capacity, does NOT cause outage
RTO: near-zero (traffic already distributed)
RPO: near-zero (writes on all nodes)
Best for: web servers, stateless APIs, CDN

Primary serves traffic; secondary is warm standby
Failover on primary failure — some downtime during switch
RTO: seconds to minutes depending on detection
RPO: seconds to minutes (replication lag = data at risk)
Best for: databases, stateful services, legacy systems

A system that fails fast is more reliable than one that fails slowly. Slow failures cascade — a single slow dependency exhausts thread pools, backs up queues, and takes down unrelated services. Fast failures isolate — the circuit opens, the fallback serves, and the rest of the system continues working.

📋 Chapter 2 — Summary

Circuit breaker: Closed → Open → Half-Open. Stops cascade by failing fast.
Retry: exponential backoff + jitter. Only for idempotent ops. Max 3–5 attempts.
Timeout: every external call. Set at p99 + 20% margin.
Bulkhead: isolate pools per dependency. One leak doesn't sink the ship.
Fallback: serve degraded response when primary fails.
Active-Active for zero-downtime; Active-Passive for simpler consistency.

Chapter Three

Availability Deep Dive

The Nines and What They Cost

Availability is a promise. When you say your system has 99.9% availability, you are making a commitment to users about what they can expect. That promise is only as good as the architecture behind it. Most teams discover their real availability number after an incident — not before. Each additional "nine" is an order of magnitude harder and more expensive to achieve.

The Availability Nines — Downtime Budget

Error Budgets

Error Budget: 100% − SLO = error budget. If SLO is 99.9%, your error budget is 0.1% of the month (43.8 minutes). When error budget is exhausted: no new features, only reliability work. This is how Google and Netflix balance velocity with reliability.

Single Points of Failure

🗄️

Single Database

No replica. Server dies = all data unavailable. Mitigate: primary-replica, multi-AZ RDS, automated failover.

🌐

Single Load Balancer

LB dies = all traffic drops. Mitigate: dual LB, DNS failover, cloud-managed LB.

🏢

Single Availability Zone

AZ power/network failure = total outage. Mitigate: multi-AZ deployment.

🧠

Single Engineer Knowledge

Bus factor = 1. Mitigate: docs, pair programming, runbooks, code reviews.

💻

Hardware Failure

Servers, disks, NICs. Detect: health checks. Mitigate: redundancy, auto-replacement.

🔌

Network Partition

Nodes can't communicate. Detect: timeout, gossip. Mitigate: multi-zone, graceful degradation.

🐛

Software Failure

Bugs, memory leaks. Detect: alerts, error spikes. Mitigate: circuit breakers, auto-restart.

🔗

Dependency Failure

Third-party down. Detect: timeout. Mitigate: fallbacks, cached last-known-good.

👤

Human Error

Wrong config, bad migration. Mitigate: automated rollback, canary deploys.

⚡

Capacity Exhaustion

Disk full, OOM, connection limits. Detect: threshold alerts. Mitigate: autoscaling, capacity planning.

Chaos Engineering

Netflix Chaos Monkey philosophy: if failure will happen anyway, test it in controlled conditions. Define steady state → hypothesize → introduce failure → observe. Not reckless — disciplined. Levels: kill process → terminate server → fail AZ → inject latency → saturate CPU.

Concrete example: Netflix injects 100ms latency into their payment service during a game day. Discovery: the retry storm from the auth service doubles load on the payment service, causing cascading timeout. Fix deployed before a real incident: circuit breaker + exponential backoff added to auth→payment calls.

Your system's availability is determined by its weakest link. One single point of failure makes every redundancy investment irrelevant. SPOF elimination is not optimization — it is prerequisite work.

📋 Chapter 3 — Summary

99.9% = 43 min/month downtime. Each nine = 10× harder and more expensive.
Error budget = 100% − SLO. Exhausted = feature freeze, reliability work only.
SPOFs: single DB, single LB, single AZ, single engineer. Eliminate first.
Failure modes: hardware, network, software, dependency, human. Each needs different mitigation.
Chaos Engineering: test failure deliberately — better to discover weakness on your terms.

Chapter Four

Performance Patterns

Latency vs Throughput vs Resources

Performance and scalability are related but different. Performance is how fast a single request is served. Scalability is how well the system maintains that performance as load increases. A system can be fast but not scalable. A system can be scalable but slow. Both matter — they require different techniques.

Latency Distribution — Why p99 Matters

Patterns for Performance

⚡

Async Processing

Move work out of the request path
When: work >200ms, not needed for response, can retry
Examples: email, image processing, reports
Pattern: message queue + background worker
Trade-off: no immediate result for user

🔮

Pre-computation

Compute expensive results ahead of time
Twitter: fan-out on write for most users
Leaderboards: recompute every minute, cache
Analytics: nightly aggregation, materialized views
Trade-off: stale data — how stale is acceptable?

Fan-out on Write vs Read — Twitter's Hybrid Model

📖

Read Replicas

Route reads to replicas (80–90% of traffic). Read-your-own-writes: read from primary for X sec after write.

🔌

Connection Pooling

Reuse DB connections (~50ms cost per new one). Pool size: cores × 2 + spindle_count. Too many = context-switch overhead. Too few = request queuing. PgBouncer for PostgreSQL.

📋

Denormalization

Duplicate data to avoid expensive joins. Trade-off: faster reads, slower writes, app manages consistency.

Lazy vs Eager Loading

Lazy Loading

Eager Loading

Load data only when requested — on first access
Lower startup cost and memory usage
Possible latency spike on first access (cold start)
Best for: large datasets, optional/rarely-used relations
Example: load user profile details only when user clicks "View Profile"

Pre-load all related data upfront in a single query
Faster subsequent reads — no additional round trips
Wastes resources if loaded data is never used
Best for: small datasets, always-needed relations, N+1 query avoidance
Example: load order + items + shipping in one query for order detail page

Throughput vs Latency Tension

Optimize for Latency

Optimize for Throughput

Process each request immediately — no waiting
Lower per-request response time
Higher resource cost per operation (no batching)
Best for: user-facing APIs, real-time interactions

Batch multiple operations together — amortize overhead
Higher total operations/second
Individual requests wait for batch to fill (added latency)
Best for: data pipelines, background jobs, analytics ingestion

The throughput-latency trade-off is fundamental. Batching (Kafka, bulk writes) maximizes throughput at the cost of per-message latency. Streaming (WebSocket, single-row commits) minimizes latency at the cost of throughput. Pipelining (HTTP/2, Redis pipelines) improves both — but adds implementation complexity. Choose based on whether the consumer is a human (latency) or a machine (throughput).

Optimize for p99, not average. Your average latency can look healthy while 1% of users — often your most active, highest-value customers — experience unacceptable slowness. The average of [1ms, 1ms, 1ms, 1000ms] is 250ms. The p99 tells you someone waited a full second.

📋 Chapter 4 — Summary

Performance ≠ Scalability. Performance = speed of one request. Scalability = speed under load.
p99 over averages: exposes the worst 1% that averages hide.
Async: move non-essential work out of the request path. Queue + worker.
Pre-compute: compute ahead of time. Trade freshness for speed.
Fan-out on write for most; fan-out on read for celebrities. Hybrid wins.
Read replicas, connection pooling, denormalization each trade something for read speed.

Chapter Five

CAP Theorem in Practice

What You Actually Choose Between

CAP theorem is one of the most cited and most misunderstood concepts in distributed systems. The common misreading is "pick 2 of 3." The reality: partition tolerance (P) is not optional in distributed systems. Partitions will happen. The real choice is Consistency vs Availability during a partition — and that choice only applies during the partition event itself.

CAP Theorem — The Real Choice

PACELC — The Practical Extension

CAP only describes what happens during a partition — a rare event. PACELC extends the model: in normal operation (99.9% of the time), do you prioritize Latency or Consistency? Most real decisions are PACELC decisions, not CAP decisions.

PACELC — What Real Databases Choose

Spanner TrueTime

Google Spanner achieves strong consistency (PC/EC) across global regions using TrueTime — GPS receivers and atomic clocks give each server bounded time uncertainty (typically ±4ms). Spanner waits out the uncertainty window before committing a transaction, guaranteeing external consistency without coordination. Trade-off: ~7ms commit latency floor. This is how Spanner is both globally distributed and strongly consistent — something CAP says is impossible during partitions, but Spanner minimizes partition probability through private fiber networks.

Consistency Models & Real Decisions

Consistency Spectrum — Eventual to Strong

5-Step Consistency Decision Framework

What is the business cost of stale data? — If a user sees stale data for 5 seconds, does anyone lose money?
How long can staleness be tolerated? — Milliseconds (payments) vs seconds (likes) vs minutes (analytics).
Is the data read-heavy or write-heavy? — Read-heavy favors eventual + caching. Write-heavy favors strong + single leader.
Is cross-region latency acceptable? — Strong consistency across regions adds 100–300ms per write (consensus round trip).
Can conflicts be resolved automatically? — CRDTs and last-write-wins enable eventual consistency without manual conflict resolution.

🛒

Shopping Cart — Eventual OK

Two devices add simultaneously: merge both. Briefly stale view harms nobody. Optimize for speed.

💺

Seat Booking — Strong Required

Double-booking = angry customer + refund. Stale data has direct financial cost. Locking required.

👍

Social Likes — Eventual OK

Count off by a few for seconds — not user-visible. No business harm. Optimize for latency.

🏦

Bank Balance — Strong Required

Overdraft has real financial consequences. Must see accurate balance. Cannot trade consistency here.

CAP theorem does not say consistency and availability are always in conflict. It says they conflict during a partition. Design for the partition case — but optimize for the normal case, which is 99.9% of your system's life. That's what PACELC is about.

📋 Chapter 5 — Summary

CAP: P is mandatory. During partition: choose C (consistency) or A (availability).
"Pick 2 of 3" is wrong. P is not optional. Real choice is C vs A during failure.
PACELC: more useful. Normal operation: Latency (L) vs Consistency (C).
DynamoDB/Cassandra: PA/EL. PostgreSQL/Spanner: PC/EC.
Choose based on business consequence of stale data, not technical preference.

Scalability & Reliability at a Glance

01 · Scalability Fundamentals

Vertical Has a Ceiling, Horizontal Needs Design

Vertical: simple but capped (~96 cores, ~12TB)
Horizontal: unlimited but requires stateless architecture
Stateless prerequisite: push state to dedicated stores
Five dimensions: traffic, data, geography, features, team

02 · Reliability Patterns

Circuit Breakers Stop Cascades

Circuit breaker: Closed → Open → Half-Open
Retry: exponential backoff + jitter, idempotent only
Bulkhead: isolate so one failure doesn't sink the ship
Timeout every external call. Fallback when primary fails

03 · Availability

Each Nine Is 10× Harder

99.9% = 43 min/month. 99.99% = 4.4 min/month
SPOFs eliminate all redundancy investments
Error budget: exhausted = feature freeze
Chaos Engineering: test failure on your terms

04 · Performance Patterns

Optimize for p99, Not Average

Performance ≠ Scalability (speed vs speed under load)
Async: move non-essential work off request path
Pre-compute: trade freshness for speed
Read replicas, connection pools, denormalization

05 · CAP in Practice

P Is Not Optional, PACELC Is More Useful

During partition: choose C or A
PACELC: normal operation forces L vs C
DynamoDB: PA/EL. PostgreSQL: PC/EC
Choose based on business cost of stale data

← Building Blocks Communication & APIs →