System Design ยท Scalability & Reliability

Scalability & Reliability

Designing systems that handle load and survive failure.

01
Chapter One

Scalability Fundamentals

What Scalability Actually Means

Scalability is not a feature you add later. By the time you need it urgently, it is too late to add it cleanly. The systems that scale gracefully were designed with scale in mind from the start โ€” not necessarily built for scale on day one, but designed so that scaling is possible without a rewrite. A system that requires a fundamental rearchitecture to handle 10ร— load is not scalable. A system that requires adding servers is.

Scalability: the ability of a system to handle increased load by adding resources โ€” without requiring fundamental architectural changes. The key phrase is "without fundamental changes." That constraint is what separates scalable architecture from a prototype.

Vertical vs Horizontal Scaling
Vertical Scaling (Scale Up)
Horizontal Scaling (Scale Out)
  • Add more power to existing machines โ€” bigger CPU, more RAM, faster SSD
  • Simple: no code changes, no distributed complexity
  • Hard ceiling: ~96 cores, ~12TB RAM on largest cloud instances
  • Single point of failure remains โ€” one machine dies, system dies
  • Downtime often required for hardware upgrade
  • Best for: early stage, databases hard to distribute, buying time
  • Add more machines, distribute load across them
  • Theoretically unlimited scale โ€” add servers as demand grows
  • Fault tolerant โ€” lose one server, others handle the load
  • Cost-effective at scale: commodity hardware, not premium instances
  • Requires stateless application design
  • Adds distributed system complexity โ€” network, consistency, coordination
Vertical vs Horizontal Scaling โ€” How Each Handles Growth
Vertical Scaling Server Bigger Biggest 96 cores ยท 12TB Hardware Ceiling Horizontal Scaling LB S1 S2 S3 S4 + add more as needed No ceiling Shared State Redis / DB
Stateless vs Stateful Services

The prerequisite for horizontal scaling is stateless services. If a server holds session data, user state, or any request-specific context in memory, you cannot freely route requests to any instance. Stateless design pushes all state to dedicated stores โ€” databases, Redis, object storage โ€” so application servers become interchangeable and disposable.

Stateful vs Stateless โ€” Why Stateless Scales
Stateful (Problem) LB (sticky) Server A state: user1 state: user2 Server B state: user3 โœ— Server A dies = user1 & user2 state LOST Stateless (Solution) LB (any) S1 S2 S3 Redis (all state) user1, user2, user3 Any server dies = no state lost Requests reroute automatically
The Five Dimensions of Scale
๐Ÿ“ˆ

Traffic Volume

More requests/sec. Strategy: horizontal scaling, load balancing, caching, CDN.

๐Ÿ’พ

Data Volume

More data to store/query. Strategy: sharding, partitioning, tiered storage, archival.

๐ŸŒ

Geographic Scale

Users in more regions. Strategy: CDN, multi-region deployment, edge computing.

๐Ÿงฉ

Feature Scale

More system complexity. Strategy: modular architecture, service boundaries, API contracts.

๐Ÿ‘ฅ

Team Scale

More engineers. Strategy: microservices, clear ownership, independent deployability.

โฑ๏ธ

Operational Scale

More to monitor and maintain. Strategy: observability, automated runbooks, incident response frameworks.

Horizontal scaling is not a technology choice.

๐Ÿ“‹ Chapter 1 โ€” Summary
  • Scalability = handling more load by adding resources without fundamental changes.
  • Vertical: simple but has a hard ceiling (~96 cores). Single point of failure remains.
  • Horizontal: theoretically unlimited โ€” requires stateless design and distributed complexity.
  • Stateless prerequisite: push all state to dedicated stores. Servers become disposable.
  • Five dimensions: traffic, data, geography, features, and team โ€” each needs different strategies.
02
Chapter Two

Reliability Patterns

Building Systems That Survive Failure

Distributed systems fail. Not might fail โ€” will fail. Networks partition, servers crash, disks fill up, dependencies time out. The question is not whether your system will experience failures โ€” it is whether those failures are isolated or cascading. Every pattern in this chapter exists to turn inevitable failures into manageable, contained events.

Circuit Breaker

Named after the electrical equivalent: when a downstream dependency is failing, the circuit breaker stops sending requests to it โ€” giving it time to recover. Without a circuit breaker, a slow dependency causes thread pool exhaustion in the caller. One failing service takes down the entire system via cascade.

Circuit Breaker โ€” Three States
CLOSED Normal operation Requests flow through OPEN Dependency failing Requests rejected fast HALF-OPEN Testing recovery One request allowed Failures exceed threshold Timeout elapsed Test request succeeded Test failed Without circuit breaker: slow dependency exhausts caller thread pool = cascade
๐Ÿ”„

Retry with Exponential Backoff

  • Wait = base ร— 2^attempt + random jitter
  • Jitter prevents thundering herd on recovery
  • Max retries: 3โ€“5. Beyond that, return error
  • Only safe for idempotent operations
  • Retrying a payment without idempotency key = double charge
  • Implementation: client-generated UUID as idempotency key; server stores keyโ†’response; replay returns stored response
โฑ๏ธ

Timeout

  • Every external call must have a timeout
  • Without it: threads hang forever, pool exhausts
  • Too short = false failures. Too long = cascade risk
  • Rule: p99 latency of dependency + 20% margin
  • Always set connect timeout AND read timeout separately
Bulkhead Pattern โ€” One Leak Doesn't Sink the Ship
Without Bulkhead SHARED THREAD POOL Dep A (slow) Dep B calls Dep A exhausts pool โ†’ Dep B also fails Total system failure With Bulkhead Pool A Dep A (slow) EXHAUSTED Pool B Dep B calls UNAFFECTED Partial degradation only Named after ship hull compartments โ€” one leak doesn't sink the ship. Separate thread pools, connection pools, queues per dependency.
๐ŸŽฏ

Fallback: Recommendations

Primary: personalized. Fallback: show popular items. Limitation: lower engagement.

โš™๏ธ

Fallback: Preferences

Primary: user settings. Fallback: defaults. Limitation: impersonal experience.

๐Ÿ“ฆ

Fallback: Inventory

Primary: real-time stock. Fallback: cached + "may be inaccurate." Limitation: overselling risk.

Active-Active
Active-Passive
  • Multiple instances all serve traffic simultaneously
  • Loss of one reduces capacity, does NOT cause outage
  • RTO: near-zero (traffic already distributed)
  • RPO: near-zero (writes on all nodes)
  • Best for: web servers, stateless APIs, CDN
  • Primary serves traffic; secondary is warm standby
  • Failover on primary failure โ€” some downtime during switch
  • RTO: seconds to minutes depending on detection
  • RPO: seconds to minutes (replication lag = data at risk)
  • Best for: databases, stateful services, legacy systems

A system that fails fast is more reliable than one that fails slowly. Slow failures cascade โ€” a single slow dependency exhausts thread pools, backs up queues, and takes down unrelated services. Fast failures isolate โ€” the circuit opens, the fallback serves, and the rest of the system continues working.

๐Ÿ“‹ Chapter 2 โ€” Summary
  • Circuit breaker: Closed โ†’ Open โ†’ Half-Open. Stops cascade by failing fast.
  • Retry: exponential backoff + jitter. Only for idempotent ops. Max 3โ€“5 attempts.
  • Timeout: every external call. Set at p99 + 20% margin.
  • Bulkhead: isolate pools per dependency. One leak doesn't sink the ship.
  • Fallback: serve degraded response when primary fails.
  • Active-Active for zero-downtime; Active-Passive for simpler consistency.
03
Chapter Three

Availability Deep Dive

The Nines and What They Cost

Availability is a promise. When you say your system has 99.9% availability, you are making a commitment to users about what they can expect. That promise is only as good as the architecture behind it. Most teams discover their real availability number after an incident โ€” not before. Each additional "nine" is an order of magnitude harder and more expensive to achieve.

The Availability Nines โ€” Downtime Budget
Availability Downtime/Year Downtime/Month Typical Use 99% 3 days 15 hrs 7.2 hours Internal tools 99.9% 8 hrs 45 min 43.8 minutes Consumer apps 99.99% 52 min 35 sec 4.38 minutes Business SaaS 99.999% 5 min 15 sec 26 seconds Finance / Health Each nine = 10ร— harder ยท 10ร— more expensive ยท 10ร— more operational discipline
Error Budgets

Error Budget: 100% โˆ’ SLO = error budget. If SLO is 99.9%, your error budget is 0.1% of the month (43.8 minutes). When error budget is exhausted: no new features, only reliability work. This is how Google and Netflix balance velocity with reliability.

Single Points of Failure
๐Ÿ—„๏ธ

Single Database

No replica. Server dies = all data unavailable. Mitigate: primary-replica, multi-AZ RDS, automated failover.

๐ŸŒ

Single Load Balancer

LB dies = all traffic drops. Mitigate: dual LB, DNS failover, cloud-managed LB.

๐Ÿข

Single Availability Zone

AZ power/network failure = total outage. Mitigate: multi-AZ deployment.

๐Ÿง 

Single Engineer Knowledge

Bus factor = 1. Mitigate: docs, pair programming, runbooks, code reviews.

๐Ÿ’ป

Hardware Failure

Servers, disks, NICs. Detect: health checks. Mitigate: redundancy, auto-replacement.

๐Ÿ”Œ

Network Partition

Nodes can't communicate. Detect: timeout, gossip. Mitigate: multi-zone, graceful degradation.

๐Ÿ›

Software Failure

Bugs, memory leaks. Detect: alerts, error spikes. Mitigate: circuit breakers, auto-restart.

๐Ÿ”—

Dependency Failure

Third-party down. Detect: timeout. Mitigate: fallbacks, cached last-known-good.

๐Ÿ‘ค

Human Error

Wrong config, bad migration. Mitigate: automated rollback, canary deploys.

โšก

Capacity Exhaustion

Disk full, OOM, connection limits. Detect: threshold alerts. Mitigate: autoscaling, capacity planning.

Chaos Engineering

Netflix Chaos Monkey philosophy: if failure will happen anyway, test it in controlled conditions. Define steady state โ†’ hypothesize โ†’ introduce failure โ†’ observe. Not reckless โ€” disciplined. Levels: kill process โ†’ terminate server โ†’ fail AZ โ†’ inject latency โ†’ saturate CPU.

Concrete example: Netflix injects 100ms latency into their payment service during a game day. Discovery: the retry storm from the auth service doubles load on the payment service, causing cascading timeout. Fix deployed before a real incident: circuit breaker + exponential backoff added to authโ†’payment calls.

Your system's availability is determined by its weakest link. One single point of failure makes every redundancy investment irrelevant. SPOF elimination is not optimization โ€” it is prerequisite work.

๐Ÿ“‹ Chapter 3 โ€” Summary
  • 99.9% = 43 min/month downtime. Each nine = 10ร— harder and more expensive.
  • Error budget = 100% โˆ’ SLO. Exhausted = feature freeze, reliability work only.
  • SPOFs: single DB, single LB, single AZ, single engineer. Eliminate first.
  • Failure modes: hardware, network, software, dependency, human. Each needs different mitigation.
  • Chaos Engineering: test failure deliberately โ€” better to discover weakness on your terms.
04
Chapter Four

Performance Patterns

Latency vs Throughput vs Resources

Performance and scalability are related but different. Performance is how fast a single request is served. Scalability is how well the system maintains that performance as load increases. A system can be fast but not scalable. A system can be scalable but slow. Both matter โ€” they require different techniques.

Latency Distribution โ€” Why p99 Matters
p50 p95 p99 Long tail Response time (ms) โ†’ The 1% in the tail are often your most active, highest-value users
Patterns for Performance
โšก

Async Processing

  • Move work out of the request path
  • When: work >200ms, not needed for response, can retry
  • Examples: email, image processing, reports
  • Pattern: message queue + background worker
  • Trade-off: no immediate result for user
๐Ÿ”ฎ

Pre-computation

  • Compute expensive results ahead of time
  • Twitter: fan-out on write for most users
  • Leaderboards: recompute every minute, cache
  • Analytics: nightly aggregation, materialized views
  • Trade-off: stale data โ€” how stale is acceptable?
Fan-out on Write vs Read โ€” Twitter's Hybrid Model
Fan-out on Write User posts โ†’ Write to all followers' feeds Read = instant (pre-computed) 10M followers = 10M writes Best for: most users Fan-out on Read User posts โ†’ Stored once 1 write Read = merge N sources Write = instant Best for: celebrities Twitter: hybrid โ€” write for most, read for celebs Optimize for the common case. Handle the exceptional case differently.
๐Ÿ“–

Read Replicas

Route reads to replicas (80โ€“90% of traffic). Read-your-own-writes: read from primary for X sec after write.

๐Ÿ”Œ

Connection Pooling

Reuse DB connections (~50ms cost per new one). Pool size: cores ร— 2 + spindle_count. Too many = context-switch overhead. Too few = request queuing. PgBouncer for PostgreSQL.

๐Ÿ“‹

Denormalization

Duplicate data to avoid expensive joins. Trade-off: faster reads, slower writes, app manages consistency.

Lazy vs Eager Loading
Lazy Loading
Eager Loading
  • Load data only when requested โ€” on first access
  • Lower startup cost and memory usage
  • Possible latency spike on first access (cold start)
  • Best for: large datasets, optional/rarely-used relations
  • Example: load user profile details only when user clicks "View Profile"
  • Pre-load all related data upfront in a single query
  • Faster subsequent reads โ€” no additional round trips
  • Wastes resources if loaded data is never used
  • Best for: small datasets, always-needed relations, N+1 query avoidance
  • Example: load order + items + shipping in one query for order detail page
Throughput vs Latency Tension
Optimize for Latency
Optimize for Throughput
  • Process each request immediately โ€” no waiting
  • Lower per-request response time
  • Higher resource cost per operation (no batching)
  • Best for: user-facing APIs, real-time interactions
  • Batch multiple operations together โ€” amortize overhead
  • Higher total operations/second
  • Individual requests wait for batch to fill (added latency)
  • Best for: data pipelines, background jobs, analytics ingestion

The throughput-latency trade-off is fundamental. Batching (Kafka, bulk writes) maximizes throughput at the cost of per-message latency. Streaming (WebSocket, single-row commits) minimizes latency at the cost of throughput. Pipelining (HTTP/2, Redis pipelines) improves both โ€” but adds implementation complexity. Choose based on whether the consumer is a human (latency) or a machine (throughput).

Optimize for p99, not average. Your average latency can look healthy while 1% of users โ€” often your most active, highest-value customers โ€” experience unacceptable slowness. The average of [1ms, 1ms, 1ms, 1000ms] is 250ms. The p99 tells you someone waited a full second.

๐Ÿ“‹ Chapter 4 โ€” Summary
  • Performance โ‰  Scalability. Performance = speed of one request. Scalability = speed under load.
  • p99 over averages: exposes the worst 1% that averages hide.
  • Async: move non-essential work out of the request path. Queue + worker.
  • Pre-compute: compute ahead of time. Trade freshness for speed.
  • Fan-out on write for most; fan-out on read for celebrities. Hybrid wins.
  • Read replicas, connection pooling, denormalization each trade something for read speed.
05
Chapter Five

CAP Theorem in Practice

What You Actually Choose Between

CAP theorem is one of the most cited and most misunderstood concepts in distributed systems. The common misreading is "pick 2 of 3." The reality: partition tolerance (P) is not optional in distributed systems. Partitions will happen. The real choice is Consistency vs Availability during a partition โ€” and that choice only applies during the partition event itself.

CAP Theorem โ€” The Real Choice
C Consistency A Availability P Partition Tol. CP Systems AP Systems CA: single-node only HBase, ZooKeeper, etcd Cassandra, DynamoDB P is non-negotiable. Real choice: C or A during a partition.
PACELC โ€” The Practical Extension

CAP only describes what happens during a partition โ€” a rare event. PACELC extends the model: in normal operation (99.9% of the time), do you prioritize Latency or Consistency? Most real decisions are PACELC decisions, not CAP decisions.

PACELC โ€” What Real Databases Choose
During Partition (P) Choose A DynamoDB Cassandra CouchDB Choose C PostgreSQL Spanner HBase Normal Operation (E) Choose L DynamoDB Cassandra Choose C PostgreSQL Spanner DynamoDB: PA/EL PostgreSQL: PC/EC Spanner: PC/EC PA/EL = Available during partition, Low latency normally ยท PC/EC = Consistent always
Spanner TrueTime

Google Spanner achieves strong consistency (PC/EC) across global regions using TrueTime โ€” GPS receivers and atomic clocks give each server bounded time uncertainty (typically ยฑ4ms). Spanner waits out the uncertainty window before committing a transaction, guaranteeing external consistency without coordination. Trade-off: ~7ms commit latency floor. This is how Spanner is both globally distributed and strongly consistent โ€” something CAP says is impossible during partitions, but Spanner minimizes partition probability through private fiber networks.

Consistency Models & Real Decisions
Consistency Spectrum โ€” Eventual to Strong
Eventual Read-Your-Writes Causal Strong Social likes User profile Collab editing Bank transfer โ† Higher availability & lower latency Higher correctness โ†’ Choose based on: business consequence of stale data
5-Step Consistency Decision Framework
  1. What is the business cost of stale data? โ€” If a user sees stale data for 5 seconds, does anyone lose money?
  2. How long can staleness be tolerated? โ€” Milliseconds (payments) vs seconds (likes) vs minutes (analytics).
  3. Is the data read-heavy or write-heavy? โ€” Read-heavy favors eventual + caching. Write-heavy favors strong + single leader.
  4. Is cross-region latency acceptable? โ€” Strong consistency across regions adds 100โ€“300ms per write (consensus round trip).
  5. Can conflicts be resolved automatically? โ€” CRDTs and last-write-wins enable eventual consistency without manual conflict resolution.
๐Ÿ›’

Shopping Cart โ€” Eventual OK

Two devices add simultaneously: merge both. Briefly stale view harms nobody. Optimize for speed.

๐Ÿ’บ

Seat Booking โ€” Strong Required

Double-booking = angry customer + refund. Stale data has direct financial cost. Locking required.

๐Ÿ‘

Social Likes โ€” Eventual OK

Count off by a few for seconds โ€” not user-visible. No business harm. Optimize for latency.

๐Ÿฆ

Bank Balance โ€” Strong Required

Overdraft has real financial consequences. Must see accurate balance. Cannot trade consistency here.

CAP theorem does not say consistency and availability are always in conflict. It says they conflict during a partition. Design for the partition case โ€” but optimize for the normal case, which is 99.9% of your system's life. That's what PACELC is about.

๐Ÿ“‹ Chapter 5 โ€” Summary
  • CAP: P is mandatory. During partition: choose C (consistency) or A (availability).
  • "Pick 2 of 3" is wrong. P is not optional. Real choice is C vs A during failure.
  • PACELC: more useful. Normal operation: Latency (L) vs Consistency (C).
  • DynamoDB/Cassandra: PA/EL. PostgreSQL/Spanner: PC/EC.
  • Choose based on business consequence of stale data, not technical preference.
Scalability & Reliability at a Glance
01 ยท Scalability Fundamentals

Vertical Has a Ceiling, Horizontal Needs Design

  • Vertical: simple but capped (~96 cores, ~12TB)
  • Horizontal: unlimited but requires stateless architecture
  • Stateless prerequisite: push state to dedicated stores
  • Five dimensions: traffic, data, geography, features, team
02 ยท Reliability Patterns

Circuit Breakers Stop Cascades

  • Circuit breaker: Closed โ†’ Open โ†’ Half-Open
  • Retry: exponential backoff + jitter, idempotent only
  • Bulkhead: isolate so one failure doesn't sink the ship
  • Timeout every external call. Fallback when primary fails
03 ยท Availability

Each Nine Is 10ร— Harder

  • 99.9% = 43 min/month. 99.99% = 4.4 min/month
  • SPOFs eliminate all redundancy investments
  • Error budget: exhausted = feature freeze
  • Chaos Engineering: test failure on your terms
04 ยท Performance Patterns

Optimize for p99, Not Average

  • Performance โ‰  Scalability (speed vs speed under load)
  • Async: move non-essential work off request path
  • Pre-compute: trade freshness for speed
  • Read replicas, connection pools, denormalization
05 ยท CAP in Practice

P Is Not Optional, PACELC Is More Useful

  • During partition: choose C or A
  • PACELC: normal operation forces L vs C
  • DynamoDB: PA/EL. PostgreSQL: PC/EC
  • Choose based on business cost of stale data