Scalability & Reliability
Designing systems that handle load and survive failure.
Scalability Fundamentals
Scalability is not a feature you add later. By the time you need it urgently, it is too late to add it cleanly. The systems that scale gracefully were designed with scale in mind from the start โ not necessarily built for scale on day one, but designed so that scaling is possible without a rewrite. A system that requires a fundamental rearchitecture to handle 10ร load is not scalable. A system that requires adding servers is.
Scalability: the ability of a system to handle increased load by adding resources โ without requiring fundamental architectural changes. The key phrase is "without fundamental changes." That constraint is what separates scalable architecture from a prototype.
- Add more power to existing machines โ bigger CPU, more RAM, faster SSD
- Simple: no code changes, no distributed complexity
- Hard ceiling: ~96 cores, ~12TB RAM on largest cloud instances
- Single point of failure remains โ one machine dies, system dies
- Downtime often required for hardware upgrade
- Best for: early stage, databases hard to distribute, buying time
- Add more machines, distribute load across them
- Theoretically unlimited scale โ add servers as demand grows
- Fault tolerant โ lose one server, others handle the load
- Cost-effective at scale: commodity hardware, not premium instances
- Requires stateless application design
- Adds distributed system complexity โ network, consistency, coordination
The prerequisite for horizontal scaling is stateless services. If a server holds session data, user state, or any request-specific context in memory, you cannot freely route requests to any instance. Stateless design pushes all state to dedicated stores โ databases, Redis, object storage โ so application servers become interchangeable and disposable.
Traffic Volume
More requests/sec. Strategy: horizontal scaling, load balancing, caching, CDN.
Data Volume
More data to store/query. Strategy: sharding, partitioning, tiered storage, archival.
Geographic Scale
Users in more regions. Strategy: CDN, multi-region deployment, edge computing.
Feature Scale
More system complexity. Strategy: modular architecture, service boundaries, API contracts.
Team Scale
More engineers. Strategy: microservices, clear ownership, independent deployability.
Operational Scale
More to monitor and maintain. Strategy: observability, automated runbooks, incident response frameworks.
Horizontal scaling is not a technology choice.
- Scalability = handling more load by adding resources without fundamental changes.
- Vertical: simple but has a hard ceiling (~96 cores). Single point of failure remains.
- Horizontal: theoretically unlimited โ requires stateless design and distributed complexity.
- Stateless prerequisite: push all state to dedicated stores. Servers become disposable.
- Five dimensions: traffic, data, geography, features, and team โ each needs different strategies.
Reliability Patterns
Distributed systems fail. Not might fail โ will fail. Networks partition, servers crash, disks fill up, dependencies time out. The question is not whether your system will experience failures โ it is whether those failures are isolated or cascading. Every pattern in this chapter exists to turn inevitable failures into manageable, contained events.
Named after the electrical equivalent: when a downstream dependency is failing, the circuit breaker stops sending requests to it โ giving it time to recover. Without a circuit breaker, a slow dependency causes thread pool exhaustion in the caller. One failing service takes down the entire system via cascade.
Retry with Exponential Backoff
- Wait = base ร 2^attempt + random jitter
- Jitter prevents thundering herd on recovery
- Max retries: 3โ5. Beyond that, return error
- Only safe for idempotent operations
- Retrying a payment without idempotency key = double charge
- Implementation: client-generated UUID as idempotency key; server stores keyโresponse; replay returns stored response
Timeout
- Every external call must have a timeout
- Without it: threads hang forever, pool exhausts
- Too short = false failures. Too long = cascade risk
- Rule: p99 latency of dependency + 20% margin
- Always set connect timeout AND read timeout separately
Fallback: Recommendations
Primary: personalized. Fallback: show popular items. Limitation: lower engagement.
Fallback: Preferences
Primary: user settings. Fallback: defaults. Limitation: impersonal experience.
Fallback: Inventory
Primary: real-time stock. Fallback: cached + "may be inaccurate." Limitation: overselling risk.
- Multiple instances all serve traffic simultaneously
- Loss of one reduces capacity, does NOT cause outage
- RTO: near-zero (traffic already distributed)
- RPO: near-zero (writes on all nodes)
- Best for: web servers, stateless APIs, CDN
- Primary serves traffic; secondary is warm standby
- Failover on primary failure โ some downtime during switch
- RTO: seconds to minutes depending on detection
- RPO: seconds to minutes (replication lag = data at risk)
- Best for: databases, stateful services, legacy systems
A system that fails fast is more reliable than one that fails slowly. Slow failures cascade โ a single slow dependency exhausts thread pools, backs up queues, and takes down unrelated services. Fast failures isolate โ the circuit opens, the fallback serves, and the rest of the system continues working.
- Circuit breaker: Closed โ Open โ Half-Open. Stops cascade by failing fast.
- Retry: exponential backoff + jitter. Only for idempotent ops. Max 3โ5 attempts.
- Timeout: every external call. Set at p99 + 20% margin.
- Bulkhead: isolate pools per dependency. One leak doesn't sink the ship.
- Fallback: serve degraded response when primary fails.
- Active-Active for zero-downtime; Active-Passive for simpler consistency.
Availability Deep Dive
Availability is a promise. When you say your system has 99.9% availability, you are making a commitment to users about what they can expect. That promise is only as good as the architecture behind it. Most teams discover their real availability number after an incident โ not before. Each additional "nine" is an order of magnitude harder and more expensive to achieve.
Error Budget: 100% โ SLO = error budget. If SLO is 99.9%, your error budget is 0.1% of the month (43.8 minutes). When error budget is exhausted: no new features, only reliability work. This is how Google and Netflix balance velocity with reliability.
Single Database
No replica. Server dies = all data unavailable. Mitigate: primary-replica, multi-AZ RDS, automated failover.
Single Load Balancer
LB dies = all traffic drops. Mitigate: dual LB, DNS failover, cloud-managed LB.
Single Availability Zone
AZ power/network failure = total outage. Mitigate: multi-AZ deployment.
Single Engineer Knowledge
Bus factor = 1. Mitigate: docs, pair programming, runbooks, code reviews.
Hardware Failure
Servers, disks, NICs. Detect: health checks. Mitigate: redundancy, auto-replacement.
Network Partition
Nodes can't communicate. Detect: timeout, gossip. Mitigate: multi-zone, graceful degradation.
Software Failure
Bugs, memory leaks. Detect: alerts, error spikes. Mitigate: circuit breakers, auto-restart.
Dependency Failure
Third-party down. Detect: timeout. Mitigate: fallbacks, cached last-known-good.
Human Error
Wrong config, bad migration. Mitigate: automated rollback, canary deploys.
Capacity Exhaustion
Disk full, OOM, connection limits. Detect: threshold alerts. Mitigate: autoscaling, capacity planning.
Netflix Chaos Monkey philosophy: if failure will happen anyway, test it in controlled conditions. Define steady state โ hypothesize โ introduce failure โ observe. Not reckless โ disciplined. Levels: kill process โ terminate server โ fail AZ โ inject latency โ saturate CPU.
Concrete example: Netflix injects 100ms latency into their payment service during a game day. Discovery: the retry storm from the auth service doubles load on the payment service, causing cascading timeout. Fix deployed before a real incident: circuit breaker + exponential backoff added to authโpayment calls.
Your system's availability is determined by its weakest link. One single point of failure makes every redundancy investment irrelevant. SPOF elimination is not optimization โ it is prerequisite work.
- 99.9% = 43 min/month downtime. Each nine = 10ร harder and more expensive.
- Error budget = 100% โ SLO. Exhausted = feature freeze, reliability work only.
- SPOFs: single DB, single LB, single AZ, single engineer. Eliminate first.
- Failure modes: hardware, network, software, dependency, human. Each needs different mitigation.
- Chaos Engineering: test failure deliberately โ better to discover weakness on your terms.
Performance Patterns
Performance and scalability are related but different. Performance is how fast a single request is served. Scalability is how well the system maintains that performance as load increases. A system can be fast but not scalable. A system can be scalable but slow. Both matter โ they require different techniques.
Async Processing
- Move work out of the request path
- When: work >200ms, not needed for response, can retry
- Examples: email, image processing, reports
- Pattern: message queue + background worker
- Trade-off: no immediate result for user
Pre-computation
- Compute expensive results ahead of time
- Twitter: fan-out on write for most users
- Leaderboards: recompute every minute, cache
- Analytics: nightly aggregation, materialized views
- Trade-off: stale data โ how stale is acceptable?
Read Replicas
Route reads to replicas (80โ90% of traffic). Read-your-own-writes: read from primary for X sec after write.
Connection Pooling
Reuse DB connections (~50ms cost per new one). Pool size: cores ร 2 + spindle_count. Too many = context-switch overhead. Too few = request queuing. PgBouncer for PostgreSQL.
Denormalization
Duplicate data to avoid expensive joins. Trade-off: faster reads, slower writes, app manages consistency.
- Load data only when requested โ on first access
- Lower startup cost and memory usage
- Possible latency spike on first access (cold start)
- Best for: large datasets, optional/rarely-used relations
- Example: load user profile details only when user clicks "View Profile"
- Pre-load all related data upfront in a single query
- Faster subsequent reads โ no additional round trips
- Wastes resources if loaded data is never used
- Best for: small datasets, always-needed relations, N+1 query avoidance
- Example: load order + items + shipping in one query for order detail page
- Process each request immediately โ no waiting
- Lower per-request response time
- Higher resource cost per operation (no batching)
- Best for: user-facing APIs, real-time interactions
- Batch multiple operations together โ amortize overhead
- Higher total operations/second
- Individual requests wait for batch to fill (added latency)
- Best for: data pipelines, background jobs, analytics ingestion
The throughput-latency trade-off is fundamental. Batching (Kafka, bulk writes) maximizes throughput at the cost of per-message latency. Streaming (WebSocket, single-row commits) minimizes latency at the cost of throughput. Pipelining (HTTP/2, Redis pipelines) improves both โ but adds implementation complexity. Choose based on whether the consumer is a human (latency) or a machine (throughput).
Optimize for p99, not average. Your average latency can look healthy while 1% of users โ often your most active, highest-value customers โ experience unacceptable slowness. The average of [1ms, 1ms, 1ms, 1000ms] is 250ms. The p99 tells you someone waited a full second.
- Performance โ Scalability. Performance = speed of one request. Scalability = speed under load.
- p99 over averages: exposes the worst 1% that averages hide.
- Async: move non-essential work out of the request path. Queue + worker.
- Pre-compute: compute ahead of time. Trade freshness for speed.
- Fan-out on write for most; fan-out on read for celebrities. Hybrid wins.
- Read replicas, connection pooling, denormalization each trade something for read speed.
CAP Theorem in Practice
CAP theorem is one of the most cited and most misunderstood concepts in distributed systems. The common misreading is "pick 2 of 3." The reality: partition tolerance (P) is not optional in distributed systems. Partitions will happen. The real choice is Consistency vs Availability during a partition โ and that choice only applies during the partition event itself.
CAP only describes what happens during a partition โ a rare event. PACELC extends the model: in normal operation (99.9% of the time), do you prioritize Latency or Consistency? Most real decisions are PACELC decisions, not CAP decisions.
Google Spanner achieves strong consistency (PC/EC) across global regions using TrueTime โ GPS receivers and atomic clocks give each server bounded time uncertainty (typically ยฑ4ms). Spanner waits out the uncertainty window before committing a transaction, guaranteeing external consistency without coordination. Trade-off: ~7ms commit latency floor. This is how Spanner is both globally distributed and strongly consistent โ something CAP says is impossible during partitions, but Spanner minimizes partition probability through private fiber networks.
- What is the business cost of stale data? โ If a user sees stale data for 5 seconds, does anyone lose money?
- How long can staleness be tolerated? โ Milliseconds (payments) vs seconds (likes) vs minutes (analytics).
- Is the data read-heavy or write-heavy? โ Read-heavy favors eventual + caching. Write-heavy favors strong + single leader.
- Is cross-region latency acceptable? โ Strong consistency across regions adds 100โ300ms per write (consensus round trip).
- Can conflicts be resolved automatically? โ CRDTs and last-write-wins enable eventual consistency without manual conflict resolution.
Shopping Cart โ Eventual OK
Two devices add simultaneously: merge both. Briefly stale view harms nobody. Optimize for speed.
Seat Booking โ Strong Required
Double-booking = angry customer + refund. Stale data has direct financial cost. Locking required.
Social Likes โ Eventual OK
Count off by a few for seconds โ not user-visible. No business harm. Optimize for latency.
Bank Balance โ Strong Required
Overdraft has real financial consequences. Must see accurate balance. Cannot trade consistency here.
CAP theorem does not say consistency and availability are always in conflict. It says they conflict during a partition. Design for the partition case โ but optimize for the normal case, which is 99.9% of your system's life. That's what PACELC is about.
- CAP: P is mandatory. During partition: choose C (consistency) or A (availability).
- "Pick 2 of 3" is wrong. P is not optional. Real choice is C vs A during failure.
- PACELC: more useful. Normal operation: Latency (L) vs Consistency (C).
- DynamoDB/Cassandra: PA/EL. PostgreSQL/Spanner: PC/EC.
- Choose based on business consequence of stale data, not technical preference.
Vertical Has a Ceiling, Horizontal Needs Design
- Vertical: simple but capped (~96 cores, ~12TB)
- Horizontal: unlimited but requires stateless architecture
- Stateless prerequisite: push state to dedicated stores
- Five dimensions: traffic, data, geography, features, team
Circuit Breakers Stop Cascades
- Circuit breaker: Closed โ Open โ Half-Open
- Retry: exponential backoff + jitter, idempotent only
- Bulkhead: isolate so one failure doesn't sink the ship
- Timeout every external call. Fallback when primary fails
Each Nine Is 10ร Harder
- 99.9% = 43 min/month. 99.99% = 4.4 min/month
- SPOFs eliminate all redundancy investments
- Error budget: exhausted = feature freeze
- Chaos Engineering: test failure on your terms
Optimize for p99, Not Average
- Performance โ Scalability (speed vs speed under load)
- Async: move non-essential work off request path
- Pre-compute: trade freshness for speed
- Read replicas, connection pools, denormalization
P Is Not Optional, PACELC Is More Useful
- During partition: choose C or A
- PACELC: normal operation forces L vs C
- DynamoDB: PA/EL. PostgreSQL: PC/EC
- Choose based on business cost of stale data