System Design ยท Case Studies

Case Study: Distributed Cache

Design, trade-offs, and alternatives for a distributed cache at scale.

01
Chapter One

Problem Statement

What We Are Building

A distributed cache stores frequently accessed data in memory across multiple servers, reducing database load and serving responses in microseconds instead of milliseconds. The challenge is not caching a single value โ€” it is maintaining consistency across a cluster of cache nodes, handling node failures without data loss storms, distributing keys evenly, and invalidating stale data without serving incorrect results. A cache that serves stale data can be worse than no cache at all.

Scale Requirements

Traffic & Scale

  • 10M requests/sec to cache cluster
  • 50TB+ data cached across cluster
  • 1000+ cache nodes in production
  • Sub-millisecond latency: p99 < 1ms

Requirements

  • Hit rate: >95% (misses hit database)
  • Availability: 99.99% โ€” cache down = database overwhelmed
  • Eviction: graceful under memory pressure (no OOM crashes)
  • Consistency: stale data bounded (configurable TTL)

A distributed cache is infrastructure that OTHER systems depend on. When your cache goes down, it doesn't just affect one feature โ€” it cascades. Every service that relied on cached responses now hits the database directly. 10M req/sec that were served from cache now flood the DB. The database can't handle 100x its normal load. Result: cache failure = system-wide outage. This is the "thundering herd" or "cache stampede" problem โ€” and it makes cache availability more critical than almost any other component.

๐Ÿ“‹ Chapter 1 โ€” Summary
  • 10M requests/sec, 50TB+ cached data, 1000+ nodes. Sub-millisecond latency.
  • Cache failure cascades: 95% of traffic that was cached now hits DB โ†’ DB melts.
  • Hit rate >95%. Stale data bounded by TTL. Graceful eviction under memory pressure.
  • Cache is infrastructure โ€” its failure is everyone's failure.
02
Chapter Two

Questions to Ask

Clarifying Before Designing
๐Ÿ“ฆ

Data Model

  • Key-value only? Or complex structures (lists, sets, hashes)?
  • Average value size? (1KB session vs 1MB page fragment)
  • Hot key distribution? (Zipfian or uniform?)
  • TTL per key or global policy?
๐Ÿ”„

Consistency Model

  • Cache-aside, read-through, or write-through?
  • How stale can data be? (seconds, minutes, never?)
  • Active invalidation or TTL-based expiry?
  • Multi-region consistency needed?
๐Ÿ›ก๏ธ

Failure Handling

  • What happens on cache miss? (fallback to DB)
  • Cache node failure: how fast must recovery be?
  • Cold start: how to warm cache after restart?
  • Memory exhaustion: evict or reject new writes?

The cache pattern (aside vs through) determines who owns the write path. Cache-aside: application reads from cache, on miss reads DB and populates cache. Application owns logic. Read-through: cache itself fetches from DB on miss. Write-through: writes go to cache AND DB simultaneously. Write-behind: writes go to cache, async flush to DB. Each trades complexity, consistency, and performance differently.

For This Case Study, Our Answers Are:

  • Data model: key-value strings + sorted sets (for leaderboards/rate limiting)
  • Average value size: ~1-10KB (user sessions, API responses, rendered page fragments)
  • Key distribution: Zipfian โ€” top 1% of keys receive ~50% of traffic (hot key problem expected)
  • Cache pattern: cache-aside (application controls logic)
  • Staleness tolerance: up to 60 seconds for most data; 5 seconds for inventory/pricing
  • Invalidation: TTL-based expiry + active invalidation on writes
  • Multi-region: yes โ€” separate cache cluster per region, no cross-region consistency
  • Node failure recovery: replica promotion within 30 seconds
  • Cold start: gradual traffic rollout (1% โ†’ 10% โ†’ 100%) + pre-warming from DB snapshot
  • Eviction policy: LRU (Least Recently Used) โ€” evict cold data under memory pressure
๐Ÿ“‹ Chapter 2 โ€” Summary
  • Cache pattern choice (aside/through) determines application complexity and consistency.
  • Hot keys (Zipfian) need special handling โ€” one popular key can overload a single node.
  • TTL sets the staleness budget. Too short = low hit rate. Too long = stale data.
  • Cold start and node failure recovery are operational challenges, not just design ones.
03
Chapter Three

Naive Design

Single Redis Instance

The simplest design: one Redis server. All application servers read/write to this single instance. Fast (in-memory), simple (one endpoint), and effective up to ~100K requests/sec and ~25GB of data. Beyond that: one server's memory is the ceiling, one server's CPU is the bottleneck, and one server's failure means 100% cache miss rate โ€” every request hits the database simultaneously.

Naive Design โ€” Single Redis Instance
App Server 1 App Server 2 App Server N Redis Single Instance Max ~25GB RAM ~100K ops/sec Database miss fallback Server failure โ†’ 0% hit rate โ†’ all 10M req/sec hit DB simultaneously โ†’ DB melts Memory ceiling: single server caps at ~25-100GB. 50TB dataset impossible. CPU ceiling: single-threaded Redis maxes at ~100K complex ops/sec. 10M req/sec impossible. ALL app servers โ†’ ONE Redis = single point of failure for entire system
โœ…

What Works

  • Simple โ€” one server, no partitioning logic
  • Atomic operations (INCR, LPUSH) without distributed coordination
  • Sub-millisecond reads (100K+ ops/sec single-threaded)
  • Great for <25GB datasets and moderate traffic
๐Ÿ’ฅ

What Breaks

  • Single server memory limit (~25-100GB max practical)
  • Single point of failure: server dies = all cache gone
  • Thundering herd: all misses hit DB simultaneously
  • CPU bound: single-threaded Redis maxes at ~100K complex ops/sec
  • No horizontal scaling โ€” only vertical
๐Ÿ“‹ Chapter 3 โ€” Summary
  • Single Redis: simple, fast, limited. Ceiling at ~25GB and ~100K ops/sec.
  • SPOF: server failure = 100% miss rate = database flood.
  • No horizontal scaling. Only solution to more data/traffic is sharding.
04
Chapter Four

Refined Design

Consistent Hashing with Replication

The refined design distributes keys across multiple cache nodes using consistent hashing. Each key maps to a position on a hash ring, and the node responsible for that range serves it. When a node fails, only its keys are redistributed โ€” not the entire cache. Replication (each key stored on 2-3 nodes) ensures that a single node failure doesn't cause a miss storm. A client-side library handles routing, failover, and connection pooling transparently.

Refined Design โ€” Distributed Cache Cluster
Hash Ring N1a N2a N3a N1b N2b N3b Virtual nodes โ†’ even distribution App Server 1 App Server N Cache Client consistent hash ring routing + failover Node 1 slots 0โ€“5460 Node 2 slots 5461โ€“10922 Node 3 slots 10923โ€“16383 Node 1' (replica) Node 2' (replica) Node 3' (replica) async replication failover reads Database on miss populate Node added: only K/N keys migrate 4th node โ†’ 25% keys reassign Modulo: add 1 node โ†’ 75%+ move Client hashes key โ†’ routes to correct node. Miss โ†’ DB โ†’ populate cache. Node failure: replica promotes. Only 1/N keys affected (not all). Consistent hashing + virtual nodes: adding/removing nodes moves only K/N keys.
๐Ÿ”—

Key Distribution

  • Consistent hashing: key โ†’ hash โ†’ ring position โ†’ owning node
  • Virtual nodes (vnodes): 100-200 per physical node for even distribution
  • Adding a node: only K/N keys migrate (not all keys reshuffled)
  • Removing a node: its keys taken by next node on ring
๐Ÿ›ก๏ธ

Failure Handling

  • Replication factor 2-3: each key on multiple nodes
  • Node failure: client routes to replica automatically
  • No thundering herd: keys still served from surviving replicas
  • Recovery: new node joins, pulls data from replicas

Consistent hashing with virtual nodes is the foundation of every production distributed cache. Without it, adding or removing a node invalidates all keys (modulo-based hashing). With consistent hashing, only 1/N keys are affected. Virtual nodes (multiple positions per physical node) prevent hotspots from uneven distribution. This is why Memcached, Redis Cluster, and DynamoDB all use variations of consistent hashing.

Consistent Hashing: Adding a Node Moves Only K/N Keys
Modulo Hashing (3 nodes) N1 N2 N3 k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 k11 k12 Add Node 4 โ†’ modulo changes 9/12 keys (75%) reassigned! Consistent Hash (4th node added) N1 N2 N3 N4 NEW k1โ†’N4 k2 k3 k4 k5โ†’N4 k6 k7 k8 k9โ†’N4 k10 k11 k12 Only 3/12 keys (25%) moved to N4 9 keys stay with original node โœ“ Modulo: 75% keys move | Consistent hash: only 25% move (K/N) At 50TB and 1000 nodes, this difference = hours of rebalancing saved.
๐Ÿ“‹ Chapter 4 โ€” Summary
  • Consistent hashing: key โ†’ ring position โ†’ node. Minimal redistribution on changes.
  • Virtual nodes: 100+ positions per physical node. Even key distribution.
  • Replication: each key on 2-3 nodes. Node failure doesn't cause miss storm.
  • Client-side routing: intelligent library handles hash, failover, connection pooling.
05
Chapter Five

Alternative Approaches

Cache Strategies Compared
Cache-Aside (Lazy Loading)
Read-Through / Write-Through
  • Application checks cache โ†’ miss โ†’ reads DB โ†’ writes to cache
  • Application owns all logic โ€” cache is passive storage
  • Only requested data is cached (no wasted memory)
  • Cache miss = extra latency (DB round-trip + cache write)
  • Stale data possible: DB updated but cache not invalidated
  • Used by: most web applications, Memcached patterns
  • Cache sits between app and DB. Reads/writes go through cache.
  • Read-through: cache fetches from DB on miss automatically
  • Write-through: write to cache + DB synchronously (consistent)
  • Simpler application code โ€” cache handles DB interaction
  • Write-through: higher write latency (2 writes per operation)
  • Used by: DynamoDB DAX, Hibernate L2 cache
Cache Pattern Flows โ€” Read Path Comparison
Cache-Aside (Lazy Loading): App Cache GET key HIT Return โœ“ MISS App โ†’ DB App โ†’ Cache: SET Return App owns all logic Read-Through: App Cache GET key HIT Return โœ“ MISS Cache โ†’ DB (internal) Cache stores + returns Cache owns DB fetch Write-Through: App Cache SET key DB sync write Both updated โ†’ Return โœ“ Cache + DB always in sync Write-Behind (async): App Cache SET key Return immediately โœ“ async โ†’ DB (later) โš  node fails before flush = DATA LOSS Fast but risky
Write-Behind (Async Write)
Refresh-Ahead (Proactive)
  • Write to cache immediately, async flush to DB in background
  • Very fast writes โ€” user gets response before DB write
  • Risk: cache failure before DB write = data loss
  • Batching: multiple writes aggregated into one DB write
  • Good for: write-heavy with acceptable data loss risk (counters, analytics)
  • Proactively refresh cache entries before they expire
  • Monitor access patterns: if key accessed frequently, refresh early
  • No cache miss latency for popular keys (always fresh)
  • Wastes resources on keys that won't be read again
  • Good for: predictable access patterns, low-staleness requirements

Write-behind is the only cache pattern where cache failure causes data loss. In cache-aside, read-through, and write-through, the database is always up-to-date or can be recovered from. In write-behind, writes go to cache first โ€” if the cache node fails before flushing to the database, those writes are gone. This makes write-behind appropriate only for truly expendable data: view counts, analytics counters, rate limit buckets. Never use write-behind for anything that would cause a business problem if lost.

๐Ÿ“‹ Chapter 5 โ€” Summary
  • Cache-aside: app controls logic. Lazy. Only caches what's needed.
  • Read/write-through: cache handles DB. Simpler app code. Higher consistency.
  • Write-behind: fast writes, data loss risk on node failure. Only for expendable data (counters, analytics). Never for transactional data.
  • Refresh-ahead: proactive refresh. No miss latency for hot keys.
06
Chapter Six

What Real Companies Did

Production Cache Systems
๐Ÿ”ด

Redis (OSS + Enterprise)

  • Redis Cluster: hash slots (16384) distributed across nodes
  • Single-threaded per shard โ€” predictable latency
  • Rich data structures: strings, lists, sets, sorted sets, streams
  • Redis Sentinel for HA failover (replica promotion)
  • Used by: Twitter, GitHub, Stack Overflow, Instagram
๐ŸŸข

Memcached

  • Pure key-value, multi-threaded โ€” high throughput per node
  • No built-in clustering โ€” client-side consistent hashing
  • Slab allocator: efficient memory management, no fragmentation
  • No persistence โ€” pure volatile cache
  • Used by: Facebook (mcrouter), Wikipedia, YouTube (early)
๐Ÿ“˜

Facebook TAO + Memcache

  • mcrouter: custom proxy for billion-request-scale Memcached
  • Lease mechanism: prevents thundering herd on hot keys
  • Regional pools: cache per data center, invalidation via message bus
  • TAO: social graph cache (objects + associations)
  • Handles 10B+ cache requests/sec across fleet
โ˜๏ธ

Amazon ElastiCache / DAX

  • ElastiCache: managed Redis or Memcached clusters
  • DAX: DynamoDB accelerator (read-through cache, microsecond reads)
  • Auto-scaling, automatic failover, encryption at rest
  • Multi-AZ replication for high availability
  • Used by: most AWS-native applications
Production Cache Systems โ€” Comparison
System Architecture Data Structures Clustering Special Pattern Redis Cluster Hash slots (16384) across nodes Strings, lists, sets, sorted sets, streams Built-in cluster + Sentinel HA Rich data structures, pub/sub, Lua scripting Memcached Client-side consistent hashing String only (binary) No built-in โ€” client handles it Multi-threaded, slab allocator, mcrouter Facebook TAO Custom graph- aware cache Objects + associations (graph edges) Regional pools per DC, msg bus invalidation Lease mechanism prevents thundering herd Amazon DAX Read-through for DynamoDB DynamoDB item format Managed multi-AZ cluster Microsecond reads, transparent to app
๐Ÿ“‹ Chapter 6 โ€” Summary
  • Redis Cluster: hash slots, rich data structures, Sentinel for HA.
  • Memcached: pure KV, multi-threaded, client-side hashing. Facebook at billion-scale.
  • Facebook: mcrouter + lease mechanism + regional pools. 10B+ req/sec.
  • AWS DAX: read-through cache for DynamoDB. Microsecond reads.
07
Chapter Seven

Best Practices Extracted

Transferable Lessons
๐Ÿ”‘

Cache Key Design

  • Include version in key: user:v2:123
  • Namespace by service: auth:session:abc
  • Avoid hot keys: shard popular keys across multiple entries
  • Key length matters: shorter keys = less memory overhead
  • Transfers to: any key-value storage design
โšก

Thundering Herd Prevention

  • Lock on miss: only 1 request fetches from DB, others wait
  • Stale-while-revalidate: serve stale, refresh in background
  • Probabilistic early expiry: random jitter prevents synchronized expiry
  • Lease mechanism: Facebook's approach โ€” token prevents stampede
  • Transfers to: any system with expensive fallback on miss
๐Ÿ“Š

Cache Warming

  • Pre-populate cache before traffic arrives (deploy, restart)
  • Shadow traffic: replay production reads against new cache
  • Gradual rollout: route 1% โ†’ 10% โ†’ 50% โ†’ 100% to new cache
  • Never go cold-to-hot instantly โ€” database can't absorb the surge
  • Transfers to: any system with hot/cold state transition
Thundering Herd Prevention โ€” Lock-on-Miss Pattern
Without Lock (stampede) Key expires at t=0 10K requests all miss DB 10K queries OVERLOAD! DB overwhelmed โ†’ cascade failure With Lock-on-Miss Key expires at t=0 Req #1 lock DB 1 query Reqs #2-10,000 โ†’ wait (lock held) โณ queued behind lock write to cache + release lock All 10K requests โ†’ cache HIT โœ“ 1 DB query serves 10,000 requests โœ“ DB load: 1 query instead of 10,000
๐Ÿ“‹ Chapter 7 โ€” Summary
  • Key design: versioned, namespaced, short. Avoid hot keys via sharding.
  • Thundering herd: lock on miss, stale-while-revalidate, probabilistic expiry.
  • Cache warming: never go cold-to-hot instantly. Gradual rollout + pre-population.
  • Negative caching: cache "not found" results with short TTL (30s). Prevents penetration attacks from bypassing cache entirely.
08
Chapter Eight

What Could Go Wrong

Common Failure Patterns
๐Ÿ˜

Cache Stampede

  • Popular key expires โ†’ 10K simultaneous requests all miss
  • All 10K requests hit DB for same data simultaneously
  • DB overwhelmed โ†’ slow response โ†’ more timeouts โ†’ cascade
  • Fix: lock on miss (one fetches, rest wait), never-expire + background refresh, probabilistic early recompute.
Cache Stampede vs Probabilistic Early Expiry
Time (seconds) โ†’ DB queries/sec t=0 t=30 t=60 t=90 DB capacity 10K queries (all expire at t=60) Synchronized TTL=60s (stampede) โ†‘ Probabilistic expiry TTL = 60 ยฑ random(0-10s) spread over time, under capacity โœ“
๐Ÿ’€

Cache Penetration

  • Requests for keys that don't exist in DB either
  • Every request misses cache AND misses DB โ€” no data to cache
  • Attacker can exploit: request random non-existent IDs โ†’ bypass cache
  • Fix: cache negative results (empty with short TTL), Bloom filter to reject impossible keys before DB query.
Cache Penetration: Bloom Filter as Pre-Query Guard
Without Bloom Filter: GET /user/999999 Cache MISS DB MISS too Nothing to cache โ†’ next request does same thing 1M fake IDs ร— N requests = DB melted With Bloom Filter: GET /user/999999 ๐Ÿ›ก๏ธ Bloom Filter Definitely NOT in DB Return 404 โœ“ No cache lookup, no DB query possibly valid Cache DB normal path Bloom filter: O(1) check, ~1% false positive rate. 1M fake IDs blocked instantly. DB untouched.
โ„๏ธ

Cold Start Avalanche

  • Cache cluster restarted (deploy, failure) โ†’ 0% hit rate
  • 100% of traffic hits DB โ†’ DB can't handle 100x normal load
  • DB goes down โ†’ entire system offline
  • Fix: rolling restart (one node at a time), cache warming from snapshot, gradual traffic shift to new cluster.
๐Ÿ”ฅ

Hot Key Problem

  • One key (celebrity post, viral product) gets 100K req/sec
  • Single cache node serving that key becomes CPU/network bottleneck
  • Other keys on same node get slow responses (noisy neighbor)
  • Fix: replicate hot keys to multiple nodes, client-side caching for ultra-hot keys, key splitting (key_1, key_2... round-robin).

The most insidious failure is serving stale data silently. Cache stampede and cold start are visible (latency spikes, errors). But serving stale data โ€” outdated prices, old permissions, deleted content โ€” causes silent corruption. Users see wrong information, make decisions on stale data, and nobody gets alerted because the system appears healthy. Monitor staleness: track cache lag vs source-of-truth age.

๐Ÿ“‹ Chapter 8 โ€” Summary
  • Stampede: lock on miss + background refresh. Never let popular keys expire synchronously.
  • Penetration: cache negatives + Bloom filter. Block impossible keys before DB.
  • Cold start: rolling restart + warming + gradual traffic shift. Never all-at-once.
  • Hot key: replicate to multiple nodes or split key. No single node for viral content.