System Design Β· Case Studies

Case Study: Cloud Storage

Design, trade-offs, and alternatives for a cloud storage service at scale.

01
Chapter One

Problem Statement

What We Are Building

A cloud storage service (like S3, Google Cloud Storage, or Azure Blob) stores arbitrary objects (files) durably and makes them accessible via HTTP from anywhere in the world. The defining challenge is durability at exabyte scale: you are promising customers that their data will never be lost β€” 99.999999999% (11 nines) durability means losing at most 1 object per 100 billion stored per year. Simultaneously, you must serve millions of reads/sec and handle objects from 1 byte to 5 TB, all with consistent performance.

Scale Requirements

Traffic & Scale

  • 100+ exabytes total stored data
  • 100M+ requests/sec (reads + writes combined)
  • Trillions of objects stored
  • Objects from 1 byte to 5 TB each

Requirements

  • Durability: 99.999999999% (11 nines)
  • Availability: 99.99% for reads, 99.9% for writes
  • Latency: first byte <100ms (standard), seconds (archival)
  • Consistency: strong read-after-write (S3 since 2020)

11 nines durability means data replication is not optional β€” it is the entire system. A single disk fails every ~3 years. A rack loses power. A data center floods. To achieve 11 nines, you must replicate data across multiple disks, multiple racks, and multiple geographic regions β€” such that simultaneous failure of any 2 replicas still leaves the data recoverable. The system is fundamentally a replication and consistency engine that happens to store files.

πŸ“‹ Chapter 1 β€” Summary
  • 100+ exabytes, trillions of objects, 100M+ requests/sec.
  • 11 nines durability: lose at most 1 object per 100B per year.
  • Objects from 1 byte to 5 TB. Strong read-after-write consistency.
  • The system is primarily a replication engine that guarantees data survival.
02
Chapter Two

Questions to Ask

Clarifying Before Designing
πŸ“¦

Object Model

  • Flat namespace (bucket/key) or hierarchical (directories)?
  • Maximum object size? Multi-part upload needed?
  • Versioning (keep all versions of an object)?
  • Object metadata (custom headers, tags)?
πŸ”’

Access & Security

  • Public vs private access control?
  • Pre-signed URLs (time-limited access)?
  • Encryption at rest and in transit?
  • Cross-account access (IAM policies)?
πŸ’°

Storage Tiers

  • Hot (frequent access) vs cold (archival)?
  • Lifecycle rules (auto-transition after 30 days)?
  • Different durability SLAs per tier?
  • Retrieval latency tolerance (ms vs hours for glacier)?

For This Case Study, Our Answers Are:

  • Object model: flat namespace (bucket + key), no true directories
  • Maximum object size: 5 TB (requires multi-part upload above 5 GB)
  • Versioning: yes β€” enabled per bucket, soft-delete on overwrite
  • Access control: private by default; bucket policies + pre-signed URLs for time-limited public access
  • Encryption: server-side encryption at rest (AES-256); TLS in transit
  • Storage tiers: Hot, Warm (Infrequent Access), Cold, Archive β€” lifecycle rules for automatic transition
  • Freshness: strong read-after-write consistency (not eventual)
  • Durability target: 11 nines (99.999999999%)
  • Multi-region: cross-region replication optional (async, for DR)
  • Retrieval latency: Hot = <100ms first byte; Archive = 1-12 hours

Storage tiers are the primary cost optimization lever. Hot storage (SSD-backed, instant access) costs ~$0.023/GB/month. Archive (tape-backed, hours to retrieve) costs ~$0.004/GB/month β€” a 6x difference. At exabyte scale, this is billions of dollars/year in savings. Lifecycle policies that automatically transition objects from hot to cold based on access patterns are not a feature β€” they are the business model.

πŸ“‹ Chapter 2 β€” Summary
  • Flat namespace (bucket/key) is simpler and more scalable than hierarchical.
  • Multi-part upload required for large objects (5TB max).
  • Storage tiers: hot/warm/cold/archive. Lifecycle rules for automatic transition.
  • Encryption at rest (server-side) is standard. Customer-managed keys for compliance.
03
Chapter Three

Naive Design

Single NFS Server

The simplest design: a single Network File System (NFS) server with a RAID array. Clients mount the share, read/write files directly. RAID provides redundancy against single-disk failure. Works for a team's shared drive. At scale: single server = single point of failure, RAID rebuild time at TB scale takes days (another failure during rebuild = data loss), no geographic replication, and NFS doesn't scale beyond a few hundred concurrent clients.

Naive Design β€” Single NFS Server with RAID
Client 1 Client 2 Client N NFS Server Single mount point SPOF RAID Array FAIL Rebuild: 3+ days at TB scale Server failure β†’ all clients lose access instantly. No geographic redundancy. RAID rebuild at TB scale takes 3+ days. Second disk fails during rebuild β†’ permanent data loss. NFS metadata bottleneck: 100s of concurrent clients max. At 100M req/sec β†’ impossible. No versioning, no lifecycle tiers, no geographic replication, no fine-grained access control.
βœ…

What Works

  • Simple β€” standard NFS, POSIX filesystem semantics
  • RAID protects against single disk failure
  • Low latency (local network)
  • Fine for <10TB, <100 users
πŸ’₯

What Breaks

  • SPOF: server failure = all data inaccessible
  • RAID rebuild at TB scale: days. Second failure = data loss.
  • No geographic replication: data center fire = permanent loss
  • NFS doesn't scale: 100s of clients max, metadata bottleneck
  • No object versioning, no lifecycle, no access control granularity
πŸ“‹ Chapter 3 β€” Summary
  • NFS + RAID: works for small teams. SPOF, no geo-replication, no scale.
  • RAID rebuild vulnerability: during multi-day rebuild, a second failure = data loss.
  • No storage tiers, no versioning, no fine-grained access control.
04
Chapter Four

Refined Design

Distributed Object Store with Erasure Coding

The refined design separates metadata (where is the object?) from data (the object bytes). A metadata service maps bucket+key to the physical locations of data chunks. The data itself is split into chunks, erasure-coded (e.g., Reed-Solomon 6+3: any 6 of 9 chunks can reconstruct the object), and distributed across multiple racks and availability zones. Reads go to the nearest available chunk. Writes succeed only after a quorum of chunks are durably stored.

Refined Design β€” Distributed Object Store
Client API Gateway auth + routing β‘  Metadata Service key β†’ chunk locations β‘‘ Metadata DB strongly consistent KV durable chunk locs Availability Zone A Availability Zone B Availability Zone C Node 1 data: 1, 2 parity: 7 3 chunks Node 2 data: 3, 4 parity: 8 3 chunks Node 3 data: 5, 6 parity: 9 3 chunks β‘’ write chunks β‘£ quorum ACK β‘€ confirmed read: 6 of 9 chunks (min needed) 6 data chunks + 3 parity chunks = 9 total. Any 6 β†’ full recovery. Write: β‘  Client β†’ β‘‘ Metadata (allocate) β†’ β‘’ Write chunks (parallel) β†’ β‘£ Quorum ACK β†’ β‘€ Confirmed Read: Metadata (find chunks) β†’ read 6 of 9 from nearest nodes β†’ reconstruct β†’ stream to client Erasure coding: 1.5x storage overhead (vs 3x for triple replication) for same 11-nines durability
✍️

Write Path

  • 1. Client uploads object via API gateway
  • 2. Metadata service allocates chunk locations (across AZs)
  • 3. Object split into data chunks + parity chunks (erasure coding)
  • 4. Chunks written to data nodes across racks/AZs
  • 5. Quorum acknowledged β†’ write confirmed to client
  • 6. Metadata updated: key β†’ [chunk_locations]
πŸ“–

Read Path

  • 1. Client requests object by bucket + key
  • 2. Metadata service returns chunk locations
  • 3. Read minimum chunks needed (6 of 9) from nearest nodes
  • 4. Reconstruct full object from chunks
  • 5. Stream to client (chunked transfer for large objects)
  • If a node is down: read from alternative chunks (self-healing)
Multi-Part Upload β€” Handling Large Objects
Client splits 5TB file into 5MB parts Initiate upload β†’ get upload_id Parallel Part Uploads Part 1 (5MB) β†’ ETag: abc... Part 2 (5MB) β†’ ETag: def... Part 3 (5MB) β†’ ETag: ghi... Part 4 β†’ ETag: jkl... ... Part N β†’ ETag: xyz... Complete Upload (upload_id + ETags in order) Server assembles β†’ erasure codes full object β†’ available for reads Resumable: if Part 7 fails, retry only Part 7. Don't restart from beginning. Required for objects over 5 GB. Supports up to 10,000 parts.

Erasure coding vs triple replication is the fundamental storage efficiency decision. Triple replication (3 copies): simple, fast reads, 3x storage cost. Erasure coding (6+3 Reed-Solomon): 1.5x storage cost for same durability, but higher compute on read (reconstruction) and write (parity calculation). At exabyte scale, the 2x storage savings from erasure coding saves billions of dollars. This is why S3, GCS, and Azure all use erasure coding for their standard tier.

Erasure Coding vs Triple Replication β€” Storage Overhead Comparison
Triple Replication Original: 6 GB Copy 1: 6 GB β†’ Node A Copy 2: 6 GB β†’ Node B Copy 3: 6 GB β†’ Node C Total: 18 GB (3Γ— overhead) Survives: 2 node failures βœ… Fast reads (no reconstruction) ❌ 3Γ— storage cost Erasure Coding (6+3) Original: 6 GB β†’ split into 9 chunks D1 1GB D2 1GB D3 1GB D4 1GB D5 1GB D6 1GB P7 1GB P8 1GB P9 1GB ← data ← parity Total: 9 GB (1.5Γ— overhead) Survives: 3 node failures βœ… 50% less storage cost ⚠️ Higher compute on read Same 11-nines durability. 50% less storage cost. Billions saved at exabyte scale. Production standard: erasure coding for warm/cold data, triple replication for hot metadata.
πŸ“‹ Chapter 4 β€” Summary
  • Metadata service: maps key β†’ chunk locations. Backed by strongly consistent KV store.
  • Erasure coding (6+3): 1.5x overhead for 11 nines durability. Any 6 of 9 chunks recover data.
  • Multi-AZ writes: chunks distributed across availability zones. AZ failure = no data loss.
  • Quorum writes: confirmed only after sufficient chunks stored durably.
  • Multi-part upload: split large objects into 5MB chunks for parallel, resumable uploads.
05
Chapter Five

Alternative Approaches

Replication & Storage Strategies
Triple Replication
Erasure Coding
  • Store 3 complete copies of every object
  • Simple: any copy serves reads independently
  • Fast reads: no reconstruction needed
  • 3x storage cost β€” expensive at scale
  • Good for: hot data (frequent reads), small objects
  • Used by: HDFS (default), GFS (original Google)
  • Split into K data + M parity chunks. Any K chunks recover.
  • Storage overhead: (K+M)/K (e.g., 6+3 = 1.5x)
  • CPU cost on write (encoding) and degraded reads (reconstruction)
  • Same durability as replication with 50% less storage
  • Good for: warm/cold data, large objects
  • Used by: S3, Azure Blob, Google Cloud Storage
Replication vs Erasure Coding β€” Trade-off Space
Read Performance β†’ Storage Efficiency β†’ Slow Fast Low High Triple Replication Fast reads, 3Γ— overhead EC 6+3 Reconstruct on degraded, 1.5Γ— β˜… Production standard EC 12+4 More compute, 1.33Γ— efficiency–performance frontier
Object Store (Flat Namespace)
Distributed File System (Hierarchical)
  • Flat: bucket + key. No directories, no renames, no POSIX.
  • Immutable objects: write once, delete, or overwrite completely
  • Immutability enables parallel reads without locking β€” no reader needs to worry about the object changing mid-read.
  • Massively scalable: no directory metadata bottleneck
  • HTTP API: GET/PUT/DELETE (stateless, cacheable)
  • Good for: cloud-native, web content, backups, data lakes
  • Used by: S3, GCS, Azure Blob, MinIO
  • Hierarchical: directories, files, POSIX semantics
  • Mutable: append, seek, partial overwrite
  • Directory listing = metadata scan (bottleneck at scale)
  • Rename is atomic β€” important for many applications
  • Good for: HPC, ML training (random access), legacy workloads
  • Used by: HDFS, Lustre, Azure Data Lake Gen2

The flat namespace of object stores is a deliberate scalability choice, not a limitation. Hierarchical file systems have a fundamental bottleneck: directory operations (rename, list, move) require locking the directory metadata β€” at petabyte scale this becomes a serialized queue. Object stores eliminate this by making every object independently addressable by its full key. "Directories" in S3 are just key prefixes β€” there is no directory inode to lock. This is why S3 can handle trillions of objects with no metadata bottleneck, while HDFS struggles above a few billion files.

πŸ“‹ Chapter 5 β€” Summary
  • Triple replication: simple, fast reads, 3x cost. Good for hot/small objects.
  • Erasure coding: 1.5x cost, same durability. Production standard for cloud storage.
  • Object store: flat namespace, immutable, HTTP API. Infinitely scalable.
  • Distributed FS: hierarchical, mutable, POSIX. Limited by metadata bottleneck.
06
Chapter Six

What Real Companies Did

Production Cloud Storage Systems
☁️

Amazon S3

  • Launched 2006. 100+ exabytes stored. Trillions of objects.
  • 11 nines durability via erasure coding across 3+ AZs
  • Strong consistency (read-after-write) since Dec 2020
  • Storage classes: Standard, IA, Glacier, Deep Archive
  • ShardStore: custom internal storage engine (replaced ext4)
πŸ”΅

Google Cloud Storage

  • Built on Colossus (successor to Google File System)
  • Erasure coding: Reed-Solomon across multiple data centers
  • Dual-region: synchronous replication to 2 regions
  • Turbo replication: 15-minute RPO for selected objects
  • Integrated with BigQuery for direct analytics on stored data
🟦

Azure Blob Storage

  • LRS (3 copies), ZRS (3 AZs), GRS (cross-region)
  • Block blobs up to 190 TB each (largest in cloud)
  • Tiering: Hot, Cool, Cold, Archive (automated lifecycle)
  • Immutable storage: WORM compliance (legal hold)
  • Data Lake Gen2: hierarchical namespace on blob storage
🟠

MinIO (Open Source)

  • S3-compatible API: drop-in replacement for self-hosted
  • Erasure coding: configurable parity per deployment
  • Written in Go β€” single binary, no dependencies
  • Kubernetes-native: StatefulSets for persistent volumes
  • Used by: on-premise object storage, hybrid cloud
Cloud Storage Providers β€” Comparison
Provider Durability Erasure Coding Storage Classes Special Pattern Amazon S3 11 nines Yes β€” 3+ AZs Standard, IA, Glacier, Deep ShardStore, strong consistency Google GCS 11 nines Reed-Solomon Standard, Near, Cold, Archive Colossus, Turbo 15-min RPO Azure Blob 11 nines (GRS) Yes (ZRS/GRS) Hot, Cool, Cold, Archive 190 TB/blob, WORM, DL Gen2 MinIO Configurable Yes β€” K+M config Single tier (self-managed) S3-compatible, K8s-native
πŸ“‹ Chapter 6 β€” Summary
  • S3: exabyte-scale, erasure coding, strong consistency, multiple storage classes.
  • GCS: Colossus-backed, dual-region sync replication, BigQuery integration.
  • Azure: up to 190 TB per blob, WORM compliance, hierarchical namespace option.
  • MinIO: S3-compatible open-source, self-hosted, Kubernetes-native.
07
Chapter Seven

Best Practices Extracted

Transferable Lessons
πŸ”€

Separate Metadata from Data

  • Metadata: small, frequently accessed, needs strong consistency
  • Data: large, write-once, needs durability and throughput
  • Different storage engines for each (KV store vs blob store)
  • Metadata is the bottleneck β€” optimize it independently
  • Transfers to: any large-object storage system
πŸ›‘οΈ

Defense in Depth (Durability)

  • Layer 1: Checksums detect silent corruption (bit rot)
  • Layer 2: Erasure coding within a data center
  • Layer 3: Replication across availability zones
  • Layer 4: Cross-region backup for disaster recovery
  • Transfers to: any system with durability guarantees
πŸ’°

Lifecycle & Tiering

  • Automatically transition objects based on access patterns
  • 30 days no access β†’ warm. 90 days β†’ cold. 365 days β†’ archive.
  • Delete after N days (compliance retention policies)
  • Cost difference: 6x between hot and archive tiers
  • Transfers to: any system with aging data (logs, backups)
Durability Defense in Depth β€” Four Layers
Layer 4: Cross-Region Backup Survives: entire region destroyed (flood, fire, catastrophic event) Layer 3: Multi-AZ Replication Survives: one availability zone goes down entirely Layer 2: Erasure Coding (within DC) Survives: up to 3 simultaneous node/rack failures Layer 1: Checksums (Bit Rot Detection) Detects: silent data corruption from disk degradation Write checksum with data β†’ verify on every read β†’ background scrubbing
πŸ“‹ Chapter 7 β€” Summary
  • Metadata separation: different engines for small metadata vs large data blobs.
  • Durability layers: checksums + erasure coding + multi-AZ + cross-region (4 nested layers).
  • Lifecycle tiering: automatic cost optimization based on access age. 6x savings.
  • Multi-part upload: split large objects into 5MB chunks. Each part uploaded independently. Resumable on failure. Required for objects over 5GB.
08
Chapter Eight

What Could Go Wrong

Common Failure Patterns
🦠

Silent Data Corruption (Bit Rot)

  • Disk sector degrades silently β€” no error reported
  • Read returns corrupted data that looks valid
  • Discovered months later when all copies have rotted
  • Fix: end-to-end checksums (write checksum with data, verify on every read), periodic background scrubbing (read and verify all chunks).
πŸ’Ύ

Metadata Service Overload

  • Listing millions of objects in one bucket β†’ metadata scan
  • One customer's LIST request overwhelms metadata service
  • Affects all customers sharing the same metadata partition
  • Fix: rate limiting per bucket, metadata sharded by key prefix, pagination enforced (max 1000 per LIST).
πŸ—‘οΈ

Accidental Deletion at Scale

  • Customer accidentally deletes entire bucket (script bug)
  • Deletion is immediate and permanent β€” no recycle bin
  • At exabyte scale: millions of objects gone in seconds
  • Fix: versioning (soft-delete), MFA delete for destructive ops, bucket lock policies, 24h delete delay for large operations.
🐌

Degraded Read Performance

  • One data node slow (disk degrading, network congestion)
  • Reads requiring that node's chunks are all slow
  • Customer sees intermittent latency spikes (random objects)
  • Fix: hedged reads (request from replica if primary slow), speculative execution, proactive disk replacement on latency increase.
Bit Rot Detection β€” Write Checksum, Verify on Read
Without Checksums β€” Silent Corruption t=0 Write OK t=180d Disk sector degrades silently corrupted t=365d Read returns corrupted data No warning. Data lost. With Checksums + Background Scrubbing β€” Detected & Healed t=0 Write + SHA-256 checksum stored t=180d Sector degrades t=181d Scrub detects mismatch! Heal from parity chunks t=365d Read returns correct data Customer unaffected βœ“
Object Lifecycle β€” Automatic Cost Optimization via Tier Transitions
Object lifespan β€” automatic tier transitions HOT $0.023/GB/mo Day 0-30 SSD, <100ms β†’ 30d rule WARM / IA $0.0125/GB/mo Day 30-120 HDD, ms access β†’ 90d rule COLD $0.004/GB/mo min-hrs retrieval β†’ 365d rule ARCHIVE $0.001/GB/mo Day 365+ hrs to retrieve 6Γ— cost savings from Hot to Archive. Lifecycle rules apply automatically β€” no manual intervention. Optional: auto-delete after retention period (e.g., 730 days for compliance). At exabyte scale: tier transitions save billions of dollars per year.

Bit rot is the silent killer that makes checksums non-negotiable. Unlike a disk failure (which is visible), bit rot silently corrupts stored data over time. If you don't checksum on write and verify on read, you discover corruption only when someone needs the data β€” possibly years later, after all copies have degraded. Background scrubbing (reading and verifying every chunk periodically) is the only defense against this invisible threat.

πŸ“‹ Chapter 8 β€” Summary
  • Bit rot: end-to-end checksums + periodic scrubbing. Silent corruption = worst failure mode.
  • Metadata overload: rate limit per bucket, shard by prefix, enforce pagination.
  • Accidental deletion: versioning, MFA delete, delete delay for bulk ops.
  • Degraded reads: hedged requests to replicas, proactive disk replacement.