System Design · Case Studies

Case Study: Cloud Storage

Design, trade-offs, and alternatives for a cloud storage service at scale.

Chapter One

Problem Statement

What We Are Building

A cloud storage service (like S3, Google Cloud Storage, or Azure Blob) stores arbitrary objects (files) durably and makes them accessible via HTTP from anywhere in the world. The defining challenge is durability at exabyte scale: you are promising customers that their data will never be lost — 99.999999999% (11 nines) durability means losing at most 1 object per 100 billion stored per year. Simultaneously, you must serve millions of reads/sec and handle objects from 1 byte to 5 TB, all with consistent performance.

Scale Requirements

Traffic & Scale

100+ exabytes total stored data
100M+ requests/sec (reads + writes combined)
Trillions of objects stored
Objects from 1 byte to 5 TB each

Requirements

Durability: 99.999999999% (11 nines)
Availability: 99.99% for reads, 99.9% for writes
Latency: first byte <100ms (standard), seconds (archival)
Consistency: strong read-after-write (S3 since 2020)

11 nines durability means data replication is not optional — it is the entire system. A single disk fails every ~3 years. A rack loses power. A data center floods. To achieve 11 nines, you must replicate data across multiple disks, multiple racks, and multiple geographic regions — such that simultaneous failure of any 2 replicas still leaves the data recoverable. The system is fundamentally a replication and consistency engine that happens to store files.

📋 Chapter 1 — Summary

100+ exabytes, trillions of objects, 100M+ requests/sec.
11 nines durability: lose at most 1 object per 100B per year.
Objects from 1 byte to 5 TB. Strong read-after-write consistency.
The system is primarily a replication engine that guarantees data survival.

Chapter Two

Questions to Ask

Clarifying Before Designing

📦

Object Model

Flat namespace (bucket/key) or hierarchical (directories)?
Maximum object size? Multi-part upload needed?
Versioning (keep all versions of an object)?
Object metadata (custom headers, tags)?

🔒

Access & Security

Public vs private access control?
Pre-signed URLs (time-limited access)?
Encryption at rest and in transit?
Cross-account access (IAM policies)?

💰

Storage Tiers

Hot (frequent access) vs cold (archival)?
Lifecycle rules (auto-transition after 30 days)?
Different durability SLAs per tier?
Retrieval latency tolerance (ms vs hours for glacier)?

For This Case Study, Our Answers Are:

Object model: flat namespace (bucket + key), no true directories
Maximum object size: 5 TB (requires multi-part upload above 5 GB)
Versioning: yes — enabled per bucket, soft-delete on overwrite
Access control: private by default; bucket policies + pre-signed URLs for time-limited public access
Encryption: server-side encryption at rest (AES-256); TLS in transit
Storage tiers: Hot, Warm (Infrequent Access), Cold, Archive — lifecycle rules for automatic transition
Freshness: strong read-after-write consistency (not eventual)
Durability target: 11 nines (99.999999999%)
Multi-region: cross-region replication optional (async, for DR)
Retrieval latency: Hot = <100ms first byte; Archive = 1-12 hours

Storage tiers are the primary cost optimization lever. Hot storage (SSD-backed, instant access) costs ~$0.023/GB/month. Archive (tape-backed, hours to retrieve) costs ~$0.004/GB/month — a 6x difference. At exabyte scale, this is billions of dollars/year in savings. Lifecycle policies that automatically transition objects from hot to cold based on access patterns are not a feature — they are the business model.

📋 Chapter 2 — Summary

Flat namespace (bucket/key) is simpler and more scalable than hierarchical.
Multi-part upload required for large objects (5TB max).
Storage tiers: hot/warm/cold/archive. Lifecycle rules for automatic transition.
Encryption at rest (server-side) is standard. Customer-managed keys for compliance.

Chapter Three

Naive Design

Single NFS Server

The simplest design: a single Network File System (NFS) server with a RAID array. Clients mount the share, read/write files directly. RAID provides redundancy against single-disk failure. Works for a team's shared drive. At scale: single server = single point of failure, RAID rebuild time at TB scale takes days (another failure during rebuild = data loss), no geographic replication, and NFS doesn't scale beyond a few hundred concurrent clients.

Naive Design — Single NFS Server with RAID

✅

What Works

Simple — standard NFS, POSIX filesystem semantics
RAID protects against single disk failure
Low latency (local network)
Fine for <10TB, <100 users

💥

What Breaks

SPOF: server failure = all data inaccessible
RAID rebuild at TB scale: days. Second failure = data loss.
No geographic replication: data center fire = permanent loss
NFS doesn't scale: 100s of clients max, metadata bottleneck
No object versioning, no lifecycle, no access control granularity

📋 Chapter 3 — Summary

NFS + RAID: works for small teams. SPOF, no geo-replication, no scale.
RAID rebuild vulnerability: during multi-day rebuild, a second failure = data loss.
No storage tiers, no versioning, no fine-grained access control.

Chapter Four

Refined Design

Distributed Object Store with Erasure Coding

The refined design separates metadata (where is the object?) from data (the object bytes). A metadata service maps bucket+key to the physical locations of data chunks. The data itself is split into chunks, erasure-coded (e.g., Reed-Solomon 6+3: any 6 of 9 chunks can reconstruct the object), and distributed across multiple racks and availability zones. Reads go to the nearest available chunk. Writes succeed only after a quorum of chunks are durably stored.

Refined Design — Distributed Object Store

✍️

Write Path

1. Client uploads object via API gateway
2. Metadata service allocates chunk locations (across AZs)
3. Object split into data chunks + parity chunks (erasure coding)
4. Chunks written to data nodes across racks/AZs
5. Quorum acknowledged → write confirmed to client
6. Metadata updated: key → [chunk_locations]

📖

Read Path

1. Client requests object by bucket + key
2. Metadata service returns chunk locations
3. Read minimum chunks needed (6 of 9) from nearest nodes
4. Reconstruct full object from chunks
5. Stream to client (chunked transfer for large objects)
If a node is down: read from alternative chunks (self-healing)

Multi-Part Upload — Handling Large Objects

Erasure coding vs triple replication is the fundamental storage efficiency decision. Triple replication (3 copies): simple, fast reads, 3x storage cost. Erasure coding (6+3 Reed-Solomon): 1.5x storage cost for same durability, but higher compute on read (reconstruction) and write (parity calculation). At exabyte scale, the 2x storage savings from erasure coding saves billions of dollars. This is why S3, GCS, and Azure all use erasure coding for their standard tier.

Erasure Coding vs Triple Replication — Storage Overhead Comparison

📋 Chapter 4 — Summary

Metadata service: maps key → chunk locations. Backed by strongly consistent KV store.
Erasure coding (6+3): 1.5x overhead for 11 nines durability. Any 6 of 9 chunks recover data.
Multi-AZ writes: chunks distributed across availability zones. AZ failure = no data loss.
Quorum writes: confirmed only after sufficient chunks stored durably.
Multi-part upload: split large objects into 5MB chunks for parallel, resumable uploads.

Chapter Five

Alternative Approaches

Replication & Storage Strategies

Triple Replication

Erasure Coding

Store 3 complete copies of every object
Simple: any copy serves reads independently
Fast reads: no reconstruction needed
3x storage cost — expensive at scale
Good for: hot data (frequent reads), small objects
Used by: HDFS (default), GFS (original Google)

Split into K data + M parity chunks. Any K chunks recover.
Storage overhead: (K+M)/K (e.g., 6+3 = 1.5x)
CPU cost on write (encoding) and degraded reads (reconstruction)
Same durability as replication with 50% less storage
Good for: warm/cold data, large objects
Used by: S3, Azure Blob, Google Cloud Storage

Replication vs Erasure Coding — Trade-off Space

Object Store (Flat Namespace)

Distributed File System (Hierarchical)

Flat: bucket + key. No directories, no renames, no POSIX.
Immutable objects: write once, delete, or overwrite completely
Immutability enables parallel reads without locking — no reader needs to worry about the object changing mid-read.
Massively scalable: no directory metadata bottleneck
HTTP API: GET/PUT/DELETE (stateless, cacheable)
Good for: cloud-native, web content, backups, data lakes
Used by: S3, GCS, Azure Blob, MinIO

Hierarchical: directories, files, POSIX semantics
Mutable: append, seek, partial overwrite
Directory listing = metadata scan (bottleneck at scale)
Rename is atomic — important for many applications
Good for: HPC, ML training (random access), legacy workloads
Used by: HDFS, Lustre, Azure Data Lake Gen2

The flat namespace of object stores is a deliberate scalability choice, not a limitation. Hierarchical file systems have a fundamental bottleneck: directory operations (rename, list, move) require locking the directory metadata — at petabyte scale this becomes a serialized queue. Object stores eliminate this by making every object independently addressable by its full key. "Directories" in S3 are just key prefixes — there is no directory inode to lock. This is why S3 can handle trillions of objects with no metadata bottleneck, while HDFS struggles above a few billion files.

📋 Chapter 5 — Summary

Triple replication: simple, fast reads, 3x cost. Good for hot/small objects.
Erasure coding: 1.5x cost, same durability. Production standard for cloud storage.
Object store: flat namespace, immutable, HTTP API. Infinitely scalable.
Distributed FS: hierarchical, mutable, POSIX. Limited by metadata bottleneck.

Chapter Six

What Real Companies Did

Production Cloud Storage Systems

☁️

Amazon S3

Launched 2006. 100+ exabytes stored. Trillions of objects.
11 nines durability via erasure coding across 3+ AZs
Strong consistency (read-after-write) since Dec 2020
Storage classes: Standard, IA, Glacier, Deep Archive
ShardStore: custom internal storage engine (replaced ext4)

🔵

Google Cloud Storage

Built on Colossus (successor to Google File System)
Erasure coding: Reed-Solomon across multiple data centers
Dual-region: synchronous replication to 2 regions
Turbo replication: 15-minute RPO for selected objects
Integrated with BigQuery for direct analytics on stored data

🟦

Azure Blob Storage

LRS (3 copies), ZRS (3 AZs), GRS (cross-region)
Block blobs up to 190 TB each (largest in cloud)
Tiering: Hot, Cool, Cold, Archive (automated lifecycle)
Immutable storage: WORM compliance (legal hold)
Data Lake Gen2: hierarchical namespace on blob storage

🟠

MinIO (Open Source)

S3-compatible API: drop-in replacement for self-hosted
Erasure coding: configurable parity per deployment
Written in Go — single binary, no dependencies
Kubernetes-native: StatefulSets for persistent volumes
Used by: on-premise object storage, hybrid cloud

Cloud Storage Providers — Comparison

📋 Chapter 6 — Summary

S3: exabyte-scale, erasure coding, strong consistency, multiple storage classes.
GCS: Colossus-backed, dual-region sync replication, BigQuery integration.
Azure: up to 190 TB per blob, WORM compliance, hierarchical namespace option.
MinIO: S3-compatible open-source, self-hosted, Kubernetes-native.

Chapter Seven

Best Practices Extracted

Transferable Lessons

🔀

Separate Metadata from Data

Metadata: small, frequently accessed, needs strong consistency
Data: large, write-once, needs durability and throughput
Different storage engines for each (KV store vs blob store)
Metadata is the bottleneck — optimize it independently
Transfers to: any large-object storage system

🛡️

Defense in Depth (Durability)

Layer 1: Checksums detect silent corruption (bit rot)
Layer 2: Erasure coding within a data center
Layer 3: Replication across availability zones
Layer 4: Cross-region backup for disaster recovery
Transfers to: any system with durability guarantees

💰

Lifecycle & Tiering

Automatically transition objects based on access patterns
30 days no access → warm. 90 days → cold. 365 days → archive.
Delete after N days (compliance retention policies)
Cost difference: 6x between hot and archive tiers
Transfers to: any system with aging data (logs, backups)

Durability Defense in Depth — Four Layers

📋 Chapter 7 — Summary

Metadata separation: different engines for small metadata vs large data blobs.
Durability layers: checksums + erasure coding + multi-AZ + cross-region (4 nested layers).
Lifecycle tiering: automatic cost optimization based on access age. 6x savings.
Multi-part upload: split large objects into 5MB chunks. Each part uploaded independently. Resumable on failure. Required for objects over 5GB.

Chapter Eight

What Could Go Wrong

Common Failure Patterns

🦠

Silent Data Corruption (Bit Rot)

Disk sector degrades silently — no error reported
Read returns corrupted data that looks valid
Discovered months later when all copies have rotted
Fix: end-to-end checksums (write checksum with data, verify on every read), periodic background scrubbing (read and verify all chunks).

💾

Metadata Service Overload

Listing millions of objects in one bucket → metadata scan
One customer's LIST request overwhelms metadata service
Affects all customers sharing the same metadata partition
Fix: rate limiting per bucket, metadata sharded by key prefix, pagination enforced (max 1000 per LIST).

🗑️

Accidental Deletion at Scale

Customer accidentally deletes entire bucket (script bug)
Deletion is immediate and permanent — no recycle bin
At exabyte scale: millions of objects gone in seconds
Fix: versioning (soft-delete), MFA delete for destructive ops, bucket lock policies, 24h delete delay for large operations.

🐌

Degraded Read Performance

One data node slow (disk degrading, network congestion)
Reads requiring that node's chunks are all slow
Customer sees intermittent latency spikes (random objects)
Fix: hedged reads (request from replica if primary slow), speculative execution, proactive disk replacement on latency increase.

Bit Rot Detection — Write Checksum, Verify on Read

Object Lifecycle — Automatic Cost Optimization via Tier Transitions

Bit rot is the silent killer that makes checksums non-negotiable. Unlike a disk failure (which is visible), bit rot silently corrupts stored data over time. If you don't checksum on write and verify on read, you discover corruption only when someone needs the data — possibly years later, after all copies have degraded. Background scrubbing (reading and verifying every chunk periodically) is the only defense against this invisible threat.

📋 Chapter 8 — Summary

Bit rot: end-to-end checksums + periodic scrubbing. Silent corruption = worst failure mode.
Metadata overload: rate limit per bucket, shard by prefix, enforce pagination.
Accidental deletion: versioning, MFA delete, delete delay for bulk ops.
Degraded reads: hedged requests to replicas, proactive disk replacement.

← Search Engine Recommendation Engine →