Case Study: Cloud Storage
Design, trade-offs, and alternatives for a cloud storage service at scale.
Problem Statement
A cloud storage service (like S3, Google Cloud Storage, or Azure Blob) stores arbitrary objects (files) durably and makes them accessible via HTTP from anywhere in the world. The defining challenge is durability at exabyte scale: you are promising customers that their data will never be lost β 99.999999999% (11 nines) durability means losing at most 1 object per 100 billion stored per year. Simultaneously, you must serve millions of reads/sec and handle objects from 1 byte to 5 TB, all with consistent performance.
Traffic & Scale
- 100+ exabytes total stored data
- 100M+ requests/sec (reads + writes combined)
- Trillions of objects stored
- Objects from 1 byte to 5 TB each
Requirements
- Durability: 99.999999999% (11 nines)
- Availability: 99.99% for reads, 99.9% for writes
- Latency: first byte <100ms (standard), seconds (archival)
- Consistency: strong read-after-write (S3 since 2020)
11 nines durability means data replication is not optional β it is the entire system. A single disk fails every ~3 years. A rack loses power. A data center floods. To achieve 11 nines, you must replicate data across multiple disks, multiple racks, and multiple geographic regions β such that simultaneous failure of any 2 replicas still leaves the data recoverable. The system is fundamentally a replication and consistency engine that happens to store files.
- 100+ exabytes, trillions of objects, 100M+ requests/sec.
- 11 nines durability: lose at most 1 object per 100B per year.
- Objects from 1 byte to 5 TB. Strong read-after-write consistency.
- The system is primarily a replication engine that guarantees data survival.
Questions to Ask
Object Model
- Flat namespace (bucket/key) or hierarchical (directories)?
- Maximum object size? Multi-part upload needed?
- Versioning (keep all versions of an object)?
- Object metadata (custom headers, tags)?
Access & Security
- Public vs private access control?
- Pre-signed URLs (time-limited access)?
- Encryption at rest and in transit?
- Cross-account access (IAM policies)?
Storage Tiers
- Hot (frequent access) vs cold (archival)?
- Lifecycle rules (auto-transition after 30 days)?
- Different durability SLAs per tier?
- Retrieval latency tolerance (ms vs hours for glacier)?
For This Case Study, Our Answers Are:
- Object model: flat namespace (bucket + key), no true directories
- Maximum object size: 5 TB (requires multi-part upload above 5 GB)
- Versioning: yes β enabled per bucket, soft-delete on overwrite
- Access control: private by default; bucket policies + pre-signed URLs for time-limited public access
- Encryption: server-side encryption at rest (AES-256); TLS in transit
- Storage tiers: Hot, Warm (Infrequent Access), Cold, Archive β lifecycle rules for automatic transition
- Freshness: strong read-after-write consistency (not eventual)
- Durability target: 11 nines (99.999999999%)
- Multi-region: cross-region replication optional (async, for DR)
- Retrieval latency: Hot = <100ms first byte; Archive = 1-12 hours
Storage tiers are the primary cost optimization lever. Hot storage (SSD-backed, instant access) costs ~$0.023/GB/month. Archive (tape-backed, hours to retrieve) costs ~$0.004/GB/month β a 6x difference. At exabyte scale, this is billions of dollars/year in savings. Lifecycle policies that automatically transition objects from hot to cold based on access patterns are not a feature β they are the business model.
- Flat namespace (bucket/key) is simpler and more scalable than hierarchical.
- Multi-part upload required for large objects (5TB max).
- Storage tiers: hot/warm/cold/archive. Lifecycle rules for automatic transition.
- Encryption at rest (server-side) is standard. Customer-managed keys for compliance.
Naive Design
The simplest design: a single Network File System (NFS) server with a RAID array. Clients mount the share, read/write files directly. RAID provides redundancy against single-disk failure. Works for a team's shared drive. At scale: single server = single point of failure, RAID rebuild time at TB scale takes days (another failure during rebuild = data loss), no geographic replication, and NFS doesn't scale beyond a few hundred concurrent clients.
What Works
- Simple β standard NFS, POSIX filesystem semantics
- RAID protects against single disk failure
- Low latency (local network)
- Fine for <10TB, <100 users
What Breaks
- SPOF: server failure = all data inaccessible
- RAID rebuild at TB scale: days. Second failure = data loss.
- No geographic replication: data center fire = permanent loss
- NFS doesn't scale: 100s of clients max, metadata bottleneck
- No object versioning, no lifecycle, no access control granularity
- NFS + RAID: works for small teams. SPOF, no geo-replication, no scale.
- RAID rebuild vulnerability: during multi-day rebuild, a second failure = data loss.
- No storage tiers, no versioning, no fine-grained access control.
Refined Design
The refined design separates metadata (where is the object?) from data (the object bytes). A metadata service maps bucket+key to the physical locations of data chunks. The data itself is split into chunks, erasure-coded (e.g., Reed-Solomon 6+3: any 6 of 9 chunks can reconstruct the object), and distributed across multiple racks and availability zones. Reads go to the nearest available chunk. Writes succeed only after a quorum of chunks are durably stored.
Write Path
- 1. Client uploads object via API gateway
- 2. Metadata service allocates chunk locations (across AZs)
- 3. Object split into data chunks + parity chunks (erasure coding)
- 4. Chunks written to data nodes across racks/AZs
- 5. Quorum acknowledged β write confirmed to client
- 6. Metadata updated: key β [chunk_locations]
Read Path
- 1. Client requests object by bucket + key
- 2. Metadata service returns chunk locations
- 3. Read minimum chunks needed (6 of 9) from nearest nodes
- 4. Reconstruct full object from chunks
- 5. Stream to client (chunked transfer for large objects)
- If a node is down: read from alternative chunks (self-healing)
Erasure coding vs triple replication is the fundamental storage efficiency decision. Triple replication (3 copies): simple, fast reads, 3x storage cost. Erasure coding (6+3 Reed-Solomon): 1.5x storage cost for same durability, but higher compute on read (reconstruction) and write (parity calculation). At exabyte scale, the 2x storage savings from erasure coding saves billions of dollars. This is why S3, GCS, and Azure all use erasure coding for their standard tier.
- Metadata service: maps key β chunk locations. Backed by strongly consistent KV store.
- Erasure coding (6+3): 1.5x overhead for 11 nines durability. Any 6 of 9 chunks recover data.
- Multi-AZ writes: chunks distributed across availability zones. AZ failure = no data loss.
- Quorum writes: confirmed only after sufficient chunks stored durably.
- Multi-part upload: split large objects into 5MB chunks for parallel, resumable uploads.
Alternative Approaches
- Store 3 complete copies of every object
- Simple: any copy serves reads independently
- Fast reads: no reconstruction needed
- 3x storage cost β expensive at scale
- Good for: hot data (frequent reads), small objects
- Used by: HDFS (default), GFS (original Google)
- Split into K data + M parity chunks. Any K chunks recover.
- Storage overhead: (K+M)/K (e.g., 6+3 = 1.5x)
- CPU cost on write (encoding) and degraded reads (reconstruction)
- Same durability as replication with 50% less storage
- Good for: warm/cold data, large objects
- Used by: S3, Azure Blob, Google Cloud Storage
- Flat: bucket + key. No directories, no renames, no POSIX.
- Immutable objects: write once, delete, or overwrite completely
- Immutability enables parallel reads without locking β no reader needs to worry about the object changing mid-read.
- Massively scalable: no directory metadata bottleneck
- HTTP API: GET/PUT/DELETE (stateless, cacheable)
- Good for: cloud-native, web content, backups, data lakes
- Used by: S3, GCS, Azure Blob, MinIO
- Hierarchical: directories, files, POSIX semantics
- Mutable: append, seek, partial overwrite
- Directory listing = metadata scan (bottleneck at scale)
- Rename is atomic β important for many applications
- Good for: HPC, ML training (random access), legacy workloads
- Used by: HDFS, Lustre, Azure Data Lake Gen2
The flat namespace of object stores is a deliberate scalability choice, not a limitation. Hierarchical file systems have a fundamental bottleneck: directory operations (rename, list, move) require locking the directory metadata β at petabyte scale this becomes a serialized queue. Object stores eliminate this by making every object independently addressable by its full key. "Directories" in S3 are just key prefixes β there is no directory inode to lock. This is why S3 can handle trillions of objects with no metadata bottleneck, while HDFS struggles above a few billion files.
- Triple replication: simple, fast reads, 3x cost. Good for hot/small objects.
- Erasure coding: 1.5x cost, same durability. Production standard for cloud storage.
- Object store: flat namespace, immutable, HTTP API. Infinitely scalable.
- Distributed FS: hierarchical, mutable, POSIX. Limited by metadata bottleneck.
What Real Companies Did
Amazon S3
- Launched 2006. 100+ exabytes stored. Trillions of objects.
- 11 nines durability via erasure coding across 3+ AZs
- Strong consistency (read-after-write) since Dec 2020
- Storage classes: Standard, IA, Glacier, Deep Archive
- ShardStore: custom internal storage engine (replaced ext4)
Google Cloud Storage
- Built on Colossus (successor to Google File System)
- Erasure coding: Reed-Solomon across multiple data centers
- Dual-region: synchronous replication to 2 regions
- Turbo replication: 15-minute RPO for selected objects
- Integrated with BigQuery for direct analytics on stored data
Azure Blob Storage
- LRS (3 copies), ZRS (3 AZs), GRS (cross-region)
- Block blobs up to 190 TB each (largest in cloud)
- Tiering: Hot, Cool, Cold, Archive (automated lifecycle)
- Immutable storage: WORM compliance (legal hold)
- Data Lake Gen2: hierarchical namespace on blob storage
MinIO (Open Source)
- S3-compatible API: drop-in replacement for self-hosted
- Erasure coding: configurable parity per deployment
- Written in Go β single binary, no dependencies
- Kubernetes-native: StatefulSets for persistent volumes
- Used by: on-premise object storage, hybrid cloud
- S3: exabyte-scale, erasure coding, strong consistency, multiple storage classes.
- GCS: Colossus-backed, dual-region sync replication, BigQuery integration.
- Azure: up to 190 TB per blob, WORM compliance, hierarchical namespace option.
- MinIO: S3-compatible open-source, self-hosted, Kubernetes-native.
Best Practices Extracted
Separate Metadata from Data
- Metadata: small, frequently accessed, needs strong consistency
- Data: large, write-once, needs durability and throughput
- Different storage engines for each (KV store vs blob store)
- Metadata is the bottleneck β optimize it independently
- Transfers to: any large-object storage system
Defense in Depth (Durability)
- Layer 1: Checksums detect silent corruption (bit rot)
- Layer 2: Erasure coding within a data center
- Layer 3: Replication across availability zones
- Layer 4: Cross-region backup for disaster recovery
- Transfers to: any system with durability guarantees
Lifecycle & Tiering
- Automatically transition objects based on access patterns
- 30 days no access β warm. 90 days β cold. 365 days β archive.
- Delete after N days (compliance retention policies)
- Cost difference: 6x between hot and archive tiers
- Transfers to: any system with aging data (logs, backups)
- Metadata separation: different engines for small metadata vs large data blobs.
- Durability layers: checksums + erasure coding + multi-AZ + cross-region (4 nested layers).
- Lifecycle tiering: automatic cost optimization based on access age. 6x savings.
- Multi-part upload: split large objects into 5MB chunks. Each part uploaded independently. Resumable on failure. Required for objects over 5GB.
What Could Go Wrong
Silent Data Corruption (Bit Rot)
- Disk sector degrades silently β no error reported
- Read returns corrupted data that looks valid
- Discovered months later when all copies have rotted
- Fix: end-to-end checksums (write checksum with data, verify on every read), periodic background scrubbing (read and verify all chunks).
Metadata Service Overload
- Listing millions of objects in one bucket β metadata scan
- One customer's LIST request overwhelms metadata service
- Affects all customers sharing the same metadata partition
- Fix: rate limiting per bucket, metadata sharded by key prefix, pagination enforced (max 1000 per LIST).
Accidental Deletion at Scale
- Customer accidentally deletes entire bucket (script bug)
- Deletion is immediate and permanent β no recycle bin
- At exabyte scale: millions of objects gone in seconds
- Fix: versioning (soft-delete), MFA delete for destructive ops, bucket lock policies, 24h delete delay for large operations.
Degraded Read Performance
- One data node slow (disk degrading, network congestion)
- Reads requiring that node's chunks are all slow
- Customer sees intermittent latency spikes (random objects)
- Fix: hedged reads (request from replica if primary slow), speculative execution, proactive disk replacement on latency increase.
Bit rot is the silent killer that makes checksums non-negotiable. Unlike a disk failure (which is visible), bit rot silently corrupts stored data over time. If you don't checksum on write and verify on read, you discover corruption only when someone needs the data β possibly years later, after all copies have degraded. Background scrubbing (reading and verifying every chunk periodically) is the only defense against this invisible threat.
- Bit rot: end-to-end checksums + periodic scrubbing. Silent corruption = worst failure mode.
- Metadata overload: rate limit per bucket, shard by prefix, enforce pagination.
- Accidental deletion: versioning, MFA delete, delete delay for bulk ops.
- Degraded reads: hedged requests to replicas, proactive disk replacement.