System Design · Building Blocks · Replication & Sharding

Replication & Sharding

Scaling databases horizontally through data distribution.

01
Chapter One

What Replication and Sharding Are

Two Different Problems, Two Different Tools

When a single database can no longer keep up — reads, writes, or storage — engineers reach for two techniques that look similar from a distance and behave nothing alike up close. Replication makes copies of the same data on multiple machines. Sharding splits the data itself, so different machines hold different rows. They solve different problems, scale different bottlenecks, and have completely different operational profiles. Most teams need replication long before they need sharding, and the ones who jump straight to sharding usually regret it.

📋

Replication — Many Copies of Everything

What it does: every node holds a full copy of the data.

Solves: read scaling (route reads across replicas), high availability (failover when primary dies), geographic latency (replicas in user regions).

Does not solve: write scaling, storage limits.

Analogy: photocopying a book and putting copies in every branch library.

✂️

Sharding — Split the Data Itself

What it does: partition the data so different nodes hold different rows (e.g. users A–M on shard 1, N–Z on shard 2).

Solves: write scaling, storage capacity, single-table size limits.

Does not solve: high availability (each shard is still its own SPOF unless replicated).

Analogy: splitting an encyclopedia — volume A–M in one library, N–Z in another.

Replication vs Sharding — The Same Data, Two Different Treatments
Replication same data on every node Node 1 A–Z all rows Node 2 A–Z all rows Node 3 A–Z all rows ✓ scales reads ✓ HA / failover ✓ geo latency ✗ doesn't scale writes ✗ doesn't add capacity storage = 3× data size Sharding different rows on different nodes Shard 1 A–H ~33% rows Shard 2 I–P ~33% rows Shard 3 Q–Z ~33% rows ✓ scales writes ✓ scales storage ✓ smaller per-node working set ✗ cross-shard queries painful ✗ resharding is hard storage = 1× data size

Replication is mostly free. Sharding is mostly painful. The vast majority of systems need replication; only a small fraction genuinely need sharding. Production systems that combine both — sharded data with each shard replicated — pay the operational cost of both, but get write scaling and high availability.

📋 Chapter 1 — Summary
  • Replication = copies: scales reads, enables failover, reduces geographic latency.
  • Sharding = partitions: scales writes and storage by splitting data across nodes.
  • Replication does not scale writes; sharding does not provide HA on its own.
  • Production systems at scale combine both: sharded data, each shard replicated.
02
Chapter Two

How They Work Internally

Leader-Follower Replication — The Default Model

The replication architecture you will encounter 90% of the time is leader-follower (also called primary-replica). One node is the leader; it accepts all writes. Followers receive a stream of changes from the leader — usually the WAL (write-ahead log) — and apply them in order. Reads can go to any node; writes must go to the leader. The leader-follower model is simple, well-understood, and supported natively by every mainstream database.

Replication can be synchronous (the leader waits for follower acknowledgement before committing — safer, slower) or asynchronous (the leader commits immediately and replicates in the background — faster, risk of small data loss on failover). Most production systems use semi-sync: at least one follower must acknowledge, but not all. It buys most of the durability of sync without the latency hit.

Multi-Leader Replication and Its Conflicts

When you need writes accepted in multiple regions or want to survive leader failures with zero downtime, multi-leader replication is the next step up. Multiple nodes accept writes; changes flow between them in both directions. The price is conflicts — what happens when two leaders simultaneously update the same row to different values? Resolution strategies vary: last-write-wins (simple, sometimes lossy), version vectors (correct, complex), application-level merge (CRDTs, custom logic). Multi-leader is rarely needed and rarely worth the operational complexity unless you have a specific multi-region write requirement.

Sharding — Consistent Hashing and Virtual Nodes

The naive way to shard is modulo: shard = hash(key) % N. The problem appears the moment you add or remove a node: now N changes, every key's shard changes, and you have to move all your data. Consistent hashing solves this: the hash space is mapped onto a ring; each node owns a contiguous arc; only keys on the affected arc need to move when nodes are added or removed. Virtual nodes — each physical node owns many small arcs instead of one big one — smooth out the distribution and make rebalancing finer-grained.

Consistent Hashing — Why Adding a Node Doesn't Move Everything
Naive: hash(k) mod N N=3 k1 → 1, k2 → 0, k3 → 2 k4 → 1, k5 → 2, k6 → 0 N=4 (added a node) k1 → 0 (was 1) k2 → 1 (was 0) k3 → 2 (unchanged) k4 → 3 (was 1) k5 → 0 (was 2) k6 → 1 (was 0) 5 of 6 keys move ~83% data migration on every resize Consistent Hash Ring N1 N2 N3 N4 (new) k1 k2 k3 k4 k5 only k1 moves to N4 ~1/N keys move on resize, not all
🔗

Synchronous vs Async Replication

Synchronous: commit waits for follower ack. Zero data loss on leader failure; high latency cost.

Asynchronous: commit returns immediately; followers catch up. Fast writes; risk losing the last few seconds of data on failover.

Semi-sync (production default): at least one follower must ack — durability without paying for full sync.

🍃

Virtual Nodes

What: each physical node owns many small arcs of the hash ring (typically 100–256 vnodes per node).

Why: spreads load more evenly; rebalancing is fine-grained (move 1/256th, not 1/3rd).

Used by: Cassandra, Riak, ScyllaDB, Dynamo-style systems.

Consistent hashing is what makes “just add a node” possible without a downtime window. If your sharding scheme requires moving the entire dataset to add capacity, you don't have a sharded system — you have a manual data-migration project that happens to be triggered by load.

📋 Chapter 2 — Summary
  • Leader-follower is the standard replication model: one writer, many readers, WAL streaming.
  • Multi-leader enables multi-region writes but introduces conflict resolution complexity.
  • Consistent hashing on a ring + virtual nodes minimises data movement when adding or removing shards.
  • Most production systems use semi-sync replication: balance of durability and latency.
03
Chapter Three

When to Use — and When Not To

Replication First, Sharding Last

The order of escalation matters more than any single decision. Replication is cheap to add, well-supported by every database, and solves the problems most teams actually have: “our reads are saturating the database” and “we need to survive a node failure.” Sharding is operationally heavy, application-invasive, and only solves problems that genuinely require horizontal write scaling. Add replication when you need it. Add sharding when you can prove no other option will work.

USE Replication When…

Reads are saturating the primary. Read replicas absorb the load.

You need HA / failover. A standby replica becomes the new primary on crash.

Geographic latency matters. Replicas in user regions cut round-trip time.

You want a hot standby for analytics. Heavy reporting queries on a replica don't hurt OLTP latency on the primary.

Cost: mostly storage and modest replication-stream bandwidth.

DO NOT Use Replication for…

Scaling writes. All writes still hit the primary. Adding replicas does nothing for write throughput.

Adding storage capacity. Each replica has to hold the full dataset.

Strict read-after-write consistency on replicas. Replication lag is real; route reads to primary in the post-write window.

USE Sharding When…

Writes saturate the primary. Replication can't help; sharding is the only way to scale write throughput.

Storage exceeds a single machine. A 50 TB table is operationally painful on one node regardless of CPU.

Working set per shard is too big to cache. Smaller per-shard data fits in RAM; reads stay fast.

You can pick a clean shard key. User ID, tenant ID, geographic region — something that stays in the access path of every query.

DO NOT Shard When…

You haven't maxed out vertical scaling. A bigger box is cheaper than a sharded architecture.

You haven't fixed slow queries first. Sharding a poorly-indexed database multiplies the pain across N nodes.

Your queries cross shards constantly. Cross-shard JOINs and aggregations defeat the entire model.

You don't have a stable shard key. Picking the wrong key is worse than not sharding — resharding is a multi-month project.

The escalation ladder: vertical scaling → indexes and query fixes → caching → replication → sharding. Skip steps at your peril. Most teams who jump to sharding could have stayed on a single node for another 18–24 months by fixing queries and adding replicas first.

📋 Chapter 3 — Summary
  • Replication scales reads and provides HA — cheap to add, supported everywhere.
  • Sharding scales writes and storage — expensive operationally, application-invasive.
  • Climb the escalation ladder: vertical scaling → query fixes → caching → replication → sharding.
  • Don't shard without a clean, stable shard key. The wrong key is more painful than no sharding.
04
Chapter Four

Trade-offs & Comparisons

Replication Lag — Real Numbers, Real Consequences

Replication lag is not a theoretical concern. On a healthy PostgreSQL replica it is usually 5–50 ms; under heavy write load it can hit hundreds of milliseconds; during long-running queries on the primary it can stretch to multiple seconds; on a partitioned network it can be unbounded. Every system using read replicas has to design for the gap between “committed on primary” and “visible on replica.”

The classic bug: a user creates a comment, the page refreshes and reads from a replica, and the comment isn't there yet. The standard fix is route-by-recency: track the timestamp (or LSN) of the last write per session; route reads to the primary if the latest replica position is behind that timestamp. Less precise alternatives: route reads from the same user to the primary for a fixed window after any write (5–30 seconds covers ~99% of real lag).

The Cross-Shard Query Problem

A query that touches a single shard is fast. A query that has to consult every shard, gather partial results, and merge them in the application is slow, complex, and breaks the linear-scaling promise. SELECT COUNT(*) FROM orders across 50 shards means 50 queries, 50 result rows, and one application-level sum. JOINs across shards are worse: either denormalize so the joined data lives on the same shard, or accept that some queries cannot run efficiently in your sharded system.

🔀

Cross-Shard JOINs

The problem: the data you want to JOIN lives on different shards.

Bad solution: fetch from one shard, then iterate to fetch from others — N+1 across the network.

Better solutions: denormalize (duplicate the joined columns onto both shards); pre-join in the write path; use a search index for cross-shard lookups.

🔄

Resharding Reality

The problem: the dataset outgrew the original shard count, or one shard is genuinely hot.

The cost: migrating live data while serving traffic. Dual writes, gradual reads, careful cutover — weeks to months of work.

The lesson: over-provision shards from day one (e.g. 256 logical shards on 8 physical nodes) so you only ever rebalance, never reshard.

Replication Topologies — What Failure Looks Like
📋

Single-Leader

Strength: simple, no conflicts.

Weakness: writes bottlenecked on one node; failover requires leader election.

Use: nearly all production OLTP systems.

📋📋

Multi-Leader

Strength: writes accepted in any region; survives leader loss.

Weakness: conflicts. Last-write-wins is lossy; CRDTs are correct but limited.

Use: multi-region active-active, collaborative apps.

🌐

Leaderless (Dynamo-style)

Strength: no single point of failure; quorum reads/writes.

Weakness: tunable consistency means tunable correctness; read repair complexity.

Use: Cassandra, DynamoDB, Riak.

Resharding is the single most underestimated cost in sharded architectures. The teams that survive at scale are the ones who chose 256 logical shards on day one and never had to split a single one. The teams who chose 4 shards and grew into 16 spent six months migrating data while pretending the system was healthy.

📋 Chapter 4 — Summary
  • Replication lag is real — design read paths to handle the gap; route post-write reads to the primary.
  • Cross-shard queries break linear scaling — denormalize, pre-join, or use search indexes.
  • Resharding is hard — over-provision logical shards from day one to avoid it.
  • Topology choice determines failure shape: single-leader is simple, multi-leader handles conflicts, leaderless trades correctness for availability.
05
Chapter Five

Production Patterns & Common Mistakes

Patterns That Make Sharding Survivable

The teams who run sharded systems happily share a few discipline patterns. They picked a shard key that appears in the WHERE clause of nearly every query — usually a tenant ID, user ID, or geographic region. They over-provisioned logical shards (often 256 or 1024) on day one so adding capacity is rebalancing, not resharding. And they treat the shard map as a critical artifact: stored, versioned, monitored, and routed through a single service so the application doesn't hard-code shard locations.

🔑

Pattern: Shard Key Selection

Goal: a key that appears in nearly every query AND distributes evenly.

Good keys: tenant ID for SaaS; user ID for B2C; UUIDs.

Bad keys: sequential timestamps (hot recent shard); country code (skewed by population); enum values (only N possible values).

Test: if 80% of queries can't include the shard key, you picked wrong.

🔥

Pattern: Hotspot Prevention

Symptoms: one shard at 90% CPU while others sit at 10%. One tenant generating 50% of writes.

Mitigations: hash-prefix the shard key (spreads sequential IDs); split “whale” tenants onto dedicated shards; add salt to keys for write-heavy time-series.

Watch: hotspots are the leading indicator of needing a reshard. Catch them before they tip a node over.

The Five Mistakes That Wreck Sharded Systems
🔌

Mistake 1 — Wrong Shard Key

Picked a column that 80% of queries don't include → every query is a scatter-gather across all shards. Fix: pick the key that's in your access path before sharding. Migrating shard key is a project, not a refactor.

🔥

Mistake 2 — Hot Shard from Day One

Used a sequential ID or timestamp as shard key → all new writes pile onto the latest shard. Fix: hash the key or use a composite (tenant + time bucket).

♻️

Mistake 3 — Too Few Logical Shards

Started with 4 shards; outgrew them; now resharding is a multi-month project. Fix: over-provision logical shards (256+) on day one so growth is rebalancing not resharding.

🔗

Mistake 4 — Cross-Shard Transactions

App expects ACID across shards → either incorrect or requires distributed transactions (slow, fragile). Fix: design so each transaction stays in one shard. If it can't, use sagas or accept eventual consistency.

⏱️

Mistake 5 — Ignoring Replication Lag

Reads-after-writes hit a stale replica → users see their own changes disappear. Fix: route post-write reads to primary for a window; use LSN tracking for precise routing.

📝

Bonus — No Shard Map Service

Application hardcodes shard locations → rebalancing requires a deploy. Fix: central shard-map service (e.g. Vitess, Citus); applications query it for routing decisions; rebalancing is operational, not a code change.

Sharding is the most expensive technical decision most companies make. Done right, it scales for a decade. Done wrong, it consumes the engineering team. The difference is upstream work — shard key selection, logical shard count, and shard-map tooling — not heroics during the outage.

📋 Chapter 5 — Summary
  • Pick a shard key that appears in nearly every query and distributes evenly — tenant ID, user ID, hashed UUIDs.
  • Over-provision logical shards (256+) on day one; growth becomes rebalancing, not resharding.
  • Route reads through a shard-map service so rebalancing is operational, not a code deploy.
  • The five outage mistakes: wrong shard key, hot shard, too few logical shards, cross-shard transactions, ignored replication lag.