System Design · Data at Scale

Data at Scale

Storage, modeling, and processing when data grows beyond one machine.

Chapter One

Data Modeling for Scale

Schema Design Decisions That Last Decades

The data model is the most consequential decision you make in a system — more consequential than the programming language, more consequential than the cloud provider. A wrong schema choice at 100 users becomes an impossible migration at 100 million users. The schema determines which queries are fast, which are slow, and which are impossible without a rewrite. Most systems that fail at scale fail because of data model decisions made in the first sprint.

Normalization vs Denormalization

In a normalized schema, every fact lives in exactly one place. Updates are simple — change it once. But reads often require joining many tables. In a denormalized schema, data is duplicated to make reads fast. Updates become expensive — change it everywhere. The choice is not "which is better" — it is "which read and write patterns dominate your system."

Normalized (Write-Optimized)

Denormalized (Read-Optimized)

Every fact stored exactly once — no duplication
Updates are atomic: change in one place, visible everywhere
Strong data integrity through foreign keys
Reads require JOINs — expensive at scale
Best for: OLTP, write-heavy workloads, financial systems

Data duplicated across tables for fast reads
Updates require changing multiple places — fan-out writes
Application responsible for consistency
Reads are single-table lookups — fast at any scale
Best for: read-heavy systems, NoSQL, feeds, dashboards

Normalized vs Denormalized — The Trade-off at Scale

OLTP vs OLAP — Two Worlds

💳

OLTP — Transactional

Serve user requests: inserts, updates, point queries
Normalized schema, row-oriented storage
Latency: milliseconds. QPS: thousands to millions
Tools: PostgreSQL, MySQL, DynamoDB
Optimized for: small, frequent reads/writes by primary key

📊

OLAP — Analytical

Answer business questions: aggregations, scans, reports
Star/snowflake schema, column-oriented storage
Latency: seconds. QPS: few concurrent queries
Tools: BigQuery, Redshift, ClickHouse, Snowflake
Optimized for: scanning millions of rows across few columns

Schema Evolution — Compatibility Contracts

Forward compatibility means old code can read new data. New fields must be optional with defaults — if you add a field and old consumers crash when they see it, you broke forward compatibility.

Backward compatibility means new code can read old data. Do not remove or rename fields that existing consumers depend on.

Protobuf field numbers protect both: renaming a field is safe because the wire format uses numbers, not names. The consumer decodes by field number regardless of what you call the field in code. Never reuse a field number after removing a field — the old binary data still has that number and will be misread.

Raw JSON schema changes are dangerous because there is no enforcement mechanism. A breaking change ships silently until a downstream consumer crashes in production. Protobuf and Avro have schema registries that reject incompatible changes at publish time.

Polyglot Persistence

Polyglot persistence means using different databases for different concerns within one system — each chosen for the access pattern it serves best. A single PostgreSQL instance is the right answer at the start. As the system grows, certain workloads demand purpose-built storage engines that a general-purpose relational database cannot serve efficiently.

🐘

PostgreSQL

Transactional data and complex relational queries. Source of truth for structured data. ACID guarantees, foreign keys, complex joins.

⚡

Redis

Sessions, caches, rate limiting counters. Sub-millisecond reads from memory. TTL-based expiry. Not a durable store — data can be lost on restart.

🔍

Elasticsearch

Full-text search. Always a secondary index — never the source of truth. Synced from your primary DB via CDC. Built for inverted-index lookups and faceted filtering.

🗂️

S3 / Object Storage

Files, images, videos, ML training data. Infinitely scalable at ~$0.023/GB/month. Not a database — no indexing, no queries. Use metadata in PostgreSQL to find objects.

📈

InfluxDB / Prometheus

Time-series metrics. Append-only, high-velocity writes. Automatic downsampling. Purpose-built compression for sequential timestamps. Not suited for relational queries.

🕸️

Neo4j / Neptune

Graph data when relationships are first-class. Social graphs, recommendation engines, fraud rings. Traversals that would require dozens of JOINs in SQL run in milliseconds.

Operational cost warning: each additional database is a new system to monitor, back up, recover from failure, upgrade, and train your team on. Do not add a new database until you have exhausted what your existing database can do. Redis-like caching can be approximated with PostgreSQL's unlogged tables. Full-text search is built into PostgreSQL (tsvector). Graph queries can be done in SQL with recursive CTEs. Only add a purpose-built database when the performance or operational evidence demands it.

Your data model is your most expensive decision. It determines which queries are fast, slow, and impossible. Choose based on your dominant access pattern — not on textbook purity. At scale, every JOIN is a liability. Every denormalization is a consistency responsibility.

📋 Chapter 1 — Summary

Normalization = write-optimized. One fact, one place. Reads require JOINs.
Denormalization = read-optimized. Duplicate data. Single-table reads. App manages consistency.
Decision rule: read:write > 10:1 → denormalize. Write-heavy or strong consistency → normalize.
OLTP = row-oriented, millisecond latency, normalized. OLAP = column-oriented, scan-heavy, star schema.
Schema evolution: forward + backward compatibility. Protobuf/Avro > raw JSON for safe evolution. Never reuse removed field numbers.
Polyglot persistence: use the right database for each access pattern. PostgreSQL for transactions, Redis for caching, Elasticsearch for search, S3 for files. Each additional database is an operational cost.

Chapter Two

Time-Series Data

When Everything Is a Timestamped Event

Time-series data is the fastest-growing category of data in modern systems. Every metric your servers emit, every reading from an IoT sensor, every financial tick, every user action log — all timestamped events arriving at high velocity. The workload characteristics are unique: writes are append-only, reads are time-range scans, and data value decays over time. General-purpose databases struggle with these patterns. Purpose-built time-series databases exist because this problem is that different.

Why relational databases fail at time-series: a relational database with a timestamped events table works at thousands of rows per day. At 100,000 rows per second, B-tree index updates on every insert become the bottleneck — the index must be rebalanced on each write to maintain sorted order across all values. TSDBs use append-only log structures and time-based partitioning that match the write pattern exactly. New data lands at the end of the log. Indexes are built per time partition, not across the entire dataset. There is no per-write index maintenance overhead.

Time-Series Data Pipeline — Collect to Visualize

📈

Workload Characteristics

Append-only writes at high velocity (100K+ events/sec)
Reads are time-range scans: "last 24 hours" or "this week"
Data value decays — second-level detail needed for hours, not years
Heavy compression opportunity (timestamps are sequential)

🗜️

Downsampling & Retention

Raw data: keep 7–30 days at full resolution
5-minute averages: keep 90 days
1-hour rollups: keep 1–2 years
Reduces storage 100–1000× while preserving trends
Automated via retention policies in TSDB

🖥️

Use Case: Infrastructure Metrics

CPU, memory, disk, network per server. 15-second intervals. Alert on anomalies. Prometheus + Grafana standard stack.

🌡️

Use Case: IoT Sensors

Temperature, pressure, GPS readings. Thousands of devices, sub-second intervals. TimescaleDB or InfluxDB.

📊

Use Case: High-Cardinality Analytics

Ad impressions, clickstream, user events with millions of unique dimension combinations (user × product × time × location). ClickHouse is a columnar OLAP database optimized for time-series analytics. It is significantly faster than InfluxDB for aggregation queries on high-cardinality dimensions. Used by Cloudflare, ByteDance, and Uber for analytics at extreme scale.

Time-series databases are not a luxury — they are a necessity at scale. A general-purpose relational DB can handle 1,000 metrics. At 100,000 metrics per second with 90-day retention, you need purpose-built compression, automatic downsampling, and time-partitioned storage. That's what TSDBs provide.

📋 Chapter 2 — Summary

Time-series workloads: append-only writes, time-range reads, data value decays with age.
Why relational DBs fail: B-tree index rebalancing on every insert becomes the bottleneck at 100K+ rows/sec. TSDBs use append-only log structures and time-based partitioning with no per-write index overhead.
Pipeline: sources → collector (OTel/Telegraf) → TSDB → query → visualize (Grafana).
Downsampling: full resolution short-term, aggregated long-term. 100–1000× storage reduction.
Tools: Prometheus (metrics), InfluxDB (general TSDB), TimescaleDB (SQL-compatible), ClickHouse (columnar OLAP for high-cardinality analytics at extreme scale).

Chapter Three

Search at Scale

When Your Database Can't Answer the Question

Every application eventually needs search that goes beyond primary-key lookups and simple WHERE clauses. Full-text search, fuzzy matching, faceted filtering, autocomplete, semantic similarity — your primary data store was not designed for these access patterns. Search is always a secondary index — derived from your primary data but optimized for a completely different query shape. The hard problems are not search itself — they are keeping the search index in sync with the source of truth.

Search Index Sync — The Dual-Write Problem

How an Inverted Index Works

A traditional database index maps a row identifier to its data — given an ID, retrieve the record. An inverted index reverses this: it maps a term to all documents containing that term. This reversal is what makes full-text search fast — look up the term, get the posting list of matching document IDs instantly, no document scanning required.

Inverted Index — Documents to Terms to Posting Lists

🔤

Full-Text Search (Keyword)

Inverted index: maps terms → documents
Supports fuzzy matching, stemming, stop words
Tools: Elasticsearch, OpenSearch, Typesense, Meilisearch
Best for: product search, log search, document lookup

🧠

Vector / Semantic Search

Embedding model converts text → vector (1536-dim)
ANN (Approximate Nearest Neighbor) index for similarity
Tools: Pinecone, Weaviate, pgvector, Qdrant, Milvus
Best for: "find similar", RAG, recommendation, image search

Relevance Scoring — TF-IDF and BM25

TF-IDF scores documents by how often a term appears in the document (Term Frequency) weighted by how rare the term is across all documents (Inverse Document Frequency). Common terms like "the" score low because they appear everywhere. Rare domain-specific terms score high because finding them is meaningful.

BM25 (Better Match 25) is the modern refinement of TF-IDF and the default algorithm in Elasticsearch. BM25 adds document length normalization — the same term appearing in a short 50-word document is more significant than the same term in a 5,000-word document. This prevents long documents from dominating results simply because they repeat terms more.

Tuning relevance scoring is often more impactful than tuning infrastructure. A 10% improvement in ranking quality beats a 10% improvement in query latency for the vast majority of users.

Hybrid search combines both: keyword matching for exact terms + vector similarity for semantic meaning. Modern search systems weight both signals and merge results. This is the pattern behind modern RAG (Retrieval-Augmented Generation) systems.

Elasticsearch Shards and Replicas — The Mistake You Cannot Undo

Shard count is set at index creation and cannot be changed without reindexing the entire dataset. This is the most common Elasticsearch production mistake — starting with too few shards (often the default of 1) and hitting a wall when the index grows beyond what one shard can serve efficiently.

Primary shards distribute data across nodes. Replica shards on other nodes provide redundancy and additional read throughput. A 3-node cluster with 6 primary shards and 1 replica per shard = 12 total shards, 4 per node.

Start with a shard count that anticipates 1–2 years of index growth. A good rule: keep individual shards under 50GB. If you expect 300GB of index data after two years, start with at least 6–10 shards.

Cross-Reference

See Building Blocks → Search Systems for a deeper treatment of inverted indexes, analyzers, and Elasticsearch internals.

Search is a derived view, not a source of truth. The moment you treat your search index as the primary data store, you have created an unrecoverable consistency problem. Write to your primary DB. Let CDC stream changes to the search index. Never dual-write.

📋 Chapter 3 — Summary

Search is a secondary index — derived from primary data, optimized for different queries.
Sync pattern: CDC (Debezium + Kafka) → search index. Never dual-write.
Inverted index: maps terms to posting lists — no document scanning, direct lookup then set intersection.
Full-text (keyword): inverted index, fuzzy matching. Elasticsearch, Typesense.
BM25 relevance scoring: default in Elasticsearch, better than TF-IDF for most cases. Adds document length normalization. Tuning ranking beats tuning infrastructure.
ES shards: set at index creation, cannot change without full reindex. Plan shard count for 1–2 years of growth upfront. Keep individual shards under 50GB.
Vector (semantic): embeddings + ANN index. Pinecone, Weaviate, pgvector.
Hybrid search: combine keyword + vector for best-of-both results (RAG pattern).

Chapter Four

Batch vs Stream Processing

Old Data vs Fresh Data: The Fundamental Trade-off

Every data system eventually faces the same question: do you process data in large batches at scheduled intervals, or do you process it continuously as it arrives? Batch gives you high throughput and simpler correctness. Stream gives you low latency and real-time insights. Neither is universally better — the choice depends on how fresh your results need to be and what cost you are willing to pay for that freshness.

Batch Processing

Stream Processing

Process large volumes of data at scheduled intervals
High throughput, high latency (minutes to hours)
Simple correctness: process all data, rerun if failed
Tools: Spark, Hadoop MapReduce, dbt, Airflow
Best for: reports, ML training, daily aggregations

Process events continuously as they arrive
Low latency (milliseconds to seconds), lower throughput per node
Complex correctness: ordering, exactly-once, late arrivals
Tools: Kafka Streams, Flink, Spark Streaming
Best for: fraud detection, real-time dashboards, alerting

Lambda Architecture — Batch + Stream Layers

🏗️

Lambda Architecture

Batch layer: complete, correct results (recomputed nightly)
Speed layer: real-time approximate results (streaming)
Serving layer: merges both for query responses
Downside: two separate codebases to maintain

♾️

Kappa Architecture

Single streaming pipeline — no separate batch layer
Replay from Kafka log when you need to recompute
Simpler: one codebase, one data path
Requires: Kafka with sufficient retention (days/weeks)

Cross-Reference

See Building Blocks → Message Queues for Kafka as the streaming backbone — partitioning, consumer groups, and exactly-once semantics.

Why Streaming Correctness Is Harder Than Batch

Batch correctness is simple: process all the data, rerun the job if something fails. Streaming adds three problems that do not exist in batch:

Windowing: events must be grouped into time windows to compute aggregations (e.g., "orders per minute"). Three window types — tumbling (fixed non-overlapping, e.g. 1-minute windows where every event belongs to exactly one window), sliding (overlapping, e.g. last 5 minutes computed every 1 minute — one event can appear in multiple windows), session (activity-based gaps, a window closes after a period of inactivity). The question: which window does a late-arriving event belong to?

Watermarks: the system's estimate of how late an event can arrive before its window is considered complete and results are emitted. Set too tight and late-arriving events are silently dropped. Set too loose and results are delayed unnecessarily, increasing end-to-end latency.

Exactly-once semantics: each event must affect the output exactly once even if the processor crashes and restarts mid-computation. This requires coordinated transactions between the stream processor and the output sink — non-trivial engineering. Flink and Kafka Streams support it — but with a measurable latency cost. Exactly-once is not the same as at-least-once, which allows duplicates.

Start with batch. Add streaming when freshness is a hard business requirement, not a nice-to-have. Streaming adds operational complexity — windowing, watermarks, late arrivals, exactly-once guarantees. If nightly aggregation is good enough for the business, batch is the right answer.

📋 Chapter 4 — Summary

Batch: high throughput, high latency. Simple correctness. Best for reports, training, aggregations.
Stream: low latency, complex correctness. Best for alerts, fraud detection, real-time dashboards.
Streaming correctness challenges: windowing (tumbling/sliding/session), watermarks (late arrival estimates), exactly-once semantics. Batch avoids all of these.
Lambda: batch + stream merged at serving layer. Two codebases to maintain.
Kappa: single streaming pipeline. Replay from Kafka log for recomputation. Simpler but needs retention.
Default to batch. Add streaming only when freshness is a genuine business constraint.

Chapter Five

Data Lakes & Warehouses

Storing Everything to Analyze Anything

At some scale, your transactional database cannot serve both operational queries and analytical queries without degrading one or the other. That is when you need a separate analytical system — a data warehouse (structured, fast queries) or a data lake (raw, flexible schema). The modern answer is often a lakehouse that combines the best of both. The decision depends on team maturity, query patterns, and whether you know what questions you will ask before you store the data.

🏢

Data Warehouse

Schema-on-write: data structured before loading
Column-oriented storage reads only the columns needed for a query. A COUNT(orders) BY month query scans only the month column and the orders column — not all 50 columns in the row. This makes analytical aggregations 10–100× faster than row-oriented OLTP storage for the same data volume.
SQL interface, optimized for aggregations
Tools: BigQuery, Redshift, Snowflake
Best for: known queries, BI dashboards, reports

🌊

Data Lake

Schema-on-read: store raw, interpret later
Any format: JSON, Parquet, CSV, images, video
Cheap object storage (S3, GCS, ADLS)
Tools: S3 + Spark, Databricks, EMR
Best for: ML training, unknown future queries

🏠

Data Lakehouse

Structured layer on top of a lake
ACID transactions + schema enforcement on files
SQL queries on Parquet files in object storage
Tools: Delta Lake, Apache Iceberg, Apache Hudi
Best for: combining ML + BI on one platform

Modern Data Stack — Sources to Insights

The Missing Component: Data Catalog

The data stack above is missing one critical component that most teams add too late: a data catalog. Without it, data exists in the warehouse but nobody can find it, trust it, or understand where it came from. "Where does this revenue_v3 table come from? Is it still used? Which dashboard depends on it?" — these questions consume hours of engineering time in every growing data team.

DataHub, Apache Atlas, and dbt docs are the common solutions. Treat the catalog as infrastructure — not documentation. Build lineage tracking from day one.

Cost Model: Why Lake-First Makes Sense at Scale

Object storage (S3, GCS) costs approximately $0.023 per GB per month. BigQuery charges approximately $5 per TB queried. This cost asymmetry drives the lake-first strategy at scale — store everything cheaply in object storage, run warehouse compute only on the queries that need structured access.

At 10TB of data queried daily, BigQuery costs ~$50/day. The same data sitting in S3 costs $0.23/day just to store. The lake is not a technical choice — it is a financial one.

If you know the questions, use a warehouse. If you don't know the questions yet, use a lake. If you need both (and most growing companies do), a lakehouse gives you ACID transactions and SQL queries on cheap object storage — the convergence point of both worlds.

📋 Chapter 5 — Summary

Warehouse: schema-on-write, column-oriented (reads only needed columns — 10–100× faster for analytics), SQL-first. Best for known queries and BI.
Lake: schema-on-read, any format, object storage (~$0.023/GB/month). Best for ML and unknown future questions.
Lakehouse: ACID + schema enforcement on lake files (Delta Lake, Iceberg). Best of both worlds.
ELT > ETL: load raw, transform inside the warehouse. More flexible, cheaper iteration.
Modern stack: Sources → Ingest (Fivetran) → Lake/Warehouse → Transform (dbt) → Serve (BI, ML).
Data catalog: DataHub, Atlas, dbt docs. Add early — treat as infrastructure, not documentation.
Cost: S3 at $0.023/GB vs BigQuery at $5/TB queried. Lake-first is a financial choice at scale.

Chapter Six

Consistency Patterns

From Strong Consistency to Eventual Consistency and Back

In a distributed data system, consistency is not binary — it is a spectrum. "Strong consistency" and "eventual consistency" are the endpoints, but most real systems operate somewhere in between. The choice is not about technical preference — it is about what your users can tolerate. A banking system that shows a briefly wrong balance creates real financial harm. A social media feed that is 2 seconds stale creates zero harm. Both are valid consistency choices — for their context.

Consistency Spectrum — Guarantees vs Cost

Key Consistency Models Defined

✍️

Read-Your-Own-Writes

A user always sees their own most recent write, even if other users may still see stale data. You post a comment and immediately see it. Another user in a different region may see it seconds later.

This is the minimum acceptable consistency for most user-facing features. Violating it is immediately noticed: "I just submitted the form — why isn't my change showing?"

Implementation: route reads to the primary for a short window (typically 60 seconds) after any write from that user, then fall back to replica reads.

⏩

Monotonic Reads

Once you read a value, you never read an older version of it. Prevents the confusing experience of reading newer data, refreshing, and seeing older data — as if time went backwards.

This can happen when different reads within the same session go to different replicas with different replication lag. Replica A is 100ms behind. Replica B is 2 seconds behind. Read from A then B and you see the past.

Implementation: route all reads within a session to the same replica. Session stickiness by replica ID.

🔄

Eventual Consistency

Given no new updates, all replicas converge
Time frame: milliseconds to seconds typically
Cheapest, fastest, most available
Acceptable for: counters, likes, view counts, CDN
Not acceptable for: bank balance, inventory, bookings

🔒

Strong (Linearizable)

Every read sees the most recent write — globally
Operations appear instantaneous and ordered
Highest latency, lowest availability during partition
Required for: payments, seat booking, inventory deduction
Implementation: consensus (Paxos, Raft) or single-leader

Conflict Resolution Strategies

In eventually consistent systems, concurrent writes to the same record will conflict. You need a resolution strategy decided at design time — not discovered at incident time.

⏰

Last-Write-Wins (LWW)

Latest timestamp wins. Simple but loses earlier writes silently. Requires synchronized clocks. Used by Cassandra, DynamoDB (default).

🔀

Application-Level Merge

Return all conflicting versions to the app. App logic resolves (e.g., merge shopping carts). Complex but lossless. Used by CouchDB, Riak.

📝

CRDTs

Conflict-free Replicated Data Types. Mathematical guarantee of convergence without coordination. Used for counters, sets, text editing (Yjs).

Cross-Reference — Go Deeper

See Distributed Systems for consensus algorithms (Paxos, Raft), vector clocks, and the theoretical foundations behind these consistency guarantees.

Choose the weakest consistency model that still satisfies your business requirements. Every step stronger on the spectrum costs latency, availability, and operational complexity. Most systems need strong consistency for payments and eventual consistency for everything else — mixing models within one system is normal and expected.

📋 Chapter 6 — Summary

Consistency is a spectrum: eventual → monotonic → read-your-writes → causal → strong.
Read-your-own-writes: user always sees their own latest write. Minimum for user-facing features. Route writes and immediate subsequent reads to primary.
Monotonic reads: never see older data after seeing newer. Route session reads to the same replica.
Choose based on business harm: stale bank balance = real harm. Stale like count = zero harm.
Conflict resolution: LWW (simple, lossy), app-level merge (complex, lossless), CRDTs (automatic).
Most systems mix models: strong for payments, eventual for counters. This is normal.
Stronger = more expensive. Higher latency, lower availability, more operational complexity.

Data at Scale — At a Glance

01 · Data Modeling

Schema Is Your Most Expensive Decision

Normalize for writes, denormalize for reads
Read:write > 10:1 → denormalize
OLTP: row-oriented, ms latency. OLAP: column-oriented, scan-heavy
Schema evolution: forward + backward compat. Protobuf field numbers protect both.
Polyglot persistence: right DB for each pattern. PostgreSQL for transactions, Redis for caching, Elasticsearch for search, S3 for files. Each extra DB is operational cost.

02 · Time-Series Data

Append-Only, Time-Range, Value Decays

Unique workload: high-velocity writes, range reads
Downsample older data: 100–1000× storage reduction
Tools: Prometheus, InfluxDB, TimescaleDB, ClickHouse
General-purpose DBs struggle — purpose-built TSDBs needed

03 · Search at Scale

Secondary Index, Not Source of Truth

CDC pattern: DB → Kafka → search index (never dual-write)
Inverted index: maps terms to posting lists — no document scanning, direct lookup then set intersection
Full-text: Elasticsearch, Typesense. Vector: Pinecone, pgvector
BM25 relevance scoring: default in ES, better than TF-IDF. Tuning ranking beats tuning infrastructure
ES shards: set at index creation, cannot change without reindex. Plan for 1–2 years of growth upfront
Hybrid: keyword + semantic for best results (RAG)

04 · Batch vs Stream

Default Batch, Stream When Required

Batch: high throughput, simple correctness (Spark, dbt)
Stream: low latency, complex correctness (Flink, Kafka Streams)
Streaming correctness challenges: windowing, watermarks, late arrivals, exactly-once semantics. Batch avoids all of these.
Lambda: batch + stream layers merged. Two codebases
Kappa: single stream, replay from log. Simpler

05 · Lakes & Warehouses

Know Questions → Warehouse. Don't → Lake

Warehouse: schema-on-write, SQL, BI (BigQuery, Snowflake)
Lake: schema-on-read, any format, cheap (S3 + Spark)
Lakehouse: ACID on files (Delta Lake, Iceberg)
ELT > ETL: load raw, transform inside warehouse

06 · Consistency Patterns

Weakest Model That Satisfies Business

Spectrum: eventual → causal → strong
Read-your-own-writes: user always sees their own latest write. Minimum for user-facing features.
Monotonic reads: never see older data after seeing newer. Route session reads to the same replica.
Choose based on business harm of stale data
Conflicts: LWW (simple), app-merge (lossless), CRDTs (auto)
Most systems mix: strong for payments, eventual for rest

← Communication & APIs Security & Observability →