System Design Β· Data at Scale

Data at Scale

Storage, modeling, and processing when data grows beyond one machine.

01
Chapter One

Data Modeling for Scale

Schema Design Decisions That Last Decades

The data model is the most consequential decision you make in a system β€” more consequential than the programming language, more consequential than the cloud provider. A wrong schema choice at 100 users becomes an impossible migration at 100 million users. The schema determines which queries are fast, which are slow, and which are impossible without a rewrite. Most systems that fail at scale fail because of data model decisions made in the first sprint.

Normalization vs Denormalization

In a normalized schema, every fact lives in exactly one place. Updates are simple β€” change it once. But reads often require joining many tables. In a denormalized schema, data is duplicated to make reads fast. Updates become expensive β€” change it everywhere. The choice is not "which is better" β€” it is "which read and write patterns dominate your system."

Normalized (Write-Optimized)
Denormalized (Read-Optimized)
  • Every fact stored exactly once β€” no duplication
  • Updates are atomic: change in one place, visible everywhere
  • Strong data integrity through foreign keys
  • Reads require JOINs β€” expensive at scale
  • Best for: OLTP, write-heavy workloads, financial systems
  • Data duplicated across tables for fast reads
  • Updates require changing multiple places β€” fan-out writes
  • Application responsible for consistency
  • Reads are single-table lookups β€” fast at any scale
  • Best for: read-heavy systems, NoSQL, feeds, dashboards
Normalized vs Denormalized β€” The Trade-off at Scale
Normalized Schema Users id, name, email 1 row per user Posts id, user_id, text FK β†’ Users Read: JOIN users ON posts.user_id = slow at 100M rows Denormalized Schema Posts (with embedded user) id, text, user_name, user_avatar No JOIN needed β€” single read user_name duplicated in every post Read: single table scan = fast Write: user renames β†’ update ALL posts The decision framework Read:Write ratio > 10:1 β†’ denormalize. Write-heavy or strong consistency β†’ normalize.
OLTP vs OLAP β€” Two Worlds
πŸ’³

OLTP β€” Transactional

  • Serve user requests: inserts, updates, point queries
  • Normalized schema, row-oriented storage
  • Latency: milliseconds. QPS: thousands to millions
  • Tools: PostgreSQL, MySQL, DynamoDB
  • Optimized for: small, frequent reads/writes by primary key
πŸ“Š

OLAP β€” Analytical

  • Answer business questions: aggregations, scans, reports
  • Star/snowflake schema, column-oriented storage
  • Latency: seconds. QPS: few concurrent queries
  • Tools: BigQuery, Redshift, ClickHouse, Snowflake
  • Optimized for: scanning millions of rows across few columns
Schema Evolution β€” Compatibility Contracts

Forward compatibility means old code can read new data. New fields must be optional with defaults β€” if you add a field and old consumers crash when they see it, you broke forward compatibility.

Backward compatibility means new code can read old data. Do not remove or rename fields that existing consumers depend on.

Protobuf field numbers protect both: renaming a field is safe because the wire format uses numbers, not names. The consumer decodes by field number regardless of what you call the field in code. Never reuse a field number after removing a field β€” the old binary data still has that number and will be misread.

Raw JSON schema changes are dangerous because there is no enforcement mechanism. A breaking change ships silently until a downstream consumer crashes in production. Protobuf and Avro have schema registries that reject incompatible changes at publish time.
Polyglot Persistence

Polyglot persistence means using different databases for different concerns within one system β€” each chosen for the access pattern it serves best. A single PostgreSQL instance is the right answer at the start. As the system grows, certain workloads demand purpose-built storage engines that a general-purpose relational database cannot serve efficiently.

🐘

PostgreSQL

Transactional data and complex relational queries. Source of truth for structured data. ACID guarantees, foreign keys, complex joins.

⚑

Redis

Sessions, caches, rate limiting counters. Sub-millisecond reads from memory. TTL-based expiry. Not a durable store β€” data can be lost on restart.

πŸ”

Elasticsearch

Full-text search. Always a secondary index β€” never the source of truth. Synced from your primary DB via CDC. Built for inverted-index lookups and faceted filtering.

πŸ—‚οΈ

S3 / Object Storage

Files, images, videos, ML training data. Infinitely scalable at ~$0.023/GB/month. Not a database β€” no indexing, no queries. Use metadata in PostgreSQL to find objects.

πŸ“ˆ

InfluxDB / Prometheus

Time-series metrics. Append-only, high-velocity writes. Automatic downsampling. Purpose-built compression for sequential timestamps. Not suited for relational queries.

πŸ•ΈοΈ

Neo4j / Neptune

Graph data when relationships are first-class. Social graphs, recommendation engines, fraud rings. Traversals that would require dozens of JOINs in SQL run in milliseconds.

Operational cost warning: each additional database is a new system to monitor, back up, recover from failure, upgrade, and train your team on. Do not add a new database until you have exhausted what your existing database can do. Redis-like caching can be approximated with PostgreSQL's unlogged tables. Full-text search is built into PostgreSQL (tsvector). Graph queries can be done in SQL with recursive CTEs. Only add a purpose-built database when the performance or operational evidence demands it.

Your data model is your most expensive decision. It determines which queries are fast, slow, and impossible. Choose based on your dominant access pattern β€” not on textbook purity. At scale, every JOIN is a liability. Every denormalization is a consistency responsibility.

πŸ“‹ Chapter 1 β€” Summary
  • Normalization = write-optimized. One fact, one place. Reads require JOINs.
  • Denormalization = read-optimized. Duplicate data. Single-table reads. App manages consistency.
  • Decision rule: read:write > 10:1 β†’ denormalize. Write-heavy or strong consistency β†’ normalize.
  • OLTP = row-oriented, millisecond latency, normalized. OLAP = column-oriented, scan-heavy, star schema.
  • Schema evolution: forward + backward compatibility. Protobuf/Avro > raw JSON for safe evolution. Never reuse removed field numbers.
  • Polyglot persistence: use the right database for each access pattern. PostgreSQL for transactions, Redis for caching, Elasticsearch for search, S3 for files. Each additional database is an operational cost.
02
Chapter Two

Time-Series Data

When Everything Is a Timestamped Event

Time-series data is the fastest-growing category of data in modern systems. Every metric your servers emit, every reading from an IoT sensor, every financial tick, every user action log β€” all timestamped events arriving at high velocity. The workload characteristics are unique: writes are append-only, reads are time-range scans, and data value decays over time. General-purpose databases struggle with these patterns. Purpose-built time-series databases exist because this problem is that different.

Why relational databases fail at time-series: a relational database with a timestamped events table works at thousands of rows per day. At 100,000 rows per second, B-tree index updates on every insert become the bottleneck β€” the index must be rebalanced on each write to maintain sorted order across all values. TSDBs use append-only log structures and time-based partitioning that match the write pattern exactly. New data lands at the end of the log. Indexes are built per time partition, not across the entire dataset. There is no per-write index maintenance overhead.
Time-Series Data Pipeline β€” Collect to Visualize
Sources Servers IoT sensors App events Collect Telegraf / OTel Kafka buffer Store (TSDB) InfluxDB TimescaleDB Prometheus Query PromQL SQL / Flux Visualize Grafana Dashboards Downsample older data
πŸ“ˆ

Workload Characteristics

  • Append-only writes at high velocity (100K+ events/sec)
  • Reads are time-range scans: "last 24 hours" or "this week"
  • Data value decays β€” second-level detail needed for hours, not years
  • Heavy compression opportunity (timestamps are sequential)
πŸ—œοΈ

Downsampling & Retention

  • Raw data: keep 7–30 days at full resolution
  • 5-minute averages: keep 90 days
  • 1-hour rollups: keep 1–2 years
  • Reduces storage 100–1000Γ— while preserving trends
  • Automated via retention policies in TSDB
πŸ–₯️

Use Case: Infrastructure Metrics

CPU, memory, disk, network per server. 15-second intervals. Alert on anomalies. Prometheus + Grafana standard stack.

🌑️

Use Case: IoT Sensors

Temperature, pressure, GPS readings. Thousands of devices, sub-second intervals. TimescaleDB or InfluxDB.

πŸ“Š

Use Case: High-Cardinality Analytics

Ad impressions, clickstream, user events with millions of unique dimension combinations (user Γ— product Γ— time Γ— location). ClickHouse is a columnar OLAP database optimized for time-series analytics. It is significantly faster than InfluxDB for aggregation queries on high-cardinality dimensions. Used by Cloudflare, ByteDance, and Uber for analytics at extreme scale.

Time-series databases are not a luxury β€” they are a necessity at scale. A general-purpose relational DB can handle 1,000 metrics. At 100,000 metrics per second with 90-day retention, you need purpose-built compression, automatic downsampling, and time-partitioned storage. That's what TSDBs provide.

πŸ“‹ Chapter 2 β€” Summary
  • Time-series workloads: append-only writes, time-range reads, data value decays with age.
  • Why relational DBs fail: B-tree index rebalancing on every insert becomes the bottleneck at 100K+ rows/sec. TSDBs use append-only log structures and time-based partitioning with no per-write index overhead.
  • Pipeline: sources β†’ collector (OTel/Telegraf) β†’ TSDB β†’ query β†’ visualize (Grafana).
  • Downsampling: full resolution short-term, aggregated long-term. 100–1000Γ— storage reduction.
  • Tools: Prometheus (metrics), InfluxDB (general TSDB), TimescaleDB (SQL-compatible), ClickHouse (columnar OLAP for high-cardinality analytics at extreme scale).
03
Chapter Three

Search at Scale

When Your Database Can't Answer the Question

Every application eventually needs search that goes beyond primary-key lookups and simple WHERE clauses. Full-text search, fuzzy matching, faceted filtering, autocomplete, semantic similarity β€” your primary data store was not designed for these access patterns. Search is always a secondary index β€” derived from your primary data but optimized for a completely different query shape. The hard problems are not search itself β€” they are keeping the search index in sync with the source of truth.

Search Index Sync β€” The Dual-Write Problem
Application Primary DB Source of truth CDC / Events Debezium / Kafka Search Index Elasticsearch / Typesense Write Dual-write (dangerous!) Dual-write = write to DB AND search directly β†’ inconsistency when one fails CDC pattern = DB is truth β†’ changes stream to search index asynchronously β†’ safe
How an Inverted Index Works

A traditional database index maps a row identifier to its data β€” given an ID, retrieve the record. An inverted index reverses this: it maps a term to all documents containing that term. This reversal is what makes full-text search fast β€” look up the term, get the posting list of matching document IDs instantly, no document scanning required.

Inverted Index β€” Documents to Terms to Posting Lists
Documents Doc 1 "fast database scale" Doc 2 "database query fast" Doc 3 "scale systems fast" Index Terms "fast" "database" "scale" Posting Lists Doc1, Doc2, Doc3 Doc1, Doc2 Doc1, Doc3 Query "fast scale" β†’ intersect posting lists β†’ [Doc1, Doc3]
πŸ”€

Full-Text Search (Keyword)

  • Inverted index: maps terms β†’ documents
  • Supports fuzzy matching, stemming, stop words
  • Tools: Elasticsearch, OpenSearch, Typesense, Meilisearch
  • Best for: product search, log search, document lookup
🧠

Vector / Semantic Search

  • Embedding model converts text β†’ vector (1536-dim)
  • ANN (Approximate Nearest Neighbor) index for similarity
  • Tools: Pinecone, Weaviate, pgvector, Qdrant, Milvus
  • Best for: "find similar", RAG, recommendation, image search
Relevance Scoring β€” TF-IDF and BM25

TF-IDF scores documents by how often a term appears in the document (Term Frequency) weighted by how rare the term is across all documents (Inverse Document Frequency). Common terms like "the" score low because they appear everywhere. Rare domain-specific terms score high because finding them is meaningful.

BM25 (Better Match 25) is the modern refinement of TF-IDF and the default algorithm in Elasticsearch. BM25 adds document length normalization β€” the same term appearing in a short 50-word document is more significant than the same term in a 5,000-word document. This prevents long documents from dominating results simply because they repeat terms more.

Tuning relevance scoring is often more impactful than tuning infrastructure. A 10% improvement in ranking quality beats a 10% improvement in query latency for the vast majority of users.

Hybrid search combines both: keyword matching for exact terms + vector similarity for semantic meaning. Modern search systems weight both signals and merge results. This is the pattern behind modern RAG (Retrieval-Augmented Generation) systems.

Elasticsearch Shards and Replicas β€” The Mistake You Cannot Undo

Shard count is set at index creation and cannot be changed without reindexing the entire dataset. This is the most common Elasticsearch production mistake β€” starting with too few shards (often the default of 1) and hitting a wall when the index grows beyond what one shard can serve efficiently.

Primary shards distribute data across nodes. Replica shards on other nodes provide redundancy and additional read throughput. A 3-node cluster with 6 primary shards and 1 replica per shard = 12 total shards, 4 per node.

Start with a shard count that anticipates 1–2 years of index growth. A good rule: keep individual shards under 50GB. If you expect 300GB of index data after two years, start with at least 6–10 shards.
Cross-Reference

See Building Blocks β†’ Search Systems for a deeper treatment of inverted indexes, analyzers, and Elasticsearch internals.

Search is a derived view, not a source of truth. The moment you treat your search index as the primary data store, you have created an unrecoverable consistency problem. Write to your primary DB. Let CDC stream changes to the search index. Never dual-write.

πŸ“‹ Chapter 3 β€” Summary
  • Search is a secondary index β€” derived from primary data, optimized for different queries.
  • Sync pattern: CDC (Debezium + Kafka) β†’ search index. Never dual-write.
  • Inverted index: maps terms to posting lists β€” no document scanning, direct lookup then set intersection.
  • Full-text (keyword): inverted index, fuzzy matching. Elasticsearch, Typesense.
  • BM25 relevance scoring: default in Elasticsearch, better than TF-IDF for most cases. Adds document length normalization. Tuning ranking beats tuning infrastructure.
  • ES shards: set at index creation, cannot change without full reindex. Plan shard count for 1–2 years of growth upfront. Keep individual shards under 50GB.
  • Vector (semantic): embeddings + ANN index. Pinecone, Weaviate, pgvector.
  • Hybrid search: combine keyword + vector for best-of-both results (RAG pattern).
04
Chapter Four

Batch vs Stream Processing

Old Data vs Fresh Data: The Fundamental Trade-off

Every data system eventually faces the same question: do you process data in large batches at scheduled intervals, or do you process it continuously as it arrives? Batch gives you high throughput and simpler correctness. Stream gives you low latency and real-time insights. Neither is universally better β€” the choice depends on how fresh your results need to be and what cost you are willing to pay for that freshness.

Batch Processing
Stream Processing
  • Process large volumes of data at scheduled intervals
  • High throughput, high latency (minutes to hours)
  • Simple correctness: process all data, rerun if failed
  • Tools: Spark, Hadoop MapReduce, dbt, Airflow
  • Best for: reports, ML training, daily aggregations
  • Process events continuously as they arrive
  • Low latency (milliseconds to seconds), lower throughput per node
  • Complex correctness: ordering, exactly-once, late arrivals
  • Tools: Kafka Streams, Flink, Spark Streaming
  • Best for: fraud detection, real-time dashboards, alerting
Lambda Architecture β€” Batch + Stream Layers
Events All data in Batch Layer Spark / MapReduce Complete, correct, slow Speed Layer Kafka Streams / Flink Approximate, fast Serving Layer Merge batch + speed Serve to queries Kappa Architecture: eliminate batch layer β€” single streaming pipeline for everything
πŸ—οΈ

Lambda Architecture

  • Batch layer: complete, correct results (recomputed nightly)
  • Speed layer: real-time approximate results (streaming)
  • Serving layer: merges both for query responses
  • Downside: two separate codebases to maintain
♾️

Kappa Architecture

  • Single streaming pipeline β€” no separate batch layer
  • Replay from Kafka log when you need to recompute
  • Simpler: one codebase, one data path
  • Requires: Kafka with sufficient retention (days/weeks)
Cross-Reference

See Building Blocks β†’ Message Queues for Kafka as the streaming backbone β€” partitioning, consumer groups, and exactly-once semantics.

Why Streaming Correctness Is Harder Than Batch

Batch correctness is simple: process all the data, rerun the job if something fails. Streaming adds three problems that do not exist in batch:

Windowing: events must be grouped into time windows to compute aggregations (e.g., "orders per minute"). Three window types β€” tumbling (fixed non-overlapping, e.g. 1-minute windows where every event belongs to exactly one window), sliding (overlapping, e.g. last 5 minutes computed every 1 minute β€” one event can appear in multiple windows), session (activity-based gaps, a window closes after a period of inactivity). The question: which window does a late-arriving event belong to?

Watermarks: the system's estimate of how late an event can arrive before its window is considered complete and results are emitted. Set too tight and late-arriving events are silently dropped. Set too loose and results are delayed unnecessarily, increasing end-to-end latency.

Exactly-once semantics: each event must affect the output exactly once even if the processor crashes and restarts mid-computation. This requires coordinated transactions between the stream processor and the output sink β€” non-trivial engineering. Flink and Kafka Streams support it β€” but with a measurable latency cost. Exactly-once is not the same as at-least-once, which allows duplicates.

Start with batch. Add streaming when freshness is a hard business requirement, not a nice-to-have. Streaming adds operational complexity β€” windowing, watermarks, late arrivals, exactly-once guarantees. If nightly aggregation is good enough for the business, batch is the right answer.

πŸ“‹ Chapter 4 β€” Summary
  • Batch: high throughput, high latency. Simple correctness. Best for reports, training, aggregations.
  • Stream: low latency, complex correctness. Best for alerts, fraud detection, real-time dashboards.
  • Streaming correctness challenges: windowing (tumbling/sliding/session), watermarks (late arrival estimates), exactly-once semantics. Batch avoids all of these.
  • Lambda: batch + stream merged at serving layer. Two codebases to maintain.
  • Kappa: single streaming pipeline. Replay from Kafka log for recomputation. Simpler but needs retention.
  • Default to batch. Add streaming only when freshness is a genuine business constraint.
05
Chapter Five

Data Lakes & Warehouses

Storing Everything to Analyze Anything

At some scale, your transactional database cannot serve both operational queries and analytical queries without degrading one or the other. That is when you need a separate analytical system β€” a data warehouse (structured, fast queries) or a data lake (raw, flexible schema). The modern answer is often a lakehouse that combines the best of both. The decision depends on team maturity, query patterns, and whether you know what questions you will ask before you store the data.

🏒

Data Warehouse

  • Schema-on-write: data structured before loading
  • Column-oriented storage reads only the columns needed for a query. A COUNT(orders) BY month query scans only the month column and the orders column β€” not all 50 columns in the row. This makes analytical aggregations 10–100Γ— faster than row-oriented OLTP storage for the same data volume.
  • SQL interface, optimized for aggregations
  • Tools: BigQuery, Redshift, Snowflake
  • Best for: known queries, BI dashboards, reports
🌊

Data Lake

  • Schema-on-read: store raw, interpret later
  • Any format: JSON, Parquet, CSV, images, video
  • Cheap object storage (S3, GCS, ADLS)
  • Tools: S3 + Spark, Databricks, EMR
  • Best for: ML training, unknown future queries
🏠

Data Lakehouse

  • Structured layer on top of a lake
  • ACID transactions + schema enforcement on files
  • SQL queries on Parquet files in object storage
  • Tools: Delta Lake, Apache Iceberg, Apache Hudi
  • Best for: combining ML + BI on one platform
Modern Data Stack β€” Sources to Insights
Sources App DBs SaaS APIs Events Files Ingest Fivetran Airbyte Kafka Storage Lake (S3/GCS) Warehouse (BigQuery/Snowflake) Lakehouse (Iceberg) Transform dbt Spark SQL models Serve BI (Looker/Metabase) ML training Reverse ETL Dashboards ELT pattern: load raw first, transform inside the warehouse vs ETL: transform before loading (older pattern, less flexible)
The Missing Component: Data Catalog

The data stack above is missing one critical component that most teams add too late: a data catalog. Without it, data exists in the warehouse but nobody can find it, trust it, or understand where it came from. "Where does this revenue_v3 table come from? Is it still used? Which dashboard depends on it?" β€” these questions consume hours of engineering time in every growing data team.

DataHub, Apache Atlas, and dbt docs are the common solutions. Treat the catalog as infrastructure β€” not documentation. Build lineage tracking from day one.
Cost Model: Why Lake-First Makes Sense at Scale

Object storage (S3, GCS) costs approximately $0.023 per GB per month. BigQuery charges approximately $5 per TB queried. This cost asymmetry drives the lake-first strategy at scale β€” store everything cheaply in object storage, run warehouse compute only on the queries that need structured access.

At 10TB of data queried daily, BigQuery costs ~$50/day. The same data sitting in S3 costs $0.23/day just to store. The lake is not a technical choice β€” it is a financial one.

If you know the questions, use a warehouse. If you don't know the questions yet, use a lake. If you need both (and most growing companies do), a lakehouse gives you ACID transactions and SQL queries on cheap object storage β€” the convergence point of both worlds.

πŸ“‹ Chapter 5 β€” Summary
  • Warehouse: schema-on-write, column-oriented (reads only needed columns β€” 10–100Γ— faster for analytics), SQL-first. Best for known queries and BI.
  • Lake: schema-on-read, any format, object storage (~$0.023/GB/month). Best for ML and unknown future questions.
  • Lakehouse: ACID + schema enforcement on lake files (Delta Lake, Iceberg). Best of both worlds.
  • ELT > ETL: load raw, transform inside the warehouse. More flexible, cheaper iteration.
  • Modern stack: Sources β†’ Ingest (Fivetran) β†’ Lake/Warehouse β†’ Transform (dbt) β†’ Serve (BI, ML).
  • Data catalog: DataHub, Atlas, dbt docs. Add early β€” treat as infrastructure, not documentation.
  • Cost: S3 at $0.023/GB vs BigQuery at $5/TB queried. Lake-first is a financial choice at scale.
06
Chapter Six

Consistency Patterns

From Strong Consistency to Eventual Consistency and Back

In a distributed data system, consistency is not binary β€” it is a spectrum. "Strong consistency" and "eventual consistency" are the endpoints, but most real systems operate somewhere in between. The choice is not about technical preference β€” it is about what your users can tolerate. A banking system that shows a briefly wrong balance creates real financial harm. A social media feed that is 2 seconds stale creates zero harm. Both are valid consistency choices β€” for their context.

Consistency Spectrum β€” Guarantees vs Cost
Eventual Monotonic Reads Read-Your- Writes Causal Strong DNS, CDN Like counts Session data Feeds User profile Comments Chat messages Collab editing Bank balance Inventory ← Faster, more available, cheaper Stronger guarantees, higher latency β†’ Choose the weakest model that still satisfies business requirements
Key Consistency Models Defined
✍️

Read-Your-Own-Writes

A user always sees their own most recent write, even if other users may still see stale data. You post a comment and immediately see it. Another user in a different region may see it seconds later.

This is the minimum acceptable consistency for most user-facing features. Violating it is immediately noticed: "I just submitted the form β€” why isn't my change showing?"

Implementation: route reads to the primary for a short window (typically 60 seconds) after any write from that user, then fall back to replica reads.

⏩

Monotonic Reads

Once you read a value, you never read an older version of it. Prevents the confusing experience of reading newer data, refreshing, and seeing older data β€” as if time went backwards.

This can happen when different reads within the same session go to different replicas with different replication lag. Replica A is 100ms behind. Replica B is 2 seconds behind. Read from A then B and you see the past.

Implementation: route all reads within a session to the same replica. Session stickiness by replica ID.

πŸ”„

Eventual Consistency

  • Given no new updates, all replicas converge
  • Time frame: milliseconds to seconds typically
  • Cheapest, fastest, most available
  • Acceptable for: counters, likes, view counts, CDN
  • Not acceptable for: bank balance, inventory, bookings
πŸ”’

Strong (Linearizable)

  • Every read sees the most recent write β€” globally
  • Operations appear instantaneous and ordered
  • Highest latency, lowest availability during partition
  • Required for: payments, seat booking, inventory deduction
  • Implementation: consensus (Paxos, Raft) or single-leader
Conflict Resolution Strategies

In eventually consistent systems, concurrent writes to the same record will conflict. You need a resolution strategy decided at design time β€” not discovered at incident time.

⏰

Last-Write-Wins (LWW)

Latest timestamp wins. Simple but loses earlier writes silently. Requires synchronized clocks. Used by Cassandra, DynamoDB (default).

πŸ”€

Application-Level Merge

Return all conflicting versions to the app. App logic resolves (e.g., merge shopping carts). Complex but lossless. Used by CouchDB, Riak.

πŸ“

CRDTs

Conflict-free Replicated Data Types. Mathematical guarantee of convergence without coordination. Used for counters, sets, text editing (Yjs).

Cross-Reference β€” Go Deeper

See Distributed Systems for consensus algorithms (Paxos, Raft), vector clocks, and the theoretical foundations behind these consistency guarantees.

Choose the weakest consistency model that still satisfies your business requirements. Every step stronger on the spectrum costs latency, availability, and operational complexity. Most systems need strong consistency for payments and eventual consistency for everything else β€” mixing models within one system is normal and expected.

πŸ“‹ Chapter 6 β€” Summary
  • Consistency is a spectrum: eventual β†’ monotonic β†’ read-your-writes β†’ causal β†’ strong.
  • Read-your-own-writes: user always sees their own latest write. Minimum for user-facing features. Route writes and immediate subsequent reads to primary.
  • Monotonic reads: never see older data after seeing newer. Route session reads to the same replica.
  • Choose based on business harm: stale bank balance = real harm. Stale like count = zero harm.
  • Conflict resolution: LWW (simple, lossy), app-level merge (complex, lossless), CRDTs (automatic).
  • Most systems mix models: strong for payments, eventual for counters. This is normal.
  • Stronger = more expensive. Higher latency, lower availability, more operational complexity.
Data at Scale β€” At a Glance
01 Β· Data Modeling

Schema Is Your Most Expensive Decision

  • Normalize for writes, denormalize for reads
  • Read:write > 10:1 β†’ denormalize
  • OLTP: row-oriented, ms latency. OLAP: column-oriented, scan-heavy
  • Schema evolution: forward + backward compat. Protobuf field numbers protect both.
  • Polyglot persistence: right DB for each pattern. PostgreSQL for transactions, Redis for caching, Elasticsearch for search, S3 for files. Each extra DB is operational cost.
02 Β· Time-Series Data

Append-Only, Time-Range, Value Decays

  • Unique workload: high-velocity writes, range reads
  • Downsample older data: 100–1000Γ— storage reduction
  • Tools: Prometheus, InfluxDB, TimescaleDB, ClickHouse
  • General-purpose DBs struggle β€” purpose-built TSDBs needed
03 Β· Search at Scale

Secondary Index, Not Source of Truth

  • CDC pattern: DB β†’ Kafka β†’ search index (never dual-write)
  • Inverted index: maps terms to posting lists β€” no document scanning, direct lookup then set intersection
  • Full-text: Elasticsearch, Typesense. Vector: Pinecone, pgvector
  • BM25 relevance scoring: default in ES, better than TF-IDF. Tuning ranking beats tuning infrastructure
  • ES shards: set at index creation, cannot change without reindex. Plan for 1–2 years of growth upfront
  • Hybrid: keyword + semantic for best results (RAG)
04 Β· Batch vs Stream

Default Batch, Stream When Required

  • Batch: high throughput, simple correctness (Spark, dbt)
  • Stream: low latency, complex correctness (Flink, Kafka Streams)
  • Streaming correctness challenges: windowing, watermarks, late arrivals, exactly-once semantics. Batch avoids all of these.
  • Lambda: batch + stream layers merged. Two codebases
  • Kappa: single stream, replay from log. Simpler
05 Β· Lakes & Warehouses

Know Questions β†’ Warehouse. Don't β†’ Lake

  • Warehouse: schema-on-write, SQL, BI (BigQuery, Snowflake)
  • Lake: schema-on-read, any format, cheap (S3 + Spark)
  • Lakehouse: ACID on files (Delta Lake, Iceberg)
  • ELT > ETL: load raw, transform inside warehouse
06 Β· Consistency Patterns

Weakest Model That Satisfies Business

  • Spectrum: eventual β†’ causal β†’ strong
  • Read-your-own-writes: user always sees their own latest write. Minimum for user-facing features.
  • Monotonic reads: never see older data after seeing newer. Route session reads to the same replica.
  • Choose based on business harm of stale data
  • Conflicts: LWW (simple), app-merge (lossless), CRDTs (auto)
  • Most systems mix: strong for payments, eventual for rest