System Design ยท Case Studies

Case Study: Search Engine

Design, trade-offs, and alternatives for a search engine at scale.

01
Chapter One

Problem Statement

What We Are Building

A search engine crawls the web, builds an index of every page's content, and serves relevant results for arbitrary text queries in under 500 milliseconds. The scale is staggering: 100B+ pages indexed, 8.5B searches per day (~100K queries/sec), and new content must be discoverable within minutes of publication. The core challenge is not finding a needle in a haystack โ€” it is ranking the 10 most relevant needles from 100 billion candidates, faster than the user can blink.

Scale Requirements

Traffic & Scale

  • 8.5B searches/day (~100K queries/sec)
  • 100B+ web pages indexed
  • Index size: 100+ PB (compressed inverted index)
  • Crawl rate: 1B+ pages/day for freshness

Requirements

  • Query latency: <500ms end-to-end (p99)
  • Freshness: breaking news indexed in <5 minutes
  • Relevance: top-10 results satisfy user intent >90%
  • Availability: 99.99% โ€” search is the gateway to the internet

Search is two completely separate systems: the indexing pipeline and the query serving system. Indexing is a massive batch/stream processing problem (crawl โ†’ parse โ†’ index). Query serving is a low-latency distributed query problem (parse query โ†’ scatter to shards โ†’ gather โ†’ rank โ†’ return). They share only the index data structure. Designing them as one system is the same mistake as combining video upload with video playback.

๐Ÿ“‹ Chapter 1 โ€” Summary
  • 8.5B searches/day across 100B+ indexed pages. 100K queries/sec.
  • Two systems: indexing pipeline (batch) and query serving (low-latency).
  • Sub-500ms latency. Freshness under 5 minutes. Relevance >90% satisfaction.
  • The inverted index is the fundamental data structure that makes this possible.
02
Chapter Two

Questions to Ask

Clarifying Before Designing
๐Ÿ”

Query Types

  • Full-text keyword search or structured filters?
  • Boolean (AND/OR/NOT) or natural language?
  • Phrase matching ("exact phrase")?
  • Fuzzy/typo-tolerant? (did you mean...)
๐Ÿ“ฅ

Indexing Strategy

  • Batch indexing (hours) or real-time (seconds)?
  • Multi-language support? (tokenization differs)
  • Structured metadata + full-text hybrid?
  • Document updates: re-index or delta update?
๐Ÿ‘ค

User Experience

  • Autocomplete/typeahead suggestions?
  • Personalized results (per-user ranking)?
  • Faceting/filtering (category, price, date)?
  • Highlighting matched terms in results?

Real-time vs batch indexing changes the entire architecture. Batch indexing (rebuild index every N hours) is simpler โ€” build offline, swap atomically. Real-time indexing (document searchable in seconds) requires append-to-live-index, segment merging, and handling queries against partially-built index segments. Most production systems use near-real-time: batch for the base index, real-time for recent documents merged on query.

For This Case Study, Our Answers Are:

  • Search scope: web search (full internet crawl, not internal/enterprise docs)
  • Query types: natural language + boolean operators + phrase matching
  • Fuzzy/typo-tolerance: yes โ€” edit distance โ‰ค 2 for terms under 8 characters
  • Indexing model: near-real-time hybrid โ€” batch base index + real-time segment append
  • Freshness target: breaking news indexed in <5 minutes; long-tail in <24 hours
  • Personalization: yes โ€” per-user click history influences ranking
  • Autocomplete: yes โ€” separate prefix trie, not covered in this design
  • Multi-language: yes โ€” separate tokenizers per language (50+ languages)
  • Result features: snippets, image thumbnails, Knowledge Graph panels
  • Query latency SLA: p50 <100ms, p99 <500ms
๐Ÿ“‹ Chapter 2 โ€” Summary
  • Query types (boolean, phrase, fuzzy) determine index data structures needed.
  • Real-time vs batch indexing: freshness vs simplicity trade-off.
  • Multi-language: different tokenizers, stemmers, stop words per language.
  • Autocomplete needs a separate data structure (prefix trie or edge-ngrams).
03
Chapter Three

Naive Design

Single Node Full-Text Index

The simplest design: a single Elasticsearch node indexing all documents into one inverted index. queries go to that one server, which scans the index, scores results using BM25, and returns ranked results. Works beautifully for 1M documents. At 100B documents, the index is 100+ PB โ€” it doesn't fit on any single machine. Even at 10M documents, query latency degrades as the index grows, because scoring must traverse longer posting lists for common terms.

Naive Design โ€” Single Elasticsearch Node
User Query Single ES Node Inverted Index: "the" โ†’ [doc1, doc2, doc3 ... doc50M] 50M entries โ€” SLOW "python" โ†’ [d1,d7,d42] 3 entries โ€” fast BM25 scoring: O(posting_list_length) Common terms = long traversal = slow Local Disk All docs on 1 machine 100B docs = 100+ PB index. Doesn't fit on one machine. Long posting lists for common terms โ†’ latency degrades. SPOF. Common query terms ("the", "how", "best") have posting lists with millions of entries โ€” scoring each is O(posting_list_length) per query term. Query "how to best..." โ†’ 50M + 40M + 30M entries traversed. Seconds, not milliseconds.
โœ…

What Works

  • Simple โ€” one node, standard Elasticsearch
  • Full-text search with BM25 scoring out-of-box
  • Faceting, filtering, highlighting built-in
  • Great for <10M documents
๐Ÿ’ฅ

What Breaks

  • Index too large for single machine at scale
  • Query latency degrades with index size
  • Single point of failure โ€” node dies = no search
  • No separation of indexing from serving (compete for resources)
  • Scaling: only vertical (bigger machine)
๐Ÿ“‹ Chapter 3 โ€” Summary
  • Single node: works for small datasets. Index doesn't fit at web scale.
  • Latency degrades as posting lists grow. Common terms = slow queries.
  • Indexing and serving on same node compete for CPU/IO resources.
04
Chapter Four

Refined Design

Sharded Inverted Index with Scatter-Gather Query

The refined design partitions the inverted index across hundreds of shards. Each shard holds a subset of documents and can answer queries independently. A query coordinator receives user queries, broadcasts them to all shards in parallel (scatter), collects top-K results from each (gather), merges and re-ranks globally, then returns the final top-10. The indexing pipeline (crawler โ†’ parser โ†’ indexer) runs separately, building index segments that are pushed to serving shards.

Refined Design โ€” Distributed Search Architecture
Query Serving Path User Coordinator parse query query plan scatter โ†’ gather โ†’ rank Shard 1 Shard 2 Shard N parallel, simultaneous top-K results Latency = slowest shard Replicas Shard 1' Shard 2' Shard N' sync repl. failover / load balance Indexing Pipeline Web Crawler Parser extract text, links, meta tokenize + normalize Indexer Index Segments segment push to serving shards (no downtime) Query: User โ†’ Coordinator โ†’ Scatter to all shards (parallel) โ†’ Gather top-K โ†’ Merge + re-rank โ†’ Return top-10 Index: Crawler โ†’ Parser โ†’ Indexer โ†’ Segments pushed to serving shards Each shard: independent inverted index. Replicas for HA + read throughput + hedging.
๐Ÿ”

Query Path (Scatter-Gather)

  • 1. User query โ†’ coordinator parses and plans
  • 2. Scatter: send query to all shards in parallel
  • 3. Each shard: search local inverted index, return top-K
  • 4. Gather: coordinator merges Kร—N results, re-ranks globally
  • 5. Return top-10 with snippets and highlighting
  • Latency = slowest shard response (tail latency matters)
๐Ÿ“ฅ

Indexing Pipeline

  • Crawler: discover new/updated pages (BFS, priority queue)
  • Parser: extract text, links, metadata from HTML
  • Indexer: build inverted index segments (term โ†’ doc_ids)
  • Segment push: new segments deployed to serving shards
  • Near-real-time: fresh content searchable in minutes
Inverted Index โ€” The Core Data Structure
Documents: Doc 1: "fast Python tutorial" Doc 2: "Python web framework guide" Doc 3: "fast car review" Inverted Index: Term Posting List (doc_ids) "fast" โ†’ [doc1, doc3] "python" โ†’ [doc1, doc2] "tutorial" โ†’ [doc1] "web" โ†’ [doc2] "car" โ†’ [doc3] ... Query: "python fast" โ‘  Lookup "python" โ†’ [doc1, doc2] โ‘ก Lookup "fast" โ†’ [doc1, doc3] โ‘ข Intersection โ†’ [doc1] โœ“ "fast Python tutorial" โ€” matches! Only Doc 1 contains BOTH terms. Query = intersection of posting lists. No full-text scan of documents needed. This is O(posting_list_length) โ€” not O(total_documents). The index makes search fast.

Tail latency is the enemy of scatter-gather search. If you scatter to 100 shards and one is slow (GC pause, disk IO spike), your entire query is slow. At p99, you are waiting for the slowest shard's p99. Solutions: hedged requests (send to replica if primary is slow), speculative execution, and aggressive timeout + fallback to partial results.

Hedged Requests โ€” Eliminating Tail Latency
Without Hedging: t=0 t=100ms t=450ms 500ms SLA Send to Shard 1 GC pause โ€” shard unresponsive resp 450ms โ€” barely within SLA โš ๏ธ With Hedging (100ms threshold): t=0 t=100ms t=120ms hedge threshold Shard 1 (primary) GC pause (cancelled) Shard 1' (replica) responds at 120ms โœ“ Query complete in 120ms. Saved 330ms.
๐Ÿ“‹ Chapter 4 โ€” Summary
  • Sharded inverted index: each shard holds a subset of documents independently.
  • Scatter-gather query: coordinator fans out to all shards, merges results.
  • Indexing pipeline separate from serving: crawler โ†’ parser โ†’ indexer โ†’ segment push.
  • Tail latency problem: slowest shard determines query latency. Hedge requests.
05
Chapter Five

Alternative Approaches

Retrieval & Ranking Strategies
Inverted Index (BM25)
Vector Embeddings (Semantic)
  • Term โ†’ document_ids posting list. Exact keyword matching.
  • BM25 scoring: term frequency ร— inverse doc frequency
  • Fast: O(posting_list_length) per term
  • No understanding of meaning: "car" doesn't match "automobile"
  • Well-understood, 30+ years of optimization
  • Used by: Google (base layer), Elasticsearch, Solr
  • Document โ†’ dense vector (768-dim). Query โ†’ vector. Nearest neighbor.
  • Semantic understanding: "car" matches "automobile" naturally
  • Requires ML model to encode documents and queries
  • Approximate Nearest Neighbor (ANN): HNSW, IVF, ScaNN
  • Higher latency and cost than inverted index at scale
  • Used by: Google (semantic layer), Pinecone, Weaviate
Batch Indexing
Real-Time Indexing
  • Build complete index offline, swap atomically to serving
  • Simple: no concurrent read/write on same index
  • Freshness: new content waits for next build (hours)
  • Consistent: all queries see same index version
  • Good for: catalogs, reference data, static content
  • Append to live index immediately. Searchable in seconds.
  • Segment-based: new docs go to small segments, merged later
  • Complex: queries span old + new segments simultaneously
  • Background merge: compact small segments into large ones
  • Good for: news, social media, e-commerce (inventory)
Real-Time Indexing โ€” Segment-Based Architecture
New document arrives In-memory buffer (searchable immediately) ~1s flush seg_a seg_b seg_c seg_d seg_e Small segs โ†“ background merge Medium seg (a+b) Medium seg (d+e) ~10s of docs โ†“ merge Large segment (1000s of docs) Base Index Segment Queries search ALL segments simultaneously (old + new). This is how Lucene/Elasticsearch achieve NRT โ€” searchable in ~1 second.
Hybrid Search Pipeline โ€” Production Standard
Query: "best Python web framework" Stage 1: Inverted Index (BM25) Fast keyword recall โ€” posting list intersection ~10ms โ†’ 1,000 candidate documents Stage 2: Lightweight Scorer BM25 score + freshness + authority signals ~50ms โ†’ 100 candidates Stage 3: Vector Re-Ranker (ML) Query embedding โ†” doc embeddings ~200ms โ†’ 10 final results Return to user (<500ms total)

Hybrid retrieval is now the production standard for any serious search system. Pure keyword search misses synonyms and intent ("cheap" vs "affordable", "heart attack" vs "myocardial infarction"). Pure vector search misses exact matches and is more expensive. The hybrid approach runs both in parallel โ€” or in cascade โ€” and combines scores. Google, Bing, and all major enterprise search platforms use this approach. If you are building search today, assume hybrid retrieval from day one.

๐Ÿ“‹ Chapter 5 โ€” Summary
  • Inverted index: fast, exact keyword matching. 30 years of optimization.
  • Vector search: semantic understanding, higher cost. Good for meaning-based queries.
  • Hybrid (production standard): inverted index for recall, vectors for re-ranking.
  • Real-time indexing: segment-based append. Freshness in seconds.
06
Chapter Six

What Real Companies Did

Production Search Systems
๐Ÿ”

Google

  • 100B+ pages. Caffeine: near-real-time incremental indexing
  • PageRank: link-based authority scoring (original breakthrough)
  • BERT/MUM: neural language understanding for query intent
  • Knowledge Graph: structured answers without clicking results
  • Custom hardware (TPUs) for ML inference in serving path
๐Ÿ”Ž

Elasticsearch

  • Built on Apache Lucene: inverted index + BM25
  • Shard + replica model: distribute index across cluster
  • Near-real-time: documents searchable in ~1 second
  • Rich query DSL: boolean, phrase, fuzzy, aggregations
  • Used by: GitHub, Netflix, Uber, Wikipedia (search)
โšก

Algolia

  • Search-as-a-service: hosted, <50ms globally
  • Typo-tolerance built-in (edit distance at index time)
  • Edge search: replicate index to 70+ PoPs worldwide
  • Instant search: results update as user types each character
  • Used by: Stripe docs, Twitch, Lacoste, Medium
๐ŸŒ

Bing

  • Deep learning ranking models (transformer-based)
  • FPGA acceleration: custom hardware for inference in serving path
  • Multi-stage ranking: BM25 โ†’ lightweight ML โ†’ heavy ML
  • Conversational search integration (Copilot/ChatGPT)
  • Index freshness: minutes for news, hours for long-tail
Production Search Systems โ€” Comparison
System Index Type Ranking Model Special Pattern Scale Google Inverted + vector (hybrid) PageRank โ†’ BERT/MUM neural ranking Caffeine real-time index, Knowledge Graph 100B+ pages, 8.5B searches/day Elastic- search Lucene inverted index BM25 + custom scoring Shard/replica, rich DSL, NRT (1s) GitHub, Netflix, Uber Algolia Inverted + typo- tolerance Custom relevance + business rules Edge-replicated 70+ PoPs, instant as-you-type <50ms globally Bing Inverted + deep learning Multi-stage: BM25 โ†’ light ML โ†’ transformer FPGA inference accel., conversational search Billions of queries/day
๐Ÿ“‹ Chapter 6 โ€” Summary
  • Google: Caffeine (real-time), PageRank + BERT/MUM neural ranking, Knowledge Graph.
  • Elasticsearch: Lucene-based, shard/replica, near-real-time. De facto standard for app search.
  • Algolia: edge-replicated, typo-tolerant, <50ms instant search.
  • Bing: multi-stage ranking with FPGA-accelerated ML inference.
07
Chapter Seven

Best Practices Extracted

Transferable Lessons
๐Ÿ“Š

Index Partitioning

  • Document-partitioned: each shard holds complete docs (standard)
  • Term-partitioned: each shard holds some terms for ALL docs (rare)
  • Document-partitioned wins: each shard scores independently
  • Re-shard by splitting: subdivide existing shards as data grows
  • Transfers to: any distributed data store partitioning
๐Ÿ”€

Multi-Stage Ranking

  • Stage 1: Retrieval โ€” fast, recall-focused (inverted index, top-1000)
  • Stage 2: Lightweight scoring โ€” BM25 + simple features (top-100)
  • Stage 3: Heavy ML model โ€” deep ranking (top-10)
  • Each stage trades precision for cost reduction
  • Transfers to: recommendations, ad ranking, feed ranking
๐Ÿ’พ

Query Result Caching

  • Cache top results for popular queries (20% of queries = 80% of volume)
  • Short TTL: 5-15 min (index changes, freshness matters)
  • Cache key: normalized query + filters + page
  • Invalidate on index refresh for affected terms
  • Transfers to: any expensive computation with repeating inputs
Multi-Stage Ranking โ€” Candidate Funnel
Stage 1 โ€” Retrieval (Inverted Index) 100,000 candidates ~10ms (index lookup) Stage 2 โ€” Lightweight Scoring (BM25) 1,000 ~50ms (score 1K docs) Stage 3 โ€” ML Re-ranking (neural) 100 ~200ms (ML inference) Stage 4 โ€” Final 10 ~10ms (sort + filter) Each stage reduces candidates by 10-100x. Heavy ML only runs on top 100 โ€” not 100K. Total: ~270ms within 500ms SLA. 99.9% of candidates eliminated before expensive ML scoring.
๐Ÿ“‹ Chapter 7 โ€” Summary
  • Document-partitioned: standard approach. Each shard scores independently.
  • Multi-stage ranking: retrieval โ†’ lightweight scoring โ†’ heavy ML. Funnel reduces cost.
  • Query caching: popular queries cached with short TTL. Invalidate on index refresh.
  • Hedged requests: if primary shard slow after 100ms, query replica in parallel. First response wins. Eliminates tail latency from GC pauses.
08
Chapter Eight

What Could Go Wrong

Common Failure Patterns
๐Ÿ’€

Query of Death

  • User enters expensive regex or wildcard (e.g., *a*b*c*)
  • Explodes into millions of term lookups per shard
  • One query consumes all CPU on every shard simultaneously
  • Fix: query complexity limits, wildcard restrictions, per-query timeout (hard kill at 5s), query cost estimation before execution.
Query of Death โ€” Wildcard Explosion
Normal query: "python tutorial" "python" โ†’ posting list: 50K entries "tutorial" โ†’ posting list: 10K entries Total: ~60K ops. Fast. โœ“ O(posting_list) per term. Bounded. Wildcard: *a*b*c* Must expand to ALL matching terms: "abc" + "abstract" + "alphabet" + ... โ†’ matches 100,000+ terms in the index โ†’ each term has its own posting list โ†’ 100K lists ร— avg 50K entries each = 5 BILLION operations. โœ• CPU maxed on every shard. Entire cluster unresponsive. Fix: reject queries with leading wildcards. Max 3 wildcard terms. Hard kill at 5s.
๐Ÿ“‰

Relevance Degradation

  • Ranking signals become stale (click data from months ago)
  • SEO spam games the ranking algorithm gradually
  • Users see poor results but don't report โ€” they just leave
  • Fix: continuous ranking evaluation (human raters + metrics), fresh click signals, anti-spam ML models, A/B testing all ranking changes.
๐Ÿ’พ

Index Corruption

  • Bad indexing code writes corrupted segment to one shard
  • That shard returns wrong results or crashes on certain queries
  • Affects 1/N of all queries โ€” randomly, hard to reproduce
  • Fix: index checksums, segment-level validation, rollback capability (keep previous segment version), canary indexing.
Safe Index Deployment โ€” Segment Versioning
t=0: Serving Segment v47 (current, stable) t=1: New segment v48 built (indexing pipeline) t=2: Canary โ€” route 1% of traffic to v48 shards Monitor: error rate, latency, result quality metrics v47 (99%) v48 (1%) t=3: v48 passes canary โ†’ promote to 100% Keep v47 for 24h rollback window v48 live โœ“ โ€” v47 retained as backup Rollback to v47 (seconds) If issues detected โ†’ Never deploy index changes to 100% without canary validation. Always keep rollback.
๐Ÿง 

Shard Rebalancing Split-Brain

  • Adding shards: some docs temporarily not queryable during migration
  • Coordinator's shard map out of sync โ†’ routes to wrong shard
  • Partial results returned without indication of incompleteness
  • Fix: double-write during migration (old + new shard), coordinator health checks, explicit "partial results" flag in response.

Relevance degradation is the most insidious search failure. The system returns 200 OK, latency is fine, no errors โ€” but users get poor results and silently switch to a competitor. You need continuous relevance monitoring: automated evaluation sets, click-through-rate tracking, and human rater programs. If CTR on position 1 drops 5% week-over-week, something is wrong even though all infrastructure metrics are green.

๐Ÿ“‹ Chapter 8 โ€” Summary
  • Query of death: complexity limits + hard timeout. Never let one query kill the cluster.
  • Relevance degradation: silent failure. Monitor CTR, use evaluation sets, A/B test changes.
  • Index corruption: checksums + rollback + canary deployment for index changes.
  • Split-brain: double-write during migration, explicit partial-results signaling.