System Design · Case Studies

Case Study: Search Engine

Design, trade-offs, and alternatives for a search engine at scale.

Chapter One

Problem Statement

What We Are Building

A search engine crawls the web, builds an index of every page's content, and serves relevant results for arbitrary text queries in under 500 milliseconds. The scale is staggering: 100B+ pages indexed, 8.5B searches per day (~100K queries/sec), and new content must be discoverable within minutes of publication. The core challenge is not finding a needle in a haystack — it is ranking the 10 most relevant needles from 100 billion candidates, faster than the user can blink.

Scale Requirements

Traffic & Scale

8.5B searches/day (~100K queries/sec)
100B+ web pages indexed
Index size: 100+ PB (compressed inverted index)
Crawl rate: 1B+ pages/day for freshness

Requirements

Query latency: <500ms end-to-end (p99)
Freshness: breaking news indexed in <5 minutes
Relevance: top-10 results satisfy user intent >90%
Availability: 99.99% — search is the gateway to the internet

Search is two completely separate systems: the indexing pipeline and the query serving system. Indexing is a massive batch/stream processing problem (crawl → parse → index). Query serving is a low-latency distributed query problem (parse query → scatter to shards → gather → rank → return). They share only the index data structure. Designing them as one system is the same mistake as combining video upload with video playback.

📋 Chapter 1 — Summary

8.5B searches/day across 100B+ indexed pages. 100K queries/sec.
Two systems: indexing pipeline (batch) and query serving (low-latency).
Sub-500ms latency. Freshness under 5 minutes. Relevance >90% satisfaction.
The inverted index is the fundamental data structure that makes this possible.

Chapter Two

Questions to Ask

Clarifying Before Designing

🔍

Query Types

Full-text keyword search or structured filters?
Boolean (AND/OR/NOT) or natural language?
Phrase matching ("exact phrase")?
Fuzzy/typo-tolerant? (did you mean...)

📥

Indexing Strategy

Batch indexing (hours) or real-time (seconds)?
Multi-language support? (tokenization differs)
Structured metadata + full-text hybrid?
Document updates: re-index or delta update?

👤

User Experience

Autocomplete/typeahead suggestions?
Personalized results (per-user ranking)?
Faceting/filtering (category, price, date)?
Highlighting matched terms in results?

Real-time vs batch indexing changes the entire architecture. Batch indexing (rebuild index every N hours) is simpler — build offline, swap atomically. Real-time indexing (document searchable in seconds) requires append-to-live-index, segment merging, and handling queries against partially-built index segments. Most production systems use near-real-time: batch for the base index, real-time for recent documents merged on query.

For This Case Study, Our Answers Are:

Search scope: web search (full internet crawl, not internal/enterprise docs)
Query types: natural language + boolean operators + phrase matching
Fuzzy/typo-tolerance: yes — edit distance ≤ 2 for terms under 8 characters
Indexing model: near-real-time hybrid — batch base index + real-time segment append
Freshness target: breaking news indexed in <5 minutes; long-tail in <24 hours
Personalization: yes — per-user click history influences ranking
Autocomplete: yes — separate prefix trie, not covered in this design
Multi-language: yes — separate tokenizers per language (50+ languages)
Result features: snippets, image thumbnails, Knowledge Graph panels
Query latency SLA: p50 <100ms, p99 <500ms

📋 Chapter 2 — Summary

Query types (boolean, phrase, fuzzy) determine index data structures needed.
Real-time vs batch indexing: freshness vs simplicity trade-off.
Multi-language: different tokenizers, stemmers, stop words per language.
Autocomplete needs a separate data structure (prefix trie or edge-ngrams).

Chapter Three

Naive Design

Single Node Full-Text Index

The simplest design: a single Elasticsearch node indexing all documents into one inverted index. queries go to that one server, which scans the index, scores results using BM25, and returns ranked results. Works beautifully for 1M documents. At 100B documents, the index is 100+ PB — it doesn't fit on any single machine. Even at 10M documents, query latency degrades as the index grows, because scoring must traverse longer posting lists for common terms.

Naive Design — Single Elasticsearch Node

✅

What Works

Simple — one node, standard Elasticsearch
Full-text search with BM25 scoring out-of-box
Faceting, filtering, highlighting built-in
Great for <10M documents

💥

What Breaks

Index too large for single machine at scale
Query latency degrades with index size
Single point of failure — node dies = no search
No separation of indexing from serving (compete for resources)
Scaling: only vertical (bigger machine)

📋 Chapter 3 — Summary

Single node: works for small datasets. Index doesn't fit at web scale.
Latency degrades as posting lists grow. Common terms = slow queries.
Indexing and serving on same node compete for CPU/IO resources.

Chapter Four

Refined Design

Sharded Inverted Index with Scatter-Gather Query

The refined design partitions the inverted index across hundreds of shards. Each shard holds a subset of documents and can answer queries independently. A query coordinator receives user queries, broadcasts them to all shards in parallel (scatter), collects top-K results from each (gather), merges and re-ranks globally, then returns the final top-10. The indexing pipeline (crawler → parser → indexer) runs separately, building index segments that are pushed to serving shards.

Refined Design — Distributed Search Architecture

🔍

Query Path (Scatter-Gather)

1. User query → coordinator parses and plans
2. Scatter: send query to all shards in parallel
3. Each shard: search local inverted index, return top-K
4. Gather: coordinator merges K×N results, re-ranks globally
5. Return top-10 with snippets and highlighting
Latency = slowest shard response (tail latency matters)

📥

Indexing Pipeline

Crawler: discover new/updated pages (BFS, priority queue)
Parser: extract text, links, metadata from HTML
Indexer: build inverted index segments (term → doc_ids)
Segment push: new segments deployed to serving shards
Near-real-time: fresh content searchable in minutes

Inverted Index — The Core Data Structure

Tail latency is the enemy of scatter-gather search. If you scatter to 100 shards and one is slow (GC pause, disk IO spike), your entire query is slow. At p99, you are waiting for the slowest shard's p99. Solutions: hedged requests (send to replica if primary is slow), speculative execution, and aggressive timeout + fallback to partial results.

Hedged Requests — Eliminating Tail Latency

📋 Chapter 4 — Summary

Sharded inverted index: each shard holds a subset of documents independently.
Scatter-gather query: coordinator fans out to all shards, merges results.
Indexing pipeline separate from serving: crawler → parser → indexer → segment push.
Tail latency problem: slowest shard determines query latency. Hedge requests.

Chapter Five

Alternative Approaches

Retrieval & Ranking Strategies

Inverted Index (BM25)

Vector Embeddings (Semantic)

Term → document_ids posting list. Exact keyword matching.
BM25 scoring: term frequency × inverse doc frequency
Fast: O(posting_list_length) per term
No understanding of meaning: "car" doesn't match "automobile"
Well-understood, 30+ years of optimization
Used by: Google (base layer), Elasticsearch, Solr

Document → dense vector (768-dim). Query → vector. Nearest neighbor.
Semantic understanding: "car" matches "automobile" naturally
Requires ML model to encode documents and queries
Approximate Nearest Neighbor (ANN): HNSW, IVF, ScaNN
Higher latency and cost than inverted index at scale
Used by: Google (semantic layer), Pinecone, Weaviate

Batch Indexing

Real-Time Indexing

Build complete index offline, swap atomically to serving
Simple: no concurrent read/write on same index
Freshness: new content waits for next build (hours)
Consistent: all queries see same index version
Good for: catalogs, reference data, static content

Append to live index immediately. Searchable in seconds.
Segment-based: new docs go to small segments, merged later
Complex: queries span old + new segments simultaneously
Background merge: compact small segments into large ones
Good for: news, social media, e-commerce (inventory)

Real-Time Indexing — Segment-Based Architecture

Hybrid Search Pipeline — Production Standard

Hybrid retrieval is now the production standard for any serious search system. Pure keyword search misses synonyms and intent ("cheap" vs "affordable", "heart attack" vs "myocardial infarction"). Pure vector search misses exact matches and is more expensive. The hybrid approach runs both in parallel — or in cascade — and combines scores. Google, Bing, and all major enterprise search platforms use this approach. If you are building search today, assume hybrid retrieval from day one.

📋 Chapter 5 — Summary

Inverted index: fast, exact keyword matching. 30 years of optimization.
Vector search: semantic understanding, higher cost. Good for meaning-based queries.
Hybrid (production standard): inverted index for recall, vectors for re-ranking.
Real-time indexing: segment-based append. Freshness in seconds.

Chapter Six

What Real Companies Did

Production Search Systems

🔍

Google

100B+ pages. Caffeine: near-real-time incremental indexing
PageRank: link-based authority scoring (original breakthrough)
BERT/MUM: neural language understanding for query intent
Knowledge Graph: structured answers without clicking results
Custom hardware (TPUs) for ML inference in serving path

🔎

Elasticsearch

Built on Apache Lucene: inverted index + BM25
Shard + replica model: distribute index across cluster
Near-real-time: documents searchable in ~1 second
Rich query DSL: boolean, phrase, fuzzy, aggregations
Used by: GitHub, Netflix, Uber, Wikipedia (search)

⚡

Algolia

Search-as-a-service: hosted, <50ms globally
Typo-tolerance built-in (edit distance at index time)
Edge search: replicate index to 70+ PoPs worldwide
Instant search: results update as user types each character
Used by: Stripe docs, Twitch, Lacoste, Medium

🌐

Bing

Deep learning ranking models (transformer-based)
FPGA acceleration: custom hardware for inference in serving path
Multi-stage ranking: BM25 → lightweight ML → heavy ML
Conversational search integration (Copilot/ChatGPT)
Index freshness: minutes for news, hours for long-tail

Production Search Systems — Comparison

📋 Chapter 6 — Summary

Google: Caffeine (real-time), PageRank + BERT/MUM neural ranking, Knowledge Graph.
Elasticsearch: Lucene-based, shard/replica, near-real-time. De facto standard for app search.
Algolia: edge-replicated, typo-tolerant, <50ms instant search.
Bing: multi-stage ranking with FPGA-accelerated ML inference.

Chapter Seven

Best Practices Extracted

Transferable Lessons

📊

Index Partitioning

Document-partitioned: each shard holds complete docs (standard)
Term-partitioned: each shard holds some terms for ALL docs (rare)
Document-partitioned wins: each shard scores independently
Re-shard by splitting: subdivide existing shards as data grows
Transfers to: any distributed data store partitioning

🔀

Multi-Stage Ranking

Stage 1: Retrieval — fast, recall-focused (inverted index, top-1000)
Stage 2: Lightweight scoring — BM25 + simple features (top-100)
Stage 3: Heavy ML model — deep ranking (top-10)
Each stage trades precision for cost reduction
Transfers to: recommendations, ad ranking, feed ranking

💾

Query Result Caching

Cache top results for popular queries (20% of queries = 80% of volume)
Short TTL: 5-15 min (index changes, freshness matters)
Cache key: normalized query + filters + page
Invalidate on index refresh for affected terms
Transfers to: any expensive computation with repeating inputs

Multi-Stage Ranking — Candidate Funnel

📋 Chapter 7 — Summary

Document-partitioned: standard approach. Each shard scores independently.
Multi-stage ranking: retrieval → lightweight scoring → heavy ML. Funnel reduces cost.
Query caching: popular queries cached with short TTL. Invalidate on index refresh.
Hedged requests: if primary shard slow after 100ms, query replica in parallel. First response wins. Eliminates tail latency from GC pauses.

Chapter Eight

What Could Go Wrong

Common Failure Patterns

💀

Query of Death

User enters expensive regex or wildcard (e.g., *a*b*c*)
Explodes into millions of term lookups per shard
One query consumes all CPU on every shard simultaneously
Fix: query complexity limits, wildcard restrictions, per-query timeout (hard kill at 5s), query cost estimation before execution.

Query of Death — Wildcard Explosion

📉

Relevance Degradation

Ranking signals become stale (click data from months ago)
SEO spam games the ranking algorithm gradually
Users see poor results but don't report — they just leave
Fix: continuous ranking evaluation (human raters + metrics), fresh click signals, anti-spam ML models, A/B testing all ranking changes.

💾

Index Corruption

Bad indexing code writes corrupted segment to one shard
That shard returns wrong results or crashes on certain queries
Affects 1/N of all queries — randomly, hard to reproduce
Fix: index checksums, segment-level validation, rollback capability (keep previous segment version), canary indexing.

Safe Index Deployment — Segment Versioning

🧠

Shard Rebalancing Split-Brain

Adding shards: some docs temporarily not queryable during migration
Coordinator's shard map out of sync → routes to wrong shard
Partial results returned without indication of incompleteness
Fix: double-write during migration (old + new shard), coordinator health checks, explicit "partial results" flag in response.

Relevance degradation is the most insidious search failure. The system returns 200 OK, latency is fine, no errors — but users get poor results and silently switch to a competitor. You need continuous relevance monitoring: automated evaluation sets, click-through-rate tracking, and human rater programs. If CTR on position 1 drops 5% week-over-week, something is wrong even though all infrastructure metrics are green.

📋 Chapter 8 — Summary

Query of death: complexity limits + hard timeout. Never let one query kill the cluster.
Relevance degradation: silent failure. Monitor CTR, use evaluation sets, A/B test changes.
Index corruption: checksums + rollback + canary deployment for index changes.
Split-brain: double-write during migration, explicit partial-results signaling.

← Distributed Cache Cloud Storage →