Case Study: Search Engine
Design, trade-offs, and alternatives for a search engine at scale.
Problem Statement
A search engine crawls the web, builds an index of every page's content, and serves relevant results for arbitrary text queries in under 500 milliseconds. The scale is staggering: 100B+ pages indexed, 8.5B searches per day (~100K queries/sec), and new content must be discoverable within minutes of publication. The core challenge is not finding a needle in a haystack โ it is ranking the 10 most relevant needles from 100 billion candidates, faster than the user can blink.
Traffic & Scale
- 8.5B searches/day (~100K queries/sec)
- 100B+ web pages indexed
- Index size: 100+ PB (compressed inverted index)
- Crawl rate: 1B+ pages/day for freshness
Requirements
- Query latency: <500ms end-to-end (p99)
- Freshness: breaking news indexed in <5 minutes
- Relevance: top-10 results satisfy user intent >90%
- Availability: 99.99% โ search is the gateway to the internet
Search is two completely separate systems: the indexing pipeline and the query serving system. Indexing is a massive batch/stream processing problem (crawl โ parse โ index). Query serving is a low-latency distributed query problem (parse query โ scatter to shards โ gather โ rank โ return). They share only the index data structure. Designing them as one system is the same mistake as combining video upload with video playback.
- 8.5B searches/day across 100B+ indexed pages. 100K queries/sec.
- Two systems: indexing pipeline (batch) and query serving (low-latency).
- Sub-500ms latency. Freshness under 5 minutes. Relevance >90% satisfaction.
- The inverted index is the fundamental data structure that makes this possible.
Questions to Ask
Query Types
- Full-text keyword search or structured filters?
- Boolean (AND/OR/NOT) or natural language?
- Phrase matching ("exact phrase")?
- Fuzzy/typo-tolerant? (did you mean...)
Indexing Strategy
- Batch indexing (hours) or real-time (seconds)?
- Multi-language support? (tokenization differs)
- Structured metadata + full-text hybrid?
- Document updates: re-index or delta update?
User Experience
- Autocomplete/typeahead suggestions?
- Personalized results (per-user ranking)?
- Faceting/filtering (category, price, date)?
- Highlighting matched terms in results?
Real-time vs batch indexing changes the entire architecture. Batch indexing (rebuild index every N hours) is simpler โ build offline, swap atomically. Real-time indexing (document searchable in seconds) requires append-to-live-index, segment merging, and handling queries against partially-built index segments. Most production systems use near-real-time: batch for the base index, real-time for recent documents merged on query.
For This Case Study, Our Answers Are:
- Search scope: web search (full internet crawl, not internal/enterprise docs)
- Query types: natural language + boolean operators + phrase matching
- Fuzzy/typo-tolerance: yes โ edit distance โค 2 for terms under 8 characters
- Indexing model: near-real-time hybrid โ batch base index + real-time segment append
- Freshness target: breaking news indexed in <5 minutes; long-tail in <24 hours
- Personalization: yes โ per-user click history influences ranking
- Autocomplete: yes โ separate prefix trie, not covered in this design
- Multi-language: yes โ separate tokenizers per language (50+ languages)
- Result features: snippets, image thumbnails, Knowledge Graph panels
- Query latency SLA: p50 <100ms, p99 <500ms
- Query types (boolean, phrase, fuzzy) determine index data structures needed.
- Real-time vs batch indexing: freshness vs simplicity trade-off.
- Multi-language: different tokenizers, stemmers, stop words per language.
- Autocomplete needs a separate data structure (prefix trie or edge-ngrams).
Naive Design
The simplest design: a single Elasticsearch node indexing all documents into one inverted index. queries go to that one server, which scans the index, scores results using BM25, and returns ranked results. Works beautifully for 1M documents. At 100B documents, the index is 100+ PB โ it doesn't fit on any single machine. Even at 10M documents, query latency degrades as the index grows, because scoring must traverse longer posting lists for common terms.
What Works
- Simple โ one node, standard Elasticsearch
- Full-text search with BM25 scoring out-of-box
- Faceting, filtering, highlighting built-in
- Great for <10M documents
What Breaks
- Index too large for single machine at scale
- Query latency degrades with index size
- Single point of failure โ node dies = no search
- No separation of indexing from serving (compete for resources)
- Scaling: only vertical (bigger machine)
- Single node: works for small datasets. Index doesn't fit at web scale.
- Latency degrades as posting lists grow. Common terms = slow queries.
- Indexing and serving on same node compete for CPU/IO resources.
Refined Design
The refined design partitions the inverted index across hundreds of shards. Each shard holds a subset of documents and can answer queries independently. A query coordinator receives user queries, broadcasts them to all shards in parallel (scatter), collects top-K results from each (gather), merges and re-ranks globally, then returns the final top-10. The indexing pipeline (crawler โ parser โ indexer) runs separately, building index segments that are pushed to serving shards.
Query Path (Scatter-Gather)
- 1. User query โ coordinator parses and plans
- 2. Scatter: send query to all shards in parallel
- 3. Each shard: search local inverted index, return top-K
- 4. Gather: coordinator merges KรN results, re-ranks globally
- 5. Return top-10 with snippets and highlighting
- Latency = slowest shard response (tail latency matters)
Indexing Pipeline
- Crawler: discover new/updated pages (BFS, priority queue)
- Parser: extract text, links, metadata from HTML
- Indexer: build inverted index segments (term โ doc_ids)
- Segment push: new segments deployed to serving shards
- Near-real-time: fresh content searchable in minutes
Tail latency is the enemy of scatter-gather search. If you scatter to 100 shards and one is slow (GC pause, disk IO spike), your entire query is slow. At p99, you are waiting for the slowest shard's p99. Solutions: hedged requests (send to replica if primary is slow), speculative execution, and aggressive timeout + fallback to partial results.
- Sharded inverted index: each shard holds a subset of documents independently.
- Scatter-gather query: coordinator fans out to all shards, merges results.
- Indexing pipeline separate from serving: crawler โ parser โ indexer โ segment push.
- Tail latency problem: slowest shard determines query latency. Hedge requests.
Alternative Approaches
- Term โ document_ids posting list. Exact keyword matching.
- BM25 scoring: term frequency ร inverse doc frequency
- Fast: O(posting_list_length) per term
- No understanding of meaning: "car" doesn't match "automobile"
- Well-understood, 30+ years of optimization
- Used by: Google (base layer), Elasticsearch, Solr
- Document โ dense vector (768-dim). Query โ vector. Nearest neighbor.
- Semantic understanding: "car" matches "automobile" naturally
- Requires ML model to encode documents and queries
- Approximate Nearest Neighbor (ANN): HNSW, IVF, ScaNN
- Higher latency and cost than inverted index at scale
- Used by: Google (semantic layer), Pinecone, Weaviate
- Build complete index offline, swap atomically to serving
- Simple: no concurrent read/write on same index
- Freshness: new content waits for next build (hours)
- Consistent: all queries see same index version
- Good for: catalogs, reference data, static content
- Append to live index immediately. Searchable in seconds.
- Segment-based: new docs go to small segments, merged later
- Complex: queries span old + new segments simultaneously
- Background merge: compact small segments into large ones
- Good for: news, social media, e-commerce (inventory)
Hybrid retrieval is now the production standard for any serious search system. Pure keyword search misses synonyms and intent ("cheap" vs "affordable", "heart attack" vs "myocardial infarction"). Pure vector search misses exact matches and is more expensive. The hybrid approach runs both in parallel โ or in cascade โ and combines scores. Google, Bing, and all major enterprise search platforms use this approach. If you are building search today, assume hybrid retrieval from day one.
- Inverted index: fast, exact keyword matching. 30 years of optimization.
- Vector search: semantic understanding, higher cost. Good for meaning-based queries.
- Hybrid (production standard): inverted index for recall, vectors for re-ranking.
- Real-time indexing: segment-based append. Freshness in seconds.
What Real Companies Did
- 100B+ pages. Caffeine: near-real-time incremental indexing
- PageRank: link-based authority scoring (original breakthrough)
- BERT/MUM: neural language understanding for query intent
- Knowledge Graph: structured answers without clicking results
- Custom hardware (TPUs) for ML inference in serving path
Elasticsearch
- Built on Apache Lucene: inverted index + BM25
- Shard + replica model: distribute index across cluster
- Near-real-time: documents searchable in ~1 second
- Rich query DSL: boolean, phrase, fuzzy, aggregations
- Used by: GitHub, Netflix, Uber, Wikipedia (search)
Algolia
- Search-as-a-service: hosted, <50ms globally
- Typo-tolerance built-in (edit distance at index time)
- Edge search: replicate index to 70+ PoPs worldwide
- Instant search: results update as user types each character
- Used by: Stripe docs, Twitch, Lacoste, Medium
Bing
- Deep learning ranking models (transformer-based)
- FPGA acceleration: custom hardware for inference in serving path
- Multi-stage ranking: BM25 โ lightweight ML โ heavy ML
- Conversational search integration (Copilot/ChatGPT)
- Index freshness: minutes for news, hours for long-tail
- Google: Caffeine (real-time), PageRank + BERT/MUM neural ranking, Knowledge Graph.
- Elasticsearch: Lucene-based, shard/replica, near-real-time. De facto standard for app search.
- Algolia: edge-replicated, typo-tolerant, <50ms instant search.
- Bing: multi-stage ranking with FPGA-accelerated ML inference.
Best Practices Extracted
Index Partitioning
- Document-partitioned: each shard holds complete docs (standard)
- Term-partitioned: each shard holds some terms for ALL docs (rare)
- Document-partitioned wins: each shard scores independently
- Re-shard by splitting: subdivide existing shards as data grows
- Transfers to: any distributed data store partitioning
Multi-Stage Ranking
- Stage 1: Retrieval โ fast, recall-focused (inverted index, top-1000)
- Stage 2: Lightweight scoring โ BM25 + simple features (top-100)
- Stage 3: Heavy ML model โ deep ranking (top-10)
- Each stage trades precision for cost reduction
- Transfers to: recommendations, ad ranking, feed ranking
Query Result Caching
- Cache top results for popular queries (20% of queries = 80% of volume)
- Short TTL: 5-15 min (index changes, freshness matters)
- Cache key: normalized query + filters + page
- Invalidate on index refresh for affected terms
- Transfers to: any expensive computation with repeating inputs
- Document-partitioned: standard approach. Each shard scores independently.
- Multi-stage ranking: retrieval โ lightweight scoring โ heavy ML. Funnel reduces cost.
- Query caching: popular queries cached with short TTL. Invalidate on index refresh.
- Hedged requests: if primary shard slow after 100ms, query replica in parallel. First response wins. Eliminates tail latency from GC pauses.
What Could Go Wrong
Query of Death
- User enters expensive regex or wildcard (e.g.,
*a*b*c*) - Explodes into millions of term lookups per shard
- One query consumes all CPU on every shard simultaneously
- Fix: query complexity limits, wildcard restrictions, per-query timeout (hard kill at 5s), query cost estimation before execution.
Relevance Degradation
- Ranking signals become stale (click data from months ago)
- SEO spam games the ranking algorithm gradually
- Users see poor results but don't report โ they just leave
- Fix: continuous ranking evaluation (human raters + metrics), fresh click signals, anti-spam ML models, A/B testing all ranking changes.
Index Corruption
- Bad indexing code writes corrupted segment to one shard
- That shard returns wrong results or crashes on certain queries
- Affects 1/N of all queries โ randomly, hard to reproduce
- Fix: index checksums, segment-level validation, rollback capability (keep previous segment version), canary indexing.
Shard Rebalancing Split-Brain
- Adding shards: some docs temporarily not queryable during migration
- Coordinator's shard map out of sync โ routes to wrong shard
- Partial results returned without indication of incompleteness
- Fix: double-write during migration (old + new shard), coordinator health checks, explicit "partial results" flag in response.
Relevance degradation is the most insidious search failure. The system returns 200 OK, latency is fine, no errors โ but users get poor results and silently switch to a competitor. You need continuous relevance monitoring: automated evaluation sets, click-through-rate tracking, and human rater programs. If CTR on position 1 drops 5% week-over-week, something is wrong even though all infrastructure metrics are green.
- Query of death: complexity limits + hard timeout. Never let one query kill the cluster.
- Relevance degradation: silent failure. Monitor CTR, use evaluation sets, A/B test changes.
- Index corruption: checksums + rollback + canary deployment for index changes.
- Split-brain: double-write during migration, explicit partial-results signaling.