System Design · Building Blocks · Search Systems

Search Systems

How inverted indexes power fast, relevant search at scale.

01
Chapter One

What a Search System Is — and Why Your Database Isn't One

The Wall You Hit With LIKE '%query%'

A user types “wireless bluetooth headphones” into your search box and expects to see relevant products in 100 ms. The first instinct is reasonable: stick a LIKE '%query%' on the products table and call it done. It works for the demo. It limps along at a thousand products. At a hundred thousand products it takes seconds. At ten million it takes minutes — and it returns nothing for “wireles bluetoth”, finds “headphones” only when the column literally contains that exact substring, and ranks results by row order rather than relevance. The wall isn't a Postgres tuning problem. The wall is that you're using the wrong tool: relational databases are built to find rows by exact field value, and search systems are built to find documents by what they mean.

A search system flips the data structure inside out. Instead of storing rows and scanning columns, it stores an inverted index — a map from each word to the list of documents containing it. Instead of ranking by primary key, it scores by relevance using statistical signals about how informative each word is. Instead of exact-match, it tokenises and analyses text so “running shoes”, “run shoe”, and “Runners Shoe” can all match “runner's shoes”. The cost: a separate system, a separate index to keep in sync, a different query language, and a separate operational burden. The benefit: sub-100 ms full-text queries over hundreds of millions of documents, with relevance ranking, fuzzy matching, faceting, and aggregations that no relational database can match at any scale.

Why Databases Fail at Full-Text Search

Wrong index: B-trees are built for ordered field lookups, not word-in-document lookups.

No relevance: rows have no notion of “closer match” — only equal or not.

No analysis: “running” and “run” are different strings, end of story.

No facets: grouping/counting across millions of matches is a full-table scan.

The escape hatch: Postgres' tsvector works at small/medium scale but doesn't reach Elasticsearch's relevance, language handling, or distributed performance.

What a Search System Adds

Inverted index: word → documents. Lookup is O(1); intersection of posting lists is the query.

Relevance scoring: BM25 ranks results by statistical relevance, not row order.

Analysers: tokenise, lowercase, stem, remove stopwords, handle synonyms.

Fuzzy / typo tolerance: match “bluetoth” to “bluetooth” via edit distance.

Aggregations: facets and counts over the matched set in milliseconds.

The Book Index Analogy — Inverted, Literally

Open a textbook to its index at the back. You don't find a list of pages numbered 1 to 800; you find an alphabetical list of topics, each pointing to the pages where that topic appears. To find every page that discusses “recursion”, you look up the word “recursion” once and read the page list. To find pages discussing both “recursion” and “memoisation”, you look up both and intersect the lists. That is exactly an inverted index. A relational table is the book's page-by-page contents (find me page 247); an inverted index is the back-of-book index (find me every page that mentions X). The structures are inversions of each other — and once you see this, the rest of search architecture becomes intuitive.

Forward Index (Database) vs Inverted Index (Search)
Forward index (relational table) doc_id → full text D1 “wireless bluetooth headphones” D2 “wireless mouse, USB-C” D3 “bluetooth speaker, portable” Query “bluetooth” → scan all rows O(N) — gets worse as data grows Inverted index (search engine) term → doc_ids wireless → [D1, D2] bluetooth → [D1, D3] headphones → [D1] speaker → [D3] Query “bluetooth” → one lookup O(1) lookup + intersect for AND Constant time even at 10M docs

A search system is not a faster database. It is a different shape of index, optimised for the question “which documents contain these words?” rather than “what is the value of this column for this row?” If your queries are mostly the second kind, you don't need a search system. If they're mostly the first, no amount of database tuning will get you there.

📋 Chapter 1 — Summary
  • Databases find rows by exact field values; search systems find documents by what they mean.
  • The core data structure is the inverted index — a map from each term to the documents that contain it.
  • Search adds relevance scoring, analysis (tokenising, stemming), fuzzy matching, and aggregations — none of which a B-tree provides.
  • The book's back-index is the analogy: alphabetical terms → pages, the literal inversion of contents pages.
02
Chapter Two

How Elasticsearch Works Internally

From Document to Index — the Indexing Pipeline

When a document arrives at Elasticsearch, it doesn't go straight into the index. It passes through an analyser that turns text into terms, and only those terms end up in the inverted index. The pipeline has three stages: character filters normalise the raw text (strip HTML, transliterate accents); the tokeniser breaks the stream into tokens (usually on whitespace and punctuation); token filters transform each token (lowercase, remove stopwords like “the”, stem “running” → “run”, expand synonyms). The output — a sequence of normalised terms — is what gets indexed. The same analyser is applied to the query at search time, which is why the choice of analyser is the single most important configuration decision in a search system. Get it wrong and your index is full of unmatchable strings.

Indexing Pipeline — Document → Inverted Index
Same analyser must run at index time AND query time Document “The Running Shoes!” Char Filter strip HTML, transliterate Tokenise [The, Running, Shoes] Token Filter lowercase, stop, stem, synonyms Terms [run, shoe] (stem, no stop) Inverted Index run→[D1] shoe→[D1] A query for “runners” goes through the SAME pipeline → becomes “run” → matches D1. Mismatched analysers between index and query = silently broken search.
Relevance Scoring — TF-IDF and BM25

Returning matching documents is the easy part. Ranking them by usefulness is what separates a search system from a glorified WHERE clause. The classic algorithm is TF-IDF, on two intuitions: term frequency (a doc that mentions “recursion” ten times is more about recursion than one mentioning it once); inverse document frequency (the word “the” appears everywhere, so its presence tells us nothing — rare words like “recursion” are more informative). The score for a doc is the product: high TF, high IDF → high score. Modern Elasticsearch defaults to BM25, a refinement that prevents term frequency from dominating (the tenth occurrence adds less than the second) and adjusts for document length (a 50-word doc with three matches is more focused than a 5,000-word one with the same three matches). The signals are statistical, not semantic, but they get you remarkably far on real corpora.

📐

TF-IDF (the classic intuition)

TF (term frequency): how often the term appears in this doc. More mentions → more relevant.

IDF (inverse document frequency): how rare the term is across the corpus. Rare terms are more informative.

Score: roughly TF × IDF — a doc gets points for hitting rare-and-frequent terms.

Weakness: linear in TF (the 100th occurrence counts the same as the 1st); ignores doc length.

🎯

BM25 (Elasticsearch default since 5.0)

Saturation: TF contribution levels off — diminishing returns past a few mentions.

Length normalisation: short focused docs beat long unfocused ones for the same term count.

Tunables: k1 (saturation point) and b (length penalty strength).

Why it's the default: better behaviour on real corpora; tunable per use case.

Shards, Replicas, Segments

Elasticsearch achieves scale by splitting an index into shards (each a self-contained Lucene index). A query fans out to all shards in parallel; each returns its top-K, and the coordinator merges them. Add shards → more parallelism, more capacity. Replicas are full copies of each shard on different nodes — they provide read scale-out and survive node failures. The number of primary shards is fixed when the index is created and can't be changed without reindexing; the number of replicas can be tuned anytime. Inside a shard, data lives in immutable segments — new writes create new segments, and a background merge process consolidates them. Deletions are tombstones; updates are delete-plus-insert. Understanding the shard / replica / segment hierarchy is the difference between a cluster that scales gracefully and one that mysteriously falls over at 80% disk usage.

The single biggest beginner mistake: using a default analyser (or none) and discovering six months later that the index has been silently storing un-stemmed, case-sensitive tokens. By that point reindexing tens of millions of documents is a multi-day operation. Pick the analyser deliberately, test it with the analyse API on real strings, and re-test it any time you change the mapping.

📋 Chapter 2 — Summary
  • The analyser pipeline (char filter → tokeniser → token filter) turns text into terms; the same analyser must run at index time and query time.
  • Relevance: TF-IDF intuition, BM25 in practice — with saturation and length normalisation as the key refinements.
  • Shards partition data for parallel queries; replicas provide redundancy and read scale-out; primary shard count is fixed at creation.
  • Segments are immutable; deletes are tombstones; merges run in the background.
03
Chapter Three

When a Search System Earns Its Keep

The Question Isn't “Should I Use Elasticsearch?” It's “Have I Outgrown LIKE?”

A search system is not a database upgrade. It is a complement, almost always living alongside your primary store. So the right question is not “ES or Postgres?” — it's “am I trying to answer a question my database is fundamentally bad at, and is that costing me enough to justify a second system?” The honest answer for many teams is no: a Postgres tsvector column with a GIN index will serve a few million rows of full-text data with respectable relevance, no extra infrastructure, and the same transactional guarantees as the rest of your data. Reach for a dedicated search system when you cross thresholds the database can't cross: tens of millions of documents, sub-100 ms query latency requirements, multi-language analysers, faceted navigation across high-cardinality fields, or relevance tuning that goes beyond “ORDER BY ts_rank”.

USE a Search System When…

Tens of millions of documents with full-text search latency targets in the tens of ms.

Relevance ranking matters — users care about quality of results, not just presence.

Faceted search / aggregations — counts and groupings over the matched set in real time.

Multi-language / fuzzy matching / synonyms — non-trivial linguistic handling.

Logs and observability data — the canonical “everything is text and we want to query it fast” case.

Auto-complete / suggest-as-you-type at scale.

DO NOT Use One When…

You have < 1M documents and Postgres FTS would do fine — one fewer system to operate.

You need ACID transactions across search + writes — ES is eventually consistent and not your primary store.

Queries are exact-key lookups by ID or simple equality — the database wins on every dimension.

You want it as the system of record — never. ES is a derived index, not a source of truth.

Strong consistency is required — index lag of seconds is normal; design around it or pick another tool.

Your “search” is really analytics — OLAP engines (ClickHouse, Druid) are usually a better fit for that.

Search and Database Are Complements, Not Competitors

The right mental model: your relational database is the source of truth. Every product, user, document, log entry — the authoritative copy lives there with referential integrity, transactions, and constraints. The search system is a derived, denormalised view of that data, optimised for one access pattern: full-text + faceted + relevance-ranked queries. Writes go to the database; an indexing pipeline projects them into the search index; reads for search-shaped queries go to the index, while everything else — checkout, billing, account changes — stays on the database. The two systems will be slightly inconsistent at any given moment (typically by milliseconds to seconds); designs that pretend otherwise will hit subtle bugs.

If your team is two engineers and your search problem is “find users matching this name fragment in a 200k-row table”, do not deploy Elasticsearch. Postgres with a GIN-indexed tsvector, or a trigram index, will solve your problem with no second system. Reach for a search system when the gap between what your database does and what your users need is wide enough that no amount of tuning will close it.

📋 Chapter 3 — Summary
  • Use a search system for full-text at scale, relevance ranking, facets, multi-language, logs, and auto-complete.
  • Skip it for < 1M docs (Postgres FTS is fine), exact-key lookups, ACID needs, or analytics workloads.
  • Search is a complement to the database, not a replacement — the DB stays the source of truth.
  • Never use a search engine as the system of record. It's a derived, eventually-consistent index.
04
Chapter Four

Trade-offs & Comparisons

The Three Trade-offs You'll Live With

First: consistency lag between database and search index. The index is updated by a pipeline that runs after the write commits in the database — a refresh interval (Elasticsearch default: 1 second) plus pipeline latency. A user updates their profile, immediately searches for themselves, and sees the old value. This is normal. Workarounds — refresh-on-write, query-after-write — exist but cost throughput. Second: relevance versus performance. More analysis (n-grams, synonyms, semantic embeddings) means better recall but bigger indexes and slower queries. The right operating point is a per-product decision and changes over time. Third: operational complexity. Elasticsearch is a stateful distributed system with shard rebalancing, master elections, JVM tuning, segment merges, mapping migrations, and snapshot/restore concerns. Treating it like “just another service” is the fastest way to a Friday-evening cluster red status.

⏱️

Consistency Lag — Always Present

Source: refresh interval (default 1 s in ES) + indexing pipeline latency.

Implication: read-your-own-write through search will see stale data.

Mitigations: manual refresh on critical writes (cost throughput); short refresh intervals (cost throughput differently); read recent writes from the DB, not the index.

⚖️

Relevance vs Performance

Levers that improve relevance: n-grams, synonyms, fuzzy matching, more fields, larger analysers, semantic re-ranking.

Levers' cost: index size, indexing throughput, query latency, RAM for caches.

Discipline: measure relevance with a real eval set; tune one knob at a time; benchmark every change.

Elasticsearch vs Solr vs Typesense vs Algolia
🔍

Elasticsearch (and OpenSearch)

Strengths: dominant ecosystem, vast feature set (search + log analytics + vector + ML), Kibana, Logstash.

Trade-off: heaviest operational footprint (JVM, shards, clusters); cost at very high scale.

OpenSearch: Apache 2.0 fork after the licence change; near-feature-parity for most use cases.

Pick when: you need everything — search, logs, observability, vector search — in one platform.

☘️

Apache Solr

Strengths: battle-tested for over a decade; same Lucene core as ES; rich enterprise search features (faceting, joins, document grouping).

Trade-off: smaller modern community; fewer cloud-managed options.

Pick when: you have an existing Solr deployment that works, or strict requirements ES-flavoured stack doesn't meet.

Typesense (and Meilisearch)

Strengths: tiny operational footprint, dev-friendly defaults, excellent typo tolerance out of the box, fast.

Trade-off: simpler feature set; not the right tool for log analytics or extreme scale.

Pick when: you want product/site search up in an afternoon and don't need ES's breadth.

☁️

Algolia (managed SaaS)

Strengths: incredibly fast, polished SDKs, near-zero ops, great for instant-search UX.

Trade-off: per-record + per-query pricing scales painfully past a few million records; vendor lock-in.

Pick when: small to mid product catalogue, premium UX requirements, no appetite for ops.

The choice between these is rarely about features — all four can do typo tolerance, facets, and ranking. It's about your operating model. Self-host with a mature ecosystem? Elasticsearch / OpenSearch. Self-host with minimal ops? Typesense or Meilisearch. No ops, willing to pay? Algolia. Existing investment? Solr. Pick the one whose operational style fits your team, not the one with the longest feature matrix.

📋 Chapter 4 — Summary
  • Three permanent trade-offs: consistency lag, relevance vs performance, operational complexity.
  • Elasticsearch / OpenSearch for breadth; Solr for legacy / specific needs; Typesense / Meilisearch for low ops; Algolia for managed.
  • Choose by operating model and team capacity, not by feature checklist.
  • Always run a real relevance eval set before and after changing anything — intuition lies.
05
Chapter Five

Production Patterns & Common Mistakes

Two Patterns That Keep DB and Index Honest

If you take only two patterns into production from this chapter, take these. Outbox-driven indexing over naive dual writes: don't have your application write to the database and the search index in two separate calls — one will eventually fail, the index will drift, and you'll spend a Saturday reconciling. Instead, write the database row and an “outbox” event in the same transaction, and have a downstream worker (or CDC pipeline like Debezium) read that outbox and update the index. The transaction guarantees you can't commit data without queueing the index update. Versioned aliases for zero-downtime reindexing: never write or read against an index name directly — always use an alias. When you need to reindex (mapping change, analyser change, schema evolution), build a new index in parallel, atomically swap the alias once it's caught up, and delete the old one. This turns a downtime-heavy migration into a routine operation.

📤

Pattern: Outbox-Driven Indexing

Goal: never have the DB and index disagree because of partial failures.

How: in one DB transaction, write the business data + an outbox row describing the change. A worker tails the outbox table (or uses CDC) and updates the search index.

Win: at-least-once delivery to the index; the index will eventually catch up. Combine with idempotent updates to the index (use the doc's primary key as the ES _id).

🔁

Pattern: Versioned Aliases for Reindex

Goal: change mappings/analysers without downtime.

How: apps read/write via products alias; today it points to products_v3. When you need new mappings, create products_v4, dual-write or replay, then atomically point the alias to v4. Drop v3.

Win: reindexing becomes a routine, reversible operation instead of a stop-the-world event.

The Six Mistakes That Take Search Down
🏛️

Mistake 1 — ES as System of Record

No backups, no transactions, no constraints — one corrupted shard or accidental delete is irrecoverable. Fix: the database is always the source of truth. The index is rebuildable from it.

📈

Mistake 2 — Indexing Everything

Every field marked searchable, including 50KB JSON blobs and binary data. Index doubles in size; query latency triples. Fix: index only fields you actually search; mark the rest as doc_values: false or index: false.

👻

Mistake 3 — Default / Missing Analyser

No language stemmer, case-sensitive, no stopwords. “Running” doesn't match “run”; users blame “the search”. Fix: pick the right language analyser at index creation; verify with the analyse API.

🔥

Mistake 4 — Naive Dual Writes

App writes to DB, then to ES, in two separate calls. ES write fails — data drifts forever. Fix: outbox pattern + idempotent updater, or CDC (Debezium → Kafka → ES).

🏷️

Mistake 5 — No Aliases, Reindex = Outage

Apps hardcoded to products_v1; mapping change requires deploy + reindex coordination. Fix: always read/write via aliases; switch atomically.

🔍

Mistake 6 — Mapping Explosion

Dynamic mapping enabled + arbitrary user JSON ingested → thousands of fields, OOM on the master, cluster red. Fix: set strict mappings; cap with index.mapping.total_fields.limit.

The single highest-leverage decision in a search project is the analyser, set at index creation, before you put a single document in. The single highest-leverage operational practice is aliases, set up before you ever query the index. The single highest-leverage architectural practice is treating the database as the source of truth and the search index as rebuildable. Do those three things and you'll avoid 80% of the production pain that search systems generate.

📋 Chapter 5 — Summary
  • Outbox-driven indexing instead of naive dual writes — one transaction commits both the data and the index instruction.
  • Versioned aliases turn reindexing from an outage into a routine operation.
  • The six mistakes: ES as system of record, indexing everything, missing analyser, naive dual writes, no aliases, mapping explosion.
  • Database is truth; the index is rebuildable. Build that assumption into every part of your design.