System Design ยท Case Studies

Case Study: Recommendation Engine

Design, trade-offs, and alternatives for a recommendation engine at scale.

01
Chapter One

Problem Statement

What We Are Building

A recommendation engine predicts what items a user will engage with โ€” products to buy, videos to watch, songs to listen to, people to follow โ€” based on their past behavior, similar users, and item characteristics. The core challenge is ranking millions of candidate items for each user in real-time, using signals that range from explicit (ratings, purchases) to implicit (dwell time, scroll speed, skip patterns). At scale, you are computing personalized rankings for 500M+ users across 100M+ items, serving recommendations in under 100ms per request.

Scale Requirements

Traffic & Scale

  • 500M daily active users
  • 100M+ items in catalog
  • 1M recommendation requests/sec at peak
  • 10B+ user-item interactions/day (impressions, clicks, purchases)

Requirements

  • Latency: <100ms per recommendation request
  • Freshness: new interactions reflected in <30 minutes
  • Relevance: personalized, not just popular
  • Diversity: avoid filter bubbles (explore vs exploit)

A recommendation engine is a real-time ranking system that must balance exploitation (show what we know the user likes) with exploration (show new items to learn preferences). Pure exploitation creates filter bubbles โ€” users see the same type of content forever. Pure exploration feels random. The best systems use multi-armed bandit approaches or explicit diversity injection to balance both. This is not just an engineering problem โ€” it is a product and ethical decision that affects what 500M people see every day.

๐Ÿ“‹ Chapter 1 โ€” Summary
  • 500M users, 100M+ items, 1M rec requests/sec. Personalized in <100ms.
  • Signals: clicks, purchases, dwell time, ratings, skip patterns.
  • Balance exploitation (relevance) with exploration (diversity).
  • New interactions must update recommendations within 30 minutes.
02
Chapter Two

Questions to Ask

Clarifying Before Designing
๐ŸŽฏ

Recommendation Context

  • Homepage (cold start) or in-session (already browsing)?
  • "Similar items" or "personalized for you"?
  • Single domain (movies) or cross-domain (movies + books)?
  • Real-time context (time, location, device)?
๐Ÿ“Š

Signal Types

  • Explicit feedback (ratings 1-5)?
  • Implicit feedback (views, clicks, dwell time)?
  • Negative signals (skip, hide, dislike)?
  • Social signals (friends' activity)?
โš–๏ธ

Business Constraints

  • Revenue optimization or engagement?
  • Fairness constraints (expose all sellers)?
  • Content moderation (don't recommend harmful content)?
  • Explainability ("Because you watched...")?

The cold start problem determines your architecture's complexity. A new user with no history can't get collaborative filtering recommendations. A new item with no interactions can't be recommended by any behavior-based model. You need content-based fallbacks (recommend based on item features), popularity-based defaults, and rapid learning from first few interactions. Cold start is not an edge case โ€” it is 20%+ of your traffic (new users, new items daily).

For This Case Study, Our Answers Are:

  • Context: homepage recommendations (cold open, no session context)
  • Item domain: single domain โ€” video content (YouTube-style)
  • Primary signals: implicit (watch time, completion rate, skip) + explicit (likes, saves)
  • Negative signals: yes โ€” skip and explicit "not interested" used
  • Social signals: no โ€” individual behavior only (no friend graph)
  • Cold start: yes โ€” 20%+ of users are new or returning after 30+ days
  • Business constraints: no harmful content, max 2 sponsored slots in top-20
  • Personalization freshness: user model updated every 30 minutes
  • Explainability: "Because you watched X" shown for top recommendations
  • Catalog size: 100M+ items, ~10K new items added daily
๐Ÿ“‹ Chapter 2 โ€” Summary
  • Context matters: homepage vs in-session recommendations have different models.
  • Implicit signals (clicks, dwell time) far more abundant than explicit (ratings).
  • Cold start (new users, new items) is 20%+ of traffic โ€” needs fallback strategy.
  • Business constraints (fairness, revenue, safety) override pure relevance.
03
Chapter Three

Naive Design

Global Popularity + Simple Rules

The simplest recommendation: show everyone the most popular items. Sort by total purchases/views in the last 7 days, return top-20. Add basic rules: "if user bought X, show category Y." No personalization, no machine learning, no user modeling. Works surprisingly well for a new service (popular items are popular for a reason). Breaks when you need to differentiate โ€” everyone sees the same recommendations, engagement plateaus, and long-tail items never get surfaced.

โœ…

What Works

  • Simple โ€” one SQL query sorted by popularity
  • No cold start problem (works for all users equally)
  • Decent baseline: popular items have broad appeal
  • No ML infrastructure needed
๐Ÿ’ฅ

What Breaks

  • Zero personalization โ€” everyone sees the same list
  • Rich-get-richer: popular items stay popular, new items invisible
  • Engagement plateau: users see nothing new or surprising
  • Long-tail items never recommended (90% of catalog ignored)
  • No learning: system never improves from user behavior

Popularity-based recommendations also remain the correct fallback for cold-start users โ€” when no behavior history exists, popular items are unambiguously better than a personalized model with no data to work from.

Naive Design โ€” Global Popularity Ranking
User A ๐ŸŽท Jazz fan User B ๐Ÿค˜ Metal fan User N ๐ŸŽต New user Recommendation Service SELECT item_id FROM items ORDER BY views DESC LIMIT 20 identical identical Top-20 (Pop Hits) same list โ†’ User A Top-20 (Pop Hits) same list โ†’ User B Top-20 (Pop Hits) same list โ†’ User N โš ๏ธ Problems Zero personalization: User A (jazz fan) and User B (metal fan) get the same "Best Of Pop" recommendations. Rich-get-richer: popular items stay popular. Long-tail items (90% of catalog) never surface. No feedback loop: system never learns. Same 20 items forever regardless of user behavior.
๐Ÿ“‹ Chapter 3 โ€” Summary
  • Popularity-based: simple, no personalization, decent baseline.
  • Everyone sees same list โ€” engagement plateaus quickly.
  • Long-tail items (90% of catalog) never surfaced.
  • No feedback loop: system never learns from user behavior.
04
Chapter Four

Refined Design

Multi-Stage Retrieval + Ranking Pipeline

The refined design uses a funnel approach: from 100M candidate items, narrow to 1000 candidates (retrieval), score those 1000 (ranking), apply business rules (filtering), then return top-20. Each stage uses different models optimized for their task โ€” retrieval prioritizes recall (don't miss good items), ranking prioritizes precision (order the best items correctly). This separation allows each stage to use appropriate compute budget: retrieval is cheap per item, ranking is expensive but runs on few items.

Refined Design โ€” Multi-Stage Recommendation Pipeline
User Candidate Retrieval 100M โ†’ 1000 items ANN, collaborative filter ~10ms Ranking Model 1000 โ†’ 100 scored deep neural network ~50ms ~1000 Re-Ranking dedup (no same brand) diversity (max 3/category) boost (sponsored, new) filter (out of stock, 18+) ~5ms ~100 Top 20 20 final Item Index (ANN) pre-computed embeddings Feature Store user + item features features User Profile embeddings, history user embedding context Offline Training Pipeline Event logs โ†’ Feature engineering โ†’ Model training โ†’ Deploy model weights embeddings click, purchase, skip (async) Retrieval (~10ms) โ†’ Ranking (~50ms) โ†’ Re-rank (~5ms) = Total <100ms serving budget Training: offline batch (hours). Serving: online real-time (<100ms). Feature store bridges offline training and online serving โ€” same features, different latency.
๐Ÿ”

Candidate Retrieval

  • Goal: recall โ€” find 1000 potentially relevant items from 100M
  • Source 1: Collaborative filtering (users like you liked...)
  • Source 2: Content-based (similar item features)
  • Source 3: ANN vector search (user embedding โ†’ nearest item embeddings)
  • Source 4: Popularity (trending, new arrivals)
  • Multiple sources merged โ€” maximize recall at low compute cost
๐Ÿ“Š

Ranking Model

  • Goal: precision โ€” correctly order the 1000 candidates
  • Deep neural network with rich features (user + item + context)
  • Features from feature store: user history, item attributes, real-time context
  • Predict P(click), P(purchase), P(watch_completion)
  • Score = weighted combination of predicted objectives
  • Heavy compute but only runs on 1000 items (not 100M)
Two-Tower Model โ€” How Retrieval Works
User Tower age, location watch history recent searches device type โ†“ Neural Network โ†“ User Embedding (128d) computed at serving time Item Tower title, genre description engagement rates recency โ†“ Neural Network โ†“ Item Embeddings (128d) pre-computed & indexed ANN Index Lookup cosine similarity Top-K Nearest Items (candidates) Item embeddings pre-computed. Only user embedding computed at serving time. ANN = O(log N).

The feature store is the critical infrastructure that bridges offline training and online serving. Models are trained offline on historical features. At serving time, the same features must be available in real-time (<10ms). The feature store pre-computes and caches user features (updated every few minutes) and item features (updated on catalog change). Without it, you either train on different features than you serve (training-serving skew) or compute features on every request (latency explosion).

Feature Store โ€” Bridging Offline Training and Online Serving
Feature Store single source of truth OFFLINE (Training) Raw event logs Feature Engineering (Spark) write read Model Training ONLINE (Serving) User request arrives Read features (~5ms) read Model inference Recommendation served โ†‘ SAME features for both paths โ†‘ โš  Without Feature Store Feature computed differently in training (batch SQL) vs serving (real-time query) โ†’ TRAINING-SERVING SKEW โ†’ model underperforms in production
๐Ÿ“‹ Chapter 4 โ€” Summary
  • Multi-stage funnel: retrieval (100Mโ†’1K) โ†’ ranking (1Kโ†’100) โ†’ re-rank (100โ†’20).
  • Retrieval: cheap, recall-focused. Multiple sources (collab filter, ANN, content-based).
  • Ranking: expensive deep learning model. Precision-focused. Rich features.
  • Feature store: bridges offline training and online serving. Prevents training-serving skew.
05
Chapter Five

Alternative Approaches

Recommendation Strategies
Collaborative Filtering
Content-Based Filtering
  • "Users who liked X also liked Y" โ€” behavior patterns
  • Matrix factorization: user ร— item โ†’ latent factor embeddings
  • Powerful: discovers non-obvious connections between items
  • Cold start problem: new users/items have no behavior data
  • Popularity bias: already-popular items get recommended more
  • Used by: Netflix (early), Amazon (item-item CF)
  • "Similar items based on features" โ€” metadata matching
  • Item features: genre, author, price range, description embeddings
  • No cold start for items: new item with metadata can be recommended immediately
  • Limited serendipity: recommends same type (thriller โ†’ more thrillers)
  • Need good metadata/features for items
  • Used by: Spotify (audio features), news (topic matching)
Collaborative Filtering โ€” User-Item Matrix Factorization
Item AItem BItem CItem DItem E User 1 User 2 User 3 User 4 5 4 ? 1 ? 4 ? 2 ? 5 ? 3 ? 4 2 1 ? 5 ? 3 Predicted: 3.8 User1 latent ยท ItemC latent Matrix โ‰ˆ User Matrix ร— Item Matrix (each reduced to k latent factors, e.g., k=50) = known rating = unknown (to predict) User 1 and User 2 both like Item A (5, 4) โ†’ similar tastes โ†’ Recommend User 2's liked items (E=5) to User 1. Finds latent patterns: users who like {A, B} tend to also like {E}.
Deep Learning (Two-Tower)
Multi-Armed Bandit (Explore/Exploit)
  • One tower encodes user, one tower encodes item โ†’ dot product = score
  • Trained end-to-end on engagement data
  • Embeddings capture complex non-linear relationships
  • Efficient serving: pre-compute item embeddings, ANN for retrieval
  • Dominates production systems at scale
  • Used by: YouTube, TikTok, Instagram Explore
  • Treat each recommendation slot as a decision under uncertainty
  • Exploit: show highest-predicted item. Explore: show uncertain item.
  • Thompson Sampling, UCB, or ฮต-greedy policies
  • Automatically balances showing known-good vs discovering new
  • Handles cold start: uncertain items get explored automatically
  • Used by: Netflix (cover art), Spotify (Discover Weekly)
Explore vs Exploit โ€” The Recommendation Dilemma
Pure Exploitation Item A: 95% of slots Item B: 5% Item C: 0% โ€” never shown High short-term CTR. Never discovers better items. Filter bubble. ฮต-Greedy (10% exploration) Item A: 80% Item B: 10% Item C: 10% โ€” exploration! Slightly lower short-term CTR. Discovers new preferences over time. Real-World Exploration Budgets Netflix: 10-15% of homepage slots reserved for exploration. Spotify: Discover Weekly is 100% exploration.

The two-tower model solves the fundamental scaling problem of recommendation retrieval. Scoring every user-item pair is O(users ร— items) = 500M ร— 100M = 50 trillion operations โ€” physically impossible in real-time. The two-tower model pre-computes item embeddings and indexes them with Approximate Nearest Neighbor (ANN) search. At serving time, only the user embedding is computed (milliseconds), then ANN retrieves the nearest 1000 item vectors in O(log N). This collapses 50 trillion comparisons into a single fast lookup โ€” and is why deep learning now dominates recommendation retrieval at every major platform.

๐Ÿ“‹ Chapter 5 โ€” Summary
  • Collaborative filtering: behavior-based, powerful but cold-start vulnerable.
  • Content-based: feature-matching, handles cold start, limited serendipity.
  • Two-tower deep learning: dominant at scale. Pre-compute embeddings, ANN retrieval.
  • Multi-armed bandit: explore/exploit balance. Handles cold start naturally.
06
Chapter Six

What Real Companies Did

Production Recommendation Systems
โ–ถ๏ธ

YouTube

  • Two-tower model: deep candidate generation + ranking
  • Candidate gen: produce 100s from billions of videos (ANN)
  • Ranking: wide & deep network with 100+ features
  • Watch time prediction (not just click): optimize engagement quality
  • Published: "Deep Neural Networks for YouTube Recommendations" (2016)
๐ŸŽฌ

Netflix

  • Personalized homepage: each row is a different model output
  • Artwork personalization: different thumbnails per user
  • Offline batch (Spark) + online lightweight re-ranking
  • Thompson Sampling for explore/exploit on new content
  • $1B/year saved from reduced churn attributed to recommendations
๐Ÿ›’

Amazon

  • Item-to-item collaborative filtering (original patent)
  • "Customers who bought X also bought Y" โ€” simple but effective
  • Real-time: updates recommendations as cart changes
  • 35% of revenue attributed to recommendations
  • Personalized per surface: homepage vs product page vs cart
๐ŸŽต

Spotify

  • Discover Weekly: collaborative filtering + audio features
  • Audio embeddings: analyze raw audio (ML model on waveform)
  • NLP on playlists: treat playlist names as "documents" for NLP models
  • Bandits for podcast recommendations (cold start domain)
  • Blends collaborative (behavior) and content (audio features)
Production Recommendation Systems โ€” Comparison
Company Retrieval Ranking Special Pattern Key Metric
YouTube Two-tower DNN + ANN Wide & Deep, 100+ features Watch-time prediction (not clicks) Billions of videos ranked/day
Netflix CF + content hybrid Offline batch + online re-rank Thompson Sampling, per-surface models $1B/year churn reduction
Amazon Item-to-item CF Lightweight scoring + rules Real-time cart-aware, 35% revenue Intent-aware (changes with cart)
Spotify Audio embeddings + CF Playlist NLP + audio features Bandits for cold-start podcasts Weekly full personalization
๐Ÿ“‹ Chapter 6 โ€” Summary
  • YouTube: two-tower deep model, watch-time optimization, ANN retrieval at billion scale.
  • Netflix: personalized rows + artwork, Thompson Sampling, $1B/year churn reduction.
  • Amazon: item-to-item CF, real-time cart-aware, 35% revenue from recommendations.
  • Spotify: audio embeddings + collaborative filtering + bandits for cold start.
07
Chapter Seven

Best Practices Extracted

Transferable Lessons
๐Ÿ”€

Multi-Stage Funnel

  • Never rank all items with expensive model โ€” infeasible at scale
  • Cheap retrieval โ†’ expensive ranking โ†’ business filtering
  • Each stage 10-100x reduction in candidates
  • Different models optimized for each stage's objective
  • Transfers to: search ranking, ad selection, feed ranking
๐Ÿ“ฆ

Feature Store Pattern

  • Pre-compute features offline, serve online at low latency
  • Same feature definitions for training and serving (no skew)
  • User features: updated every N minutes (near-real-time)
  • Item features: updated on catalog change (batch)
  • Transfers to: any ML system with online inference
๐Ÿงช

Online Experimentation

  • A/B test every model change on live traffic
  • Metric: not just clicks โ€” long-term engagement, retention
  • Guardrail metrics: ensure no regression on safety, diversity
  • Small % traffic โ†’ ramp โ†’ full rollout over weeks
  • Transfers to: any product change validated by user behavior
Multi-Stage Recommendation Funnel โ€” Candidate Reduction
100,000,000 items in catalog Candidate Retrieval (ANN + CF + trending) ~10ms | cheap per item 1,000 candidates Ranking Model (DNN) ~50ms | expensive per item 100 ranked items Re-rank (~5ms) Top 20 Ranking ALL 100M: 100M ร— 5ms = 140 hours With funnel: 1000 ร— 5ms = 5 seconds โœ“

Optimizing for clicks is not the same as optimizing for value. Click-optimized systems promote clickbait โ€” sensational thumbnails and titles that disappoint after clicking. Watch-time or satisfaction-optimized systems promote genuinely good content. YouTube's shift from click prediction to watch-time prediction significantly improved content quality on the platform. The metric you optimize becomes the system's behavior โ€” choose carefully.

๐Ÿ“‹ Chapter 7 โ€” Summary
  • Multi-stage funnel: retrieval โ†’ ranking โ†’ filtering. Each stage: different model, different budget.
  • Feature store: bridge between offline training and online serving. Prevent skew.
  • Experimentation: A/B test everything. Optimize long-term metrics, not just clicks.
  • Two-tower serving: pre-compute all item embeddings. At request time, compute only user embedding โ†’ ANN lookup โ†’ O(log N) retrieval of top-K candidates.
08
Chapter Eight

What Could Go Wrong

Common Failure Patterns
๐Ÿซง

Filter Bubble

  • System only recommends what user already likes โ†’ echo chamber
  • User's world shrinks โ€” never exposed to new genres/topics
  • Engagement appears fine short-term, but long-term retention drops
  • Fix: explicit diversity injection (MMR), exploration budget (5-10% slots for discovery), track long-term satisfaction metrics.
๐Ÿ“‰

Training-Serving Skew

  • Model trained on features computed differently than serving
  • Example: "user_click_count" computed differently in batch vs real-time
  • Model performs great offline, terrible in production
  • Fix: feature store with single definition for train and serve. Log serving features and compare to training features. Alert on distribution shift.
Training-Serving Skew โ€” What Goes Wrong Without a Feature Store
Training (Offline) Batch SQL: last 7 calendar days user_clicks_7d = 47 Model trained with 47 Serving (Online) โ€” โŒ Bug Real-time: 7 ร— 24h rolling window user_clicks_7d = 52 โ† different! Model receives 52 โ†’ wrong predictions โœ— With Feature Store โ€” โœ… Single definition: rolling 7ร—24h Train reads โ†’ 47 Serve reads โ†’ 47 โœ“ same โœ“ Training-serving skew is the #1 silent killer of ML model performance. Model looks great in offline evaluation (same feature pipeline) but underperforms in production. Often takes weeks to diagnose because metrics degrade slowly, not catastrophically. Fix: Feature store + log serving features + compare distributions + automated alerts.
๐Ÿ”„

Feedback Loop Amplification

  • System recommends item A โ†’ users click A โ†’ more signal for A
  • A gets recommended more โ†’ even more clicks โ†’ dominates recommendations
  • Items not initially recommended never get clicks โ†’ never recommended
  • Fix: position-debiased training (account for placement effect), propensity scoring, counterfactual evaluation, exploration budget.
Feedback Loop Amplification โ€” Self-Reinforcing Bias
Self-Reinforcing Loop (Item A) โ‘  System recommends Item A heavily โ‘ก Users click A (because it's shown) โ‘ข High click signal โ†’ more training data โ‘ฃ Model: "A = high engagement" โ‘ค Recommend A even MORE โ†’ loop โ†‘ Items B, C, D Not recommended No clicks No signal Extinction Over time: 10% of items receive 90% of recommendations. 90% of catalog becomes invisible. Diversity collapses.
โ„๏ธ

Cold Start Failure

  • New user: no history โ†’ collaborative filtering returns nothing
  • New item: no interactions โ†’ never enters candidate generation
  • 15-20% of users/items affected at any time (constant influx)
  • Fix: content-based fallback, popularity baseline, onboarding preferences, metadata-based item features, bandit for new items.

Feedback loops are the most dangerous long-term failure because they are self-reinforcing. The system creates the data it trains on: recommend โ†’ user clicks โ†’ more training signal โ†’ recommend more. Over time, this converges to a tiny set of items dominating the entire catalog. The fix requires actively fighting the system's natural convergence: exploration budgets, position debiasing, and measuring whether recommendations are surfacing catalog diversity.

๐Ÿ“‹ Chapter 8 โ€” Summary
  • Filter bubble: diversity injection + exploration budget + long-term satisfaction metrics.
  • Training-serving skew: feature store with unified definitions. Monitor feature distributions.
  • Feedback loops: position debiasing + propensity scoring + exploration. Fight convergence.
  • Cold start: content-based fallback + popularity + bandit exploration for new items.
  • Principle: the metric you optimize becomes the system's behavior. Choose wisely.