System Design · Case Studies

Case Study: Recommendation Engine

Design, trade-offs, and alternatives for a recommendation engine at scale.

Chapter One

Problem Statement

What We Are Building

A recommendation engine predicts what items a user will engage with — products to buy, videos to watch, songs to listen to, people to follow — based on their past behavior, similar users, and item characteristics. The core challenge is ranking millions of candidate items for each user in real-time, using signals that range from explicit (ratings, purchases) to implicit (dwell time, scroll speed, skip patterns). At scale, you are computing personalized rankings for 500M+ users across 100M+ items, serving recommendations in under 100ms per request.

Scale Requirements

Traffic & Scale

500M daily active users
100M+ items in catalog
1M recommendation requests/sec at peak
10B+ user-item interactions/day (impressions, clicks, purchases)

Requirements

Latency: <100ms per recommendation request
Freshness: new interactions reflected in <30 minutes
Relevance: personalized, not just popular
Diversity: avoid filter bubbles (explore vs exploit)

A recommendation engine is a real-time ranking system that must balance exploitation (show what we know the user likes) with exploration (show new items to learn preferences). Pure exploitation creates filter bubbles — users see the same type of content forever. Pure exploration feels random. The best systems use multi-armed bandit approaches or explicit diversity injection to balance both. This is not just an engineering problem — it is a product and ethical decision that affects what 500M people see every day.

📋 Chapter 1 — Summary

500M users, 100M+ items, 1M rec requests/sec. Personalized in <100ms.
Signals: clicks, purchases, dwell time, ratings, skip patterns.
Balance exploitation (relevance) with exploration (diversity).
New interactions must update recommendations within 30 minutes.

Chapter Two

Questions to Ask

Clarifying Before Designing

🎯

Recommendation Context

Homepage (cold start) or in-session (already browsing)?
"Similar items" or "personalized for you"?
Single domain (movies) or cross-domain (movies + books)?
Real-time context (time, location, device)?

📊

Signal Types

Explicit feedback (ratings 1-5)?
Implicit feedback (views, clicks, dwell time)?
Negative signals (skip, hide, dislike)?
Social signals (friends' activity)?

⚖️

Business Constraints

Revenue optimization or engagement?
Fairness constraints (expose all sellers)?
Content moderation (don't recommend harmful content)?
Explainability ("Because you watched...")?

The cold start problem determines your architecture's complexity. A new user with no history can't get collaborative filtering recommendations. A new item with no interactions can't be recommended by any behavior-based model. You need content-based fallbacks (recommend based on item features), popularity-based defaults, and rapid learning from first few interactions. Cold start is not an edge case — it is 20%+ of your traffic (new users, new items daily).

For This Case Study, Our Answers Are:

Context: homepage recommendations (cold open, no session context)
Item domain: single domain — video content (YouTube-style)
Primary signals: implicit (watch time, completion rate, skip) + explicit (likes, saves)
Negative signals: yes — skip and explicit "not interested" used
Social signals: no — individual behavior only (no friend graph)
Cold start: yes — 20%+ of users are new or returning after 30+ days
Business constraints: no harmful content, max 2 sponsored slots in top-20
Personalization freshness: user model updated every 30 minutes
Explainability: "Because you watched X" shown for top recommendations
Catalog size: 100M+ items, ~10K new items added daily

📋 Chapter 2 — Summary

Context matters: homepage vs in-session recommendations have different models.
Implicit signals (clicks, dwell time) far more abundant than explicit (ratings).
Cold start (new users, new items) is 20%+ of traffic — needs fallback strategy.
Business constraints (fairness, revenue, safety) override pure relevance.

Chapter Three

Naive Design

Global Popularity + Simple Rules

The simplest recommendation: show everyone the most popular items. Sort by total purchases/views in the last 7 days, return top-20. Add basic rules: "if user bought X, show category Y." No personalization, no machine learning, no user modeling. Works surprisingly well for a new service (popular items are popular for a reason). Breaks when you need to differentiate — everyone sees the same recommendations, engagement plateaus, and long-tail items never get surfaced.

✅

What Works

Simple — one SQL query sorted by popularity
No cold start problem (works for all users equally)
Decent baseline: popular items have broad appeal
No ML infrastructure needed

💥

What Breaks

Zero personalization — everyone sees the same list
Rich-get-richer: popular items stay popular, new items invisible
Engagement plateau: users see nothing new or surprising
Long-tail items never recommended (90% of catalog ignored)
No learning: system never improves from user behavior

Popularity-based recommendations also remain the correct fallback for cold-start users — when no behavior history exists, popular items are unambiguously better than a personalized model with no data to work from.

Naive Design — Global Popularity Ranking

📋 Chapter 3 — Summary

Popularity-based: simple, no personalization, decent baseline.
Everyone sees same list — engagement plateaus quickly.
Long-tail items (90% of catalog) never surfaced.
No feedback loop: system never learns from user behavior.

Chapter Four

Refined Design

Multi-Stage Retrieval + Ranking Pipeline

The refined design uses a funnel approach: from 100M candidate items, narrow to 1000 candidates (retrieval), score those 1000 (ranking), apply business rules (filtering), then return top-20. Each stage uses different models optimized for their task — retrieval prioritizes recall (don't miss good items), ranking prioritizes precision (order the best items correctly). This separation allows each stage to use appropriate compute budget: retrieval is cheap per item, ranking is expensive but runs on few items.

Refined Design — Multi-Stage Recommendation Pipeline

🔍

Candidate Retrieval

Goal: recall — find 1000 potentially relevant items from 100M
Source 1: Collaborative filtering (users like you liked...)
Source 2: Content-based (similar item features)
Source 3: ANN vector search (user embedding → nearest item embeddings)
Source 4: Popularity (trending, new arrivals)
Multiple sources merged — maximize recall at low compute cost

📊

Ranking Model

Goal: precision — correctly order the 1000 candidates
Deep neural network with rich features (user + item + context)
Features from feature store: user history, item attributes, real-time context
Predict P(click), P(purchase), P(watch_completion)
Score = weighted combination of predicted objectives
Heavy compute but only runs on 1000 items (not 100M)

Two-Tower Model — How Retrieval Works

The feature store is the critical infrastructure that bridges offline training and online serving. Models are trained offline on historical features. At serving time, the same features must be available in real-time (<10ms). The feature store pre-computes and caches user features (updated every few minutes) and item features (updated on catalog change). Without it, you either train on different features than you serve (training-serving skew) or compute features on every request (latency explosion).

Feature Store — Bridging Offline Training and Online Serving

📋 Chapter 4 — Summary

Multi-stage funnel: retrieval (100M→1K) → ranking (1K→100) → re-rank (100→20).
Retrieval: cheap, recall-focused. Multiple sources (collab filter, ANN, content-based).
Ranking: expensive deep learning model. Precision-focused. Rich features.
Feature store: bridges offline training and online serving. Prevents training-serving skew.

Chapter Five

Alternative Approaches

Recommendation Strategies

Collaborative Filtering

Content-Based Filtering

"Users who liked X also liked Y" — behavior patterns
Matrix factorization: user × item → latent factor embeddings
Powerful: discovers non-obvious connections between items
Cold start problem: new users/items have no behavior data
Popularity bias: already-popular items get recommended more
Used by: Netflix (early), Amazon (item-item CF)

"Similar items based on features" — metadata matching
Item features: genre, author, price range, description embeddings
No cold start for items: new item with metadata can be recommended immediately
Limited serendipity: recommends same type (thriller → more thrillers)
Need good metadata/features for items
Used by: Spotify (audio features), news (topic matching)

Collaborative Filtering — User-Item Matrix Factorization

Deep Learning (Two-Tower)

Multi-Armed Bandit (Explore/Exploit)

One tower encodes user, one tower encodes item → dot product = score
Trained end-to-end on engagement data
Embeddings capture complex non-linear relationships
Efficient serving: pre-compute item embeddings, ANN for retrieval
Dominates production systems at scale
Used by: YouTube, TikTok, Instagram Explore

Treat each recommendation slot as a decision under uncertainty
Exploit: show highest-predicted item. Explore: show uncertain item.
Thompson Sampling, UCB, or ε-greedy policies
Automatically balances showing known-good vs discovering new
Handles cold start: uncertain items get explored automatically
Used by: Netflix (cover art), Spotify (Discover Weekly)

Explore vs Exploit — The Recommendation Dilemma

The two-tower model solves the fundamental scaling problem of recommendation retrieval. Scoring every user-item pair is O(users × items) = 500M × 100M = 50 trillion operations — physically impossible in real-time. The two-tower model pre-computes item embeddings and indexes them with Approximate Nearest Neighbor (ANN) search. At serving time, only the user embedding is computed (milliseconds), then ANN retrieves the nearest 1000 item vectors in O(log N). This collapses 50 trillion comparisons into a single fast lookup — and is why deep learning now dominates recommendation retrieval at every major platform.

📋 Chapter 5 — Summary

Collaborative filtering: behavior-based, powerful but cold-start vulnerable.
Content-based: feature-matching, handles cold start, limited serendipity.
Two-tower deep learning: dominant at scale. Pre-compute embeddings, ANN retrieval.
Multi-armed bandit: explore/exploit balance. Handles cold start naturally.

Chapter Six

What Real Companies Did

Production Recommendation Systems

▶️

YouTube

Two-tower model: deep candidate generation + ranking
Candidate gen: produce 100s from billions of videos (ANN)
Ranking: wide & deep network with 100+ features
Watch time prediction (not just click): optimize engagement quality
Published: "Deep Neural Networks for YouTube Recommendations" (2016)

🎬

Netflix

Personalized homepage: each row is a different model output
Artwork personalization: different thumbnails per user
Offline batch (Spark) + online lightweight re-ranking
Thompson Sampling for explore/exploit on new content
$1B/year saved from reduced churn attributed to recommendations

🛒

Amazon

Item-to-item collaborative filtering (original patent)
"Customers who bought X also bought Y" — simple but effective
Real-time: updates recommendations as cart changes
35% of revenue attributed to recommendations
Personalized per surface: homepage vs product page vs cart

🎵

Spotify

Discover Weekly: collaborative filtering + audio features
Audio embeddings: analyze raw audio (ML model on waveform)
NLP on playlists: treat playlist names as "documents" for NLP models
Bandits for podcast recommendations (cold start domain)
Blends collaborative (behavior) and content (audio features)

Production Recommendation Systems — Comparison

Company	Retrieval	Ranking	Special Pattern	Key Metric
YouTube	Two-tower DNN + ANN	Wide & Deep, 100+ features	Watch-time prediction (not clicks)	Billions of videos ranked/day
Netflix	CF + content hybrid	Offline batch + online re-rank	Thompson Sampling, per-surface models	$1B/year churn reduction
Amazon	Item-to-item CF	Lightweight scoring + rules	Real-time cart-aware, 35% revenue	Intent-aware (changes with cart)
Spotify	Audio embeddings + CF	Playlist NLP + audio features	Bandits for cold-start podcasts	Weekly full personalization

📋 Chapter 6 — Summary

YouTube: two-tower deep model, watch-time optimization, ANN retrieval at billion scale.
Netflix: personalized rows + artwork, Thompson Sampling, $1B/year churn reduction.
Amazon: item-to-item CF, real-time cart-aware, 35% revenue from recommendations.
Spotify: audio embeddings + collaborative filtering + bandits for cold start.

Chapter Seven

Best Practices Extracted

Transferable Lessons

🔀

Multi-Stage Funnel

Never rank all items with expensive model — infeasible at scale
Cheap retrieval → expensive ranking → business filtering
Each stage 10-100x reduction in candidates
Different models optimized for each stage's objective
Transfers to: search ranking, ad selection, feed ranking

📦

Feature Store Pattern

Pre-compute features offline, serve online at low latency
Same feature definitions for training and serving (no skew)
User features: updated every N minutes (near-real-time)
Item features: updated on catalog change (batch)
Transfers to: any ML system with online inference

🧪

Online Experimentation

A/B test every model change on live traffic
Metric: not just clicks — long-term engagement, retention
Guardrail metrics: ensure no regression on safety, diversity
Small % traffic → ramp → full rollout over weeks
Transfers to: any product change validated by user behavior

Multi-Stage Recommendation Funnel — Candidate Reduction

Optimizing for clicks is not the same as optimizing for value. Click-optimized systems promote clickbait — sensational thumbnails and titles that disappoint after clicking. Watch-time or satisfaction-optimized systems promote genuinely good content. YouTube's shift from click prediction to watch-time prediction significantly improved content quality on the platform. The metric you optimize becomes the system's behavior — choose carefully.

📋 Chapter 7 — Summary

Multi-stage funnel: retrieval → ranking → filtering. Each stage: different model, different budget.
Feature store: bridge between offline training and online serving. Prevent skew.
Experimentation: A/B test everything. Optimize long-term metrics, not just clicks.
Two-tower serving: pre-compute all item embeddings. At request time, compute only user embedding → ANN lookup → O(log N) retrieval of top-K candidates.

Chapter Eight

What Could Go Wrong

Common Failure Patterns

🫧

Filter Bubble

System only recommends what user already likes → echo chamber
User's world shrinks — never exposed to new genres/topics
Engagement appears fine short-term, but long-term retention drops
Fix: explicit diversity injection (MMR), exploration budget (5-10% slots for discovery), track long-term satisfaction metrics.

📉

Training-Serving Skew

Model trained on features computed differently than serving
Example: "user_click_count" computed differently in batch vs real-time
Model performs great offline, terrible in production
Fix: feature store with single definition for train and serve. Log serving features and compare to training features. Alert on distribution shift.

Training-Serving Skew — What Goes Wrong Without a Feature Store

🔄

Feedback Loop Amplification

System recommends item A → users click A → more signal for A
A gets recommended more → even more clicks → dominates recommendations
Items not initially recommended never get clicks → never recommended
Fix: position-debiased training (account for placement effect), propensity scoring, counterfactual evaluation, exploration budget.

Feedback Loop Amplification — Self-Reinforcing Bias

❄️

Cold Start Failure

New user: no history → collaborative filtering returns nothing
New item: no interactions → never enters candidate generation
15-20% of users/items affected at any time (constant influx)
Fix: content-based fallback, popularity baseline, onboarding preferences, metadata-based item features, bandit for new items.

Feedback loops are the most dangerous long-term failure because they are self-reinforcing. The system creates the data it trains on: recommend → user clicks → more training signal → recommend more. Over time, this converges to a tiny set of items dominating the entire catalog. The fix requires actively fighting the system's natural convergence: exploration budgets, position debiasing, and measuring whether recommendations are surfacing catalog diversity.

📋 Chapter 8 — Summary

Filter bubble: diversity injection + exploration budget + long-term satisfaction metrics.
Training-serving skew: feature store with unified definitions. Monitor feature distributions.
Feedback loops: position debiasing + propensity scoring + exploration. Fight convergence.
Cold start: content-based fallback + popularity + bandit exploration for new items.
Principle: the metric you optimize becomes the system's behavior. Choose wisely.

← Cloud Storage Case Studies Index →