Case Study: Recommendation Engine
Design, trade-offs, and alternatives for a recommendation engine at scale.
Problem Statement
A recommendation engine predicts what items a user will engage with โ products to buy, videos to watch, songs to listen to, people to follow โ based on their past behavior, similar users, and item characteristics. The core challenge is ranking millions of candidate items for each user in real-time, using signals that range from explicit (ratings, purchases) to implicit (dwell time, scroll speed, skip patterns). At scale, you are computing personalized rankings for 500M+ users across 100M+ items, serving recommendations in under 100ms per request.
Traffic & Scale
- 500M daily active users
- 100M+ items in catalog
- 1M recommendation requests/sec at peak
- 10B+ user-item interactions/day (impressions, clicks, purchases)
Requirements
- Latency: <100ms per recommendation request
- Freshness: new interactions reflected in <30 minutes
- Relevance: personalized, not just popular
- Diversity: avoid filter bubbles (explore vs exploit)
A recommendation engine is a real-time ranking system that must balance exploitation (show what we know the user likes) with exploration (show new items to learn preferences). Pure exploitation creates filter bubbles โ users see the same type of content forever. Pure exploration feels random. The best systems use multi-armed bandit approaches or explicit diversity injection to balance both. This is not just an engineering problem โ it is a product and ethical decision that affects what 500M people see every day.
- 500M users, 100M+ items, 1M rec requests/sec. Personalized in <100ms.
- Signals: clicks, purchases, dwell time, ratings, skip patterns.
- Balance exploitation (relevance) with exploration (diversity).
- New interactions must update recommendations within 30 minutes.
Questions to Ask
Recommendation Context
- Homepage (cold start) or in-session (already browsing)?
- "Similar items" or "personalized for you"?
- Single domain (movies) or cross-domain (movies + books)?
- Real-time context (time, location, device)?
Signal Types
- Explicit feedback (ratings 1-5)?
- Implicit feedback (views, clicks, dwell time)?
- Negative signals (skip, hide, dislike)?
- Social signals (friends' activity)?
Business Constraints
- Revenue optimization or engagement?
- Fairness constraints (expose all sellers)?
- Content moderation (don't recommend harmful content)?
- Explainability ("Because you watched...")?
The cold start problem determines your architecture's complexity. A new user with no history can't get collaborative filtering recommendations. A new item with no interactions can't be recommended by any behavior-based model. You need content-based fallbacks (recommend based on item features), popularity-based defaults, and rapid learning from first few interactions. Cold start is not an edge case โ it is 20%+ of your traffic (new users, new items daily).
For This Case Study, Our Answers Are:
- Context: homepage recommendations (cold open, no session context)
- Item domain: single domain โ video content (YouTube-style)
- Primary signals: implicit (watch time, completion rate, skip) + explicit (likes, saves)
- Negative signals: yes โ skip and explicit "not interested" used
- Social signals: no โ individual behavior only (no friend graph)
- Cold start: yes โ 20%+ of users are new or returning after 30+ days
- Business constraints: no harmful content, max 2 sponsored slots in top-20
- Personalization freshness: user model updated every 30 minutes
- Explainability: "Because you watched X" shown for top recommendations
- Catalog size: 100M+ items, ~10K new items added daily
- Context matters: homepage vs in-session recommendations have different models.
- Implicit signals (clicks, dwell time) far more abundant than explicit (ratings).
- Cold start (new users, new items) is 20%+ of traffic โ needs fallback strategy.
- Business constraints (fairness, revenue, safety) override pure relevance.
Naive Design
The simplest recommendation: show everyone the most popular items. Sort by total purchases/views in the last 7 days, return top-20. Add basic rules: "if user bought X, show category Y." No personalization, no machine learning, no user modeling. Works surprisingly well for a new service (popular items are popular for a reason). Breaks when you need to differentiate โ everyone sees the same recommendations, engagement plateaus, and long-tail items never get surfaced.
What Works
- Simple โ one SQL query sorted by popularity
- No cold start problem (works for all users equally)
- Decent baseline: popular items have broad appeal
- No ML infrastructure needed
What Breaks
- Zero personalization โ everyone sees the same list
- Rich-get-richer: popular items stay popular, new items invisible
- Engagement plateau: users see nothing new or surprising
- Long-tail items never recommended (90% of catalog ignored)
- No learning: system never improves from user behavior
Popularity-based recommendations also remain the correct fallback for cold-start users โ when no behavior history exists, popular items are unambiguously better than a personalized model with no data to work from.
- Popularity-based: simple, no personalization, decent baseline.
- Everyone sees same list โ engagement plateaus quickly.
- Long-tail items (90% of catalog) never surfaced.
- No feedback loop: system never learns from user behavior.
Refined Design
The refined design uses a funnel approach: from 100M candidate items, narrow to 1000 candidates (retrieval), score those 1000 (ranking), apply business rules (filtering), then return top-20. Each stage uses different models optimized for their task โ retrieval prioritizes recall (don't miss good items), ranking prioritizes precision (order the best items correctly). This separation allows each stage to use appropriate compute budget: retrieval is cheap per item, ranking is expensive but runs on few items.
Candidate Retrieval
- Goal: recall โ find 1000 potentially relevant items from 100M
- Source 1: Collaborative filtering (users like you liked...)
- Source 2: Content-based (similar item features)
- Source 3: ANN vector search (user embedding โ nearest item embeddings)
- Source 4: Popularity (trending, new arrivals)
- Multiple sources merged โ maximize recall at low compute cost
Ranking Model
- Goal: precision โ correctly order the 1000 candidates
- Deep neural network with rich features (user + item + context)
- Features from feature store: user history, item attributes, real-time context
- Predict P(click), P(purchase), P(watch_completion)
- Score = weighted combination of predicted objectives
- Heavy compute but only runs on 1000 items (not 100M)
The feature store is the critical infrastructure that bridges offline training and online serving. Models are trained offline on historical features. At serving time, the same features must be available in real-time (<10ms). The feature store pre-computes and caches user features (updated every few minutes) and item features (updated on catalog change). Without it, you either train on different features than you serve (training-serving skew) or compute features on every request (latency explosion).
- Multi-stage funnel: retrieval (100Mโ1K) โ ranking (1Kโ100) โ re-rank (100โ20).
- Retrieval: cheap, recall-focused. Multiple sources (collab filter, ANN, content-based).
- Ranking: expensive deep learning model. Precision-focused. Rich features.
- Feature store: bridges offline training and online serving. Prevents training-serving skew.
Alternative Approaches
- "Users who liked X also liked Y" โ behavior patterns
- Matrix factorization: user ร item โ latent factor embeddings
- Powerful: discovers non-obvious connections between items
- Cold start problem: new users/items have no behavior data
- Popularity bias: already-popular items get recommended more
- Used by: Netflix (early), Amazon (item-item CF)
- "Similar items based on features" โ metadata matching
- Item features: genre, author, price range, description embeddings
- No cold start for items: new item with metadata can be recommended immediately
- Limited serendipity: recommends same type (thriller โ more thrillers)
- Need good metadata/features for items
- Used by: Spotify (audio features), news (topic matching)
- One tower encodes user, one tower encodes item โ dot product = score
- Trained end-to-end on engagement data
- Embeddings capture complex non-linear relationships
- Efficient serving: pre-compute item embeddings, ANN for retrieval
- Dominates production systems at scale
- Used by: YouTube, TikTok, Instagram Explore
- Treat each recommendation slot as a decision under uncertainty
- Exploit: show highest-predicted item. Explore: show uncertain item.
- Thompson Sampling, UCB, or ฮต-greedy policies
- Automatically balances showing known-good vs discovering new
- Handles cold start: uncertain items get explored automatically
- Used by: Netflix (cover art), Spotify (Discover Weekly)
The two-tower model solves the fundamental scaling problem of recommendation retrieval. Scoring every user-item pair is O(users ร items) = 500M ร 100M = 50 trillion operations โ physically impossible in real-time. The two-tower model pre-computes item embeddings and indexes them with Approximate Nearest Neighbor (ANN) search. At serving time, only the user embedding is computed (milliseconds), then ANN retrieves the nearest 1000 item vectors in O(log N). This collapses 50 trillion comparisons into a single fast lookup โ and is why deep learning now dominates recommendation retrieval at every major platform.
- Collaborative filtering: behavior-based, powerful but cold-start vulnerable.
- Content-based: feature-matching, handles cold start, limited serendipity.
- Two-tower deep learning: dominant at scale. Pre-compute embeddings, ANN retrieval.
- Multi-armed bandit: explore/exploit balance. Handles cold start naturally.
What Real Companies Did
YouTube
- Two-tower model: deep candidate generation + ranking
- Candidate gen: produce 100s from billions of videos (ANN)
- Ranking: wide & deep network with 100+ features
- Watch time prediction (not just click): optimize engagement quality
- Published: "Deep Neural Networks for YouTube Recommendations" (2016)
Netflix
- Personalized homepage: each row is a different model output
- Artwork personalization: different thumbnails per user
- Offline batch (Spark) + online lightweight re-ranking
- Thompson Sampling for explore/exploit on new content
- $1B/year saved from reduced churn attributed to recommendations
Amazon
- Item-to-item collaborative filtering (original patent)
- "Customers who bought X also bought Y" โ simple but effective
- Real-time: updates recommendations as cart changes
- 35% of revenue attributed to recommendations
- Personalized per surface: homepage vs product page vs cart
Spotify
- Discover Weekly: collaborative filtering + audio features
- Audio embeddings: analyze raw audio (ML model on waveform)
- NLP on playlists: treat playlist names as "documents" for NLP models
- Bandits for podcast recommendations (cold start domain)
- Blends collaborative (behavior) and content (audio features)
| Company | Retrieval | Ranking | Special Pattern | Key Metric |
|---|---|---|---|---|
| YouTube | Two-tower DNN + ANN | Wide & Deep, 100+ features | Watch-time prediction (not clicks) | Billions of videos ranked/day |
| Netflix | CF + content hybrid | Offline batch + online re-rank | Thompson Sampling, per-surface models | $1B/year churn reduction |
| Amazon | Item-to-item CF | Lightweight scoring + rules | Real-time cart-aware, 35% revenue | Intent-aware (changes with cart) |
| Spotify | Audio embeddings + CF | Playlist NLP + audio features | Bandits for cold-start podcasts | Weekly full personalization |
- YouTube: two-tower deep model, watch-time optimization, ANN retrieval at billion scale.
- Netflix: personalized rows + artwork, Thompson Sampling, $1B/year churn reduction.
- Amazon: item-to-item CF, real-time cart-aware, 35% revenue from recommendations.
- Spotify: audio embeddings + collaborative filtering + bandits for cold start.
Best Practices Extracted
Multi-Stage Funnel
- Never rank all items with expensive model โ infeasible at scale
- Cheap retrieval โ expensive ranking โ business filtering
- Each stage 10-100x reduction in candidates
- Different models optimized for each stage's objective
- Transfers to: search ranking, ad selection, feed ranking
Feature Store Pattern
- Pre-compute features offline, serve online at low latency
- Same feature definitions for training and serving (no skew)
- User features: updated every N minutes (near-real-time)
- Item features: updated on catalog change (batch)
- Transfers to: any ML system with online inference
Online Experimentation
- A/B test every model change on live traffic
- Metric: not just clicks โ long-term engagement, retention
- Guardrail metrics: ensure no regression on safety, diversity
- Small % traffic โ ramp โ full rollout over weeks
- Transfers to: any product change validated by user behavior
Optimizing for clicks is not the same as optimizing for value. Click-optimized systems promote clickbait โ sensational thumbnails and titles that disappoint after clicking. Watch-time or satisfaction-optimized systems promote genuinely good content. YouTube's shift from click prediction to watch-time prediction significantly improved content quality on the platform. The metric you optimize becomes the system's behavior โ choose carefully.
- Multi-stage funnel: retrieval โ ranking โ filtering. Each stage: different model, different budget.
- Feature store: bridge between offline training and online serving. Prevent skew.
- Experimentation: A/B test everything. Optimize long-term metrics, not just clicks.
- Two-tower serving: pre-compute all item embeddings. At request time, compute only user embedding โ ANN lookup โ O(log N) retrieval of top-K candidates.
What Could Go Wrong
Filter Bubble
- System only recommends what user already likes โ echo chamber
- User's world shrinks โ never exposed to new genres/topics
- Engagement appears fine short-term, but long-term retention drops
- Fix: explicit diversity injection (MMR), exploration budget (5-10% slots for discovery), track long-term satisfaction metrics.
Training-Serving Skew
- Model trained on features computed differently than serving
- Example: "user_click_count" computed differently in batch vs real-time
- Model performs great offline, terrible in production
- Fix: feature store with single definition for train and serve. Log serving features and compare to training features. Alert on distribution shift.
Feedback Loop Amplification
- System recommends item A โ users click A โ more signal for A
- A gets recommended more โ even more clicks โ dominates recommendations
- Items not initially recommended never get clicks โ never recommended
- Fix: position-debiased training (account for placement effect), propensity scoring, counterfactual evaluation, exploration budget.
Cold Start Failure
- New user: no history โ collaborative filtering returns nothing
- New item: no interactions โ never enters candidate generation
- 15-20% of users/items affected at any time (constant influx)
- Fix: content-based fallback, popularity baseline, onboarding preferences, metadata-based item features, bandit for new items.
Feedback loops are the most dangerous long-term failure because they are self-reinforcing. The system creates the data it trains on: recommend โ user clicks โ more training signal โ recommend more. Over time, this converges to a tiny set of items dominating the entire catalog. The fix requires actively fighting the system's natural convergence: exploration budgets, position debiasing, and measuring whether recommendations are surfacing catalog diversity.
- Filter bubble: diversity injection + exploration budget + long-term satisfaction metrics.
- Training-serving skew: feature store with unified definitions. Monitor feature distributions.
- Feedback loops: position debiasing + propensity scoring + exploration. Fight convergence.
- Cold start: content-based fallback + popularity + bandit exploration for new items.
- Principle: the metric you optimize becomes the system's behavior. Choose wisely.