System Design · Case Studies

Case Study: Social Feed

Design, trade-offs, and alternatives for a social news feed at scale.

Chapter One

Problem Statement

What We Are Building

A social news feed aggregates posts from everyone a user follows and presents them in a personalized, ranked order. When you open Instagram, Twitter, or LinkedIn, the feed is already there — hundreds of candidate posts filtered, ranked, and ready within milliseconds. The challenge is not displaying posts — it is computing a personalized feed for each of 500M users from billions of candidate posts, in under 200ms, while handling the "celebrity problem" where one account has 100M followers and every post must reach all of them.

Scale Requirements

Traffic & Scale

500M daily active users
Each user follows ~500 accounts on average
300M new posts/day (~3,500 posts/sec)
Feed request: 10B/day (~115K reads/sec)

Requirements

Feed latency: <200ms (pre-computed or fast assembly)
Freshness: new post visible in feed within 30 seconds
Ranking: ML-based relevance, not just reverse chronological
Celebrity handling: accounts with 100M+ followers

The fundamental question is: when do you compute the feed? You can compute it when the author posts (fan-out on write) or when the reader opens the app (fan-out on read). This single decision determines your entire architecture — storage model, latency profile, cost structure, and how you handle celebrities. There is no right answer; every production system uses a hybrid.

📋 Chapter 1 — Summary

500M DAU, 300M posts/day, 10B feed reads/day.
Core decision: fan-out on write (pre-compute) vs fan-out on read (compute at request time).
Celebrity problem: 1 post from a 100M-follower account = 100M feed updates.
Feed must be ranked (not just chronological), fresh (30s), and fast (<200ms).

Chapter Two

Questions to Ask

Clarifying Before Designing

A Twitter-style reverse-chronological timeline is architecturally simpler than a Facebook-style ranked feed with ML scoring. The questions below determine whether you need a pre-computed cache or a real-time assembly system — and how much infrastructure the ranking layer requires.

📰

Feed Model

Reverse chronological or ML-ranked?
Single feed or multiple (home, explore, following)?
Content types: text, images, videos, stories?
Ads mixed into feed? (sponsored posts)

👥

Social Graph

Follow model (asymmetric) or friend model (symmetric)?
Average followers per user? Max followers?
Celebrity accounts? (1M+ followers)
Groups/communities in addition to follows?

⚡

Freshness & Features

How fast must a post appear in followers' feeds?
Engagement counters real-time? (likes, comments)
Seen/read status tracking?
Infinite scroll pagination or fixed pages?

The celebrity problem is the defining constraint. If your max follower count is 5K (like early Facebook), fan-out on write works perfectly. If one account has 100M followers (like a Twitter celebrity), fan-out on write means one post triggers 100M writes — taking minutes and costing enormous storage. This is why every production system uses a hybrid: fan-out on write for normal users, fan-out on read for celebrities.

For This Case Study, Our Answers Are:

Feed model: ML-ranked (not chronological)
Social graph: follow model (asymmetric) — like Twitter, not Facebook friends
Max followers: up to 100M (celebrity accounts exist)
Content types: text + images + short video
Freshness SLA: new post visible in followers' feeds within 30 seconds
Engagement counters: real-time likes/comments count (not pre-computed)
Pagination: infinite scroll — cursor-based
Celebrity threshold: >500K followers → fan-out on read (no pre-computation)
Active user definition: seen in last 7 days (for selective fan-out)

📋 Chapter 2 — Summary

Chronological vs ranked: ranked requires ML scoring layer on read path.
Follow model (asymmetric) creates the celebrity problem. Friend model (symmetric) doesn't.
Max follower count determines viability of fan-out on write.
Freshness SLA (30s vs 5min) affects whether pre-computation can use batch processing.

Chapter Three

Naive Design

Fan-Out on Read (Pull Model)

The simplest design: when a user opens the app, query the database for all accounts they follow, fetch recent posts from each, merge and sort by timestamp, return the top N. This is a pure "pull" model — nothing is pre-computed. It works beautifully for 1000 users. For 500M users each following 500 accounts, it means 500 queries per feed request × 115K requests/sec = 57.5M database queries per second just for feed generation. The database melts. Beyond latency, the posts table requires a composite index on (author_id, created_at) across hundreds of billions of rows — a read-optimized index that becomes increasingly expensive to maintain as write volume grows.

Naive Design — Fan-Out on Read

✅

What Works

Simple — no pre-computation infrastructure needed
Always fresh — queries live data every time
No storage overhead (no pre-computed feeds)
Celebrity posts appear instantly (no fan-out delay)

💥

What Breaks

500 queries per feed request — high latency (2-5s)
57M+ DB queries/sec — database cannot handle this
Ranking requires fetching all posts then scoring — slow
Caching helps but invalidation is complex (new posts constantly)
User experience: slow feed load → users leave

📋 Chapter 3 — Summary

Fan-out on read: compute feed at request time. Simple but slow and expensive.
500 queries per feed × 115K req/sec = unsustainable DB load.
Latency 2-5 seconds: unacceptable for mobile feed refresh.
Advantage: always fresh, no storage cost, celebrities handled naturally.

Chapter Four

Refined Design

Hybrid Fan-Out with Pre-Computed Feeds

The refined design uses fan-out on write for normal users (pre-compute feeds when a post is created) and fan-out on read for celebrities (merge their posts at read time). When Alice posts, a fan-out service pushes the post ID into the pre-computed feed cache of each of Alice's followers. When a user opens the app, their feed is already waiting in cache — just read and return. Celebrity posts are merged on the fly from a small "celebrity posts" list. Result: feed served in <50ms from cache, with celebrity freshness maintained.

Refined Design — Hybrid Fan-Out Architecture

✍️

Write Path (Fan-Out)

Author creates post → stored in Post DB
Fan-out service gets author's follower list
If author has <500K followers: push post_id to each follower's cache
If author has >500K followers: write to "celebrity posts" store only
Average fan-out: 500 followers × 3,500 posts/sec = 1.75M cache writes/sec

📖

Read Path (Assembly)

Read pre-computed feed from Redis cache (list of post_ids)
Merge in celebrity posts from users this reader follows
Score/rank merged candidates (ML model)
Hydrate top-N post_ids with full content from Post Store
Result: feed in <50ms (cache hit) + 100ms (ranking) = <200ms total

Fan-Out Decision: Normal User vs Celebrity

The threshold between fan-out and no-fan-out is a tunable parameter. Facebook uses ~5K followers. Twitter uses a dynamic threshold based on how many followers are currently online. The higher the threshold, the more you pre-compute (faster reads, more storage + write cost). The lower the threshold, the more you compute on read (slower reads, less storage). Most systems start at 500K and tune based on infrastructure capacity.

📋 Chapter 4 — Summary

Hybrid: fan-out on write for normal users (pre-computed, fast read), fan-out on read for celebrities.
Feed cache in Redis: list of post_ids per user. Read = single cache lookup.
Celebrity posts merged on the fly — only ~10-20 celebrity accounts per user to merge.
Ranking happens at read time: score merged candidates, return top N.
Feed latency: <200ms (cache read + merge + rank + hydrate).

Chapter Five

Alternative Approaches

Fan-Out Strategies Compared

The three canonical approaches to feed generation each optimize for different constraints. Pure fan-out on write optimizes for read speed. Pure fan-out on read optimizes for write simplicity. The hybrid approach — used by every major platform — trades implementation complexity for the best of both worlds.

Fan-Out on Write (Push)

Fan-Out on Read (Pull)

When post created → write to every follower's feed cache
Read is a single cache lookup — O(1) and fast
Write amplification: 1 post = N writes (N = follower count)
Celebrity problem: 100M followers = 100M writes per post
Storage cost: N copies of each post_id across all follower feeds
Good for: mostly-equal follower counts, fast reads

When user requests feed → query all followed accounts' posts
Read is expensive: N queries per feed (N = followed accounts)
Write is simple: just store the post once
No celebrity problem: celebrity post stored once regardless of followers
Storage efficient: no duplication
Good for: write-heavy, celebrity-heavy platforms

Write Amplification: Fan-Out on Write vs Read

Chronological Feed

Ranked Feed (ML-based)

Sort by timestamp — newest first
Deterministic: no ML model, no A/B testing needed
Pre-computable: just prepend new posts to the list
Problem: user misses important posts from close friends
Users who follow 1000+ accounts: most posts never seen
Used by: Twitter (optional "Latest"), Mastodon

ML model scores each candidate: P(engagement | user, post)
Features: author affinity, content type, recency, past interaction
Higher engagement: users spend more time, see relevant content
Requires ML inference infrastructure at read time (100ms budget)
Controversial: filter bubbles, engagement over quality
Used by: Facebook, Instagram, TikTok, LinkedIn

Ranking Pipeline: Candidate Reduction at Each Stage

Ranked feeds increase engagement by 2-5x over chronological. But they require an ML scoring service that can rank hundreds of candidates in under 100ms per request. This is a massive infrastructure investment — Facebook's ranking system uses hundreds of features, multiple ML models in cascade, and serves at billions of predictions/day. Start chronological, add ranking when you have the data and infrastructure.

📋 Chapter 5 — Summary

Fan-out on write: fast reads, expensive writes. Breaks for celebrities.
Fan-out on read: simple writes, slow reads. No celebrity problem.
Hybrid: fan-out on write for normal, on read for celebrities. Production standard.
Ranked vs chronological: ranking increases engagement 2-5x but needs ML infrastructure.

Chapter Six

What Real Companies Did

Production Feed Systems

Every major social platform has published papers or talks about their feed architecture. The common theme: they all started simple (chronological, fan-out on write) and evolved toward hybrid systems with ML ranking as they scaled. Nobody ships a perfect feed system on day one.

📘

Facebook / Meta

Fan-out on write for friends (symmetric, max ~5K)
Multi-stage ranking: coarse filter → fine rank → final reorder
1000+ features per candidate post for ML scoring
TAO: distributed social graph store for follow relationships
Aggregator: fetches ~2000 candidates, ranks, returns 20-50

🐦

Twitter / X

Hybrid: fan-out on write for users with <threshold followers
Timeline mixer: merges pre-computed + celebrity + algorithmic
Manhattan: custom distributed key-value store for timelines
GraphJet: real-time recommendation engine for "For You"
~400K fan-out operations/sec at peak

📸

Instagram

Shifted from chronological to ranked feed (2016)
ML model predicts: P(like), P(comment), P(share), P(save)
Weighted combo of predictions → final rank score
Cassandra for feed storage (high write throughput)
Engagement increased significantly after ranking launch

🎵

TikTok (For You Page)

Pure fan-out on read + recommendation (no follow-based feed)
Content-based: rank ALL content, not just followed accounts
Watch time is the primary signal (not likes/follows)
Massive candidate retrieval → multi-stage ranking + filtering
Cold start: new users get personalized feed in <10 videos

Production Feed Systems — Comparison

📋 Chapter 6 — Summary

Facebook: fan-out on write (friends), multi-stage ML ranking with 1000+ features.
Twitter: hybrid fan-out, Manhattan KV store, GraphJet for recommendations.
Instagram: chronological → ranked (2016). ML predicting multiple engagement types.
TikTok: pure recommendation (no follow-based), watch time as primary signal.

Chapter Seven

Best Practices Extracted

Transferable Lessons

Feed systems teach patterns that apply to any system doing personalized content assembly: recommendation engines, email inboxes, notification centers, and content discovery pages. The principles of pre-computation, tiered ranking, and hybrid push/pull apply far beyond social media.

🔀

Tiered Ranking

Stage 1: Candidate retrieval (thousands → hundreds)
Stage 2: Coarse ranking (hundreds → tens, light model)
Stage 3: Fine ranking (tens → final order, heavy model)
Stage 4: Policy filters (remove duplicates, enforce diversity)
Transfers to: search, recommendations, ad selection

💾

Feed Cache Design

Store only post_ids in feed cache (not full content)
Hydrate content separately (post store lookup)
Cap feed length: keep last 500-1000 post_ids per user
Evict oldest when cap reached (FIFO)
Transfers to: any personalized list/inbox system

⚡

Selective Fan-Out

Fan out only to active users (seen in last 7 days)
Inactive users: build feed on demand when they return
Saves 40-60% of fan-out writes (many users dormant)
Trade-off: returning users get slightly stale first feed
Transfers to: any event distribution with inactive subscribers

Tiered Ranking: Candidate Funnel

Store post_ids, not content, in feed caches. A post might get edited, deleted, or enriched with engagement counts. If you store content in the feed cache, every edit requires updating millions of copies. Storing only IDs means the post lives in one canonical location — the feed cache is just an ordered list of pointers. Hydrate at read time. This separation is the single most important feed cache design decision.

📋 Chapter 7 — Summary

Tiered ranking: retrieve → coarse rank → fine rank → policy filter. Each stage reduces candidates.
IDs not content: feed cache stores post_ids only. Hydrate separately.
Selective fan-out: only push to active users. Build on demand for returning dormant users.
Feed cap: 500-1000 items max per user. FIFO eviction.
Selective fan-out: skip inactive users (seen 7+ days ago). Saves 40-60% of write volume.

Chapter Eight

What Could Go Wrong

Common Failure Patterns

Feed failures are subtle — users don't see errors, they just see stale content, missing posts from close friends, or an empty feed. These are worse than explicit errors because users don't report them — they just gradually disengage. Every failure below has happened at major social platforms and took weeks to detect because the symptom is "fewer users opening the app" not "500 errors."

⭐

Celebrity Fan-Out Storm

Celebrity posts without threshold → 100M writes per post
Fan-out service queue grows to hours of lag
Normal users' posts delayed because queue is full of celebrity fan-outs
Fix: celebrity threshold (skip fan-out for high-follower accounts), separate queues for normal vs high-follower fan-outs.

Celebrity Storm: Without vs With Threshold

🕳️

Feed Staleness

Feed cache not updated for some users (fan-out worker fell behind)
User sees 12-hour-old content — thinks platform is dead
Silent failure: no errors, just stale data. Hard to detect.
Fix: monitor fan-out lag per percentile. Alert on p99 > 5min. TTL on feed cache forcing refresh.

What to monitor for fan-out lag:

fan_out_lag_p50: median time from post creation to all followers' caches updated. Alert if > 10s.
fan_out_lag_p99: tail latency — how long it takes for the slowest 1% of followers. Alert if > 2min.
fan_out_queue_depth: number of pending fan-out jobs. Alert if growing (means workers falling behind).
feed_cache_hit_rate: should be > 99%. Drop indicates fan-out not keeping up or cache eviction.

These metrics are the difference between detecting staleness in minutes vs discovering it from user complaints days later.

🤖

Ranking Model Degradation

ML model starts promoting low-quality content (engagement bait)
Or: model bug causes same 5 posts shown repeatedly
User engagement drops 20% over days — slow detection
Fix: diversity constraints in ranking, content freshness signals, A/B testing all model changes, kill-switch to revert to chronological.

📭

Cold Start (New User Empty Feed)

New user follows 0 accounts → empty feed → leaves immediately
Or: user follows accounts but no pre-computed feed exists yet
Fan-out hasn't run yet → cache is empty → blank screen
Fix: onboarding follows (suggest popular accounts), trending/explore content as fallback, immediate fan-out for new follows.

Feed quality problems are invisible to traditional monitoring. No 500 errors. No latency spikes. The system appears healthy — but users see stale, irrelevant, or repetitive content and silently leave. You need engagement metrics (time spent, scroll depth, posts seen) as the real health signal, not just infrastructure metrics. If average scroll depth drops 15% — something is wrong with feed quality, even if all servers are green.

📋 Chapter 8 — Summary

Celebrity storm: 100M fan-out writes. Fix: threshold + separate queues.
Staleness: silent failure — stale feeds with no errors. Fix: lag monitoring + TTL.
Model degradation: bad ranking causes slow engagement drop. Fix: diversity rules + kill-switch.
Cold start: new user gets empty feed. Fix: trending fallback + immediate fan-out.
Principle: engagement metrics are the real feed health signal, not server metrics.

← Video Streaming Ride Sharing →