System Design · Case Studies

Case Study: Notification System

Design, trade-offs, and alternatives for a notification system at scale.

Chapter One

Problem Statement

What We Are Building

A notification system delivers messages to users through multiple channels — push notifications, SMS, email, and in-app alerts. It seems simple: something happens, tell the user. But at scale, you are coordinating millions of messages per hour across unreliable third-party providers (APNs, FCM, Twilio, SendGrid), each with different delivery semantics, rate limits, and failure modes. The challenge is not sending one notification — it is sending the right notification, to the right user, through the right channel, exactly once, within seconds.

Scale Requirements

Traffic & Scale

500M daily active users
10B notifications/day (~115K/sec average)
Peak: 5x average → 500K notifications/sec
3 channels: push (70%), email (25%), SMS (5%)

Requirements

Latency: push <2 sec, email <30 sec, SMS <10 sec
Delivery guarantee: at-least-once (with dedup)
No duplicates: user must not receive same notification twice
User preferences: opt-out per channel, quiet hours, frequency caps

A notification system is a write-dominated, fan-out-heavy pipeline. Unlike a URL shortener (read-heavy) or rate limiter (inline), notifications are fire-and-forget from the caller's perspective — but the system internally must guarantee reliable delivery across unreliable external channels. The core challenge is reliability, not latency.

📋 Chapter 1 — Summary

10B notifications/day across push, email, and SMS channels.
At-least-once delivery with deduplication — no missed and no duplicated notifications.
User preferences (opt-out, quiet hours, frequency caps) must be respected.
External providers (APNs, FCM, Twilio) are unreliable — system must handle their failures.

Chapter Two

Questions to Ask

Clarifying Before Designing

Notification systems vary enormously by use case. A ride-hailing app sending "your driver is arriving" has completely different requirements than a social network sending "someone liked your post." The questions below determine whether you need a real-time pipeline or a batch system, strict delivery or best-effort, and how much complexity the preference layer adds.

📬

Channel & Delivery

Which channels? Push, email, SMS, in-app, webhook?
Is the system responsible for rendering content?
Are delivery receipts required?
Do we need retry on failure? How many times?

🎯

Priority & Timing

Priority tiers? (critical alert vs marketing email)
Quiet hours per timezone?
Scheduled notifications? (send at 9am user-local)
Frequency caps? (max 3 emails/day per user)

👤

User Preferences

Per-channel opt-out? (email yes, SMS no)
Per-category opt-out? (marketing no, security yes)
Can users choose notification grouping/digest?
Legal compliance? (GDPR, CAN-SPAM, TCPA)

For This Case Study, Our Answers Are:

Channels: push (70%), email (25%), SMS (5%)
Priority tiers: yes — P0 security, P1 transactional, P2 engagement, P3 marketing
Delivery guarantee: at-least-once with Redis dedup
Quiet hours: yes, enforced per user timezone
Frequency caps: yes, per-category per-day limits
Delivery receipts: tracked in Analytics DB (sent / delivered / opened)
Legal: GDPR + CAN-SPAM compliance enforced at preference layer

Priority tiers change the entire architecture. A critical security alert ("someone logged into your account") must bypass frequency caps, quiet hours, and batch queues — it goes straight to the fastest channel immediately. A marketing notification ("new feature!") can be batched, delayed, and capped. Mixing these in one queue causes critical alerts to wait behind marketing blasts.

📋 Chapter 2 — Summary

Channel selection: push, email, SMS each have different latency and cost profiles.
Priority tiers prevent critical alerts from being delayed by batch marketing sends.
User preferences (opt-out, quiet hours, frequency caps) add a filtering layer before delivery.
Legal compliance (GDPR, CAN-SPAM) makes opt-out enforcement non-optional.
Delivery receipts needed? Determines whether you need acknowledgment tracking.

Chapter Three

Naive Design

Synchronous Send on the Request Path

The simplest design: when an event occurs (user gets a new message), the application server directly calls the notification providers — APNs, SendGrid, Twilio — in the same request path. Send the push, send the email, respond to the caller. Works great for a prototype. At scale, a single slow provider (Twilio having a bad day) blocks your entire application and creates cascading failures across unrelated features.

Naive Design — Synchronous Direct Send

✅

What Works

Simple to implement — direct HTTP calls
Immediate feedback (know if send succeeded)
No infrastructure beyond the providers
Fine for <1K notifications/hour

💥

What Breaks

Provider slowdown blocks application threads
Provider outage = notification failure (no retry)
No deduplication — retry on timeout = user gets 2 emails
No user preference filtering — opt-outs ignored or scattered
No priority separation — critical alerts queue behind marketing

📋 Chapter 3 — Summary

Synchronous send: application blocks waiting for each provider response.
Provider slowdown cascades into application-level failures.
No retry mechanism — if a send fails, the notification is lost.
No dedup — timeout + retry = duplicate notification.
No preference filtering, no priority queues. Everything is tangled.

Chapter Four

Refined Design

Async Queue-Based Architecture

The refined design decouples notification creation from delivery. A caller publishes a notification event to a message queue and returns immediately. Dedicated workers per channel consume from their queues, apply user preferences, deduplicate, and deliver through the appropriate provider. Failed sends go to a retry queue with exponential backoff. This architecture isolates provider failures, enables independent scaling per channel, and never blocks the application. For events that require all channels simultaneously — such as a security alert — the notification service fans out to multiple queues in parallel rather than sequentially, ensuring push, email, and SMS are dispatched at the same time.

Refined Design — Async Queue-Based Notification Pipeline

📨

Send Path

Step 1: Caller publishes notification event (returns immediately)
Step 2: Notification service validates, deduplicates (Redis idempotency key)
Step 3: Check user preferences — opt-out? quiet hours? frequency cap?
Step 4: Route to channel-specific queue (priority-aware)
Step 5: Worker delivers via provider, records status

🔄

Retry Path

Provider returns error → message goes to retry queue
Exponential backoff: 1s, 5s, 30s, 5min, 30min
Max 5 retries then → dead letter queue (DLQ)
DLQ monitored by ops: manual review + alerting
Idempotency key prevents duplicate sends on retry

The key insight: separate queues per channel enable independent scaling and failure isolation. If Twilio has a 2-hour outage, only SMS notifications queue up. Push and email continue unaffected. With a single queue, a stuck SMS consumer blocks push notifications too — a failure in a $0.01 SMS blocks a free push notification.

📋 Chapter 4 — Summary

Async: caller publishes event and returns immediately. No blocking.
Separate queues per channel: push, email, SMS scale and fail independently.
Priority queues: critical alerts bypass normal queue ordering.
Deduplication via Redis idempotency keys — prevents duplicate sends.
Retry with exponential backoff. Dead letter queue for persistent failures.
User preference check happens before queuing — never send what user opted out of.
Fan-out: security alerts publish to all channel queues simultaneously, not sequentially.

Chapter Five

Alternative Approaches

Unified Queue vs Per-Channel Queues

The two dominant architectural choices differ in how you organize the message pipeline. A single unified queue is simpler to operate but creates coupling between channels. Per-channel queues add operational overhead but provide isolation, independent scaling, and channel-specific optimization. The right choice depends on scale and operational maturity.

Single Unified Queue

Per-Channel Queues

One queue, workers read and route by channel field
Simpler infrastructure — one topic to monitor
Workers must handle all channel types
Problem: slow channel blocks fast channels
Head-of-line blocking: SMS timeout delays push delivery
Scaling: must scale for slowest channel's throughput

Separate queue per channel: push, email, SMS
Workers specialized per channel — optimized batching
Independent scaling: push (100 workers) vs SMS (10)
Isolation: Twilio outage only affects SMS queue
More infrastructure: N queues to configure and monitor
Used by: Most production systems at scale

Unified Queue vs Per-Channel Queues — Visual Comparison

Pull-Based (Workers Poll)

Push-Based (Event-Driven)

Workers poll the queue at intervals (SQS long-polling)
Workers control their own pace — natural backpressure
Slight latency from polling interval (50-250ms)
Simple to implement — no connection management
Good for: email, SMS (latency tolerance)
Used by: AWS SQS-based architectures

Broker pushes messages to workers in real-time
Lower latency — message delivered as soon as produced
Need flow control to prevent overwhelming slow consumers
Connection management complexity (reconnects, heartbeats)
Good for: push notifications (latency-sensitive)
Used by: Kafka consumer groups, RabbitMQ

Per-channel queues win at scale because channels have fundamentally different characteristics. Push notifications are cheap and fast (HTTP/2 multiplexing to APNs). Email has complex rendering and throttling. SMS is expensive ($0.01/msg) and rate-limited by carriers. Forcing them through one pipeline means optimizing for nothing.

📋 Chapter 5 — Summary

Unified queue: simple, but head-of-line blocking across channels. OK for small scale.
Per-channel queues: isolation, independent scaling, channel-specific optimization. Production standard.
Pull-based: natural backpressure, slight latency. Good for email/SMS.
Push-based: real-time delivery, needs flow control. Good for push notifications.

Chapter Six

What Real Companies Did

Production Notification Systems

Every major platform has built a notification system — and every one has learned the hard way about duplicate sends, provider outages, and user notification fatigue. Their approaches differ, but common patterns emerge: event-driven architectures, separate processing per channel, and preference engines that sit between event generation and delivery.

📘

1B+ members, millions of notifications/day
Kafka-based pipeline: event → filter → route → deliver
Preference engine checks 100+ rules per notification
Aggregation: "5 people viewed your profile" instead of 5 separate pushes
ML model predicts optimal send time per user

📱

Facebook / Meta

Billions of notifications/day across apps
Custom notification pipeline: "Notiflood" (internal)
Priority lanes: P0 (security), P1 (social), P2 (marketing)
Frequency capping prevents notification fatigue
Device-aware: collapse stale pushes on delivery

🏠

Airbnb

Multi-channel: push, email, SMS per booking lifecycle
Template service: content rendering separate from delivery
i18n: notifications in 60+ languages (template per locale)
Channel preference cascade: try push → fall back to email → SMS
Published architecture: "Scaling Notifications at Airbnb"

🛒

Amazon

SNS + SQS + Lambda: serverless notification pipeline
SNS fan-out: one publish triggers push + email + SMS
Per-channel SQS queues with independent consumers
DLQ with auto-replay after TTL for transient failures
Pinpoint for user segmentation and campaign management

Quick Comparison — Real Company Notification Architectures

Company	Queue Tech	Priority Tiers	Aggregation	Special Pattern
LinkedIn	Kafka	Yes	Yes — ML timed	Preference engine (100+ rules)
Facebook	Custom (Notiflood)	P0/P1/P2/P3	Yes — collapse on device	Device-aware stale push collapse
Airbnb	Not disclosed	No explicit tiers	No	Channel cascade (push → email → SMS)
Amazon	SNS + SQS	No explicit tiers	No	Serverless fan-out via Lambda

LinkedIn: Kafka pipeline, ML-based send-time optimization, notification aggregation.
Facebook: priority lanes (P0/P1/P2), frequency capping, device-aware collapse.
Airbnb: template service for i18n, channel cascade (push → email → SMS).
Amazon: SNS fan-out → per-channel SQS → Lambda workers. Serverless at scale.

Chapter Seven

Best Practices Extracted

Transferable Lessons

Notification systems teach patterns that apply to any system involving multi-channel delivery, user-facing reliability, and asynchronous processing. These are not notification-specific — they are principles of reliable event-driven architectures.

🔑

Idempotent Delivery

Every notification gets a unique idempotency key
Before sending: check Redis — already sent?
After sending: write key with TTL (24h)
Retry-safe: same key → skip, no duplicate to user
Transfers to: any at-least-once delivery system

👤

Preference Center

Centralized service for all user notification preferences
Per-channel, per-category opt-out (matrix)
Quiet hours per timezone (user-local time)
Frequency caps: max N per day/week per category
Transfers to: any system with user-configurable behavior

🚨

Priority Queuing

P0: security alerts — bypass all caps, send immediately
P1: transactional — order confirmed, message received
P2: engagement — someone liked your post
P3: marketing — batched, capped, delayed
Transfers to: any system with mixed urgency workloads

Idempotency Key Flow — Preventing Duplicate Sends

 Instead of 10 pushes for 10 likes, send one: "10 people liked your post." This reduces provider costs (fewer API calls), respects user attention (less noise), and reduces unsubscribe rate (less fatigue). Aggregate by entity + time window — simple yet high impact. 

📋 Chapter 7 — Summary

Idempotent sends: Redis-backed dedup keys. Retry-safe without duplicates.
Preference center: centralized per-channel, per-category, per-timezone control.
Priority tiers: security alerts never wait behind marketing. Separate queues per priority.
Aggregation: batch related notifications. Better UX, lower cost, less fatigue.

Chapter Eight

What Could Go Wrong

Common Failure Patterns

Notification failures are uniquely visible to users — a missed password reset email or a duplicate charge notification erodes trust immediately. The failures below are not theoretical: every major platform has experienced them, and the fixes are well-established patterns that separate production-grade systems from prototypes.

👯

Duplicate Notifications

At-least-once delivery + retry = user gets 2-3 copies
Especially bad for SMS (user charged per text in some countries)
Root cause: no idempotency key, or key TTL too short
Fix: idempotency key in Redis with 24h TTL. Check before every send.

📵

Silent Delivery Failures

Provider accepts message but never delivers (stale device token)
APNs returns 200 OK but token is expired — push never arrives
Team thinks system works; users complain they get nothing
Fix: APNs feedback service for stale tokens. Track delivery rate per channel. Alert on drops.

😤

Notification Fatigue

Too many notifications → user disables all push permissions
Once push is disabled, you lose the highest-value channel
Recovering push permission is nearly impossible
Fix: frequency caps per user per category. Aggregate related events. A/B test volume thresholds.

📱

Device Token Staleness

User reinstalls app → old device token invalid
Millions of stale tokens burn API quota sending to void
APNs/FCM rate-limit you for too many invalid sends
Fix: process APNs/FCM feedback. Purge tokens with 3+ consecutive failures. Refresh on app open.

The most expensive failure is notification fatigue. Technical failures (duplicates, missed sends) are fixable with engineering. Losing push notification permission because you annoyed users is a business problem — once they toggle it off, it stays off. Frequency capping isn't just a feature; it is infrastructure protection for your most valuable delivery channel.

📋 Chapter 8 — Summary

Duplicates: idempotency keys with TTL. Check before every send attempt.
Silent failures: track delivery rates. Process provider feedback (stale tokens).
Notification fatigue: frequency caps + aggregation. Losing push permission is permanent.
Token staleness: purge invalid tokens. Process APNs/FCM feedback loops.
Principle: protect the user's attention — it is more valuable than your message.

← Rate Limiter Chat System →