Case Study: Notification System
Design, trade-offs, and alternatives for a notification system at scale.
Problem Statement
A notification system delivers messages to users through multiple channels β push notifications, SMS, email, and in-app alerts. It seems simple: something happens, tell the user. But at scale, you are coordinating millions of messages per hour across unreliable third-party providers (APNs, FCM, Twilio, SendGrid), each with different delivery semantics, rate limits, and failure modes. The challenge is not sending one notification β it is sending the right notification, to the right user, through the right channel, exactly once, within seconds.
Traffic & Scale
- 500M daily active users
- 10B notifications/day (~115K/sec average)
- Peak: 5x average β 500K notifications/sec
- 3 channels: push (70%), email (25%), SMS (5%)
Requirements
- Latency: push <2 sec, email <30 sec, SMS <10 sec
- Delivery guarantee: at-least-once (with dedup)
- No duplicates: user must not receive same notification twice
- User preferences: opt-out per channel, quiet hours, frequency caps
A notification system is a write-dominated, fan-out-heavy pipeline. Unlike a URL shortener (read-heavy) or rate limiter (inline), notifications are fire-and-forget from the caller's perspective β but the system internally must guarantee reliable delivery across unreliable external channels. The core challenge is reliability, not latency.
- 10B notifications/day across push, email, and SMS channels.
- At-least-once delivery with deduplication β no missed and no duplicated notifications.
- User preferences (opt-out, quiet hours, frequency caps) must be respected.
- External providers (APNs, FCM, Twilio) are unreliable β system must handle their failures.
Questions to Ask
Notification systems vary enormously by use case. A ride-hailing app sending "your driver is arriving" has completely different requirements than a social network sending "someone liked your post." The questions below determine whether you need a real-time pipeline or a batch system, strict delivery or best-effort, and how much complexity the preference layer adds.
Channel & Delivery
- Which channels? Push, email, SMS, in-app, webhook?
- Is the system responsible for rendering content?
- Are delivery receipts required?
- Do we need retry on failure? How many times?
Priority & Timing
- Priority tiers? (critical alert vs marketing email)
- Quiet hours per timezone?
- Scheduled notifications? (send at 9am user-local)
- Frequency caps? (max 3 emails/day per user)
User Preferences
- Per-channel opt-out? (email yes, SMS no)
- Per-category opt-out? (marketing no, security yes)
- Can users choose notification grouping/digest?
- Legal compliance? (GDPR, CAN-SPAM, TCPA)
For This Case Study, Our Answers Are:
- Channels: push (70%), email (25%), SMS (5%)
- Priority tiers: yes β P0 security, P1 transactional, P2 engagement, P3 marketing
- Delivery guarantee: at-least-once with Redis dedup
- Quiet hours: yes, enforced per user timezone
- Frequency caps: yes, per-category per-day limits
- Delivery receipts: tracked in Analytics DB (sent / delivered / opened)
- Legal: GDPR + CAN-SPAM compliance enforced at preference layer
Priority tiers change the entire architecture. A critical security alert ("someone logged into your account") must bypass frequency caps, quiet hours, and batch queues β it goes straight to the fastest channel immediately. A marketing notification ("new feature!") can be batched, delayed, and capped. Mixing these in one queue causes critical alerts to wait behind marketing blasts.
- Channel selection: push, email, SMS each have different latency and cost profiles.
- Priority tiers prevent critical alerts from being delayed by batch marketing sends.
- User preferences (opt-out, quiet hours, frequency caps) add a filtering layer before delivery.
- Legal compliance (GDPR, CAN-SPAM) makes opt-out enforcement non-optional.
- Delivery receipts needed? Determines whether you need acknowledgment tracking.
Naive Design
The simplest design: when an event occurs (user gets a new message), the application server directly calls the notification providers β APNs, SendGrid, Twilio β in the same request path. Send the push, send the email, respond to the caller. Works great for a prototype. At scale, a single slow provider (Twilio having a bad day) blocks your entire application and creates cascading failures across unrelated features.
What Works
- Simple to implement β direct HTTP calls
- Immediate feedback (know if send succeeded)
- No infrastructure beyond the providers
- Fine for <1K notifications/hour
What Breaks
- Provider slowdown blocks application threads
- Provider outage = notification failure (no retry)
- No deduplication β retry on timeout = user gets 2 emails
- No user preference filtering β opt-outs ignored or scattered
- No priority separation β critical alerts queue behind marketing
- Synchronous send: application blocks waiting for each provider response.
- Provider slowdown cascades into application-level failures.
- No retry mechanism β if a send fails, the notification is lost.
- No dedup β timeout + retry = duplicate notification.
- No preference filtering, no priority queues. Everything is tangled.
Refined Design
The refined design decouples notification creation from delivery. A caller publishes a notification event to a message queue and returns immediately. Dedicated workers per channel consume from their queues, apply user preferences, deduplicate, and deliver through the appropriate provider. Failed sends go to a retry queue with exponential backoff. This architecture isolates provider failures, enables independent scaling per channel, and never blocks the application. For events that require all channels simultaneously β such as a security alert β the notification service fans out to multiple queues in parallel rather than sequentially, ensuring push, email, and SMS are dispatched at the same time.
Send Path
- Step 1: Caller publishes notification event (returns immediately)
- Step 2: Notification service validates, deduplicates (Redis idempotency key)
- Step 3: Check user preferences β opt-out? quiet hours? frequency cap?
- Step 4: Route to channel-specific queue (priority-aware)
- Step 5: Worker delivers via provider, records status
Retry Path
- Provider returns error β message goes to retry queue
- Exponential backoff: 1s, 5s, 30s, 5min, 30min
- Max 5 retries then β dead letter queue (DLQ)
- DLQ monitored by ops: manual review + alerting
- Idempotency key prevents duplicate sends on retry
The key insight: separate queues per channel enable independent scaling and failure isolation. If Twilio has a 2-hour outage, only SMS notifications queue up. Push and email continue unaffected. With a single queue, a stuck SMS consumer blocks push notifications too β a failure in a $0.01 SMS blocks a free push notification.
- Async: caller publishes event and returns immediately. No blocking.
- Separate queues per channel: push, email, SMS scale and fail independently.
- Priority queues: critical alerts bypass normal queue ordering.
- Deduplication via Redis idempotency keys β prevents duplicate sends.
- Retry with exponential backoff. Dead letter queue for persistent failures.
- User preference check happens before queuing β never send what user opted out of.
- Fan-out: security alerts publish to all channel queues simultaneously, not sequentially.
Alternative Approaches
The two dominant architectural choices differ in how you organize the message pipeline. A single unified queue is simpler to operate but creates coupling between channels. Per-channel queues add operational overhead but provide isolation, independent scaling, and channel-specific optimization. The right choice depends on scale and operational maturity.
- One queue, workers read and route by channel field
- Simpler infrastructure β one topic to monitor
- Workers must handle all channel types
- Problem: slow channel blocks fast channels
- Head-of-line blocking: SMS timeout delays push delivery
- Scaling: must scale for slowest channel's throughput
- Separate queue per channel: push, email, SMS
- Workers specialized per channel β optimized batching
- Independent scaling: push (100 workers) vs SMS (10)
- Isolation: Twilio outage only affects SMS queue
- More infrastructure: N queues to configure and monitor
- Used by: Most production systems at scale
- Workers poll the queue at intervals (SQS long-polling)
- Workers control their own pace β natural backpressure
- Slight latency from polling interval (50-250ms)
- Simple to implement β no connection management
- Good for: email, SMS (latency tolerance)
- Used by: AWS SQS-based architectures
- Broker pushes messages to workers in real-time
- Lower latency β message delivered as soon as produced
- Need flow control to prevent overwhelming slow consumers
- Connection management complexity (reconnects, heartbeats)
- Good for: push notifications (latency-sensitive)
- Used by: Kafka consumer groups, RabbitMQ
Per-channel queues win at scale because channels have fundamentally different characteristics. Push notifications are cheap and fast (HTTP/2 multiplexing to APNs). Email has complex rendering and throttling. SMS is expensive ($0.01/msg) and rate-limited by carriers. Forcing them through one pipeline means optimizing for nothing.
- Unified queue: simple, but head-of-line blocking across channels. OK for small scale.
- Per-channel queues: isolation, independent scaling, channel-specific optimization. Production standard.
- Pull-based: natural backpressure, slight latency. Good for email/SMS.
- Push-based: real-time delivery, needs flow control. Good for push notifications.
What Real Companies Did
Every major platform has built a notification system β and every one has learned the hard way about duplicate sends, provider outages, and user notification fatigue. Their approaches differ, but common patterns emerge: event-driven architectures, separate processing per channel, and preference engines that sit between event generation and delivery.
- 1B+ members, millions of notifications/day
- Kafka-based pipeline: event β filter β route β deliver
- Preference engine checks 100+ rules per notification
- Aggregation: "5 people viewed your profile" instead of 5 separate pushes
- ML model predicts optimal send time per user
Facebook / Meta
- Billions of notifications/day across apps
- Custom notification pipeline: "Notiflood" (internal)
- Priority lanes: P0 (security), P1 (social), P2 (marketing)
- Frequency capping prevents notification fatigue
- Device-aware: collapse stale pushes on delivery
Airbnb
- Multi-channel: push, email, SMS per booking lifecycle
- Template service: content rendering separate from delivery
- i18n: notifications in 60+ languages (template per locale)
- Channel preference cascade: try push β fall back to email β SMS
- Published architecture: "Scaling Notifications at Airbnb"
Amazon
- SNS + SQS + Lambda: serverless notification pipeline
- SNS fan-out: one publish triggers push + email + SMS
- Per-channel SQS queues with independent consumers
- DLQ with auto-replay after TTL for transient failures
- Pinpoint for user segmentation and campaign management
| Company | Queue Tech | Priority Tiers | Aggregation | Special Pattern |
|---|---|---|---|---|
| Kafka | Yes | Yes β ML timed | Preference engine (100+ rules) | |
| Custom (Notiflood) | P0/P1/P2/P3 | Yes β collapse on device | Device-aware stale push collapse | |
| Airbnb | Not disclosed | No explicit tiers | No | Channel cascade (push β email β SMS) |
| Amazon | SNS + SQS | No explicit tiers | No | Serverless fan-out via Lambda |
- LinkedIn: Kafka pipeline, ML-based send-time optimization, notification aggregation.
- Facebook: priority lanes (P0/P1/P2), frequency capping, device-aware collapse.
- Airbnb: template service for i18n, channel cascade (push β email β SMS).
- Amazon: SNS fan-out β per-channel SQS β Lambda workers. Serverless at scale.
Best Practices Extracted
Notification systems teach patterns that apply to any system involving multi-channel delivery, user-facing reliability, and asynchronous processing. These are not notification-specific β they are principles of reliable event-driven architectures.
Idempotent Delivery
- Every notification gets a unique idempotency key
- Before sending: check Redis β already sent?
- After sending: write key with TTL (24h)
- Retry-safe: same key β skip, no duplicate to user
- Transfers to: any at-least-once delivery system
Preference Center
- Centralized service for all user notification preferences
- Per-channel, per-category opt-out (matrix)
- Quiet hours per timezone (user-local time)
- Frequency caps: max N per day/week per category
- Transfers to: any system with user-configurable behavior
Priority Queuing
- P0: security alerts β bypass all caps, send immediately
- P1: transactional β order confirmed, message received
- P2: engagement β someone liked your post
- P3: marketing β batched, capped, delayed
- Transfers to: any system with mixed urgency workloads
- Idempotent sends: Redis-backed dedup keys. Retry-safe without duplicates.
- Preference center: centralized per-channel, per-category, per-timezone control.
- Priority tiers: security alerts never wait behind marketing. Separate queues per priority.
- Aggregation: batch related notifications. Better UX, lower cost, less fatigue.
What Could Go Wrong
Notification failures are uniquely visible to users β a missed password reset email or a duplicate charge notification erodes trust immediately. The failures below are not theoretical: every major platform has experienced them, and the fixes are well-established patterns that separate production-grade systems from prototypes.
Duplicate Notifications
- At-least-once delivery + retry = user gets 2-3 copies
- Especially bad for SMS (user charged per text in some countries)
- Root cause: no idempotency key, or key TTL too short
- Fix: idempotency key in Redis with 24h TTL. Check before every send.
Silent Delivery Failures
- Provider accepts message but never delivers (stale device token)
- APNs returns 200 OK but token is expired β push never arrives
- Team thinks system works; users complain they get nothing
- Fix: APNs feedback service for stale tokens. Track delivery rate per channel. Alert on drops.
Notification Fatigue
- Too many notifications β user disables all push permissions
- Once push is disabled, you lose the highest-value channel
- Recovering push permission is nearly impossible
- Fix: frequency caps per user per category. Aggregate related events. A/B test volume thresholds.
Device Token Staleness
- User reinstalls app β old device token invalid
- Millions of stale tokens burn API quota sending to void
- APNs/FCM rate-limit you for too many invalid sends
- Fix: process APNs/FCM feedback. Purge tokens with 3+ consecutive failures. Refresh on app open.
The most expensive failure is notification fatigue. Technical failures (duplicates, missed sends) are fixable with engineering. Losing push notification permission because you annoyed users is a business problem β once they toggle it off, it stays off. Frequency capping isn't just a feature; it is infrastructure protection for your most valuable delivery channel.
- Duplicates: idempotency keys with TTL. Check before every send attempt.
- Silent failures: track delivery rates. Process provider feedback (stale tokens).
- Notification fatigue: frequency caps + aggregation. Losing push permission is permanent.
- Token staleness: purge invalid tokens. Process APNs/FCM feedback loops.
- Principle: protect the user's attention β it is more valuable than your message.