System Design ยท Case Studies

Case Study: Chat System

Design, trade-offs, and alternatives for a real-time chat system at scale.

01
Chapter One

Problem Statement

What We Are Building

A real-time chat system lets users exchange messages instantly โ€” one-on-one or in groups. The defining challenge is bidirectional, low-latency communication: when Alice sends a message, Bob must see it within milliseconds, not seconds. This is fundamentally different from request-response systems. You need persistent connections, ordered delivery, presence tracking, and offline message queuing โ€” all while keeping millions of concurrent connections alive on stateful servers that cannot simply be load-balanced like stateless HTTP services.

Scale Requirements

Traffic & Scale

  • 500M daily active users
  • 50M concurrent connections at peak
  • 20B messages/day (~230K messages/sec)
  • Average user sends 40 messages/day

Requirements

  • Delivery latency: <200ms for online recipient
  • Ordering: messages in a conversation arrive in send order
  • Offline delivery: queue messages, deliver on reconnect
  • History: persistent, searchable, 30-day+ retention

Chat is a connection-oriented, stateful system. Unlike HTTP APIs where any server can handle any request, a chat server holds an open WebSocket per user. This makes load balancing, failover, and deployment fundamentally harder. You cannot just restart a server โ€” doing so disconnects 100K users simultaneously.

๐Ÿ“‹ Chapter 1 โ€” Summary
  • 50M concurrent WebSocket connections at peak. 230K messages/sec throughput.
  • Sub-200ms delivery latency for online users. Offline queuing for disconnected users.
  • Message ordering within a conversation must be preserved.
  • Stateful servers (WebSocket connections) make deployment and scaling fundamentally different from HTTP.
02
Chapter Two

Questions to Ask

Clarifying Before Designing

A Slack-like workspace chat has fundamentally different requirements than a WhatsApp-like personal messenger. Group size, message retention, encryption model, and whether you show typing indicators all change the architecture in significant ways. Get these wrong and you build the wrong system.

๐Ÿ’ฌ

Message Model

  • 1-on-1 only, or group chat too?
  • Max group size? (10 friends vs 100K server)
  • Message types: text only? images, files, voice?
  • Message editing and deletion supported?
๐ŸŸข

Presence & Indicators

  • Online/offline presence required?
  • Typing indicators? ("Alice is typing...")
  • Read receipts? (double check marks)
  • "Last seen" timestamp?
๐Ÿ”’

Security & Compliance

  • End-to-end encryption? (server cannot read messages)
  • Message retention vs right-to-deletion (GDPR)?
  • Multi-device sync? (phone + desktop + web)
  • Push notifications for offline users?

Group size determines your fan-out strategy. A 5-person group chat: fan out the message to 4 recipients โ€” trivial. A 100K-member Discord server: fan out to 100K connections โ€” you need pub/sub, not point-to-point delivery. The same architecture cannot serve both without a fan-out abstraction layer.

For This Case Study, Our Answers Are:

  • Chat type: 1-on-1 and group (up to 500 members โ€” no massive Discord-style servers)
  • Presence: yes โ€” online/offline + typing indicators
  • Read receipts: yes โ€” per-message, per-user
  • E2E encryption: no โ€” server can read messages (enables search and moderation)
  • Multi-device: yes โ€” sync across phone, desktop, web
  • Retention: 30 days full history, searchable
  • Push notifications: yes โ€” for offline users (integrates with notification system)
  • Scale: 50M concurrent, 230K messages/sec
๐Ÿ“‹ Chapter 2 โ€” Summary
  • 1-on-1 vs group chat: group size changes fan-out strategy entirely.
  • Presence and typing indicators add constant heartbeat traffic โ€” significant at scale.
  • E2E encryption means server stores ciphertext โ€” cannot search, moderate, or index.
  • Multi-device sync requires message ordering across devices per user.
  • Read receipts are Nยฒ messages in a group โ€” expensive to fan out.
03
Chapter Three

Naive Design

HTTP Polling for Messages

The simplest chat design: clients periodically send HTTP requests asking "any new messages for me?" The server queries the database and returns new messages. This is how early web chat worked. It is trivially simple โ€” and catastrophically wasteful. With 50M users polling every 2 seconds, you generate 25M requests/sec, of which 95%+ return "no new messages." You are burning server resources to learn that nothing happened.

Naive Design โ€” HTTP Short Polling
Client (Alice) API Server GET /messages every 2 seconds SELECT * WHERE id > last_seen Database messages table poll every 2s 200 OK (empty) โ€” 95% of polls 200 OK (1 message) โ€” 5% of polls 50M users ร— poll every 2s = 25M requests/sec 95% of polls return nothing โ€” pure waste 2-second polling interval = 2-second message delay (unacceptable for chat) Database queried 25M times/sec for "anything new?" โ€” the DB melts Reduce interval to 100ms for "real-time"? โ†’ 500M req/sec. Impossible.
โœ…

What Works

  • Dead simple โ€” standard HTTP, no special infrastructure
  • Works through firewalls and proxies (plain HTTP)
  • Stateless servers โ€” any server handles any request
  • Fine for <1K concurrent users
๐Ÿ’ฅ

What Breaks

  • 25M empty responses/sec โ€” wasted compute and bandwidth
  • 2s polling = 2s message delay (not "real-time")
  • Database overwhelmed by constant polling queries
  • No presence detection โ€” can't tell who's online
  • Battery drain on mobile (constant HTTP requests)
๐Ÿ“‹ Chapter 3 โ€” Summary
  • HTTP polling: simple but catastrophically wasteful at scale.
  • 25M req/sec with 95% empty โ€” burning server resources for nothing.
  • Polling interval = message delay. Faster polling = more waste.
  • No persistent connection = no presence, no typing indicators, no push delivery.
04
Chapter Four

Refined Design

WebSocket Servers with Pub/Sub Fan-Out

The refined design replaces polling with persistent WebSocket connections. Each user holds exactly one open connection to a chat server. When Alice sends a message, it is written to the database, then published to a message broker (Redis Pub/Sub or Kafka). The broker delivers the message to whichever chat server holds Bob's connection โ€” that server pushes it down Bob's WebSocket instantly. No polling. No wasted requests. Sub-100ms delivery.

Refined Design โ€” WebSocket + Pub/Sub Chat Architecture
Alice Bob WS Gateway / LB sticky routing (IP hash) Chat Server 1 Alice's WS conn Chat Server 2 Bob's WS conn WebSocket WebSocket Session Registry user โ†’ server map register: aliceโ†’s1 lookup: bobโ†’s2? Message Broker Redis Pub/Sub publish deliver Message Store Cassandra / Scylla async persist Offline Queue delivered on reconnect delivered=false Presence Service Redis heartbeats โ‘  Alice sends โ†’ โ‘ก Chat Server 1 persists + publishes โ†’ โ‘ข Broker routes โ‘ฃ Chat Server 2 pushes โ†’ โ‘ค Bob receives Session registry tracks userโ†’server mapping. Offline messages queued and delivered on reconnect.
๐Ÿ“จ

Message Send Flow

  • 1. Alice sends message via her WebSocket connection
  • 2. Chat Server 1 validates, assigns message ID + timestamp
  • 3. Write to message store (async, write-behind)
  • 4. Publish to message broker (Redis Pub/Sub or Kafka)
  • 5. Broker delivers to Chat Server 2 (holds Bob's connection)
  • 6. Chat Server 2 pushes down Bob's WebSocket โ€” instant delivery
๐Ÿ“ต

Offline Delivery

  • Session registry shows Bob is offline โ€” no active server
  • Message persisted to DB with delivered=false
  • Push notification sent via notification system
  • On reconnect: Bob's new server queries undelivered messages
  • Messages delivered in order, marked as delivered

The session registry is the critical piece. When a message arrives for Bob, the system must know: (1) Is Bob online? (2) Which chat server holds Bob's connection? The session registry (Redis hash: user_id โ†’ server_id) answers both questions in one lookup. Without it, you must broadcast every message to every server โ€” O(N) fan-out instead of O(1) targeted delivery.

๐Ÿ“‹ Chapter 4 โ€” Summary
  • WebSocket connections: persistent, bidirectional, sub-100ms delivery.
  • Message broker (Redis Pub/Sub): routes messages between chat servers holding different users.
  • Session registry: maps user โ†’ server for targeted delivery (not broadcast).
  • Offline queue: undelivered messages persisted, delivered on reconnect.
  • Message store: write-behind to Cassandra/Scylla for history and search.
  • Presence service: heartbeat-based online/offline detection.
05
Chapter Five

Alternative Approaches

Connection & Delivery Strategies

WebSocket is the dominant protocol for real-time chat, but it is not the only option. Long polling was the standard before WebSocket support was universal. Server-Sent Events (SSE) offer a simpler one-way channel. Each has trade-offs in latency, complexity, mobile friendliness, and infrastructure cost.

WebSocket (Bidirectional)
Long Polling (HTTP-based)
  • Full-duplex: client and server send anytime
  • Single TCP connection per user โ€” minimal overhead
  • Sub-100ms latency for message delivery
  • Requires sticky sessions (connection = state)
  • Proxies/firewalls may terminate idle connections
  • Used by: WhatsApp Web, Slack, Discord
  • Client opens HTTP request; server holds until new message or timeout
  • Works through all proxies and firewalls (plain HTTP)
  • ~100-500ms latency (connection setup overhead)
  • Each response requires new connection โ€” overhead per message
  • Server holds open connection = thread/memory per user
  • Used by: Facebook Chat (early), fallback for WebSocket
Server-Sent Events (SSE)
MQTT (IoT Protocol)
  • Server โ†’ client only (one-way push)
  • Client sends via regular HTTP POST
  • Auto-reconnect built into browser EventSource API
  • Simpler than WebSocket โ€” no handshake upgrade
  • Good for: notifications, live feeds (not bidirectional chat)
  • Limitation: 6 connection limit per domain in HTTP/1.1
  • Lightweight pub/sub protocol โ€” tiny packet overhead
  • QoS levels: 0 (fire&forget), 1 (at-least-once), 2 (exactly-once)
  • Built for low-bandwidth, unreliable networks (mobile, IoT)
  • Persistent sessions: broker queues messages for offline clients
  • Good for: mobile-first chat, IoT messaging
  • Used by: Facebook Messenger (mobile), IoT platforms

WebSocket is the production standard for web and desktop chat. MQTT wins for mobile-first apps on unreliable networks (3G, frequent disconnects) because of its tiny overhead and built-in session resumption. In practice, most chat systems use WebSocket for web/desktop and push notifications (APNs/FCM) for mobile offline delivery.

Protocol Trade-offs: Latency vs Complexity
Implementation Complexity โ†’ Low High Delivery Latency โ†’ Low High MQTT mobile-optimized SSE serverโ†’client only Long Polling HTTP-based fallback Web Socket โ˜… production choice for this case study
๐Ÿ“‹ Chapter 5 โ€” Summary
  • WebSocket: bidirectional, low-latency, production standard. Requires sticky sessions.
  • Long polling: fallback for restricted environments. Higher latency, more overhead.
  • SSE: server-to-client only. Good for feeds, not bidirectional chat.
  • MQTT: lightweight pub/sub, mobile-optimized. Facebook Messenger uses it on mobile.
06
Chapter Six

What Real Companies Did

Production Chat Systems

Real-time chat at billion-user scale has been solved by multiple companies โ€” each making different trade-offs based on their constraints. WhatsApp optimized for minimal infrastructure. Slack optimized for searchability and integrations. Discord optimized for massive concurrent voice + text rooms. Their architectures are strikingly different.

๐Ÿ“ฑ

WhatsApp

  • 2B+ users, 100B+ messages/day
  • Erlang/OTP for connection handling (~2M connections/server)
  • XMPP-based protocol (customized)
  • End-to-end encryption (Signal Protocol) โ€” server stores ciphertext
  • Only 50 engineers for 2B users โ€” efficiency through Erlang concurrency
๐Ÿ’ผ

Slack

  • WebSocket for real-time + REST API for history
  • MySQL + Vitess for message storage (sharded by workspace)
  • Full-text search across message history (Elasticsearch)
  • Channel-based pub/sub: subscribe to channels, not individual users
  • Flannel: custom edge cache for connection management
๐ŸŽฎ

Discord

  • 150M+ monthly active users, millions concurrent in voice
  • Elixir/Erlang for WebSocket gateway (high concurrency)
  • Cassandra โ†’ migrated to ScyllaDB for message storage
  • Servers (guilds) up to 1M members โ€” massive fan-out challenge
  • Lazy loading: only deliver messages for channels user is viewing
๐Ÿ“ง

Facebook Messenger

  • 1B+ users, MQTT for mobile, WebSocket for web
  • Iris: ordered message queue per user (like a personal Kafka topic)
  • Messages stored in HBase (wide-column, append-only)
  • Multi-device: Iris pointer tracks per-device read position
  • Zero-downtime deploys via connection draining
Production Chat Systems โ€” Comparison
Company Protocol Storage Special Pattern Max Scale WhatsApp XMPP over WebSocket Custom (Erlang Mnesia) Erlang 2M conn/server, E2E encryption 2B users Slack WebSocket + REST MySQL/Vitess (by workspace) Flannel edge cache, full-text search 20M DAU Discord WebSocket (Elixir) ScyllaDB Lazy loading for 1M-member servers 150M MAU Messenger MQTT (mobile) + WebSocket (web) HBase Iris ordered queue, per-device read pointer 1B users
๐Ÿ“‹ Chapter 6 โ€” Summary
  • WhatsApp: Erlang for 2M conn/server, E2E encryption, 50 engineers for 2B users.
  • Slack: WebSocket + REST, MySQL/Vitess sharded by workspace, full-text search.
  • Discord: Elixir gateway, ScyllaDB storage, lazy loading for 1M-member servers.
  • Messenger: MQTT (mobile) + WebSocket (web), Iris ordered queue, HBase storage.
07
Chapter Seven

Best Practices Extracted

Transferable Lessons

Chat systems push you to solve problems that appear in every real-time, connection-oriented system: connection lifecycle management, ordered delivery over unreliable networks, and presence tracking at scale. These patterns transfer to live collaboration tools, multiplayer games, real-time dashboards, and any system where "push" beats "poll."

๐Ÿ”—

Connection Management

  • Heartbeat interval: 30s client โ†’ server (detect dead connections)
  • Graceful drain on deploy: stop accepting new, finish existing
  • Reconnect with exponential backoff + jitter
  • Session resumption: reconnect without re-fetching all state
  • Transfers to: any WebSocket/long-lived connection system
๐Ÿ”ข

Message Ordering

  • Server-assigned sequence number per conversation
  • Client detects gaps: "I have msg 5, 6, 8 โ€” where's 7?"
  • Request missing messages on gap detection
  • Timestamp + sequence for total ordering
  • Transfers to: any ordered event stream system
๐Ÿ†”

Client-Generated IDs

  • Client generates UUID for each message before sending
  • Server uses it as idempotency key โ€” retry-safe
  • If network drops after send: client retries with same ID
  • Server deduplicates: same ID = already processed
  • Transfers to: any at-least-once delivery system

Graceful deployment is the hardest operational challenge in chat systems. You cannot just restart a server โ€” it holds 100K active WebSocket connections. You must: (1) stop routing new connections to it, (2) wait for existing connections to naturally close or timeout, (3) drain remaining connections โ€” each client auto-reconnects to a different server. This "connection draining" takes minutes, not seconds.

Graceful Deployment: Connection Draining Flow
Deploy triggered LB: stop routing NEW connections to Server A Wait: existing connections drain (clients disconnect naturally or timeout) Force-close remaining โ†’ clients reconnect to B/C Server A: 0 connections โ†’ safe to restart Bring Server A back โ†’ LB resumes routing Total drain time: 2โ€“5 minutes (not seconds)
๐Ÿ“‹ Chapter 7 โ€” Summary
  • Heartbeats: 30s interval detects dead connections without wasting bandwidth.
  • Connection draining: graceful deploy = drain existing connections, route new ones elsewhere.
  • Message ordering: server-assigned sequence numbers + client gap detection.
  • Client-generated IDs: retry-safe message sending. Dedup on server.
08
Chapter Eight

What Could Go Wrong

Common Failure Patterns

Chat system failures are immediately visible โ€” users see messages out of order, presence shows "online" when someone is offline, or messages simply vanish. These failures erode trust in the platform faster than almost any other system failure because users notice instantly. Every pattern below has happened in production at major chat platforms.

๐Ÿ”€

Message Ordering Violations

  • Alice sends "Are you free?" then "For dinner tonight?"
  • Bob sees "For dinner tonight?" first โ€” context lost
  • Root cause: messages routed through different servers with clock skew
  • Fix: per-conversation sequence numbers assigned by a single coordinator. Lamport timestamps for distributed ordering.
๐Ÿ‘ป

Presence Inaccuracy

  • User closes laptop lid โ€” connection drops silently
  • Presence still shows "online" until heartbeat timeout (30-60s)
  • Or: user on flaky mobile โ€” rapidly toggles online/offline
  • Fix: heartbeat timeout + grace period. Debounce presence changes (wait 15s before broadcasting "offline").
โšก

Thundering Herd on Reconnect

  • Chat server crashes โ†’ 100K users reconnect simultaneously
  • All hit the same gateway โ†’ new server overwhelmed
  • Each reconnect triggers "fetch unread messages" โ†’ DB flooded
  • Fix: reconnect with random jitter (0-30s), connection rate limiting at gateway, progressive history loading.
๐Ÿ”’

Split Brain in Message Delivery

  • Network partition: broker thinks user is on Server A, actually on Server B
  • Messages routed to wrong server โ€” user doesn't receive them
  • Session registry stale after partition heals
  • Fix: session registry TTL with frequent refresh. Delivery acknowledgment: if no ACK in 5s, re-route via push notification.

The thundering herd is the most common chat outage pattern. One server dies โ†’ 100K reconnects โ†’ next server dies โ†’ 200K reconnects โ†’ cascading failure. The fix is simple but critical: random jitter on reconnect (each client waits 0-30 seconds randomly), connection rate limiting at the gateway, and circuit breaking when new connection rate exceeds capacity.

Thundering Herd: Without vs With Reconnect Jitter
Reconnects/sec No Jitter (Cascade Failure) capacity limit T=0 100K T=1 200K T=2 300K TOTAL OUTAGE With Jitter (0โ€“30s Window) capacity ~3K connections/sec (spread evenly) T=0 T=30s โœ“ Stays healthy well below capacity โ€” no cascade
๐Ÿ“‹ Chapter 8 โ€” Summary
  • Message ordering: per-conversation sequence numbers. Never rely on timestamps alone.
  • Presence ghost: heartbeat timeout + debounce. Don't broadcast rapid online/offline flips.
  • Thundering herd: random reconnect jitter + rate limiting at gateway. Prevent cascade.
  • Split brain: session registry TTL + delivery ACK. Re-route unacknowledged messages.
  • Principle: design for the reconnect storm, not just the steady state.