System Design · Case Studies

Case Study: Chat System

Design, trade-offs, and alternatives for a real-time chat system at scale.

Chapter One

Problem Statement

What We Are Building

A real-time chat system lets users exchange messages instantly — one-on-one or in groups. The defining challenge is bidirectional, low-latency communication: when Alice sends a message, Bob must see it within milliseconds, not seconds. This is fundamentally different from request-response systems. You need persistent connections, ordered delivery, presence tracking, and offline message queuing — all while keeping millions of concurrent connections alive on stateful servers that cannot simply be load-balanced like stateless HTTP services.

Scale Requirements

Traffic & Scale

500M daily active users
50M concurrent connections at peak
20B messages/day (~230K messages/sec)
Average user sends 40 messages/day

Requirements

Delivery latency: <200ms for online recipient
Ordering: messages in a conversation arrive in send order
Offline delivery: queue messages, deliver on reconnect
History: persistent, searchable, 30-day+ retention

Chat is a connection-oriented, stateful system. Unlike HTTP APIs where any server can handle any request, a chat server holds an open WebSocket per user. This makes load balancing, failover, and deployment fundamentally harder. You cannot just restart a server — doing so disconnects 100K users simultaneously.

📋 Chapter 1 — Summary

50M concurrent WebSocket connections at peak. 230K messages/sec throughput.
Sub-200ms delivery latency for online users. Offline queuing for disconnected users.
Message ordering within a conversation must be preserved.
Stateful servers (WebSocket connections) make deployment and scaling fundamentally different from HTTP.

Chapter Two

Questions to Ask

Clarifying Before Designing

A Slack-like workspace chat has fundamentally different requirements than a WhatsApp-like personal messenger. Group size, message retention, encryption model, and whether you show typing indicators all change the architecture in significant ways. Get these wrong and you build the wrong system.

💬

Message Model

1-on-1 only, or group chat too?
Max group size? (10 friends vs 100K server)
Message types: text only? images, files, voice?
Message editing and deletion supported?

🟢

Presence & Indicators

Online/offline presence required?
Typing indicators? ("Alice is typing...")
Read receipts? (double check marks)
"Last seen" timestamp?

🔒

Security & Compliance

End-to-end encryption? (server cannot read messages)
Message retention vs right-to-deletion (GDPR)?
Multi-device sync? (phone + desktop + web)
Push notifications for offline users?

Group size determines your fan-out strategy. A 5-person group chat: fan out the message to 4 recipients — trivial. A 100K-member Discord server: fan out to 100K connections — you need pub/sub, not point-to-point delivery. The same architecture cannot serve both without a fan-out abstraction layer.

For This Case Study, Our Answers Are:

Chat type: 1-on-1 and group (up to 500 members — no massive Discord-style servers)
Presence: yes — online/offline + typing indicators
Read receipts: yes — per-message, per-user
E2E encryption: no — server can read messages (enables search and moderation)
Multi-device: yes — sync across phone, desktop, web
Retention: 30 days full history, searchable
Push notifications: yes — for offline users (integrates with notification system)
Scale: 50M concurrent, 230K messages/sec

📋 Chapter 2 — Summary

1-on-1 vs group chat: group size changes fan-out strategy entirely.
Presence and typing indicators add constant heartbeat traffic — significant at scale.
E2E encryption means server stores ciphertext — cannot search, moderate, or index.
Multi-device sync requires message ordering across devices per user.
Read receipts are N² messages in a group — expensive to fan out.

Chapter Three

Naive Design

HTTP Polling for Messages

The simplest chat design: clients periodically send HTTP requests asking "any new messages for me?" The server queries the database and returns new messages. This is how early web chat worked. It is trivially simple — and catastrophically wasteful. With 50M users polling every 2 seconds, you generate 25M requests/sec, of which 95%+ return "no new messages." You are burning server resources to learn that nothing happened.

Naive Design — HTTP Short Polling

✅

What Works

Dead simple — standard HTTP, no special infrastructure
Works through firewalls and proxies (plain HTTP)
Stateless servers — any server handles any request
Fine for <1K concurrent users

💥

What Breaks

25M empty responses/sec — wasted compute and bandwidth
2s polling = 2s message delay (not "real-time")
Database overwhelmed by constant polling queries
No presence detection — can't tell who's online
Battery drain on mobile (constant HTTP requests)

📋 Chapter 3 — Summary

HTTP polling: simple but catastrophically wasteful at scale.
25M req/sec with 95% empty — burning server resources for nothing.
Polling interval = message delay. Faster polling = more waste.
No persistent connection = no presence, no typing indicators, no push delivery.

Chapter Four

Refined Design

WebSocket Servers with Pub/Sub Fan-Out

The refined design replaces polling with persistent WebSocket connections. Each user holds exactly one open connection to a chat server. When Alice sends a message, it is written to the database, then published to a message broker (Redis Pub/Sub or Kafka). The broker delivers the message to whichever chat server holds Bob's connection — that server pushes it down Bob's WebSocket instantly. No polling. No wasted requests. Sub-100ms delivery.

Refined Design — WebSocket + Pub/Sub Chat Architecture

📨

Message Send Flow

1. Alice sends message via her WebSocket connection
2. Chat Server 1 validates, assigns message ID + timestamp
3. Write to message store (async, write-behind)
4. Publish to message broker (Redis Pub/Sub or Kafka)
5. Broker delivers to Chat Server 2 (holds Bob's connection)
6. Chat Server 2 pushes down Bob's WebSocket — instant delivery

📵

Offline Delivery

Session registry shows Bob is offline — no active server
Message persisted to DB with delivered=false
Push notification sent via notification system
On reconnect: Bob's new server queries undelivered messages
Messages delivered in order, marked as delivered

The session registry is the critical piece. When a message arrives for Bob, the system must know: (1) Is Bob online? (2) Which chat server holds Bob's connection? The session registry (Redis hash: user_id → server_id) answers both questions in one lookup. Without it, you must broadcast every message to every server — O(N) fan-out instead of O(1) targeted delivery.

📋 Chapter 4 — Summary

WebSocket connections: persistent, bidirectional, sub-100ms delivery.
Message broker (Redis Pub/Sub): routes messages between chat servers holding different users.
Session registry: maps user → server for targeted delivery (not broadcast).
Offline queue: undelivered messages persisted, delivered on reconnect.
Message store: write-behind to Cassandra/Scylla for history and search.
Presence service: heartbeat-based online/offline detection.

Chapter Five

Alternative Approaches

Connection & Delivery Strategies

WebSocket is the dominant protocol for real-time chat, but it is not the only option. Long polling was the standard before WebSocket support was universal. Server-Sent Events (SSE) offer a simpler one-way channel. Each has trade-offs in latency, complexity, mobile friendliness, and infrastructure cost.

WebSocket (Bidirectional)

Long Polling (HTTP-based)

Full-duplex: client and server send anytime
Single TCP connection per user — minimal overhead
Sub-100ms latency for message delivery
Requires sticky sessions (connection = state)
Proxies/firewalls may terminate idle connections
Used by: WhatsApp Web, Slack, Discord

Client opens HTTP request; server holds until new message or timeout
Works through all proxies and firewalls (plain HTTP)
~100-500ms latency (connection setup overhead)
Each response requires new connection — overhead per message
Server holds open connection = thread/memory per user
Used by: Facebook Chat (early), fallback for WebSocket

Server-Sent Events (SSE)

MQTT (IoT Protocol)

Server → client only (one-way push)
Client sends via regular HTTP POST
Auto-reconnect built into browser EventSource API
Simpler than WebSocket — no handshake upgrade
Good for: notifications, live feeds (not bidirectional chat)
Limitation: 6 connection limit per domain in HTTP/1.1

Lightweight pub/sub protocol — tiny packet overhead
QoS levels: 0 (fire&forget), 1 (at-least-once), 2 (exactly-once)
Built for low-bandwidth, unreliable networks (mobile, IoT)
Persistent sessions: broker queues messages for offline clients
Good for: mobile-first chat, IoT messaging
Used by: Facebook Messenger (mobile), IoT platforms

WebSocket is the production standard for web and desktop chat. MQTT wins for mobile-first apps on unreliable networks (3G, frequent disconnects) because of its tiny overhead and built-in session resumption. In practice, most chat systems use WebSocket for web/desktop and push notifications (APNs/FCM) for mobile offline delivery.

Protocol Trade-offs: Latency vs Complexity

📋 Chapter 5 — Summary

WebSocket: bidirectional, low-latency, production standard. Requires sticky sessions.
Long polling: fallback for restricted environments. Higher latency, more overhead.
SSE: server-to-client only. Good for feeds, not bidirectional chat.
MQTT: lightweight pub/sub, mobile-optimized. Facebook Messenger uses it on mobile.

Chapter Six

What Real Companies Did

Production Chat Systems

Real-time chat at billion-user scale has been solved by multiple companies — each making different trade-offs based on their constraints. WhatsApp optimized for minimal infrastructure. Slack optimized for searchability and integrations. Discord optimized for massive concurrent voice + text rooms. Their architectures are strikingly different.

📱

2B+ users, 100B+ messages/day
Erlang/OTP for connection handling (~2M connections/server)
XMPP-based protocol (customized)
End-to-end encryption (Signal Protocol) — server stores ciphertext
Only 50 engineers for 2B users — efficiency through Erlang concurrency

💼

Slack

WebSocket for real-time + REST API for history
MySQL + Vitess for message storage (sharded by workspace)
Full-text search across message history (Elasticsearch)
Channel-based pub/sub: subscribe to channels, not individual users
Flannel: custom edge cache for connection management

🎮

Discord

150M+ monthly active users, millions concurrent in voice
Elixir/Erlang for WebSocket gateway (high concurrency)
Cassandra → migrated to ScyllaDB for message storage
Servers (guilds) up to 1M members — massive fan-out challenge
Lazy loading: only deliver messages for channels user is viewing

📧

Facebook Messenger

1B+ users, MQTT for mobile, WebSocket for web
Iris: ordered message queue per user (like a personal Kafka topic)
Messages stored in HBase (wide-column, append-only)
Multi-device: Iris pointer tracks per-device read position
Zero-downtime deploys via connection draining

Production Chat Systems — Comparison

📋 Chapter 6 — Summary

WhatsApp: Erlang for 2M conn/server, E2E encryption, 50 engineers for 2B users.
Slack: WebSocket + REST, MySQL/Vitess sharded by workspace, full-text search.
Discord: Elixir gateway, ScyllaDB storage, lazy loading for 1M-member servers.
Messenger: MQTT (mobile) + WebSocket (web), Iris ordered queue, HBase storage.

Chapter Seven

Best Practices Extracted

Transferable Lessons

Chat systems push you to solve problems that appear in every real-time, connection-oriented system: connection lifecycle management, ordered delivery over unreliable networks, and presence tracking at scale. These patterns transfer to live collaboration tools, multiplayer games, real-time dashboards, and any system where "push" beats "poll."

🔗

Connection Management

Heartbeat interval: 30s client → server (detect dead connections)
Graceful drain on deploy: stop accepting new, finish existing
Reconnect with exponential backoff + jitter
Session resumption: reconnect without re-fetching all state
Transfers to: any WebSocket/long-lived connection system

🔢

Message Ordering

Server-assigned sequence number per conversation
Client detects gaps: "I have msg 5, 6, 8 — where's 7?"
Request missing messages on gap detection
Timestamp + sequence for total ordering
Transfers to: any ordered event stream system

🆔

Client-Generated IDs

Client generates UUID for each message before sending
Server uses it as idempotency key — retry-safe
If network drops after send: client retries with same ID
Server deduplicates: same ID = already processed
Transfers to: any at-least-once delivery system

Graceful deployment is the hardest operational challenge in chat systems. You cannot just restart a server — it holds 100K active WebSocket connections. You must: (1) stop routing new connections to it, (2) wait for existing connections to naturally close or timeout, (3) drain remaining connections — each client auto-reconnects to a different server. This "connection draining" takes minutes, not seconds.

Graceful Deployment: Connection Draining Flow

📋 Chapter 7 — Summary

Heartbeats: 30s interval detects dead connections without wasting bandwidth.
Connection draining: graceful deploy = drain existing connections, route new ones elsewhere.
Message ordering: server-assigned sequence numbers + client gap detection.
Client-generated IDs: retry-safe message sending. Dedup on server.

Chapter Eight

What Could Go Wrong

Common Failure Patterns

Chat system failures are immediately visible — users see messages out of order, presence shows "online" when someone is offline, or messages simply vanish. These failures erode trust in the platform faster than almost any other system failure because users notice instantly. Every pattern below has happened in production at major chat platforms.

🔀

Message Ordering Violations

Alice sends "Are you free?" then "For dinner tonight?"
Bob sees "For dinner tonight?" first — context lost
Root cause: messages routed through different servers with clock skew
Fix: per-conversation sequence numbers assigned by a single coordinator. Lamport timestamps for distributed ordering.

👻

Presence Inaccuracy

User closes laptop lid — connection drops silently
Presence still shows "online" until heartbeat timeout (30-60s)
Or: user on flaky mobile — rapidly toggles online/offline
Fix: heartbeat timeout + grace period. Debounce presence changes (wait 15s before broadcasting "offline").

⚡

Thundering Herd on Reconnect

Chat server crashes → 100K users reconnect simultaneously
All hit the same gateway → new server overwhelmed
Each reconnect triggers "fetch unread messages" → DB flooded
Fix: reconnect with random jitter (0-30s), connection rate limiting at gateway, progressive history loading.

🔒

Split Brain in Message Delivery

Network partition: broker thinks user is on Server A, actually on Server B
Messages routed to wrong server — user doesn't receive them
Session registry stale after partition heals
Fix: session registry TTL with frequent refresh. Delivery acknowledgment: if no ACK in 5s, re-route via push notification.

The thundering herd is the most common chat outage pattern. One server dies → 100K reconnects → next server dies → 200K reconnects → cascading failure. The fix is simple but critical: random jitter on reconnect (each client waits 0-30 seconds randomly), connection rate limiting at the gateway, and circuit breaking when new connection rate exceeds capacity.

Thundering Herd: Without vs With Reconnect Jitter

📋 Chapter 8 — Summary

Message ordering: per-conversation sequence numbers. Never rely on timestamps alone.
Presence ghost: heartbeat timeout + debounce. Don't broadcast rapid online/offline flips.
Thundering herd: random reconnect jitter + rate limiting at gateway. Prevent cascade.
Split brain: session registry TTL + delivery ACK. Re-route unacknowledged messages.
Principle: design for the reconnect storm, not just the steady state.

← Notification System Video Streaming →