Case Study: Chat System
Design, trade-offs, and alternatives for a real-time chat system at scale.
Problem Statement
A real-time chat system lets users exchange messages instantly โ one-on-one or in groups. The defining challenge is bidirectional, low-latency communication: when Alice sends a message, Bob must see it within milliseconds, not seconds. This is fundamentally different from request-response systems. You need persistent connections, ordered delivery, presence tracking, and offline message queuing โ all while keeping millions of concurrent connections alive on stateful servers that cannot simply be load-balanced like stateless HTTP services.
Traffic & Scale
- 500M daily active users
- 50M concurrent connections at peak
- 20B messages/day (~230K messages/sec)
- Average user sends 40 messages/day
Requirements
- Delivery latency: <200ms for online recipient
- Ordering: messages in a conversation arrive in send order
- Offline delivery: queue messages, deliver on reconnect
- History: persistent, searchable, 30-day+ retention
Chat is a connection-oriented, stateful system. Unlike HTTP APIs where any server can handle any request, a chat server holds an open WebSocket per user. This makes load balancing, failover, and deployment fundamentally harder. You cannot just restart a server โ doing so disconnects 100K users simultaneously.
- 50M concurrent WebSocket connections at peak. 230K messages/sec throughput.
- Sub-200ms delivery latency for online users. Offline queuing for disconnected users.
- Message ordering within a conversation must be preserved.
- Stateful servers (WebSocket connections) make deployment and scaling fundamentally different from HTTP.
Questions to Ask
A Slack-like workspace chat has fundamentally different requirements than a WhatsApp-like personal messenger. Group size, message retention, encryption model, and whether you show typing indicators all change the architecture in significant ways. Get these wrong and you build the wrong system.
Message Model
- 1-on-1 only, or group chat too?
- Max group size? (10 friends vs 100K server)
- Message types: text only? images, files, voice?
- Message editing and deletion supported?
Presence & Indicators
- Online/offline presence required?
- Typing indicators? ("Alice is typing...")
- Read receipts? (double check marks)
- "Last seen" timestamp?
Security & Compliance
- End-to-end encryption? (server cannot read messages)
- Message retention vs right-to-deletion (GDPR)?
- Multi-device sync? (phone + desktop + web)
- Push notifications for offline users?
Group size determines your fan-out strategy. A 5-person group chat: fan out the message to 4 recipients โ trivial. A 100K-member Discord server: fan out to 100K connections โ you need pub/sub, not point-to-point delivery. The same architecture cannot serve both without a fan-out abstraction layer.
For This Case Study, Our Answers Are:
- Chat type: 1-on-1 and group (up to 500 members โ no massive Discord-style servers)
- Presence: yes โ online/offline + typing indicators
- Read receipts: yes โ per-message, per-user
- E2E encryption: no โ server can read messages (enables search and moderation)
- Multi-device: yes โ sync across phone, desktop, web
- Retention: 30 days full history, searchable
- Push notifications: yes โ for offline users (integrates with notification system)
- Scale: 50M concurrent, 230K messages/sec
- 1-on-1 vs group chat: group size changes fan-out strategy entirely.
- Presence and typing indicators add constant heartbeat traffic โ significant at scale.
- E2E encryption means server stores ciphertext โ cannot search, moderate, or index.
- Multi-device sync requires message ordering across devices per user.
- Read receipts are Nยฒ messages in a group โ expensive to fan out.
Naive Design
The simplest chat design: clients periodically send HTTP requests asking "any new messages for me?" The server queries the database and returns new messages. This is how early web chat worked. It is trivially simple โ and catastrophically wasteful. With 50M users polling every 2 seconds, you generate 25M requests/sec, of which 95%+ return "no new messages." You are burning server resources to learn that nothing happened.
What Works
- Dead simple โ standard HTTP, no special infrastructure
- Works through firewalls and proxies (plain HTTP)
- Stateless servers โ any server handles any request
- Fine for <1K concurrent users
What Breaks
- 25M empty responses/sec โ wasted compute and bandwidth
- 2s polling = 2s message delay (not "real-time")
- Database overwhelmed by constant polling queries
- No presence detection โ can't tell who's online
- Battery drain on mobile (constant HTTP requests)
- HTTP polling: simple but catastrophically wasteful at scale.
- 25M req/sec with 95% empty โ burning server resources for nothing.
- Polling interval = message delay. Faster polling = more waste.
- No persistent connection = no presence, no typing indicators, no push delivery.
Refined Design
The refined design replaces polling with persistent WebSocket connections. Each user holds exactly one open connection to a chat server. When Alice sends a message, it is written to the database, then published to a message broker (Redis Pub/Sub or Kafka). The broker delivers the message to whichever chat server holds Bob's connection โ that server pushes it down Bob's WebSocket instantly. No polling. No wasted requests. Sub-100ms delivery.
Message Send Flow
- 1. Alice sends message via her WebSocket connection
- 2. Chat Server 1 validates, assigns message ID + timestamp
- 3. Write to message store (async, write-behind)
- 4. Publish to message broker (Redis Pub/Sub or Kafka)
- 5. Broker delivers to Chat Server 2 (holds Bob's connection)
- 6. Chat Server 2 pushes down Bob's WebSocket โ instant delivery
Offline Delivery
- Session registry shows Bob is offline โ no active server
- Message persisted to DB with
delivered=false - Push notification sent via notification system
- On reconnect: Bob's new server queries undelivered messages
- Messages delivered in order, marked as delivered
The session registry is the critical piece. When a message arrives for Bob, the system must know: (1) Is Bob online? (2) Which chat server holds Bob's connection? The session registry (Redis hash: user_id โ server_id) answers both questions in one lookup. Without it, you must broadcast every message to every server โ O(N) fan-out instead of O(1) targeted delivery.
- WebSocket connections: persistent, bidirectional, sub-100ms delivery.
- Message broker (Redis Pub/Sub): routes messages between chat servers holding different users.
- Session registry: maps user โ server for targeted delivery (not broadcast).
- Offline queue: undelivered messages persisted, delivered on reconnect.
- Message store: write-behind to Cassandra/Scylla for history and search.
- Presence service: heartbeat-based online/offline detection.
Alternative Approaches
WebSocket is the dominant protocol for real-time chat, but it is not the only option. Long polling was the standard before WebSocket support was universal. Server-Sent Events (SSE) offer a simpler one-way channel. Each has trade-offs in latency, complexity, mobile friendliness, and infrastructure cost.
- Full-duplex: client and server send anytime
- Single TCP connection per user โ minimal overhead
- Sub-100ms latency for message delivery
- Requires sticky sessions (connection = state)
- Proxies/firewalls may terminate idle connections
- Used by: WhatsApp Web, Slack, Discord
- Client opens HTTP request; server holds until new message or timeout
- Works through all proxies and firewalls (plain HTTP)
- ~100-500ms latency (connection setup overhead)
- Each response requires new connection โ overhead per message
- Server holds open connection = thread/memory per user
- Used by: Facebook Chat (early), fallback for WebSocket
- Server โ client only (one-way push)
- Client sends via regular HTTP POST
- Auto-reconnect built into browser EventSource API
- Simpler than WebSocket โ no handshake upgrade
- Good for: notifications, live feeds (not bidirectional chat)
- Limitation: 6 connection limit per domain in HTTP/1.1
- Lightweight pub/sub protocol โ tiny packet overhead
- QoS levels: 0 (fire&forget), 1 (at-least-once), 2 (exactly-once)
- Built for low-bandwidth, unreliable networks (mobile, IoT)
- Persistent sessions: broker queues messages for offline clients
- Good for: mobile-first chat, IoT messaging
- Used by: Facebook Messenger (mobile), IoT platforms
WebSocket is the production standard for web and desktop chat. MQTT wins for mobile-first apps on unreliable networks (3G, frequent disconnects) because of its tiny overhead and built-in session resumption. In practice, most chat systems use WebSocket for web/desktop and push notifications (APNs/FCM) for mobile offline delivery.
- WebSocket: bidirectional, low-latency, production standard. Requires sticky sessions.
- Long polling: fallback for restricted environments. Higher latency, more overhead.
- SSE: server-to-client only. Good for feeds, not bidirectional chat.
- MQTT: lightweight pub/sub, mobile-optimized. Facebook Messenger uses it on mobile.
What Real Companies Did
Real-time chat at billion-user scale has been solved by multiple companies โ each making different trade-offs based on their constraints. WhatsApp optimized for minimal infrastructure. Slack optimized for searchability and integrations. Discord optimized for massive concurrent voice + text rooms. Their architectures are strikingly different.
- 2B+ users, 100B+ messages/day
- Erlang/OTP for connection handling (~2M connections/server)
- XMPP-based protocol (customized)
- End-to-end encryption (Signal Protocol) โ server stores ciphertext
- Only 50 engineers for 2B users โ efficiency through Erlang concurrency
Slack
- WebSocket for real-time + REST API for history
- MySQL + Vitess for message storage (sharded by workspace)
- Full-text search across message history (Elasticsearch)
- Channel-based pub/sub: subscribe to channels, not individual users
- Flannel: custom edge cache for connection management
Discord
- 150M+ monthly active users, millions concurrent in voice
- Elixir/Erlang for WebSocket gateway (high concurrency)
- Cassandra โ migrated to ScyllaDB for message storage
- Servers (guilds) up to 1M members โ massive fan-out challenge
- Lazy loading: only deliver messages for channels user is viewing
Facebook Messenger
- 1B+ users, MQTT for mobile, WebSocket for web
- Iris: ordered message queue per user (like a personal Kafka topic)
- Messages stored in HBase (wide-column, append-only)
- Multi-device: Iris pointer tracks per-device read position
- Zero-downtime deploys via connection draining
- WhatsApp: Erlang for 2M conn/server, E2E encryption, 50 engineers for 2B users.
- Slack: WebSocket + REST, MySQL/Vitess sharded by workspace, full-text search.
- Discord: Elixir gateway, ScyllaDB storage, lazy loading for 1M-member servers.
- Messenger: MQTT (mobile) + WebSocket (web), Iris ordered queue, HBase storage.
Best Practices Extracted
Chat systems push you to solve problems that appear in every real-time, connection-oriented system: connection lifecycle management, ordered delivery over unreliable networks, and presence tracking at scale. These patterns transfer to live collaboration tools, multiplayer games, real-time dashboards, and any system where "push" beats "poll."
Connection Management
- Heartbeat interval: 30s client โ server (detect dead connections)
- Graceful drain on deploy: stop accepting new, finish existing
- Reconnect with exponential backoff + jitter
- Session resumption: reconnect without re-fetching all state
- Transfers to: any WebSocket/long-lived connection system
Message Ordering
- Server-assigned sequence number per conversation
- Client detects gaps: "I have msg 5, 6, 8 โ where's 7?"
- Request missing messages on gap detection
- Timestamp + sequence for total ordering
- Transfers to: any ordered event stream system
Client-Generated IDs
- Client generates UUID for each message before sending
- Server uses it as idempotency key โ retry-safe
- If network drops after send: client retries with same ID
- Server deduplicates: same ID = already processed
- Transfers to: any at-least-once delivery system
Graceful deployment is the hardest operational challenge in chat systems. You cannot just restart a server โ it holds 100K active WebSocket connections. You must: (1) stop routing new connections to it, (2) wait for existing connections to naturally close or timeout, (3) drain remaining connections โ each client auto-reconnects to a different server. This "connection draining" takes minutes, not seconds.
- Heartbeats: 30s interval detects dead connections without wasting bandwidth.
- Connection draining: graceful deploy = drain existing connections, route new ones elsewhere.
- Message ordering: server-assigned sequence numbers + client gap detection.
- Client-generated IDs: retry-safe message sending. Dedup on server.
What Could Go Wrong
Chat system failures are immediately visible โ users see messages out of order, presence shows "online" when someone is offline, or messages simply vanish. These failures erode trust in the platform faster than almost any other system failure because users notice instantly. Every pattern below has happened in production at major chat platforms.
Message Ordering Violations
- Alice sends "Are you free?" then "For dinner tonight?"
- Bob sees "For dinner tonight?" first โ context lost
- Root cause: messages routed through different servers with clock skew
- Fix: per-conversation sequence numbers assigned by a single coordinator. Lamport timestamps for distributed ordering.
Presence Inaccuracy
- User closes laptop lid โ connection drops silently
- Presence still shows "online" until heartbeat timeout (30-60s)
- Or: user on flaky mobile โ rapidly toggles online/offline
- Fix: heartbeat timeout + grace period. Debounce presence changes (wait 15s before broadcasting "offline").
Thundering Herd on Reconnect
- Chat server crashes โ 100K users reconnect simultaneously
- All hit the same gateway โ new server overwhelmed
- Each reconnect triggers "fetch unread messages" โ DB flooded
- Fix: reconnect with random jitter (0-30s), connection rate limiting at gateway, progressive history loading.
Split Brain in Message Delivery
- Network partition: broker thinks user is on Server A, actually on Server B
- Messages routed to wrong server โ user doesn't receive them
- Session registry stale after partition heals
- Fix: session registry TTL with frequent refresh. Delivery acknowledgment: if no ACK in 5s, re-route via push notification.
The thundering herd is the most common chat outage pattern. One server dies โ 100K reconnects โ next server dies โ 200K reconnects โ cascading failure. The fix is simple but critical: random jitter on reconnect (each client waits 0-30 seconds randomly), connection rate limiting at the gateway, and circuit breaking when new connection rate exceeds capacity.
- Message ordering: per-conversation sequence numbers. Never rely on timestamps alone.
- Presence ghost: heartbeat timeout + debounce. Don't broadcast rapid online/offline flips.
- Thundering herd: random reconnect jitter + rate limiting at gateway. Prevent cascade.
- Split brain: session registry TTL + delivery ACK. Re-route unacknowledged messages.
- Principle: design for the reconnect storm, not just the steady state.