System Design · Case Studies

Case Study: Rate Limiter

Design, trade-offs, and alternatives for a rate limiter at scale.

Chapter One

Problem Statement

Why Rate Limiting Exists

A rate limiter controls how many requests a client can make to a service within a given time window. Without it, a single misbehaving client — malicious or buggy — can consume all resources and bring down the system for everyone. Rate limiting is not optional for any public-facing API. It protects against DDoS attacks, prevents abuse, ensures fair usage, and keeps your infrastructure costs predictable.

Scale Requirements

Traffic & Scale

10M API keys to track simultaneously
500K requests/sec peak across all clients
Decision latency: <1ms added per request
Multiple rate limit rules per client (per endpoint, per hour, per day)

Requirements

Accuracy: never allow significantly more than limit (hard limit)
Distributed: work across multiple server instances
Low memory: 10M keys × multiple counters must fit in memory
Fail-open vs fail-closed: configurable per service

Storage Estimation for 10M Keys

Memory estimation: 10M keys × multiple counters per key. For token bucket: 2 values per key (tokens_remaining + last_refill_timestamp) = approximately 16 bytes per rule per key. With 3 rules per client (per-second, per-minute, per-day) = 48 bytes per client. 10M clients × 48 bytes = approximately 480MB. Redis comfortably holds this in memory with room for overhead. This calculation justifies why Redis (memory-first) is the right store rather than a disk-based database that would never achieve sub-millisecond access at this scale.

Decision Latency Breakdown

The 1ms latency budget is tight. A Redis round-trip in the same datacenter is 0.3–0.5ms. A Lua script execution adds ~0.1ms. Network variance adds another 0.1ms. Total: approximately 0.5–0.7ms for the happy path. This leaves no room for multiple Redis calls per request — the entire check-and-decrement must be one atomic operation. It also means the rate limiter cannot call a database or any service with higher latency. Redis is not a preference here — it is the only option that fits the latency budget.

The fundamental tension: accuracy vs latency. A perfectly accurate rate limiter requires coordination (expensive). A fast one allows brief over-limit bursts. Every production system makes a deliberate choice on this spectrum — there is no free lunch.

📋 Chapter 1 — Summary

Rate limiting protects against abuse, ensures fairness, and controls cost.
Must add <1ms latency per request — on the critical path of every API call.
10M concurrent keys, 500K req/sec peak. Memory efficiency matters.
Memory: 10M keys × 3 rules × 16 bytes = ~480MB. Redis fits this comfortably. No disk-based store achieves sub-millisecond access.
Latency budget: 1ms total allows one Redis round-trip (~0.5ms). No room for multiple operations — must be one atomic Lua script.
Core tension: accuracy (coordination) vs speed (local decisions).

Chapter Two

Questions to Ask

Clarifying the Design Space

Rate limiting sounds simple until you realize there are dozens of valid configurations. The questions below determine whether you need a simple in-memory counter or a distributed coordination system — and the answer often depends on business context more than technology.

🎯

Granularity

Per user? Per IP? Per API key?
Per endpoint or global across all endpoints?
Different limits for different tiers (free/paid)?
Per-second, per-minute, or per-day windows?
Request count or request cost? (GraphQL APIs need cost-based limiting — a single complex query can do the work of 100 simple requests)

⚡

Behavior

Hard limit (strict) or soft limit (occasional burst)?
Fail-open (allow on error) or fail-closed (deny on error)?
Return 429 immediately or queue and delay?
Inform client of remaining quota? (headers)

🌐

Architecture

Single data center or multi-region?
Client-side enforcement or server-side (centralized)?
Inline (on request path) or sidecar (separate process)?
Shared state (Redis) or eventual sync (local + gossip)?
What observability is required? Logging every decision at 500K req/sec generates 500K log entries/sec — sampling (1% of allows, 100% of denies) is the production approach.

Fail-open vs fail-closed is a business decision, not a technical one. If your Redis cluster goes down, do you allow all traffic through (fail-open — better UX, risk of overload) or deny all traffic (fail-closed — safe, but every customer is blocked)? Payment systems typically fail-closed. Content delivery typically fails-open.

For This Case Study, Our Answers Are:

Granularity: per API key (authenticated) + per IP (unauthenticated)
Scope: per endpoint (different limits for read vs write operations)
Window: per-second (burst) + per-day (quota)
Algorithm: token bucket — allows burst tolerance, steady rate enforcement
Hard vs soft: soft limit (token bucket allows controlled burst, not hard ceiling)
Fail behavior: fail-open (rate limiter down → allow traffic, alert ops immediately)
Response: 429 with Retry-After header and X-RateLimit-* headers
Tiers: free (100 req/min), paid (1000 req/min), enterprise (10K req/min)
Multi-region: accept ~10% over-limit from sync lag (non-financial API, acceptable)
Cost-based: no — uniform request cost (not GraphQL, not variable-cost operations)
Observability: log all 429s, sample 1% of allows, alert on Redis latency >2ms

📋 Chapter 2 — Summary

Granularity: per-user, per-IP, per-endpoint, per-tier — each changes the design.
Hard vs soft limits: strict enforcement vs graceful bursting.
Fail-open vs fail-closed: business risk tolerance determines the answer.
Single-region vs multi-region: multi-region adds replication lag complexity.
Cost-based vs count-based: flat request count for uniform APIs, complexity score for GraphQL or variable-cost operations.
Observability requirements: sample allow decisions (1%), log all denies. 500K/sec generates 500K log entries/sec at full rate.

Chapter Three

Naive Design

In-Memory Counter on a Single Server

The simplest rate limiter: a HashMap in application memory. Key = client ID, value = request count + window start time. On each request, check if count exceeds limit. Simple, fast (~0.1ms), zero external dependencies. Works perfectly on a single server — and fails immediately when you have multiple servers behind a load balancer, because each server has its own counter.

Naive Design — In-Memory Counter (Single Node)

✅

What Works

Extremely fast — no network call, pure memory
Simple implementation (~50 lines of code)
No external dependencies to fail
Perfect for single-server applications

💥

What Breaks

Multiple servers = split counters (N servers = N× limit)
Server restart = all counters lost (burst allowed)
No shared state = no accurate global enforcement
Sticky sessions partially fix it but reduce LB effectiveness

Why Sticky Sessions Don't Solve It

Sticky sessions route all requests from a specific client to the same server, effectively making the rate limiter work correctly for that client. This partially solves the split counter problem but introduces new issues: sticky sessions reduce load balancer effectiveness (one server may get all requests from a high-volume client while others sit idle), they break on server restart (the sticky client gets a fresh counter), and they require the load balancer to track client identity (adding complexity and becoming a single point of failure for that routing decision). Sticky sessions are a workaround that trades one problem for three others.

Sticky Sessions — Three Problems It Causes

Window Reset Burst Problem

The in-memory counter approach uses a fixed window: reset the counter every minute. This creates a boundary burst problem: a client can make their full limit of requests at 11:59:59, then their full limit again at 12:00:00. They have made 2× their limit in two seconds. At the naive design stage this is less important than the multi-server problem, but it is a reason why the naive approach cannot simply be "fixed" by adding shared state — the algorithm itself needs to change.

📋 Chapter 3 — Summary

In-memory counter: fast, simple, zero-dependency. Perfect for single server.
Breaks with multiple servers: counters split across instances.
N servers = client effectively gets N× the intended rate limit.
Server restart loses state — temporary unlimited access.
Sticky sessions: partially fix split counters but hurt LB effectiveness, break on restart, and add routing complexity.
Window boundary burst: in-memory fixed window allows 2× limit in 2 seconds at window boundary — algorithm problem, not just state.

Chapter Four

Refined Design

Distributed Rate Limiter with Redis

The solution to split counters is shared state. Redis provides the perfect combination: in-memory speed (~0.5ms per operation), atomic operations (INCR + EXPIRE), and cluster mode for availability. The rate limiter becomes a thin layer that queries Redis on every request — adding minimal latency while maintaining a single global counter per client.

Refined Design — Centralized Redis Rate Limiter

What the Lua Script Actually Does

The atomic rate limit check in one Redis round-trip: the script receives the key, current time, bucket capacity, and refill rate. It reads the current token count and last refill timestamp, calculates how many tokens have been refilled since the last request (elapsed_seconds × refill_rate), caps the total at bucket capacity, checks if at least one token remains, decrements if allowed or returns denied if not, writes the new state, and returns the result. This entire sequence runs on the Redis server without any intermediate network calls. Without Lua, a client would need to do GET (read state), local calculation, and SET (write state) — with a race window between GET and SET where another request could read the same stale state.

🪣

Token Bucket (Recommended)

Bucket holds N tokens. Each request consumes 1 token.
Tokens refill at rate R per second.
Allows bursts up to bucket size, then steady rate.
Redis: store tokens_remaining + last_refill_time
Lua script: calculate refill → check → decrement atomically
Used by: AWS API Gateway, Stripe, most production systems

📐

Sliding Window Counter

Combines fixed window simplicity with sliding accuracy
Track counts for current + previous window
Weighted sum: prev × overlap% + current
Memory efficient: 2 counters per key per rule
~0.1% error vs true sliding window (acceptable)
Used by: Cloudflare, Kong

Redis Cluster for rate limiting: use consistent hashing to shard rate limit keys across cluster nodes. Each key (client API key) maps to exactly one shard, so there is no cross-shard coordination for a single client's counter. The Lua script runs on the shard that owns that key. This scales linearly — adding nodes increases total throughput proportionally. Cluster mode also provides high availability: each shard has replicas. If a primary shard fails, its replica takes over within seconds. During failover, rate limit checks for keys on that shard either fail-open or fail-closed depending on configuration — this is the most likely moment when your fail-open/fail-closed policy is actually tested.

Local Cache as First Layer

For extremely high-traffic clients, a local in-memory cache of the rate limit decision can reduce Redis calls dramatically. After a Redis check confirms a client is within limits, the application server caches the result locally for 100ms. Subsequent requests from the same client within that window skip the Redis call entirely. The trade-off: a client could make up to 100ms × local_rate requests more than their limit before the next Redis check. For most APIs this overage is acceptable. For strict financial APIs (Stripe), it is not. The local cache approach reduces Redis calls by 80–90% for high-frequency clients, directly reducing latency and Redis cluster load.

Redis Lua scripts are the secret to distributed rate limiting. A Lua script executes atomically on the Redis server — no race conditions between read-check-write. One network round trip, one atomic operation, sub-millisecond. Without Lua: you need optimistic locking (WATCH/MULTI) which retries under contention and degrades at high load.

📋 Chapter 4 — Summary

Redis as centralized state: all servers share one counter per client.
Lua scripts: atomic INCR + EXPIRE in one round trip. No race conditions.
Lua script: one Redis round-trip for read-calculate-write. Without it, a race window between GET and SET allows limit bypass.
Token bucket: allows controlled bursts, steady-state rate. Most versatile.
Sliding window counter: memory-efficient approximation with ~0.1% error.
Local cache first layer: cache Redis decision for 100ms locally. Reduces Redis calls 80–90% for high-frequency clients. Trade: up to 100ms of overage for non-strict limits.
Total added latency: ~0.5-1ms per request (Redis round trip).

Chapter Five

Alternative Approaches

Algorithm Trade-offs

Five algorithms dominate rate limiter implementations. Each trades off between memory usage, accuracy, burst handling, and complexity. The right choice depends on whether you care more about strict enforcement or graceful burst handling.

Fixed Window Counter

Sliding Window Log

Divide time into fixed windows (e.g., 1-min blocks)
One counter per window per client
Simple: INCR key with TTL = window size
Problem: burst at window boundary = 2× limit
Memory: 1 counter per key — extremely efficient
Accuracy: poor at boundaries, acceptable otherwise

Store timestamp of every request in a sorted set
On each request: remove old entries, count remaining
Perfectly accurate — true sliding window
Problem: memory explosion (store every timestamp)
Redis ZRANGEBYSCORE + ZCARD per check
1000 req/min × 10M clients = 10B entries

Leaky Bucket

Token Bucket

Requests enter a queue processed at fixed rate
If bucket full → reject request
Output rate is perfectly constant (smooth)
No bursting allowed — strict rate enforcement
Good for: callers needing predictable downstream load
Used by: network traffic shaping (QoS)

Bucket of tokens refills at rate R. Request takes 1 token.
Burst up to bucket size, then throttled to refill rate
Allows controlled bursts (good UX for bursty clients)
Most flexible — burst + steady state both configurable
Good for: API rate limiting with burst tolerance
Used by: AWS, Stripe, most API platforms

Rate Limiting Algorithms — Behavior Comparison

Algorithm Selection Guide

Use token bucket when: clients are well-behaved and occasionally burst (page loads, mobile app cold starts). The burst tolerance improves user experience without enabling sustained abuse. This is the right default for developer APIs.

Use leaky bucket when: the downstream service you are protecting cannot handle bursts at all. Constant-rate output protects fragile services. Less common in API rate limiting, more common in network traffic shaping.

Use sliding window counter when: you need better accuracy than fixed window but cannot afford the memory of sliding window log. The 0.1% error is acceptable for almost all API rate limiting use cases. Cloudflare uses this at global scale.

Use fixed window counter when: you need absolute minimum memory and can tolerate the 2× boundary burst. Internal service-to-service rate limiting where clients are trusted and burst tolerance is acceptable.

Use sliding window log only when: you need perfect accuracy and have a very small number of clients (the memory cost is too high at scale). Useful for high-value enterprise customers where contractual accuracy matters more than memory efficiency.

The sliding window log is a cautionary tale about exact solutions. It is the only algorithm that gives perfectly accurate rate limiting with no boundary burst artifacts and no approximation error. It is also the only algorithm that is completely impractical at scale: storing every request timestamp for every client means 1000 req/min × 10M clients = 10 billion entries in memory. Redis sorted sets at that scale would require 800+ GB of RAM just for rate limit tracking — more than the entire production cluster of most companies. The lesson: theoretical perfection is not a design goal. Sliding window counter's 0.1% error is the correct trade-off.

Cost-Based Limiting (Sixth Approach)

Instead of counting requests, assign a cost to each request based on its computational complexity and limit the total cost per time window. A simple GET request costs 1. A complex aggregation query costs 50. The client's budget depletes by the cost of each request, not by a flat count. This prevents a single expensive query from being treated the same as a trivial one. Implementation: calculate cost at request parse time (before execution), then apply token bucket or sliding window on the cost value rather than request count. Shopify's GraphQL API uses this approach — query complexity is calculated from the AST before execution begins.

Token bucket is the production standard for API rate limiting. It provides the best UX: clients can burst briefly (page load with 20 parallel requests) without getting throttled, while sustained abuse is caught quickly. Leaky bucket is better when you need constant-rate output (protecting a fragile downstream).

📋 Chapter 5 — Summary

Fixed window: simple, memory-efficient, 2× burst at boundaries.
Sliding window log: perfectly accurate, memory-expensive.
Sliding window counter: best balance — low memory, ~0.1% error.
Leaky bucket: constant output rate, no bursting. Network shaping.
Token bucket: controlled bursts + steady rate. Production standard.
Selection: token bucket for developer APIs (burst tolerance), leaky bucket for protecting fragile downstreams, sliding window counter for global scale with memory constraints.
Cost-based limiting: assign computational cost per request, limit total cost not count. Shopify GraphQL uses complexity scoring.

Chapter Six

What Real Companies Did

Production Rate Limiting at Scale

Every major API platform has a rate limiter. Their approaches differ based on constraints — Stripe needs absolute accuracy (money involved), GitHub needs to be lenient (developer experience), Cloudflare needs extreme speed (inline on every request globally).

💳

Stripe

Token bucket per API key per endpoint
Redis-backed with Lua scripts for atomicity
Headers: X-RateLimit-Limit, Remaining, Reset
Different limits per plan (free: 25/sec, paid: 100/sec)
Hard limit — never allows over (financial correctness)

🐙

GitHub

5000 requests/hour per authenticated user
60 requests/hour for unauthenticated (per IP)
Fixed window counter (hourly reset)
Headers expose remaining + reset time
Abuse detection layer on top for scraping patterns

🌐

Cloudflare

Sliding window counter at edge PoPs globally
Local counters + async sync between PoPs
Accepts ~10% over-limit during sync lag (soft limit)
Ultra-low latency: decision in <0.1ms (local memory)
Published paper: "Generic Cell Rate Algorithm" variant
Scale: 50M+ HTTP requests/sec globally. Central Redis is physically impossible — local counters with sync lag is the only viable approach

🛒

Shopify

Leaky bucket for GraphQL API (cost-based)
Each query has a "cost" — complex queries consume more
Bucket refills at fixed rate (restore points/sec)
Prevents expensive queries from starving simple ones
Cost calculated at query parse time, not execution

🐦

Twitter/X

Per-endpoint limits with 15-minute rolling windows
Different limits per endpoint — timeline reads more expensive to serve, so lower limits
Application-level (per app credential) and user-level (per user token) tracked separately — both must pass
During 2023 restructuring, limits reduced dramatically to cut infrastructure costs — demonstrating rate limits also control costs, not just abuse
Standard headers: X-Rate-Limit-Limit, X-Rate-Limit-Remaining, X-Rate-Limit-Reset (Unix timestamp)

Production Rate Limiters — Comparison

Company	Algorithm	Granularity	Hard/Soft	Special Pattern
Stripe	Token bucket	Per key per endpoint	Hard (financial)	Redis + Lua, 25/s free, 100/s paid
GitHub	Fixed window	Per user / per IP	Soft (DX)	Hourly reset, abuse detection layer
Cloudflare	Sliding window ctr	Per IP at edge	Soft (~10%)	Local counters + async sync. 50M req/sec
Shopify	Leaky bucket (cost)	Per shop per API	Soft	Query complexity at AST parse time
Twitter/X	Fixed window (15m)	Per app + per user	Hard	Both must pass. Limits cut in 2023

📋 Chapter 6 — Summary

Stripe: token bucket + Redis + hard limits. Financial accuracy.
GitHub: fixed window counter. Simple, lenient, good DX.
Cloudflare: local counters + async sync. Speed over accuracy at edge. 50M+ req/sec globally — central Redis physically impossible.
Shopify: cost-based leaky bucket. Complex queries cost more.
Twitter/X: per-endpoint limits, 15-minute rolling windows. Separate app-level and user-level limits. Rate limits also control costs.

Chapter Seven

Best Practices Extracted

Transferable Patterns

Rate limiting patterns transfer to any system where you need to control resource consumption: API calls, database queries, message queue publishing, email sending, or feature flag rollouts. The principles below apply far beyond their rate limiting origin.

📋

Rate Limit Headers

X-RateLimit-Limit: max allowed
X-RateLimit-Remaining: left in window
X-RateLimit-Reset: window reset time
Retry-After: seconds to wait (on 429)
Transfers to: any API needing client cooperation

🔄

Client-Side Backoff

Exponential backoff on 429 response
Jitter prevents thundering herd on retry
SDKs should handle this automatically
Max retries: 3-5, then surface error
Transfers to: any client of rate-limited service

⚙️

Rules Engine

Separate rate limit rules from logic
Rules as config (YAML/DB): change without deploy
Multiple rules per client: per-sec + per-day + per-endpoint
Priority: most specific rule wins
Transfers to: any configurable policy enforcement

🕵️

Distributed Client Identity

Sophisticated clients bypass per-IP limits via IP rotation (botnets, residential proxies)
Fingerprint using multiple signals: IP + User-Agent + TLS fingerprint (JA3) + request timing patterns
No single signal is reliable alone
For authenticated APIs, API key is the strongest signal — cannot be rotated without account creation (which can be rate-limited separately)
Transfers to: any abuse prevention system

Token Bucket — Mechanical Operation

Monitor the rate limiter's own health: track Redis latency per call (alert if p99 exceeds 2ms), track the rate of 429 responses per client (spike = either abusive client or incorrectly configured limits), track fail-open vs fail-closed fallback activation rate (should be near zero — any activation means Redis is degraded), and track the percentage of requests that hit local cache vs Redis (helps tune cache TTL). A rate limiter that is silently failing open is worse than no rate limiter — you have the operational cost without the protection.

A good rate limiter is invisible to well-behaved clients. If your users are constantly hitting limits under normal usage, the limits are too low or your API design forces too many requests. Rate limiting should catch abuse and accidents — not punish normal usage patterns.

📋 Chapter 7 — Summary

Headers: always communicate limits, remaining, and reset time.
Client backoff: exponential + jitter. Build into SDKs.
Rules engine: separate policy from mechanism. Change without deploys.
Multi-signal fingerprinting: IP + User-Agent + TLS fingerprint. No single signal is reliable. API key is the strongest signal.
Monitor the rate limiter: Redis p99 latency, 429 rate per client, fallback activation rate. Silent fail-open is worse than no limiter.
Design principle: rate limits catch abuse, not restrict normal usage.

Chapter Eight

What Could Go Wrong

Common Failure Patterns

Rate limiters sit on the critical path of every request. When they fail, they either block all legitimate traffic (fail-closed) or allow unlimited abuse (fail-open). The failures below have taken down production systems — usually because the rate limiter itself became the bottleneck it was designed to prevent.

🏚️

Redis as Single Point of Failure

Redis down → all rate limit checks fail
Fail-closed: every request gets 429 (total outage)
Fail-open: no rate limiting (DDoS proceeds)
Fix: Redis Cluster for HA, local fallback counters, circuit breaker on Redis calls

⚡

Race Conditions Without Lua

GET count → check → SET new count (3 operations)
Between GET and SET: another request reads stale count
Result: N requests slip through in the gap
Fix: Redis Lua script (atomic), or WATCH/MULTI (less efficient)

🌍

Multi-Region Sync Lag

Client hits US region: count=8/10
Same client hits EU region: count=0/10 (not synced)
Client effectively gets 2× limit across regions
Fix: sticky routing by client, or accept ~10% over-limit
When ~10% is NOT acceptable: financial transaction APIs, metered billing APIs, regulatory compliance limits. For these, sticky routing is required — all requests from a client route to one region. Reduces availability in exchange for accuracy.

Multi-Region Rate Limiting — Sync Lag Problem

🎭

False Positives (Shared IP)

Per-IP limiting: all users behind corporate NAT share one IP
1000 users → 1000× natural traffic from one IP
Legitimate users blocked because of shared identity
Fix: combine IP + User-Agent + API key. Use authenticated identity.

🤖

Distributed Bypass

Sophisticated client uses many source IPs (residential proxies, botnets)
1000 IPs × 10 req/min each = 10,000 req/min from what appears to be 10,000 different legitimate users
Per-IP rate limiting is completely ineffective against this pattern
Fix: layer rate limiting with behavioral analysis. Cluster requests sharing unusual characteristics (same User-Agent, timing patterns, target endpoints) and rate-limit as a group. Requires behavioral features + ML/rule-based anomaly detection — significant step up in complexity.

🕐

Time Window Synchronization

Fixed window counters reset at specific clock times
Clock skew between servers or NTP correction → different servers reset local caches at different times
A client that has exhausted their limit could hit a server that thinks the window just reset
Less severe in Redis-backed system (single authoritative clock) but local cache layers using server time can exhibit this
Fix: use Redis TIME command as authoritative clock for all window calculations. Encode window start time in key name (key:minute:1703721600) rather than relying on TTL expiry timing.

The rate limiter must never become the thing it protects against. If your Redis-based limiter adds 50ms latency under load, you have replaced a DDoS with self-inflicted degradation. Monitor rate limiter latency. Set circuit breaker: if Redis > 5ms, fall back to local counters.

📋 Chapter 8 — Summary

Redis SPOF: cluster mode + local fallback + circuit breaker.
Race conditions: always use Lua scripts for atomic check-and-increment.
Multi-region: accept some over-limit or use sticky routing. 10% acceptable for most APIs, not acceptable for financial, metered billing, or regulated APIs.
False positives: don't limit by IP alone. Use authenticated identity.
Distributed bypass: many IPs × small count = large total. Requires behavioral clustering and anomaly detection beyond basic counting.
Clock synchronization: use Redis TIME command, not server clock, for all window calculations. Encode window start in key name.
Principle: the rate limiter must not be slower than what it protects.

← URL Shortener Notification System →