System Design Β· Case Studies

Case Study: Rate Limiter

Design, trade-offs, and alternatives for a rate limiter at scale.

01
Chapter One

Problem Statement

Why Rate Limiting Exists

A rate limiter controls how many requests a client can make to a service within a given time window. Without it, a single misbehaving client β€” malicious or buggy β€” can consume all resources and bring down the system for everyone. Rate limiting is not optional for any public-facing API. It protects against DDoS attacks, prevents abuse, ensures fair usage, and keeps your infrastructure costs predictable.

Scale Requirements

Traffic & Scale

  • 10M API keys to track simultaneously
  • 500K requests/sec peak across all clients
  • Decision latency: <1ms added per request
  • Multiple rate limit rules per client (per endpoint, per hour, per day)

Requirements

  • Accuracy: never allow significantly more than limit (hard limit)
  • Distributed: work across multiple server instances
  • Low memory: 10M keys Γ— multiple counters must fit in memory
  • Fail-open vs fail-closed: configurable per service
Storage Estimation for 10M Keys

Memory estimation: 10M keys Γ— multiple counters per key. For token bucket: 2 values per key (tokens_remaining + last_refill_timestamp) = approximately 16 bytes per rule per key. With 3 rules per client (per-second, per-minute, per-day) = 48 bytes per client. 10M clients Γ— 48 bytes = approximately 480MB. Redis comfortably holds this in memory with room for overhead. This calculation justifies why Redis (memory-first) is the right store rather than a disk-based database that would never achieve sub-millisecond access at this scale.

Decision Latency Breakdown

The 1ms latency budget is tight. A Redis round-trip in the same datacenter is 0.3–0.5ms. A Lua script execution adds ~0.1ms. Network variance adds another 0.1ms. Total: approximately 0.5–0.7ms for the happy path. This leaves no room for multiple Redis calls per request β€” the entire check-and-decrement must be one atomic operation. It also means the rate limiter cannot call a database or any service with higher latency. Redis is not a preference here β€” it is the only option that fits the latency budget.

The fundamental tension: accuracy vs latency. A perfectly accurate rate limiter requires coordination (expensive). A fast one allows brief over-limit bursts. Every production system makes a deliberate choice on this spectrum β€” there is no free lunch.

πŸ“‹ Chapter 1 β€” Summary
  • Rate limiting protects against abuse, ensures fairness, and controls cost.
  • Must add <1ms latency per request β€” on the critical path of every API call.
  • 10M concurrent keys, 500K req/sec peak. Memory efficiency matters.
  • Memory: 10M keys Γ— 3 rules Γ— 16 bytes = ~480MB. Redis fits this comfortably. No disk-based store achieves sub-millisecond access.
  • Latency budget: 1ms total allows one Redis round-trip (~0.5ms). No room for multiple operations β€” must be one atomic Lua script.
  • Core tension: accuracy (coordination) vs speed (local decisions).
02
Chapter Two

Questions to Ask

Clarifying the Design Space

Rate limiting sounds simple until you realize there are dozens of valid configurations. The questions below determine whether you need a simple in-memory counter or a distributed coordination system β€” and the answer often depends on business context more than technology.

🎯

Granularity

  • Per user? Per IP? Per API key?
  • Per endpoint or global across all endpoints?
  • Different limits for different tiers (free/paid)?
  • Per-second, per-minute, or per-day windows?
  • Request count or request cost? (GraphQL APIs need cost-based limiting β€” a single complex query can do the work of 100 simple requests)
⚑

Behavior

  • Hard limit (strict) or soft limit (occasional burst)?
  • Fail-open (allow on error) or fail-closed (deny on error)?
  • Return 429 immediately or queue and delay?
  • Inform client of remaining quota? (headers)
🌐

Architecture

  • Single data center or multi-region?
  • Client-side enforcement or server-side (centralized)?
  • Inline (on request path) or sidecar (separate process)?
  • Shared state (Redis) or eventual sync (local + gossip)?
  • What observability is required? Logging every decision at 500K req/sec generates 500K log entries/sec β€” sampling (1% of allows, 100% of denies) is the production approach.

Fail-open vs fail-closed is a business decision, not a technical one. If your Redis cluster goes down, do you allow all traffic through (fail-open β€” better UX, risk of overload) or deny all traffic (fail-closed β€” safe, but every customer is blocked)? Payment systems typically fail-closed. Content delivery typically fails-open.

For This Case Study, Our Answers Are:

  • Granularity: per API key (authenticated) + per IP (unauthenticated)
  • Scope: per endpoint (different limits for read vs write operations)
  • Window: per-second (burst) + per-day (quota)
  • Algorithm: token bucket β€” allows burst tolerance, steady rate enforcement
  • Hard vs soft: soft limit (token bucket allows controlled burst, not hard ceiling)
  • Fail behavior: fail-open (rate limiter down β†’ allow traffic, alert ops immediately)
  • Response: 429 with Retry-After header and X-RateLimit-* headers
  • Tiers: free (100 req/min), paid (1000 req/min), enterprise (10K req/min)
  • Multi-region: accept ~10% over-limit from sync lag (non-financial API, acceptable)
  • Cost-based: no β€” uniform request cost (not GraphQL, not variable-cost operations)
  • Observability: log all 429s, sample 1% of allows, alert on Redis latency >2ms
πŸ“‹ Chapter 2 β€” Summary
  • Granularity: per-user, per-IP, per-endpoint, per-tier β€” each changes the design.
  • Hard vs soft limits: strict enforcement vs graceful bursting.
  • Fail-open vs fail-closed: business risk tolerance determines the answer.
  • Single-region vs multi-region: multi-region adds replication lag complexity.
  • Cost-based vs count-based: flat request count for uniform APIs, complexity score for GraphQL or variable-cost operations.
  • Observability requirements: sample allow decisions (1%), log all denies. 500K/sec generates 500K log entries/sec at full rate.
03
Chapter Three

Naive Design

In-Memory Counter on a Single Server

The simplest rate limiter: a HashMap in application memory. Key = client ID, value = request count + window start time. On each request, check if count exceeds limit. Simple, fast (~0.1ms), zero external dependencies. Works perfectly on a single server β€” and fails immediately when you have multiple servers behind a load balancer, because each server has its own counter.

Naive Design β€” In-Memory Counter (Single Node)
Client Load Balancer Server 1 count: 3 (limit: 10) Server 2 count: 4 (limit: 10) Total: 3 + 4 = 7 Problem: Client made 7 requests total but each server thinks only 3-4. Limit: 10/window. Effective limit: 10 + 10 = 20 (each server enforces independently). β†’ N servers = NΓ— the intended limit
βœ…

What Works

  • Extremely fast β€” no network call, pure memory
  • Simple implementation (~50 lines of code)
  • No external dependencies to fail
  • Perfect for single-server applications
πŸ’₯

What Breaks

  • Multiple servers = split counters (N servers = NΓ— limit)
  • Server restart = all counters lost (burst allowed)
  • No shared state = no accurate global enforcement
  • Sticky sessions partially fix it but reduce LB effectiveness
Why Sticky Sessions Don't Solve It

Sticky sessions route all requests from a specific client to the same server, effectively making the rate limiter work correctly for that client. This partially solves the split counter problem but introduces new issues: sticky sessions reduce load balancer effectiveness (one server may get all requests from a high-volume client while others sit idle), they break on server restart (the sticky client gets a fresh counter), and they require the load balancer to track client identity (adding complexity and becoming a single point of failure for that routing decision). Sticky sessions are a workaround that trades one problem for three others.

Sticky Sessions β€” Three Problems It Causes
1. Load Imbalance Client A (1000 req/min) β†’ always β†’ Server 1: overloaded Client B (10 req/min) β†’ always β†’ Server 2: idle 2. Restart Resets Counter t=0: Client β†’ Server 1: count=8/10 t=1: Server 1 restarts t=2: Client β†’ Server 1: count=0/10 ← fresh counter, extra 10 free 3. LB Becomes SPOF LB must track: client_Aβ†’server_1, client_Bβ†’server_2 ... If LB state lost β†’ all sticky assignments break β†’ burst allowed for all clients simultaneously
Window Reset Burst Problem

The in-memory counter approach uses a fixed window: reset the counter every minute. This creates a boundary burst problem: a client can make their full limit of requests at 11:59:59, then their full limit again at 12:00:00. They have made 2Γ— their limit in two seconds. At the naive design stage this is less important than the multi-server problem, but it is a reason why the naive approach cannot simply be "fixed" by adding shared state β€” the algorithm itself needs to change.

πŸ“‹ Chapter 3 β€” Summary
  • In-memory counter: fast, simple, zero-dependency. Perfect for single server.
  • Breaks with multiple servers: counters split across instances.
  • N servers = client effectively gets NΓ— the intended rate limit.
  • Server restart loses state β€” temporary unlimited access.
  • Sticky sessions: partially fix split counters but hurt LB effectiveness, break on restart, and add routing complexity.
  • Window boundary burst: in-memory fixed window allows 2Γ— limit in 2 seconds at window boundary β€” algorithm problem, not just state.
04
Chapter Four

Refined Design

Distributed Rate Limiter with Redis

The solution to split counters is shared state. Redis provides the perfect combination: in-memory speed (~0.5ms per operation), atomic operations (INCR + EXPIRE), and cluster mode for availability. The rate limiter becomes a thin layer that queries Redis on every request β€” adding minimal latency while maintaining a single global counter per client.

Refined Design β€” Centralized Redis Rate Limiter
Client API Gateway Local Cache (100ms TTL) rate limit check (inline) Redis Cluster INCR + EXPIRE Lua script (atomic) ~0.5ms per op Backend Service on miss ← X-RateLimit-Limit: 100 ← X-RateLimit-Remaining: 47 ← X-RateLimit-Reset: 1703721600 allowed? β†’ route to backend denied? β†’ return 429 Too Many Requests Redis down? fail-open: allow fail-closed: 429 Rules Config (YAML) policies loaded 80-90% fewer Redis calls
What the Lua Script Actually Does

The atomic rate limit check in one Redis round-trip: the script receives the key, current time, bucket capacity, and refill rate. It reads the current token count and last refill timestamp, calculates how many tokens have been refilled since the last request (elapsed_seconds Γ— refill_rate), caps the total at bucket capacity, checks if at least one token remains, decrements if allowed or returns denied if not, writes the new state, and returns the result. This entire sequence runs on the Redis server without any intermediate network calls. Without Lua, a client would need to do GET (read state), local calculation, and SET (write state) β€” with a race window between GET and SET where another request could read the same stale state.

πŸͺ£

Token Bucket (Recommended)

  • Bucket holds N tokens. Each request consumes 1 token.
  • Tokens refill at rate R per second.
  • Allows bursts up to bucket size, then steady rate.
  • Redis: store tokens_remaining + last_refill_time
  • Lua script: calculate refill β†’ check β†’ decrement atomically
  • Used by: AWS API Gateway, Stripe, most production systems
πŸ“

Sliding Window Counter

  • Combines fixed window simplicity with sliding accuracy
  • Track counts for current + previous window
  • Weighted sum: prev Γ— overlap% + current
  • Memory efficient: 2 counters per key per rule
  • ~0.1% error vs true sliding window (acceptable)
  • Used by: Cloudflare, Kong

Redis Cluster for rate limiting: use consistent hashing to shard rate limit keys across cluster nodes. Each key (client API key) maps to exactly one shard, so there is no cross-shard coordination for a single client's counter. The Lua script runs on the shard that owns that key. This scales linearly β€” adding nodes increases total throughput proportionally. Cluster mode also provides high availability: each shard has replicas. If a primary shard fails, its replica takes over within seconds. During failover, rate limit checks for keys on that shard either fail-open or fail-closed depending on configuration β€” this is the most likely moment when your fail-open/fail-closed policy is actually tested.

Local Cache as First Layer

For extremely high-traffic clients, a local in-memory cache of the rate limit decision can reduce Redis calls dramatically. After a Redis check confirms a client is within limits, the application server caches the result locally for 100ms. Subsequent requests from the same client within that window skip the Redis call entirely. The trade-off: a client could make up to 100ms Γ— local_rate requests more than their limit before the next Redis check. For most APIs this overage is acceptable. For strict financial APIs (Stripe), it is not. The local cache approach reduces Redis calls by 80–90% for high-frequency clients, directly reducing latency and Redis cluster load.

Redis Lua scripts are the secret to distributed rate limiting. A Lua script executes atomically on the Redis server β€” no race conditions between read-check-write. One network round trip, one atomic operation, sub-millisecond. Without Lua: you need optimistic locking (WATCH/MULTI) which retries under contention and degrades at high load.

πŸ“‹ Chapter 4 β€” Summary
  • Redis as centralized state: all servers share one counter per client.
  • Lua scripts: atomic INCR + EXPIRE in one round trip. No race conditions.
  • Lua script: one Redis round-trip for read-calculate-write. Without it, a race window between GET and SET allows limit bypass.
  • Token bucket: allows controlled bursts, steady-state rate. Most versatile.
  • Sliding window counter: memory-efficient approximation with ~0.1% error.
  • Local cache first layer: cache Redis decision for 100ms locally. Reduces Redis calls 80–90% for high-frequency clients. Trade: up to 100ms of overage for non-strict limits.
  • Total added latency: ~0.5-1ms per request (Redis round trip).
05
Chapter Five

Alternative Approaches

Algorithm Trade-offs

Five algorithms dominate rate limiter implementations. Each trades off between memory usage, accuracy, burst handling, and complexity. The right choice depends on whether you care more about strict enforcement or graceful burst handling.

Fixed Window Counter
Sliding Window Log
  • Divide time into fixed windows (e.g., 1-min blocks)
  • One counter per window per client
  • Simple: INCR key with TTL = window size
  • Problem: burst at window boundary = 2Γ— limit
  • Memory: 1 counter per key β€” extremely efficient
  • Accuracy: poor at boundaries, acceptable otherwise
  • Store timestamp of every request in a sorted set
  • On each request: remove old entries, count remaining
  • Perfectly accurate β€” true sliding window
  • Problem: memory explosion (store every timestamp)
  • Redis ZRANGEBYSCORE + ZCARD per check
  • 1000 req/min Γ— 10M clients = 10B entries
Leaky Bucket
Token Bucket
  • Requests enter a queue processed at fixed rate
  • If bucket full β†’ reject request
  • Output rate is perfectly constant (smooth)
  • No bursting allowed β€” strict rate enforcement
  • Good for: callers needing predictable downstream load
  • Used by: network traffic shaping (QoS)
  • Bucket of tokens refills at rate R. Request takes 1 token.
  • Burst up to bucket size, then throttled to refill rate
  • Allows controlled bursts (good UX for bursty clients)
  • Most flexible β€” burst + steady state both configurable
  • Good for: API rate limiting with burst tolerance
  • Used by: AWS, Stripe, most API platforms
Rate Limiting Algorithms β€” Behavior Comparison
Fixed Window β–ˆβ–ˆβ–ˆβ–ˆ 8 req β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 10 req ← burst at boundary |β€”window 1β€”|β€”window 2β€”| β†’ 2Γ— limit in 2 seconds at boundary Token Bucket t=0: 10 tokens burst of 10 β†’ all served t=1: 2 refilled 2 more t=2: steady 2/sec Burst absorbed. Steady rate after. Leaky Bucket Input: ●●●●● (any rate) β†’ ● β†’ ● β†’ ● constant output Excess queued or dropped. Perfectly smooth output. No bursting. Fixed window (simple, boundary burst) Token bucket (burst + steady) Leaky bucket (constant rate)
Algorithm Selection Guide

Use token bucket when: clients are well-behaved and occasionally burst (page loads, mobile app cold starts). The burst tolerance improves user experience without enabling sustained abuse. This is the right default for developer APIs.

Use leaky bucket when: the downstream service you are protecting cannot handle bursts at all. Constant-rate output protects fragile services. Less common in API rate limiting, more common in network traffic shaping.

Use sliding window counter when: you need better accuracy than fixed window but cannot afford the memory of sliding window log. The 0.1% error is acceptable for almost all API rate limiting use cases. Cloudflare uses this at global scale.

Use fixed window counter when: you need absolute minimum memory and can tolerate the 2Γ— boundary burst. Internal service-to-service rate limiting where clients are trusted and burst tolerance is acceptable.

Use sliding window log only when: you need perfect accuracy and have a very small number of clients (the memory cost is too high at scale). Useful for high-value enterprise customers where contractual accuracy matters more than memory efficiency.

The sliding window log is a cautionary tale about exact solutions. It is the only algorithm that gives perfectly accurate rate limiting with no boundary burst artifacts and no approximation error. It is also the only algorithm that is completely impractical at scale: storing every request timestamp for every client means 1000 req/min Γ— 10M clients = 10 billion entries in memory. Redis sorted sets at that scale would require 800+ GB of RAM just for rate limit tracking β€” more than the entire production cluster of most companies. The lesson: theoretical perfection is not a design goal. Sliding window counter's 0.1% error is the correct trade-off.

Cost-Based Limiting (Sixth Approach)

Instead of counting requests, assign a cost to each request based on its computational complexity and limit the total cost per time window. A simple GET request costs 1. A complex aggregation query costs 50. The client's budget depletes by the cost of each request, not by a flat count. This prevents a single expensive query from being treated the same as a trivial one. Implementation: calculate cost at request parse time (before execution), then apply token bucket or sliding window on the cost value rather than request count. Shopify's GraphQL API uses this approach β€” query complexity is calculated from the AST before execution begins.

Token bucket is the production standard for API rate limiting. It provides the best UX: clients can burst briefly (page load with 20 parallel requests) without getting throttled, while sustained abuse is caught quickly. Leaky bucket is better when you need constant-rate output (protecting a fragile downstream).

πŸ“‹ Chapter 5 β€” Summary
  • Fixed window: simple, memory-efficient, 2Γ— burst at boundaries.
  • Sliding window log: perfectly accurate, memory-expensive.
  • Sliding window counter: best balance β€” low memory, ~0.1% error.
  • Leaky bucket: constant output rate, no bursting. Network shaping.
  • Token bucket: controlled bursts + steady rate. Production standard.
  • Selection: token bucket for developer APIs (burst tolerance), leaky bucket for protecting fragile downstreams, sliding window counter for global scale with memory constraints.
  • Cost-based limiting: assign computational cost per request, limit total cost not count. Shopify GraphQL uses complexity scoring.
06
Chapter Six

What Real Companies Did

Production Rate Limiting at Scale

Every major API platform has a rate limiter. Their approaches differ based on constraints β€” Stripe needs absolute accuracy (money involved), GitHub needs to be lenient (developer experience), Cloudflare needs extreme speed (inline on every request globally).

πŸ’³

Stripe

  • Token bucket per API key per endpoint
  • Redis-backed with Lua scripts for atomicity
  • Headers: X-RateLimit-Limit, Remaining, Reset
  • Different limits per plan (free: 25/sec, paid: 100/sec)
  • Hard limit β€” never allows over (financial correctness)
πŸ™

GitHub

  • 5000 requests/hour per authenticated user
  • 60 requests/hour for unauthenticated (per IP)
  • Fixed window counter (hourly reset)
  • Headers expose remaining + reset time
  • Abuse detection layer on top for scraping patterns
🌐

Cloudflare

  • Sliding window counter at edge PoPs globally
  • Local counters + async sync between PoPs
  • Accepts ~10% over-limit during sync lag (soft limit)
  • Ultra-low latency: decision in <0.1ms (local memory)
  • Published paper: "Generic Cell Rate Algorithm" variant
  • Scale: 50M+ HTTP requests/sec globally. Central Redis is physically impossible β€” local counters with sync lag is the only viable approach
πŸ›’

Shopify

  • Leaky bucket for GraphQL API (cost-based)
  • Each query has a "cost" β€” complex queries consume more
  • Bucket refills at fixed rate (restore points/sec)
  • Prevents expensive queries from starving simple ones
  • Cost calculated at query parse time, not execution
🐦

Twitter/X

  • Per-endpoint limits with 15-minute rolling windows
  • Different limits per endpoint β€” timeline reads more expensive to serve, so lower limits
  • Application-level (per app credential) and user-level (per user token) tracked separately β€” both must pass
  • During 2023 restructuring, limits reduced dramatically to cut infrastructure costs β€” demonstrating rate limits also control costs, not just abuse
  • Standard headers: X-Rate-Limit-Limit, X-Rate-Limit-Remaining, X-Rate-Limit-Reset (Unix timestamp)
Production Rate Limiters β€” Comparison
Company Algorithm Granularity Hard/Soft Special Pattern
StripeToken bucketPer key per endpointHard (financial)Redis + Lua, 25/s free, 100/s paid
GitHubFixed windowPer user / per IPSoft (DX)Hourly reset, abuse detection layer
CloudflareSliding window ctrPer IP at edgeSoft (~10%)Local counters + async sync. 50M req/sec
ShopifyLeaky bucket (cost)Per shop per APISoftQuery complexity at AST parse time
Twitter/XFixed window (15m)Per app + per userHardBoth must pass. Limits cut in 2023
πŸ“‹ Chapter 6 β€” Summary
  • Stripe: token bucket + Redis + hard limits. Financial accuracy.
  • GitHub: fixed window counter. Simple, lenient, good DX.
  • Cloudflare: local counters + async sync. Speed over accuracy at edge. 50M+ req/sec globally β€” central Redis physically impossible.
  • Shopify: cost-based leaky bucket. Complex queries cost more.
  • Twitter/X: per-endpoint limits, 15-minute rolling windows. Separate app-level and user-level limits. Rate limits also control costs.
07
Chapter Seven

Best Practices Extracted

Transferable Patterns

Rate limiting patterns transfer to any system where you need to control resource consumption: API calls, database queries, message queue publishing, email sending, or feature flag rollouts. The principles below apply far beyond their rate limiting origin.

πŸ“‹

Rate Limit Headers

  • X-RateLimit-Limit: max allowed
  • X-RateLimit-Remaining: left in window
  • X-RateLimit-Reset: window reset time
  • Retry-After: seconds to wait (on 429)
  • Transfers to: any API needing client cooperation
πŸ”„

Client-Side Backoff

  • Exponential backoff on 429 response
  • Jitter prevents thundering herd on retry
  • SDKs should handle this automatically
  • Max retries: 3-5, then surface error
  • Transfers to: any client of rate-limited service
βš™οΈ

Rules Engine

  • Separate rate limit rules from logic
  • Rules as config (YAML/DB): change without deploy
  • Multiple rules per client: per-sec + per-day + per-endpoint
  • Priority: most specific rule wins
  • Transfers to: any configurable policy enforcement
πŸ•΅οΈ

Distributed Client Identity

  • Sophisticated clients bypass per-IP limits via IP rotation (botnets, residential proxies)
  • Fingerprint using multiple signals: IP + User-Agent + TLS fingerprint (JA3) + request timing patterns
  • No single signal is reliable alone
  • For authenticated APIs, API key is the strongest signal β€” cannot be rotated without account creation (which can be rate-limited separately)
  • Transfers to: any abuse prevention system
Token Bucket β€” Mechanical Operation
t=0 Bucket: 10/10 tokens. 1 request β†’ consume 1. 9/10 βœ“ t=1 3 requests arrive. Consume 3 tokens. 6/10 βœ“ t=2 8 requests. 6+2(refill)=8 tokens. Consume all. 0/10 βœ“Γ—8 t=3 2 requests. 2 tokens refilled. Consume both. 0/10 βœ“Γ—2 t=4 5 requests arrive. Only 3 tokens (2s Γ— 1/s + leftover). βœ“Γ—3 βœ—Γ—2 = tokens available (ALLOW) = empty slots βœ— = DENY (429) Refill rate: 2 tokens/sec. Bucket capacity: 10. Burst allowed up to capacity.

Monitor the rate limiter's own health: track Redis latency per call (alert if p99 exceeds 2ms), track the rate of 429 responses per client (spike = either abusive client or incorrectly configured limits), track fail-open vs fail-closed fallback activation rate (should be near zero β€” any activation means Redis is degraded), and track the percentage of requests that hit local cache vs Redis (helps tune cache TTL). A rate limiter that is silently failing open is worse than no rate limiter β€” you have the operational cost without the protection.

A good rate limiter is invisible to well-behaved clients. If your users are constantly hitting limits under normal usage, the limits are too low or your API design forces too many requests. Rate limiting should catch abuse and accidents β€” not punish normal usage patterns.

πŸ“‹ Chapter 7 β€” Summary
  • Headers: always communicate limits, remaining, and reset time.
  • Client backoff: exponential + jitter. Build into SDKs.
  • Rules engine: separate policy from mechanism. Change without deploys.
  • Multi-signal fingerprinting: IP + User-Agent + TLS fingerprint. No single signal is reliable. API key is the strongest signal.
  • Monitor the rate limiter: Redis p99 latency, 429 rate per client, fallback activation rate. Silent fail-open is worse than no limiter.
  • Design principle: rate limits catch abuse, not restrict normal usage.
08
Chapter Eight

What Could Go Wrong

Common Failure Patterns

Rate limiters sit on the critical path of every request. When they fail, they either block all legitimate traffic (fail-closed) or allow unlimited abuse (fail-open). The failures below have taken down production systems β€” usually because the rate limiter itself became the bottleneck it was designed to prevent.

🏚️

Redis as Single Point of Failure

  • Redis down β†’ all rate limit checks fail
  • Fail-closed: every request gets 429 (total outage)
  • Fail-open: no rate limiting (DDoS proceeds)
  • Fix: Redis Cluster for HA, local fallback counters, circuit breaker on Redis calls
⚑

Race Conditions Without Lua

  • GET count β†’ check β†’ SET new count (3 operations)
  • Between GET and SET: another request reads stale count
  • Result: N requests slip through in the gap
  • Fix: Redis Lua script (atomic), or WATCH/MULTI (less efficient)
🌍

Multi-Region Sync Lag

  • Client hits US region: count=8/10
  • Same client hits EU region: count=0/10 (not synced)
  • Client effectively gets 2Γ— limit across regions
  • Fix: sticky routing by client, or accept ~10% over-limit
  • When ~10% is NOT acceptable: financial transaction APIs, metered billing APIs, regulatory compliance limits. For these, sticky routing is required β€” all requests from a client route to one region. Reduces availability in exchange for accuracy.
Multi-Region Rate Limiting β€” Sync Lag Problem
Without Sticky Routing βœ— Client limit: 10 req/min US: count=8/10 βœ“ sync not yet propagated EU: count=0/10 βœ“ stale β€” 8 not synced! Total: 16 served. Limit: 10. 1.6Γ— over! With Sticky Routing βœ“ Client limit: 10 req/min US: count=8/10 βœ“ all requests β†’ US US: 2 βœ“, 1 βœ— (10/10) 3 more requests Total: 10 served. Limit enforced. βœ“ Sticky routing: accuracy at the cost of reduced availability (region failover = client rerouted = burst) For non-financial APIs: accept ~10% over-limit. For financial: sticky routing required.
🎭

False Positives (Shared IP)

  • Per-IP limiting: all users behind corporate NAT share one IP
  • 1000 users β†’ 1000Γ— natural traffic from one IP
  • Legitimate users blocked because of shared identity
  • Fix: combine IP + User-Agent + API key. Use authenticated identity.
πŸ€–

Distributed Bypass

  • Sophisticated client uses many source IPs (residential proxies, botnets)
  • 1000 IPs Γ— 10 req/min each = 10,000 req/min from what appears to be 10,000 different legitimate users
  • Per-IP rate limiting is completely ineffective against this pattern
  • Fix: layer rate limiting with behavioral analysis. Cluster requests sharing unusual characteristics (same User-Agent, timing patterns, target endpoints) and rate-limit as a group. Requires behavioral features + ML/rule-based anomaly detection β€” significant step up in complexity.
πŸ•

Time Window Synchronization

  • Fixed window counters reset at specific clock times
  • Clock skew between servers or NTP correction β†’ different servers reset local caches at different times
  • A client that has exhausted their limit could hit a server that thinks the window just reset
  • Less severe in Redis-backed system (single authoritative clock) but local cache layers using server time can exhibit this
  • Fix: use Redis TIME command as authoritative clock for all window calculations. Encode window start time in key name (key:minute:1703721600) rather than relying on TTL expiry timing.

The rate limiter must never become the thing it protects against. If your Redis-based limiter adds 50ms latency under load, you have replaced a DDoS with self-inflicted degradation. Monitor rate limiter latency. Set circuit breaker: if Redis > 5ms, fall back to local counters.

πŸ“‹ Chapter 8 β€” Summary
  • Redis SPOF: cluster mode + local fallback + circuit breaker.
  • Race conditions: always use Lua scripts for atomic check-and-increment.
  • Multi-region: accept some over-limit or use sticky routing. 10% acceptable for most APIs, not acceptable for financial, metered billing, or regulated APIs.
  • False positives: don't limit by IP alone. Use authenticated identity.
  • Distributed bypass: many IPs Γ— small count = large total. Requires behavioral clustering and anomaly detection beyond basic counting.
  • Clock synchronization: use Redis TIME command, not server clock, for all window calculations. Encode window start in key name.
  • Principle: the rate limiter must not be slower than what it protects.