System Design Β· Case Studies

Case Study: Ride Sharing

Design, trade-offs, and alternatives for a ride-sharing platform at scale.

01
Chapter One

Problem Statement

What We Are Building

A ride-sharing platform matches riders requesting trips with nearby drivers in real-time. The core challenge is a spatial-temporal matching problem: given a rider at location (lat, lng) requesting a ride now, find the best available driver within a few kilometers β€” considering driver location, direction of travel, ETA, and driver rating β€” all within 3-5 seconds. Meanwhile, drivers' locations update every 3-5 seconds, creating a constantly moving dataset of millions of points that must be queried geospatially at thousands of times per second.

Scale Requirements

Traffic & Scale

  • 100M monthly active riders
  • 5M active drivers (concurrent online at peak)
  • 20M rides/day (~230 matches/sec)
  • Location updates: 5M drivers Γ— every 4s = 1.25M updates/sec

Requirements

  • Match latency: driver found in <5 seconds
  • ETA accuracy: Β±2 minutes at time of match
  • Location freshness: driver position <5 seconds stale
  • Availability: 99.99% (rides are time-critical)

Ride-sharing is a real-time geospatial matching system with strict time constraints. Unlike a restaurant search (where results can be cached), driver positions are moving constantly. The index is never stable. And unlike a chat system (where latency of 200ms vs 500ms is acceptable), a 30-second delay in matching means the rider leaves and opens a competitor's app. Speed of matching directly equals revenue.

πŸ“‹ Chapter 1 β€” Summary
  • 5M concurrent drivers sending location updates every 4 seconds = 1.25M updates/sec.
  • Match a rider to nearest driver in <5 seconds with accurate ETA.
  • Geospatial index is constantly moving β€” never cacheable.
  • 20M rides/day. Matching speed = revenue. Downtime = riders switch to competitor.
02
Chapter Two

Questions to Ask

Clarifying Before Designing

A basic taxi dispatch system and a full Uber-like platform differ enormously. The questions below determine whether you need surge pricing, ride pooling (shared rides), scheduled rides, multi-modal transport, and how sophisticated the matching algorithm needs to be.

πŸš—

Matching Model

  • Single rider ↔ single driver? Or ride pooling?
  • Match by nearest or by optimal ETA?
  • Driver can reject? Or auto-assigned?
  • Scheduled rides (future) or real-time only?
πŸ’°

Pricing & Economics

  • Fixed pricing or dynamic (surge)?
  • Upfront fare estimate before ride?
  • Multiple vehicle types (economy, premium, XL)?
  • Tips, cancellation fees, toll handling?
πŸ“

Geography & Routing

  • Urban only or suburban/rural too?
  • ETA based on straight-line or real road routing?
  • Traffic-aware ETA?
  • Multi-city / global deployment?

Ride pooling changes the matching problem from O(N) to NP-hard. Single matching is: find nearest available driver. Pooled matching is: given multiple riders going in similar directions, find a driver whose route can serve all of them with acceptable detour. This is a constraint optimization problem that requires fundamentally different algorithms (not just nearest-neighbor).

For This Case Study, Our Answers Are:

  • Ride type: single rider ↔ single driver (no pooling β€” keeps matching O(N))
  • Match strategy: nearest driver by ETA (not pure distance)
  • Driver rejection: yes β€” driver can reject, 15s timeout, cascade to next
  • Scheduled rides: no β€” real-time only
  • Pricing: dynamic surge pricing per geographic zone
  • Fare estimate: yes β€” shown upfront before rider confirms
  • Vehicle types: 3 tiers (economy, comfort, premium) β€” separate matching pools
  • ETA: road-routing based (not straight-line), traffic-aware
  • Geographic scope: multi-city global deployment β€” partitioned by city
  • Location freshness SLA: driver position must be <5 seconds stale at time of match
πŸ“‹ Chapter 2 β€” Summary
  • Single vs pooled rides: pooling is exponentially harder to match optimally.
  • Surge pricing requires real-time supply/demand tracking per geographic zone.
  • ETA accuracy depends on road routing + traffic data β€” not just distance.
  • Driver rejection adds timeout handling and fallback to next-best driver.
03
Chapter Three

Naive Design

SQL Query on All Drivers

The simplest design: store all driver locations in a PostgreSQL table with PostGIS. When a rider requests a ride, run a spatial query: SELECT * FROM drivers WHERE status='available' ORDER BY ST_Distance(location, rider_point) LIMIT 10. This scans the table, computes distances, sorts, and returns nearest drivers. Works for 100 drivers. With 5M active drivers updating every 4 seconds, the table is being written 1.25M times/sec while being queried for nearest neighbors β€” the database cannot handle both simultaneously.

Naive Design β€” SQL Spatial Query on All Drivers
Rider App API Server SELECT * FROM drivers WHERE status='available' ORDER BY ST_Distance(...) LIMIT 10 PostgreSQL + PostGIS drivers table 5M rows, index rebuilt constantly full table scan 1.25M location writes/sec + 230 spatial queries/sec = index thrashing β†’ DB melts Write path (location updates) and read path (matching queries) compete on same table Single DB = no geographic partitioning. San Francisco query scans Tokyo drivers too. Spatial index must be rebuilt with every location update β€” never converges.
βœ…

What Works

  • Simple β€” one table, one query, standard SQL
  • PostGIS spatial indexes handle small datasets well
  • ACID guarantees on driver status
  • Fine for <10K drivers in a single city
πŸ’₯

What Breaks

  • 1.25M location writes/sec β†’ write throughput exceeded
  • Spatial index constantly rebuilt as drivers move β†’ index thrashing
  • 230 match queries/sec each scanning millions of rows
  • Single database = SPOF + no geographic distribution
  • ETA not computable from SQL (needs routing engine)
πŸ“‹ Chapter 3 β€” Summary
  • SQL + PostGIS: works for small scale. Fails at 1.25M writes/sec + spatial queries.
  • Index thrashing: spatial index rebuilt constantly as every driver moves.
  • No separation between hot path (matching) and warm path (location ingestion).
04
Chapter Four

Refined Design

Geospatial Index + Matching Service

The refined design separates concerns: a high-throughput location service ingests 1.25M updates/sec into an in-memory geospatial index (geohash-based grid or S2 cells). A separate matching service queries this index to find nearby candidates, then scores them by ETA (via routing service). A dispatch service sends the match offer to the selected driver and handles accept/reject/timeout. The ride lifecycle (request β†’ match β†’ pickup β†’ trip β†’ dropoff) is managed as a state machine.

Refined Design β€” Ride-Sharing Platform
Rider App API Gateway auth + rate limit Matching Service Find candidates Score by ETA + rating Routing / ETA OSRM / Valhalla ETA per candidate Dispatch Service offer β†’ accept/reject requested β†’ matched β†’ pickup β†’ in_transit β†’ completed WebSocket / FCM Push Driver App dispatch offer Location Service 1.25M updates/sec GPS every 4s Geo Index S2 cells / Geohash write query: drivers in nearby cells Rider β†’ Gateway β†’ Matching queries Geo Index β†’ Scores by ETA β†’ Dispatches to driver Driver locations stream into Geo Index (in-memory). Match latency: <3 seconds.
πŸ“

Location Ingestion

  • Drivers send GPS every 3-5 seconds via WebSocket/MQTT
  • Location service writes to in-memory geospatial index
  • Index partitioned by geographic region (city/zone)
  • S2 cells or geohash for spatial bucketing
  • Old locations expire (TTL 30s) β€” no explicit delete needed
πŸ”—

Matching Flow

  • 1. Rider requests ride β†’ matching service receives (lat, lng, type)
  • 2. Query geo index: all available drivers within 3km radius
  • 3. For top-10 candidates: compute ETA via routing service
  • 4. Score: ETA Γ— 0.7 + rating Γ— 0.2 + direction Γ— 0.1
  • 5. Dispatch to best driver. 15s timeout β†’ try next.

The geospatial index must be in-memory and partitioned by city/region. You cannot query a global index for "nearest drivers in San Francisco" when the index also contains drivers in Tokyo. Partition by city. Within a city, use S2 cells (Google's spherical geometry library) or geohash prefixes to bucket drivers into geographic tiles. Querying "nearby" = check the rider's cell + adjacent cells. This is O(1) per query regardless of total driver count.

S2 Cell Spatial Query β€” How Nearby Driver Lookup Works
9q8v 9q8w 9q8x 9q8y 9q8z 9q9a 9q9b 9q9c β–² driver 9q9d 9q9e 9q9f 9q9g 9q9h ● rider 9q9i β–² driver 9q9j 9q9k 9q9l 9q9m β–² 9q9n 9q9o Rider's cell Adjacent (searched) Not searched Query = check 9 cells total. O(1) regardless of total driver count. 3 drivers found in searched cells β†’ compute ETA for each β†’ select best Contrast: SQL ST_Distance scans all 5M drivers globally
πŸ“‹ Chapter 4 β€” Summary
  • Location service: 1.25M updates/sec into in-memory geospatial index (S2 cells).
  • Matching: query nearby cells β†’ get candidates β†’ score by ETA + rating β†’ dispatch.
  • Partitioned by city/region: each city is an independent matching domain.
  • Dispatch with timeout: 15s to accept, then cascade to next driver.
  • Match latency target: <3 seconds from request to driver notification.
05
Chapter Five

Alternative Approaches

Matching Strategies
Greedy (Nearest Driver)
Batch Matching (Optimization)
  • Match each rider to their nearest available driver immediately
  • Simple, fast β€” <1 second per match
  • Locally optimal: best for THIS rider right now
  • Globally suboptimal: may leave another rider with no drivers nearby
  • Good for: low demand, or when supply >> demand
  • Used by: early Uber, lower-traffic markets
  • Collect ride requests for 2-5 seconds, batch match all at once
  • Optimization: minimize total wait time across ALL riders in batch
  • Globally optimal: better aggregate experience
  • Higher latency: rider waits 2-5 seconds before matching starts
  • Complex: Hungarian algorithm or linear programming
  • Used by: Uber (high-demand), Lyft (peak hours)
Matching Strategy: Greedy vs Batch Optimization
Greedy (Sequential) A B Drv 1 B waits 4 min... 😞 Locally optimal for A, globally bad for B. Batch (2-second window) A B Drv 1 Drv 2 Globally optimal. Both riders matched. βœ“ Batch adds 2s latency but minimizes total wait time across all riders.
Geohash Indexing
S2 Geometry (Google)
  • Encode lat/lng into string prefix. Nearby = shared prefix.
  • Simple to implement β€” Redis sorted sets by geohash
  • Non-uniform cell sizes near poles (distortion)
  • Edge cases at cell boundaries (neighbors need special handling)
  • Good for: most applications, simple infrastructure
  • Used by: MongoDB geospatial, Elasticsearch
  • Spherical geometry β€” divides earth into hierarchical cells
  • Uniform coverage: no distortion at poles or equator
  • Precise containment and intersection queries
  • More complex to implement β€” requires S2 library
  • Good for: precision-critical, global-scale geospatial
  • Used by: Google Maps, Uber H3 (Hexagonal variant)
Geospatial Index: Geohash vs S2/H3 Cell Coverage
Geohash (Rectangular) 9q8y 9q8z ↑ driver at edge of 9q8y not found searching 9q8z! Cells distort at high latitudes Irregular adjacency edges S2 / H3 Hexagonal center All 6 neighbors equidistant Uniform coverage, no pole distortion Clean adjacency β€” no boundary cases

Uber's switch from geohash to H3 hexagonal cells solved a structural problem. Rectangular geohash cells have irregular adjacency β€” a cell at the edge of "9q8y" is much closer to the center of "9q8z" than to the center of its own cell. Hexagonal grids (H3) have a key property: every cell's center is equidistant from all 6 neighboring cell centers. This eliminates boundary artifacts and makes surge pricing per-cell more accurate β€” a hexagon's area doesn't abruptly end at a street corner.

πŸ“‹ Chapter 5 β€” Summary
  • Greedy: fast, simple, locally optimal. Good for low demand.
  • Batch matching: globally optimal, higher latency. Good for high demand.
  • Geohash: simple, Redis-friendly. Non-uniform near poles.
  • S2/H3: uniform cells, precision. Used by Uber and Google at global scale.
06
Chapter Six

What Real Companies Did

Production Ride-Sharing Platforms
πŸš—

Uber

  • H3: custom hexagonal grid system for geospatial indexing
  • DISCO: dispatch optimization (batch matching at scale)
  • Ringpop: consistent hashing for partitioning by city
  • Schemaless: custom distributed database for trip data
  • Surge pricing computed per H3 hexagon cell
πŸš—

Lyft

  • S2 cells for geospatial indexing
  • Custom matching with ML-based scoring
  • Shared rides: constraint optimization for route compatibility
  • Envoy service mesh for inter-service communication
  • Predictive positioning: suggest drivers where demand will appear
πŸ›΅

Grab (Southeast Asia)

  • Multi-modal: cars, bikes, tuk-tuks β€” different matching
  • Optimized for dense urban + rural (varied road quality)
  • Real-time traffic from driver GPS data (crowdsourced)
  • Kafka for all location event streaming
  • GrabPlatform: shared infrastructure across food/ride/deliver
πŸš•

DiDi (China)

  • 600M users, 30M rides/day (peak)
  • AI-based demand prediction per zone (15-min windows)
  • Order dispatch: 3-second batch windows, global optimization
  • Handles extreme density: 100K+ concurrent drivers in single city
  • Real-time repositioning suggestions for idle drivers
Production Ride-Sharing Platforms β€” Comparison
Company Geo Index Matching Strategy Special Pattern Scale Uber H3 hexagonal grid Batch (DISCO optimizer) Ringpop consistent hashing by city, surge per H3 cell 20M+ rides/day Lyft S2 cells ML-scored matching Predictive driver repositioning ~1M rides/day Grab S2 cells + Kafka streaming Multi-modal per vehicle type Crowdsourced traffic from driver GPS 35M+ rides/day DiDi Custom grid + AI prediction 3-second batch windows Global optimization, 100K+ drivers per city 30M rides/day
πŸ“‹ Chapter 6 β€” Summary
  • Uber: H3 hexagonal grid, DISCO batch optimization, Ringpop partitioning.
  • Lyft: S2 cells, ML-based matching, predictive driver positioning.
  • Grab: multi-modal, crowdsourced traffic, shared platform across verticals.
  • DiDi: 30M rides/day, AI demand prediction, 3-second batch dispatch.
07
Chapter Seven

Best Practices Extracted

Transferable Lessons
πŸ—ΊοΈ

Geospatial Partitioning

  • Partition data by geographic region (city/zone)
  • Each partition serves independently β€” no cross-city queries
  • Scale horizontally: add servers per city as demand grows
  • Isolation: outage in one city doesn't affect others
  • Transfers to: delivery, food ordering, fleet management
⏱️

State Machine Lifecycle

  • Ride states: requested β†’ matched β†’ pickup β†’ in_transit β†’ completed
  • Each transition has timeout + fallback
  • Audit log of every state change (dispute resolution)
  • Exactly-once state transitions (idempotent)
  • Transfers to: order fulfillment, workflow engines
πŸ“Š

Supply/Demand Balancing

  • Real-time supply (available drivers) per zone
  • Predicted demand (ride requests) per zone per time window
  • Imbalance β†’ surge pricing (demand > supply) or repositioning (supply > demand)
  • Heat maps for driver repositioning guidance
  • Transfers to: any marketplace with dynamic pricing
Ride Lifecycle State Machine
requested match found matched en route pickup arrived in_transit trip ends completed payment rated timeout 15s β†’ try next driver cancelled no drivers timeout 10min driver no-show Happy path Failure / timeout path Every state transition logged for audit. Each state has timeout + fallback. Idempotent transitions prevent duplicate actions.
πŸ“‹ Chapter 7 β€” Summary
  • Geo-partition: data and compute partitioned by city. Independent scaling and failure isolation.
  • State machine: ride lifecycle with timeouts and fallbacks at every transition.
  • Supply/demand: real-time tracking per zone enables surge pricing and repositioning.
  • Simultaneous offers: send dispatch to top-3 drivers at once. First to accept wins. Eliminates cascade rejection delay.
08
Chapter Eight

What Could Go Wrong

Common Failure Patterns
πŸ‘»

Ghost Drivers

  • Driver app crashes β†’ no location updates β†’ still shows as available
  • Rider matched to "ghost" β†’ dispatch sent but never accepted
  • 15s timeout β†’ retry β†’ another ghost β†’ 45 seconds wasted
  • Fix: TTL on location records (30s). No update = marked offline. Heartbeat requirement for "available" status.
Ghost Driver: TTL-Based Availability Expiry
t=0 t=4s t=8s t=12s CRASH t=38s t=42s ping ping ping βœ• ← no pings received β†’ Without TTL: Still shows as AVAILABLE β€” matched to rider at t=20s β†’ 15s timeout wasted With TTL (30s): Available (pings active) Grace period (last ping + 30s) TTL expires β†’ marked OFFLINE Rider never matched to this driver βœ“
πŸ”„

Dispatch Cascade Failure

  • Best driver rejects β†’ send to 2nd β†’ they reject β†’ 3rd...
  • 5 rejections = 75 seconds elapsed β†’ rider cancels
  • All nearby drivers busy rejecting offers from other riders
  • Fix: batch matching (assign optimally in one shot), timeout reduction (10s), simultaneous offers to multiple drivers (first to accept wins).
πŸ“

GPS Drift and Inaccuracy

  • Urban canyons: GPS shows driver in wrong street (50-100m off)
  • ETA computed to wrong location β†’ driver arrives but rider doesn't see them
  • Matching selects "nearby" driver who is actually on the highway above
  • Fix: snap GPS to road network, use multi-point averaging, consider altitude/road level.
πŸ’₯

Surge Pricing Oscillation

  • Demand spike β†’ surge 3x β†’ riders stop requesting β†’ demand drops
  • Surge removed β†’ riders flood back β†’ surge again
  • Oscillating surge confuses both riders and drivers
  • Fix: smooth surge transitions (ramp up/down over minutes, not seconds), minimum surge duration, demand prediction (anticipate vs react).
Surge Pricing: Oscillation vs Smooth Transition
Time β†’ Demand Low High equilibrium Oscillating surge (bad UX) Smoothed surge (stable) surge on β†’ riders leave β†’ surge off β†’ riders return β†’ repeat

The most damaging failure is matching a rider to a ghost driver. The rider waits, nothing happens, and they leave for a competitor. This single failure mode has caused more user churn than any other issue in ride-sharing. The fix is simple: aggressive TTL on driver availability. If no heartbeat in 30 seconds, the driver is dead to the matching system. Better to have fewer candidates than to match to a ghost.

πŸ“‹ Chapter 8 β€” Summary
  • Ghost drivers: TTL on availability. No heartbeat = offline. Never match to stale data.
  • Cascade rejection: batch matching or simultaneous offers. Don't serial-dispatch.
  • GPS drift: snap to road network. Don't trust raw GPS in urban environments.
  • Surge oscillation: smooth transitions + minimum duration. Predict, don't react.