System Design · Case Studies

Case Study: Ride Sharing

Design, trade-offs, and alternatives for a ride-sharing platform at scale.

Chapter One

Problem Statement

What We Are Building

A ride-sharing platform matches riders requesting trips with nearby drivers in real-time. The core challenge is a spatial-temporal matching problem: given a rider at location (lat, lng) requesting a ride now, find the best available driver within a few kilometers — considering driver location, direction of travel, ETA, and driver rating — all within 3-5 seconds. Meanwhile, drivers' locations update every 3-5 seconds, creating a constantly moving dataset of millions of points that must be queried geospatially at thousands of times per second.

Scale Requirements

Traffic & Scale

100M monthly active riders
5M active drivers (concurrent online at peak)
20M rides/day (~230 matches/sec)
Location updates: 5M drivers × every 4s = 1.25M updates/sec

Requirements

Match latency: driver found in <5 seconds
ETA accuracy: ±2 minutes at time of match
Location freshness: driver position <5 seconds stale
Availability: 99.99% (rides are time-critical)

Ride-sharing is a real-time geospatial matching system with strict time constraints. Unlike a restaurant search (where results can be cached), driver positions are moving constantly. The index is never stable. And unlike a chat system (where latency of 200ms vs 500ms is acceptable), a 30-second delay in matching means the rider leaves and opens a competitor's app. Speed of matching directly equals revenue.

📋 Chapter 1 — Summary

5M concurrent drivers sending location updates every 4 seconds = 1.25M updates/sec.
Match a rider to nearest driver in <5 seconds with accurate ETA.
Geospatial index is constantly moving — never cacheable.
20M rides/day. Matching speed = revenue. Downtime = riders switch to competitor.

Chapter Two

Questions to Ask

Clarifying Before Designing

A basic taxi dispatch system and a full Uber-like platform differ enormously. The questions below determine whether you need surge pricing, ride pooling (shared rides), scheduled rides, multi-modal transport, and how sophisticated the matching algorithm needs to be.

🚗

Matching Model

Single rider ↔ single driver? Or ride pooling?
Match by nearest or by optimal ETA?
Driver can reject? Or auto-assigned?
Scheduled rides (future) or real-time only?

💰

Pricing & Economics

Fixed pricing or dynamic (surge)?
Upfront fare estimate before ride?
Multiple vehicle types (economy, premium, XL)?
Tips, cancellation fees, toll handling?

📍

Geography & Routing

Urban only or suburban/rural too?
ETA based on straight-line or real road routing?
Traffic-aware ETA?
Multi-city / global deployment?

Ride pooling changes the matching problem from O(N) to NP-hard. Single matching is: find nearest available driver. Pooled matching is: given multiple riders going in similar directions, find a driver whose route can serve all of them with acceptable detour. This is a constraint optimization problem that requires fundamentally different algorithms (not just nearest-neighbor).

For This Case Study, Our Answers Are:

Ride type: single rider ↔ single driver (no pooling — keeps matching O(N))
Match strategy: nearest driver by ETA (not pure distance)
Driver rejection: yes — driver can reject, 15s timeout, cascade to next
Scheduled rides: no — real-time only
Pricing: dynamic surge pricing per geographic zone
Fare estimate: yes — shown upfront before rider confirms
Vehicle types: 3 tiers (economy, comfort, premium) — separate matching pools
ETA: road-routing based (not straight-line), traffic-aware
Geographic scope: multi-city global deployment — partitioned by city
Location freshness SLA: driver position must be <5 seconds stale at time of match

📋 Chapter 2 — Summary

Single vs pooled rides: pooling is exponentially harder to match optimally.
Surge pricing requires real-time supply/demand tracking per geographic zone.
ETA accuracy depends on road routing + traffic data — not just distance.
Driver rejection adds timeout handling and fallback to next-best driver.

Chapter Three

Naive Design

SQL Query on All Drivers

The simplest design: store all driver locations in a PostgreSQL table with PostGIS. When a rider requests a ride, run a spatial query: SELECT * FROM drivers WHERE status='available' ORDER BY ST_Distance(location, rider_point) LIMIT 10. This scans the table, computes distances, sorts, and returns nearest drivers. Works for 100 drivers. With 5M active drivers updating every 4 seconds, the table is being written 1.25M times/sec while being queried for nearest neighbors — the database cannot handle both simultaneously.

Naive Design — SQL Spatial Query on All Drivers

✅

What Works

Simple — one table, one query, standard SQL
PostGIS spatial indexes handle small datasets well
ACID guarantees on driver status
Fine for <10K drivers in a single city

💥

What Breaks

1.25M location writes/sec → write throughput exceeded
Spatial index constantly rebuilt as drivers move → index thrashing
230 match queries/sec each scanning millions of rows
Single database = SPOF + no geographic distribution
ETA not computable from SQL (needs routing engine)

📋 Chapter 3 — Summary

SQL + PostGIS: works for small scale. Fails at 1.25M writes/sec + spatial queries.
Index thrashing: spatial index rebuilt constantly as every driver moves.
No separation between hot path (matching) and warm path (location ingestion).

Chapter Four

Refined Design

Geospatial Index + Matching Service

The refined design separates concerns: a high-throughput location service ingests 1.25M updates/sec into an in-memory geospatial index (geohash-based grid or S2 cells). A separate matching service queries this index to find nearby candidates, then scores them by ETA (via routing service). A dispatch service sends the match offer to the selected driver and handles accept/reject/timeout. The ride lifecycle (request → match → pickup → trip → dropoff) is managed as a state machine.

Refined Design — Ride-Sharing Platform

📍

Location Ingestion

Drivers send GPS every 3-5 seconds via WebSocket/MQTT
Location service writes to in-memory geospatial index
Index partitioned by geographic region (city/zone)
S2 cells or geohash for spatial bucketing
Old locations expire (TTL 30s) — no explicit delete needed

🔗

Matching Flow

1. Rider requests ride → matching service receives (lat, lng, type)
2. Query geo index: all available drivers within 3km radius
3. For top-10 candidates: compute ETA via routing service
4. Score: ETA × 0.7 + rating × 0.2 + direction × 0.1
5. Dispatch to best driver. 15s timeout → try next.

The geospatial index must be in-memory and partitioned by city/region. You cannot query a global index for "nearest drivers in San Francisco" when the index also contains drivers in Tokyo. Partition by city. Within a city, use S2 cells (Google's spherical geometry library) or geohash prefixes to bucket drivers into geographic tiles. Querying "nearby" = check the rider's cell + adjacent cells. This is O(1) per query regardless of total driver count.

S2 Cell Spatial Query — How Nearby Driver Lookup Works

📋 Chapter 4 — Summary

Location service: 1.25M updates/sec into in-memory geospatial index (S2 cells).
Matching: query nearby cells → get candidates → score by ETA + rating → dispatch.
Partitioned by city/region: each city is an independent matching domain.
Dispatch with timeout: 15s to accept, then cascade to next driver.
Match latency target: <3 seconds from request to driver notification.

Chapter Five

Alternative Approaches

Matching Strategies

Greedy (Nearest Driver)

Batch Matching (Optimization)

Match each rider to their nearest available driver immediately
Simple, fast — <1 second per match
Locally optimal: best for THIS rider right now
Globally suboptimal: may leave another rider with no drivers nearby
Good for: low demand, or when supply >> demand
Used by: early Uber, lower-traffic markets

Collect ride requests for 2-5 seconds, batch match all at once
Optimization: minimize total wait time across ALL riders in batch
Globally optimal: better aggregate experience
Higher latency: rider waits 2-5 seconds before matching starts
Complex: Hungarian algorithm or linear programming
Used by: Uber (high-demand), Lyft (peak hours)

Matching Strategy: Greedy vs Batch Optimization

Geohash Indexing

S2 Geometry (Google)

Encode lat/lng into string prefix. Nearby = shared prefix.
Simple to implement — Redis sorted sets by geohash
Non-uniform cell sizes near poles (distortion)
Edge cases at cell boundaries (neighbors need special handling)
Good for: most applications, simple infrastructure
Used by: MongoDB geospatial, Elasticsearch

Spherical geometry — divides earth into hierarchical cells
Uniform coverage: no distortion at poles or equator
Precise containment and intersection queries
More complex to implement — requires S2 library
Good for: precision-critical, global-scale geospatial
Used by: Google Maps, Uber H3 (Hexagonal variant)

Geospatial Index: Geohash vs S2/H3 Cell Coverage

Uber's switch from geohash to H3 hexagonal cells solved a structural problem. Rectangular geohash cells have irregular adjacency — a cell at the edge of "9q8y" is much closer to the center of "9q8z" than to the center of its own cell. Hexagonal grids (H3) have a key property: every cell's center is equidistant from all 6 neighboring cell centers. This eliminates boundary artifacts and makes surge pricing per-cell more accurate — a hexagon's area doesn't abruptly end at a street corner.

📋 Chapter 5 — Summary

Greedy: fast, simple, locally optimal. Good for low demand.
Batch matching: globally optimal, higher latency. Good for high demand.
Geohash: simple, Redis-friendly. Non-uniform near poles.
S2/H3: uniform cells, precision. Used by Uber and Google at global scale.

Chapter Six

What Real Companies Did

Production Ride-Sharing Platforms

🚗

Uber

H3: custom hexagonal grid system for geospatial indexing
DISCO: dispatch optimization (batch matching at scale)
Ringpop: consistent hashing for partitioning by city
Schemaless: custom distributed database for trip data
Surge pricing computed per H3 hexagon cell

🚗

Lyft

S2 cells for geospatial indexing
Custom matching with ML-based scoring
Shared rides: constraint optimization for route compatibility
Envoy service mesh for inter-service communication
Predictive positioning: suggest drivers where demand will appear

🛵

Grab (Southeast Asia)

Multi-modal: cars, bikes, tuk-tuks — different matching
Optimized for dense urban + rural (varied road quality)
Real-time traffic from driver GPS data (crowdsourced)
Kafka for all location event streaming
GrabPlatform: shared infrastructure across food/ride/deliver

🚕

DiDi (China)

600M users, 30M rides/day (peak)
AI-based demand prediction per zone (15-min windows)
Order dispatch: 3-second batch windows, global optimization
Handles extreme density: 100K+ concurrent drivers in single city
Real-time repositioning suggestions for idle drivers

Production Ride-Sharing Platforms — Comparison

📋 Chapter 6 — Summary

Uber: H3 hexagonal grid, DISCO batch optimization, Ringpop partitioning.
Lyft: S2 cells, ML-based matching, predictive driver positioning.
Grab: multi-modal, crowdsourced traffic, shared platform across verticals.
DiDi: 30M rides/day, AI demand prediction, 3-second batch dispatch.

Chapter Seven

Best Practices Extracted

Transferable Lessons

🗺️

Geospatial Partitioning

Partition data by geographic region (city/zone)
Each partition serves independently — no cross-city queries
Scale horizontally: add servers per city as demand grows
Isolation: outage in one city doesn't affect others
Transfers to: delivery, food ordering, fleet management

⏱️

State Machine Lifecycle

Ride states: requested → matched → pickup → in_transit → completed
Each transition has timeout + fallback
Audit log of every state change (dispute resolution)
Exactly-once state transitions (idempotent)
Transfers to: order fulfillment, workflow engines

📊

Supply/Demand Balancing

Real-time supply (available drivers) per zone
Predicted demand (ride requests) per zone per time window
Imbalance → surge pricing (demand > supply) or repositioning (supply > demand)
Heat maps for driver repositioning guidance
Transfers to: any marketplace with dynamic pricing

Ride Lifecycle State Machine

📋 Chapter 7 — Summary

Geo-partition: data and compute partitioned by city. Independent scaling and failure isolation.
State machine: ride lifecycle with timeouts and fallbacks at every transition.
Supply/demand: real-time tracking per zone enables surge pricing and repositioning.
Simultaneous offers: send dispatch to top-3 drivers at once. First to accept wins. Eliminates cascade rejection delay.

Chapter Eight

What Could Go Wrong

Common Failure Patterns

👻

Ghost Drivers

Driver app crashes → no location updates → still shows as available
Rider matched to "ghost" → dispatch sent but never accepted
15s timeout → retry → another ghost → 45 seconds wasted
Fix: TTL on location records (30s). No update = marked offline. Heartbeat requirement for "available" status.

Ghost Driver: TTL-Based Availability Expiry

🔄

Dispatch Cascade Failure

Best driver rejects → send to 2nd → they reject → 3rd...
5 rejections = 75 seconds elapsed → rider cancels
All nearby drivers busy rejecting offers from other riders
Fix: batch matching (assign optimally in one shot), timeout reduction (10s), simultaneous offers to multiple drivers (first to accept wins).

📐

GPS Drift and Inaccuracy

Urban canyons: GPS shows driver in wrong street (50-100m off)
ETA computed to wrong location → driver arrives but rider doesn't see them
Matching selects "nearby" driver who is actually on the highway above
Fix: snap GPS to road network, use multi-point averaging, consider altitude/road level.

💥

Surge Pricing Oscillation

Demand spike → surge 3x → riders stop requesting → demand drops
Surge removed → riders flood back → surge again
Oscillating surge confuses both riders and drivers
Fix: smooth surge transitions (ramp up/down over minutes, not seconds), minimum surge duration, demand prediction (anticipate vs react).

Surge Pricing: Oscillation vs Smooth Transition

The most damaging failure is matching a rider to a ghost driver. The rider waits, nothing happens, and they leave for a competitor. This single failure mode has caused more user churn than any other issue in ride-sharing. The fix is simple: aggressive TTL on driver availability. If no heartbeat in 30 seconds, the driver is dead to the matching system. Better to have fewer candidates than to match to a ghost.

📋 Chapter 8 — Summary

Ghost drivers: TTL on availability. No heartbeat = offline. Never match to stale data.
Cascade rejection: batch matching or simultaneous offers. Don't serial-dispatch.
GPS drift: snap to road network. Don't trust raw GPS in urban environments.
Surge oscillation: smooth transitions + minimum duration. Predict, don't react.

← Social Feed Payment System →