System Design Β· Case Studies

Case Study: Payment System

Design, trade-offs, and alternatives for a payment system at scale.

01
Chapter One

Problem Statement

What We Are Building

A payment system processes financial transactions: charging customers, settling with merchants, handling refunds, and maintaining an auditable ledger. The defining constraint is correctness over speed. Unlike chat or feeds where eventual consistency is acceptable, a payment must be processed exactly once β€” never duplicated, never lost. A double charge loses customer trust instantly. A lost payment loses merchant trust. The system must be correct in the presence of network failures, timeouts, retries, and partial outages.

Scale Requirements

Traffic & Scale

  • 1B transactions/day (~12K TPS average)
  • Peak: 5x average β†’ 60K TPS (Black Friday)
  • $1T+ annual payment volume
  • Multiple currencies, payment methods, and regions

Requirements

  • Exactly-once processing: no double charges, no lost payments
  • Latency: <2 seconds end-to-end for card payment
  • Availability: 99.999% (5 minutes downtime/year)
  • Audit trail: every state change immutably logged

Payment systems are the only system where "eventual consistency" can mean legal liability. If you double-charge a customer, you may violate consumer protection laws. If you lose a transaction record, you fail PCI-DSS audit. If your reconciliation is off by $0.01, your financial reports are incorrect. Every other system in this series can tolerate small inconsistencies. Payments cannot. Correctness is not a goal β€” it is the only acceptable outcome.

πŸ“‹ Chapter 1 β€” Summary
  • 1B transactions/day, 60K TPS at peak. $1T+ annual volume.
  • Exactly-once: no duplicate charges, no lost payments. Non-negotiable.
  • 99.999% availability β€” 5 minutes downtime per year.
  • Full audit trail: every state change logged immutably for compliance.
02
Chapter Two

Questions to Ask

Clarifying Before Designing
πŸ’³

Payment Methods

  • Credit/debit cards only? Or wallets, bank transfers, crypto?
  • Multiple PSPs (Stripe, Adyen, PayPal) or single?
  • Token vault for stored cards?
  • 3D Secure / SCA required? (EU regulation)
πŸ”„

Payment Lifecycle

  • Authorize + capture separate? Or single charge?
  • Refunds supported? Partial refunds?
  • Recurring/subscription payments?
  • Multi-currency with FX conversion?
πŸ“‹

Compliance & Reporting

  • PCI-DSS compliance level?
  • End-of-day reconciliation with banks/PSPs?
  • Regulatory reporting (regional)?
  • Fraud detection inline or async?

Authorize vs capture is the most important lifecycle question. In an auth-only model, you place a hold on funds but don't charge until fulfillment (e.g., Amazon charges when item ships). This means you need two-phase transaction handling, hold expiry management, and the ability to void unreleased authorizations. Single-charge is simpler but means you charge before fulfilling β€” creating refund volume if fulfillment fails.

For This Case Study, Our Answers Are:

  • Payment methods: credit/debit cards (Visa, Mastercard, Amex) + digital wallets (Apple Pay, Google Pay)
  • PSP model: multi-PSP (Stripe primary, Adyen fallback, PayPal for wallet payments)
  • Lifecycle: single charge (not auth + capture) β€” simplifies the state machine
  • Refunds: yes, full and partial. Processed as separate transactions in the ledger.
  • Recurring payments: no β€” out of scope for this design
  • Currency: multi-currency with FX conversion at transaction time
  • PCI-DSS: Level 1 compliance β€” no raw card data touches our servers (tokenization via PSP)
  • Fraud detection: inline, synchronous, must complete in <200ms before PSP call
  • Reconciliation: daily automated, with manual review queue for exceptions
  • Audit trail: every state change immutably logged β€” 7-year retention (regulatory requirement)
πŸ“‹ Chapter 2 β€” Summary
  • Auth + capture vs single charge: determines transaction lifecycle complexity.
  • Multiple PSPs add routing decisions + fallback complexity.
  • PCI-DSS compliance determines where card data can live (tokenization required).
  • Reconciliation: daily matching of your records vs PSP records vs bank statements.
03
Chapter Three

Naive Design

Direct PSP Call Without Idempotency

The simplest design: user clicks "Pay" β†’ app server calls Stripe's charge API synchronously β†’ returns success/failure to user. No idempotency key. No local transaction record before calling PSP. No handling of network timeouts. This works for a hackathon. In production, a network timeout after the PSP charged the card means you don't know if the charge succeeded β€” and if the user retries, they get charged twice. The PSP has no way to know the second request is a retry β€” without an idempotency key, it looks identical to a new payment β€” so it processes a second charge.

Naive Design β€” Direct PSP Call Without Idempotency
User App Server Stripe API POST /pay charge($100) ⚑ network timeout β€” charge may have succeeded User retries β†’ Stripe charges again β†’ DOUBLE CHARGE ($200) User doesn't retry β†’ payment lost, no record No idempotency key: retry = duplicate charge No local record: if app server crashes post-charge, money is in limbo No fallback: Stripe outage = 100% payment failure
βœ…

What Works

  • Simple β€” one HTTP call to Stripe
  • No local database needed (PSP is the source of truth)
  • Works for low volume, manual dispute resolution
  • PSP handles PCI, fraud, retry logic
πŸ’₯

What Breaks

  • Network timeout: charge succeeded but you don't know
  • User retries: double charge (no idempotency key)
  • No local record: can't reconcile, can't audit
  • PSP outage: entire payment system down (no fallback)
  • Refund path unknown: no record of what to refund
πŸ“‹ Chapter 3 β€” Summary
  • Direct PSP call without idempotency: network timeout = unknown state = double charge risk.
  • No local transaction record: can't reconcile, can't audit, can't refund reliably.
  • Single PSP: outage = complete payment failure. No fallback routing.
04
Chapter Four

Refined Design

Idempotent Payment Pipeline with Double-Entry Ledger

The refined design treats every payment as a state machine with exactly-once guarantees. Before calling any PSP, the system creates a local transaction record with a unique idempotency key. The PSP call includes this key β€” so retries are safe. On success, the double-entry ledger records debits and credits. On failure or timeout, a reconciliation service resolves the ambiguity by querying the PSP with the idempotency key. The ledger is the source of truth β€” not the PSP.

Refined Design β€” Payment Pipeline
User Payment Service 1. Create txn (PENDING) 2. Check idempotency 3. Route to PSP PENDING β†’ PROCESSING β†’ COMPLETED β†˜ FAILED β†˜ UNKNOWN (β†’ reconcile) Fraud Detection <100ms inline pre-auth check approve/decline Idempotency Store Redis (TTL 24h) check key yesβ†’cached / noβ†’proceed PSP Router smart routing Stripe Adyen PayPal 503/timeout β†’ failover Double-Entry Ledger debit + credit per txn on COMPLETED: write debit+credit Reconciliation resolve UNKNOWN/PENDING PENDING txns query PSP resolve β†’ write Every payment: Create PENDING record β†’ Fraud check β†’ Idempotency check β†’ PSP call β†’ Update ledger On timeout: reconciliation queries PSP with idempotency key. Ledger = source of truth.
πŸ’°

Payment Flow

  • 1. Create transaction record (status: PENDING, with idempotency key)
  • 2. Route to optimal PSP (based on currency, card type, success rate)
  • 3. Call PSP with idempotency key (retries are safe)
  • 4. On success: update status β†’ COMPLETED, write to ledger
  • 5. On failure: update status β†’ FAILED, no ledger entry
  • 6. On timeout: status stays PENDING β†’ reconciliation resolves
πŸ“’

Double-Entry Ledger

  • Every transaction = at least 2 entries (debit + credit)
  • Example: Customer account -$100, Merchant account +$97, Fee account +$3
  • Sum of all entries always = 0 (balanced books)
  • Append-only: never modify, only add correcting entries
  • Enables instant balance calculation and full audit trail
Double-Entry Ledger β€” Sample Payment of $100
Entry Account Debit Credit
1 Customer wallet $100 β€”
2 Merchant receivable β€” $97
3 Platform fee β€” $3
Total $100 $100

Sum of debits = Sum of credits = $100. Books always balance. Any discrepancy = bug detected immediately.

The idempotency key is the single most critical design element. It is generated client-side (UUID) and sent with every payment request. The server stores it with the transaction. If the same key arrives again (retry after timeout), the server returns the existing result without re-processing. This makes the entire system retry-safe. Without it, every network timeout becomes a potential double charge. With it, retries are free and safe.

Idempotency Key Flow β€” Making Retries Safe
Client generates UUID: idempotency_key = "abc-123" POST /payments { amount: $100, idempotency_key: "abc-123" } Level 1 β€” Server: Check Redis key "abc-123" in Redis? YES Return cached result (no re-processing) "Already processed: $100, COMPLETED" NO Store key in Redis (TTL 24h) Level 2 β€” PSP dedup Call PSP with key "abc-123" PSP seen key? YES Return same result NO Process charge Two-level dedup: server Redis + PSP. Both must see the same key for bulletproof retry safety.
πŸ“‹ Chapter 4 β€” Summary
  • Idempotency key: client-generated UUID sent with every request. Retries are safe.
  • Local transaction record created BEFORE calling PSP. Source of truth is local.
  • PSP router: choose optimal provider per transaction (cost, success rate, currency).
  • Double-entry ledger: debit + credit for every transaction. Sum always zero.
  • Reconciliation: resolve PENDING transactions by querying PSP with idempotency key.
05
Chapter Five

Alternative Approaches

Transaction Processing Models
Synchronous (Inline Processing)
Asynchronous (Queue-Based)
  • Process payment in the API request path β€” user waits
  • Immediate success/failure feedback to user (~2s)
  • Simple to reason about β€” request-response model
  • Problem: PSP timeout blocks user for 30+ seconds
  • Good for: card payments (fast PSP response)
  • Used by: Standard checkout flows
  • Accept payment intent β†’ queue for processing β†’ notify on completion
  • User gets "Processing..." immediately, notification when done
  • Can retry, route to backup PSP, handle complex workflows
  • More complex UX (user not sure if it worked immediately)
  • Good for: bank transfers, crypto, complex multi-step payments
  • Used by: ACH transfers, SEPA, batch settlements
Single PSP (Simple)
Multi-PSP with Smart Routing
  • All payments go to one provider (e.g., just Stripe)
  • Simple integration β€” one SDK, one dashboard
  • Single point of failure: PSP outage = no payments
  • No cost optimization (can't route to cheapest)
  • Good for: startups, low volume, single region
  • Multiple PSPs. Route each transaction to optimal provider.
  • Routing criteria: cost, success rate, currency, card type, region
  • Failover: if primary PSP fails, route to secondary
  • A/B testing of PSPs to measure real success rates
  • Good for: high volume, global, cost-sensitive
  • Used by: Uber, Airbnb, large marketplaces
PSP Smart Routing β€” Decision Logic
Transaction arrives Currency? EUR Adyen (EU-optimized) USD / other Card type? Amex Stripe (Amex rates) Visa/MC Stripe OK? YES Stripe NO Adyen (fallback) Routing rules change dynamically based on real-time PSP health and success rates. A/B test PSPs per (card_type, region) tuple to measure actual authorization rates.

PSP success rates vary by card type, region, and time of day. A Visa card issued in Germany may have a 95% success rate on Adyen (EU-optimized) but only 88% on Stripe for that region. By routing intelligently and measuring real success rates per (PSP, card_type, region) tuple, large payment processors improve overall authorization rates by 3-7 percentage points. At $1T annual volume, 1% improvement in success rate = $10B more processed. Smart routing is not just cost optimization β€” it is revenue.

πŸ“‹ Chapter 5 β€” Summary
  • Synchronous: user waits for result. Simple, good UX for fast PSPs (cards).
  • Asynchronous: accept + process later. Good for slow methods (bank transfers).
  • Single PSP: simple. SPOF risk, no cost optimization.
  • Multi-PSP routing: failover + cost optimization + higher success rates.
06
Chapter Six

What Real Companies Did

Production Payment Systems
πŸ’³

Stripe

  • Payment intent model: create intent β†’ confirm β†’ charge
  • Built-in idempotency key support on all API calls
  • Webhook-based async notifications for payment status
  • Radar: ML-based fraud detection (inline, <100ms)
  • Multi-region: processes in EU, US, APAC independently
πŸ›’

Amazon Pay

  • Authorize at checkout, capture at shipment (2-phase)
  • 1-Click: stored tokens with session-based auth
  • Custom ledger system: handles refunds, chargebacks inline
  • Multi-PSP routing across 100+ global acquirers
  • Reconciliation: automated daily match of 100M+ transactions
πŸ“±

Square

  • Offline-first: POS payments stored locally, synced later
  • Idempotency built into every SDK call
  • Instant deposits: pre-fund from cash reserves before settlement
  • Hardware + software: custom card readers with E2E encryption
  • Handles intermittent connectivity gracefully
🏦

Wise (TransferWise)

  • Cross-border payments: match buyers and sellers of currency
  • Local payment rails: ACH (US), SEPA (EU), FPS (UK)
  • Pool-based: collect in one currency, payout in another from local pool
  • Reconciliation across 50+ banking partners daily
  • Transparency: show real mid-market rate, explicit fee
Production Payment Systems β€” Comparison
Company PSP Model Idempotency Special Pattern Scale Stripe Is the PSP β€” direct card network access Built into every API endpoint Payment Intents model, Radar ML fraud Millions of businesses Amazon 100+ acquirers, own routing Session-based + request-level 2-phase auth+capture, instant reconciliation 100M+ txns/day Square Offline-first with local queue SDK-level per request Hardware E2E encryption, instant deposits from reserve ~$200B annual vol. Wise Local bank rails (ACH, SEPA, FPS) Pool-matching idempotency Currency pooling avoids FX, local-to-local transfers 16M+ customers 50+ currencies
πŸ“‹ Chapter 6 β€” Summary
  • Stripe: payment intents, built-in idempotency, ML fraud (Radar).
  • Amazon: 2-phase (auth+capture), 100+ acquirers, automated reconciliation.
  • Square: offline-first POS, hardware encryption, instant deposits.
  • Wise: cross-border pooling, local payment rails, transparent FX.
07
Chapter Seven

Best Practices Extracted

Transferable Lessons
πŸ”‘

Idempotency Everywhere

  • Every mutation endpoint accepts an idempotency key
  • Server checks: already processed? Return cached result.
  • Client retries freely without fear of duplicates
  • Store keys with TTL (24-72h)
  • Transfers to: any system where duplicates cause harm
πŸ“’

Double-Entry Ledger

  • Every money movement = debit one account, credit another
  • Sum of all entries = 0 (invariant check)
  • Append-only: never delete or modify entries
  • Corrections via reversing entries (not edits)
  • Transfers to: any system tracking balance/inventory
πŸ”„

Reconciliation

  • Daily comparison: your ledger vs PSP records vs bank statements
  • Differences flagged for manual review
  • Automated resolution for common patterns (timeouts)
  • Nothing is "eventually consistent" β€” differences must be zero
  • Transfers to: any multi-system data consistency
Daily Reconciliation β€” Three-Way Match
Your Ledger PSP Records Bank Statement txn_001: $100 txn_001: $100 settlement: $97 βœ“ matched txn_002: $50 txn_002: $50 settlement: $47 βœ“ matched txn_003: $75 txn_003: ??? settlement: $72 βœ— missing txn_004: ??? txn_004: $200 ??? βœ— missing txn_003 missing from PSP: PSP webhook missed β†’ query API to resolve txn_004 missing from ledger: PSP charged but we never recorded β†’ money in limbo Reconciliation target: zero unmatched rows. Differences = bugs or fraud. Run daily. Alert immediately on any mismatch. Manual review queue for exceptions.

The double-entry ledger is accounting's 700-year-old distributed systems pattern. Every financial movement creates exactly two entries that sum to zero. If your books don't balance, something went wrong β€” and you know immediately. This is not just accounting convention; it is an invariant that makes bugs self-revealing. Apply this pattern anywhere you track quantities that must be conserved: inventory, credits, loyalty points.

πŸ“‹ Chapter 7 β€” Summary
  • Idempotency: every write endpoint. Retries are safe. Keys stored with TTL.
  • Double-entry: debit + credit per movement. Sum = 0 invariant catches bugs immediately.
  • Reconciliation: daily truth-matching across all systems. Zero tolerance for differences.
08
Chapter Eight

What Could Go Wrong

Common Failure Patterns
πŸ‘―

Double Charge

  • User clicks "Pay", timeout, clicks again β†’ two charges
  • Root cause: no idempotency key, or key generated server-side (new key per retry)
  • Customer sees two charges β†’ disputes β†’ chargeback β†’ you pay $25 fee per dispute
  • Fix: client-generated idempotency key sent with request. Server deduplicates. Disable button after first click.
Double Charge: Without vs With Idempotency Key
Without Idempotency Key (broken): t=0 t=1 t=2 t=3 t=4 POST /pay {$100} Stripe charges βœ“ ⚑ timeout (no resp) Retry: NEW /pay {$100} Stripe charges AGAIN βœ“ β†’ User charged $200. Dispute filed. $25 chargeback fee. With Idempotency Key (correct): t=0 t=1 t=2 t=3 t=4 POST /pay {$100, key:"abc"} Stripe charges βœ“ ⚑ timeout Retry: SAME key "abc" Redis: key exists β†’ cached βœ“ β†’ No second charge. User charged $100. Correct. βœ“ Same user action, same click. Only difference: a UUID sent with the request.
πŸ•³οΈ

Money in Limbo

  • PSP charged the customer but your system crashed before recording success
  • Customer charged but no order confirmed. Money "disappeared."
  • Manual resolution required: check PSP dashboard, match to customer
  • Fix: write transaction record (PENDING) BEFORE calling PSP. Reconciliation service resolves unknowns. Webhook confirmation as secondary signal.
Write-Before-Call: Why PENDING Must Exist Before the PSP Call
Wrong Order (crash risk) 1. Call PSP β†’ charge succeeds ($100) 2. πŸ’₯ CRASH system never writes local record Customer charged, no order, no record MONEY IN LIMBO βœ• Correct Order (recoverable) 1. Write record: {txn: "t123", status: PENDING} 2. Call PSP with idempotency key β†’ charge βœ“ 3. Update: status β†’ COMPLETED, write ledger If πŸ’₯ CRASH between steps 2 and 3: Reconciliation sees PENDING txn β†’ queries PSP β†’ finds it succeeded β†’ updates to COMPLETED βœ“ ALWAYS RECOVERABLE βœ“ The PENDING record is the safety net. Without it, crashes after PSP calls create unrecoverable limbo.
πŸ“Š

Ledger Imbalance

  • Bug creates credit entry without corresponding debit (or vice versa)
  • Books don't balance β†’ financial reports are wrong β†’ audit failure
  • Undetected for days if invariant not checked continuously
  • Fix: assert sum=0 invariant on every write. Continuous background audit. Alert immediately on imbalance.
πŸ”₯

PSP Outage Cascade

  • Primary PSP goes down β†’ all payments fail β†’ revenue stops
  • Retry storms to failing PSP β†’ exhaust connection pools
  • No fallback configured β†’ 100% payment failure rate
  • Fix: multi-PSP with automatic failover. Circuit breaker on each PSP. Health-check routing: route away from degraded PSPs within seconds.

Every payment system failure has legal and financial consequences. A double charge is not just a bug β€” it can trigger a chargeback ($25 fee), damage your merchant reputation score (too many chargebacks = PSP drops you), and in some jurisdictions, violate consumer protection laws. Payment system bugs are $$ bugs. This is why exactly-once processing, idempotency, and reconciliation are not optional β€” they are the foundation.

πŸ“‹ Chapter 8 β€” Summary
  • Double charge: client-side idempotency key. Server deduplicates. Button disable on click.
  • Money in limbo: write PENDING record before PSP call. Reconciliation resolves unknowns.
  • Ledger imbalance: assert sum=0 on every write. Continuous background audit.
  • PSP outage: multi-PSP routing + circuit breaker + health-check failover.
  • Principle: payment bugs are legal liability. Correctness is the only option.