Case Study: Payment System
Design, trade-offs, and alternatives for a payment system at scale.
Problem Statement
A payment system processes financial transactions: charging customers, settling with merchants, handling refunds, and maintaining an auditable ledger. The defining constraint is correctness over speed. Unlike chat or feeds where eventual consistency is acceptable, a payment must be processed exactly once β never duplicated, never lost. A double charge loses customer trust instantly. A lost payment loses merchant trust. The system must be correct in the presence of network failures, timeouts, retries, and partial outages.
Traffic & Scale
- 1B transactions/day (~12K TPS average)
- Peak: 5x average β 60K TPS (Black Friday)
- $1T+ annual payment volume
- Multiple currencies, payment methods, and regions
Requirements
- Exactly-once processing: no double charges, no lost payments
- Latency: <2 seconds end-to-end for card payment
- Availability: 99.999% (5 minutes downtime/year)
- Audit trail: every state change immutably logged
Payment systems are the only system where "eventual consistency" can mean legal liability. If you double-charge a customer, you may violate consumer protection laws. If you lose a transaction record, you fail PCI-DSS audit. If your reconciliation is off by $0.01, your financial reports are incorrect. Every other system in this series can tolerate small inconsistencies. Payments cannot. Correctness is not a goal β it is the only acceptable outcome.
- 1B transactions/day, 60K TPS at peak. $1T+ annual volume.
- Exactly-once: no duplicate charges, no lost payments. Non-negotiable.
- 99.999% availability β 5 minutes downtime per year.
- Full audit trail: every state change logged immutably for compliance.
Questions to Ask
Payment Methods
- Credit/debit cards only? Or wallets, bank transfers, crypto?
- Multiple PSPs (Stripe, Adyen, PayPal) or single?
- Token vault for stored cards?
- 3D Secure / SCA required? (EU regulation)
Payment Lifecycle
- Authorize + capture separate? Or single charge?
- Refunds supported? Partial refunds?
- Recurring/subscription payments?
- Multi-currency with FX conversion?
Compliance & Reporting
- PCI-DSS compliance level?
- End-of-day reconciliation with banks/PSPs?
- Regulatory reporting (regional)?
- Fraud detection inline or async?
Authorize vs capture is the most important lifecycle question. In an auth-only model, you place a hold on funds but don't charge until fulfillment (e.g., Amazon charges when item ships). This means you need two-phase transaction handling, hold expiry management, and the ability to void unreleased authorizations. Single-charge is simpler but means you charge before fulfilling β creating refund volume if fulfillment fails.
For This Case Study, Our Answers Are:
- Payment methods: credit/debit cards (Visa, Mastercard, Amex) + digital wallets (Apple Pay, Google Pay)
- PSP model: multi-PSP (Stripe primary, Adyen fallback, PayPal for wallet payments)
- Lifecycle: single charge (not auth + capture) β simplifies the state machine
- Refunds: yes, full and partial. Processed as separate transactions in the ledger.
- Recurring payments: no β out of scope for this design
- Currency: multi-currency with FX conversion at transaction time
- PCI-DSS: Level 1 compliance β no raw card data touches our servers (tokenization via PSP)
- Fraud detection: inline, synchronous, must complete in <200ms before PSP call
- Reconciliation: daily automated, with manual review queue for exceptions
- Audit trail: every state change immutably logged β 7-year retention (regulatory requirement)
- Auth + capture vs single charge: determines transaction lifecycle complexity.
- Multiple PSPs add routing decisions + fallback complexity.
- PCI-DSS compliance determines where card data can live (tokenization required).
- Reconciliation: daily matching of your records vs PSP records vs bank statements.
Naive Design
The simplest design: user clicks "Pay" β app server calls Stripe's charge API synchronously β returns success/failure to user. No idempotency key. No local transaction record before calling PSP. No handling of network timeouts. This works for a hackathon. In production, a network timeout after the PSP charged the card means you don't know if the charge succeeded β and if the user retries, they get charged twice. The PSP has no way to know the second request is a retry β without an idempotency key, it looks identical to a new payment β so it processes a second charge.
What Works
- Simple β one HTTP call to Stripe
- No local database needed (PSP is the source of truth)
- Works for low volume, manual dispute resolution
- PSP handles PCI, fraud, retry logic
What Breaks
- Network timeout: charge succeeded but you don't know
- User retries: double charge (no idempotency key)
- No local record: can't reconcile, can't audit
- PSP outage: entire payment system down (no fallback)
- Refund path unknown: no record of what to refund
- Direct PSP call without idempotency: network timeout = unknown state = double charge risk.
- No local transaction record: can't reconcile, can't audit, can't refund reliably.
- Single PSP: outage = complete payment failure. No fallback routing.
Refined Design
The refined design treats every payment as a state machine with exactly-once guarantees. Before calling any PSP, the system creates a local transaction record with a unique idempotency key. The PSP call includes this key β so retries are safe. On success, the double-entry ledger records debits and credits. On failure or timeout, a reconciliation service resolves the ambiguity by querying the PSP with the idempotency key. The ledger is the source of truth β not the PSP.
Payment Flow
- 1. Create transaction record (status: PENDING, with idempotency key)
- 2. Route to optimal PSP (based on currency, card type, success rate)
- 3. Call PSP with idempotency key (retries are safe)
- 4. On success: update status β COMPLETED, write to ledger
- 5. On failure: update status β FAILED, no ledger entry
- 6. On timeout: status stays PENDING β reconciliation resolves
Double-Entry Ledger
- Every transaction = at least 2 entries (debit + credit)
- Example: Customer account -$100, Merchant account +$97, Fee account +$3
- Sum of all entries always = 0 (balanced books)
- Append-only: never modify, only add correcting entries
- Enables instant balance calculation and full audit trail
| Entry | Account | Debit | Credit |
|---|---|---|---|
| 1 | Customer wallet | $100 | β |
| 2 | Merchant receivable | β | $97 |
| 3 | Platform fee | β | $3 |
| Total | $100 | $100 | |
Sum of debits = Sum of credits = $100. Books always balance. Any discrepancy = bug detected immediately.
The idempotency key is the single most critical design element. It is generated client-side (UUID) and sent with every payment request. The server stores it with the transaction. If the same key arrives again (retry after timeout), the server returns the existing result without re-processing. This makes the entire system retry-safe. Without it, every network timeout becomes a potential double charge. With it, retries are free and safe.
- Idempotency key: client-generated UUID sent with every request. Retries are safe.
- Local transaction record created BEFORE calling PSP. Source of truth is local.
- PSP router: choose optimal provider per transaction (cost, success rate, currency).
- Double-entry ledger: debit + credit for every transaction. Sum always zero.
- Reconciliation: resolve PENDING transactions by querying PSP with idempotency key.
Alternative Approaches
- Process payment in the API request path β user waits
- Immediate success/failure feedback to user (~2s)
- Simple to reason about β request-response model
- Problem: PSP timeout blocks user for 30+ seconds
- Good for: card payments (fast PSP response)
- Used by: Standard checkout flows
- Accept payment intent β queue for processing β notify on completion
- User gets "Processing..." immediately, notification when done
- Can retry, route to backup PSP, handle complex workflows
- More complex UX (user not sure if it worked immediately)
- Good for: bank transfers, crypto, complex multi-step payments
- Used by: ACH transfers, SEPA, batch settlements
- All payments go to one provider (e.g., just Stripe)
- Simple integration β one SDK, one dashboard
- Single point of failure: PSP outage = no payments
- No cost optimization (can't route to cheapest)
- Good for: startups, low volume, single region
- Multiple PSPs. Route each transaction to optimal provider.
- Routing criteria: cost, success rate, currency, card type, region
- Failover: if primary PSP fails, route to secondary
- A/B testing of PSPs to measure real success rates
- Good for: high volume, global, cost-sensitive
- Used by: Uber, Airbnb, large marketplaces
PSP success rates vary by card type, region, and time of day. A Visa card issued in Germany may have a 95% success rate on Adyen (EU-optimized) but only 88% on Stripe for that region. By routing intelligently and measuring real success rates per (PSP, card_type, region) tuple, large payment processors improve overall authorization rates by 3-7 percentage points. At $1T annual volume, 1% improvement in success rate = $10B more processed. Smart routing is not just cost optimization β it is revenue.
- Synchronous: user waits for result. Simple, good UX for fast PSPs (cards).
- Asynchronous: accept + process later. Good for slow methods (bank transfers).
- Single PSP: simple. SPOF risk, no cost optimization.
- Multi-PSP routing: failover + cost optimization + higher success rates.
What Real Companies Did
Stripe
- Payment intent model: create intent β confirm β charge
- Built-in idempotency key support on all API calls
- Webhook-based async notifications for payment status
- Radar: ML-based fraud detection (inline, <100ms)
- Multi-region: processes in EU, US, APAC independently
Amazon Pay
- Authorize at checkout, capture at shipment (2-phase)
- 1-Click: stored tokens with session-based auth
- Custom ledger system: handles refunds, chargebacks inline
- Multi-PSP routing across 100+ global acquirers
- Reconciliation: automated daily match of 100M+ transactions
Square
- Offline-first: POS payments stored locally, synced later
- Idempotency built into every SDK call
- Instant deposits: pre-fund from cash reserves before settlement
- Hardware + software: custom card readers with E2E encryption
- Handles intermittent connectivity gracefully
Wise (TransferWise)
- Cross-border payments: match buyers and sellers of currency
- Local payment rails: ACH (US), SEPA (EU), FPS (UK)
- Pool-based: collect in one currency, payout in another from local pool
- Reconciliation across 50+ banking partners daily
- Transparency: show real mid-market rate, explicit fee
- Stripe: payment intents, built-in idempotency, ML fraud (Radar).
- Amazon: 2-phase (auth+capture), 100+ acquirers, automated reconciliation.
- Square: offline-first POS, hardware encryption, instant deposits.
- Wise: cross-border pooling, local payment rails, transparent FX.
Best Practices Extracted
Idempotency Everywhere
- Every mutation endpoint accepts an idempotency key
- Server checks: already processed? Return cached result.
- Client retries freely without fear of duplicates
- Store keys with TTL (24-72h)
- Transfers to: any system where duplicates cause harm
Double-Entry Ledger
- Every money movement = debit one account, credit another
- Sum of all entries = 0 (invariant check)
- Append-only: never delete or modify entries
- Corrections via reversing entries (not edits)
- Transfers to: any system tracking balance/inventory
Reconciliation
- Daily comparison: your ledger vs PSP records vs bank statements
- Differences flagged for manual review
- Automated resolution for common patterns (timeouts)
- Nothing is "eventually consistent" β differences must be zero
- Transfers to: any multi-system data consistency
The double-entry ledger is accounting's 700-year-old distributed systems pattern. Every financial movement creates exactly two entries that sum to zero. If your books don't balance, something went wrong β and you know immediately. This is not just accounting convention; it is an invariant that makes bugs self-revealing. Apply this pattern anywhere you track quantities that must be conserved: inventory, credits, loyalty points.
- Idempotency: every write endpoint. Retries are safe. Keys stored with TTL.
- Double-entry: debit + credit per movement. Sum = 0 invariant catches bugs immediately.
- Reconciliation: daily truth-matching across all systems. Zero tolerance for differences.
What Could Go Wrong
Double Charge
- User clicks "Pay", timeout, clicks again β two charges
- Root cause: no idempotency key, or key generated server-side (new key per retry)
- Customer sees two charges β disputes β chargeback β you pay $25 fee per dispute
- Fix: client-generated idempotency key sent with request. Server deduplicates. Disable button after first click.
Money in Limbo
- PSP charged the customer but your system crashed before recording success
- Customer charged but no order confirmed. Money "disappeared."
- Manual resolution required: check PSP dashboard, match to customer
- Fix: write transaction record (PENDING) BEFORE calling PSP. Reconciliation service resolves unknowns. Webhook confirmation as secondary signal.
Ledger Imbalance
- Bug creates credit entry without corresponding debit (or vice versa)
- Books don't balance β financial reports are wrong β audit failure
- Undetected for days if invariant not checked continuously
- Fix: assert sum=0 invariant on every write. Continuous background audit. Alert immediately on imbalance.
PSP Outage Cascade
- Primary PSP goes down β all payments fail β revenue stops
- Retry storms to failing PSP β exhaust connection pools
- No fallback configured β 100% payment failure rate
- Fix: multi-PSP with automatic failover. Circuit breaker on each PSP. Health-check routing: route away from degraded PSPs within seconds.
Every payment system failure has legal and financial consequences. A double charge is not just a bug β it can trigger a chargeback ($25 fee), damage your merchant reputation score (too many chargebacks = PSP drops you), and in some jurisdictions, violate consumer protection laws. Payment system bugs are $$ bugs. This is why exactly-once processing, idempotency, and reconciliation are not optional β they are the foundation.
- Double charge: client-side idempotency key. Server deduplicates. Button disable on click.
- Money in limbo: write PENDING record before PSP call. Reconciliation resolves unknowns.
- Ledger imbalance: assert sum=0 on every write. Continuous background audit.
- PSP outage: multi-PSP routing + circuit breaker + health-check failover.
- Principle: payment bugs are legal liability. Correctness is the only option.