System Design · Building Blocks · DNS & Load Balancing

DNS & Load Balancing

How traffic finds your system and gets distributed across servers.

Chapter One

What DNS & Load Balancing Are

The Two Questions Every Request Asks

Every request to your system starts with a question: where is this thing? DNS answers that question by translating api.yourcompany.com into an IP address. The follow-up question arrives a millisecond later: which of the many instances of that thing should handle this? That is the job of the load balancer. Together they form the entry point of every system that ever served more than one user from more than one machine.

📖

DNS — The Phone Book of the Internet

A hierarchical, distributed system that translates human-readable names into IP addresses. Root servers point to TLD servers, which point to authoritative nameservers, which return the actual IP.

Key records: A (IPv4), AAAA (IPv6), CNAME (alias), MX (mail), TXT (verification, SPF).

TTL: Time-to-live tells resolvers how long to cache the answer. This is why DNS changes do not propagate instantly.

⚖️

Load Balancer — The Traffic Director

Sits between clients and your server pool. Distributes incoming requests across multiple backend instances so no single server becomes a bottleneck or a single point of failure.

Where it sits: Public IP exposed by DNS → LB receives the request → LB picks a healthy backend and forwards it.

Key benefit: Decouples public-facing endpoints from the actual servers behind them.

Request Flow — From URL Bar to Backend Server

Analogy: DNS is the city map that gets you to the right building. The load balancer is the receptionist inside who directs you to the right desk. Both are needed. Both are invisible when working. Both cause spectacular outages when broken.

📋 Chapter 1 — Summary

DNS resolves human-readable names to IP addresses through a hierarchical chain of nameservers.
TTL is the cache lifetime — the reason DNS changes take minutes to hours to propagate globally.
Load balancers sit at the system entry point and distribute traffic across many backend instances.
Together they decouple your public endpoint from the actual servers behind it — the foundation of horizontal scale and HA.

Chapter Two

How They Work Internally

The DNS Resolution Chain

Every DNS query you have ever made took a journey through up to six layers of cache before any nameserver actually got involved. Most of the time the answer came from a cache within the first millisecond. The interesting question is what happens when it doesn't — when the chain runs all the way to the authoritative server, takes 100 milliseconds, and decides whether your page loads in time.

DNS Resolution Chain — Cache Layers Before the Real Lookup

Load Balancing Algorithms & Layers

🔁

Algorithms — How the LB Picks a Server

Round Robin: sequential. Simple, stateless. Ignores server capacity.

Weighted Round Robin: traffic proportional to weight. Use for heterogeneous hardware.

Least Connections: route to server with fewest active connections. Best for long-lived connections (WebSockets, gRPC streams).

IP Hash: same client always routes to same server. Useful for affinity, but uneven if traffic is skewed.

Least Response Time: latency + connection count. Best perf, most complex, requires accurate health probes.

🔌

L4 vs L7 — Where the LB Operates

L4 (Transport): routes on IP + TCP/UDP port. Fast, no content inspection. Cannot route by URL or header. Works for any protocol.

L7 (Application): inspects HTTP. Can route /api to API servers, /images to image servers, do header-based A/B testing. Slower, but vastly more flexible.

Rule of thumb: use L7 unless you genuinely need raw TCP performance or are routing non-HTTP traffic.

L4 vs L7 — What the Load Balancer Can See

The mental model: L4 sees envelopes. L7 reads the letters inside. Modern systems use L7 for nearly all HTTP traffic because the cost of TLS termination and parsing is dwarfed by the value of intelligent routing.

📋 Chapter 2 — Summary

DNS chains 6 layers of cache before reaching authoritative servers — cached lookups are ~1 ms, full chain 20–120 ms.
Round Robin is simple but ignores capacity. Least Connections wins for long-lived connections.
L4 routes on IP + port and is fast. L7 reads HTTP and is flexible. Use L7 for HTTP.
Algorithm choice should match traffic shape: short stateless requests → round robin; streams → least connections.

Chapter Three

When to Use — and When Not To

Both Sides of the Decision

Engineers reach for load balancers reflexively because every architecture diagram has one. That instinct is right at production scale and wrong before it. The real question isn't whether you need a load balancer eventually. It's whether you need one now, what kind, and whether you have already solved the problems it depends on.

✅

USE a Load Balancer When…

You run more than one server — which is essentially always in production.

You need high availability — one server fails, traffic drains to the others.

You scale horizontally — add capacity by adding instances behind the LB.

You deploy without downtime — drain traffic from one node, deploy, re-enable, repeat.

You terminate TLS centrally — one place to manage certificates and ciphers.

⛔

DO NOT Add One When…

You have a single-server prototype. Adds operational cost with zero benefit.

You are adding it for prestige. “Real systems have load balancers.” So do simple ones — when they need them.

Your app has heavy server-side session state. Fix the sessions first; an LB plus sticky-by-default is a band-aid that masks deeper coupling.

You only have one backend instance. An LB in front of one server is not HA — it's a single point of failure with extra steps.

🌐

USE DNS Load Balancing (GeoDNS) When…

Users on multiple continents. Latency to a single region kills UX.

Regional failover required. Direct EU users to EU; US users to US; route around regional outages.

Compliance forces residency. EU traffic must stay in EU.

⚠️

DO NOT Rely on DNS for Fast Failover

TTL is the limit. Even at 60s TTL, real propagation often takes minutes — some resolvers ignore short TTLs entirely.

Sub-minute failover requires the LB layer. Use LB health checks for fast cutover. Reserve DNS changes for slow, deliberate region shifts.

Don't flap DNS. Constantly flipping records confuses clients and pollutes caches.

The principle: A load balancer solves a specific problem — distributing load across many backends. If you don't have many backends, or you have other problems first (state, deployment, observability), an LB makes those problems harder, not easier.

📋 Chapter 3 — Summary

Load balancers earn their keep when you run multiple instances and need HA, horizontal scale, or zero-downtime deploys.
Stateful sessions are an anti-pattern in front of an LB — fix the state model before adding the balancer.
GeoDNS is for slow, deliberate regional routing. It is not a fast-failover mechanism.
Sub-minute failover lives at the LB layer with health checks — not at the DNS layer with short TTLs.

Chapter Four

Trade-offs & Comparisons

Hardware vs Software vs Cloud

The load balancer market split into three layers a long time ago. Each layer optimises for a different constraint — raw throughput, flexibility, or operational simplicity — and each carries a different cost. The right answer depends less on technical capability and more on what your team is willing to operate.

🏛️

Hardware (F5, Citrix)

Pros: extreme throughput, dedicated ASICs, mature features, predictable latency.

Cons: six-figure price tags, slow to scale (buy more boxes), inflexible config, vendor support contracts.

Use: legacy enterprise, regulated industries, on-prem datacenters with extreme TPS.

💻

Software (HAProxy, Nginx, Envoy)

Pros: commodity hardware, deep configurability, integrates with anything, free or cheap.

Cons: you operate it — HA, monitoring, upgrades are your problem.

Use: industry standard for most modern stacks — especially when you want full control.

☁️

Cloud Managed (ALB, NLB, GCLB)

Pros: fully managed, auto-scales, native cloud integration (auto-scaling groups, IAM, certs).

Cons: vendor lock-in, less flexibility for niche features, harder to debug at the platform level.

Use: the default choice in cloud-native stacks unless you have a specific reason to operate your own.

The Single Load Balancer Problem

The whole point of a load balancer is to remove a single point of failure. Then engineers put a single load balancer in front of their server pool and create a brand new one. The fix is well-understood but easy to skip: run them in pairs (active-passive or active-active), with a virtual IP that fails over between them, and ideally backed by anycast routing at the network layer.

HA Load Balancer — Active/Passive with Virtual IP

The load balancer that removes your single point of failure must not itself become a single point of failure. Active/passive with VIP is the minimum viable HA. Active/active behind anycast is the cloud-native standard.

Health Checks Done Right

A health check that only verifies the HTTP port is open will route traffic to a server whose database connection has been broken for an hour. The endpoint should verify what actually matters: database connectivity, cache reachability, dependency liveness. Active checks (LB pings a /health endpoint every few seconds) detect failures fast. Passive checks (mark unhealthy on failed live requests) catch the failures active checks miss. Use both.

📋 Chapter 4 — Summary

Hardware = throughput + cost; Software = flexibility + ops burden; Cloud = simplicity + lock-in.
The LB itself must be HA — active/passive with floating VIP is the baseline; active/active with anycast is the gold standard.
Health checks must verify real app health (DB, cache, dependencies) — not just “is the port open?”
Combine active and passive health checks: active for speed, passive for accuracy.

Chapter Five

Production Patterns & Common Mistakes

Patterns That Survive Contact with Real Traffic

Almost every production load balancer outage I have debugged came down to one of three things: a health check that lied, a deployment that didn't drain, or a session model that fought the load balancer instead of working with it. The patterns below are not exotic — they are the standard playbook, and skipping them is what gets you paged at 3am.

🍪

Sticky Sessions (Session Affinity)

What it is: same user always routes to the same backend. Implemented via cookie or IP hash.

When needed: legacy apps with server-side session state that genuinely cannot be moved.

Better: move sessions to Redis. Make the app stateless. The LB stops fighting your traffic shape.

Gotcha: one heavy user can pin to one backend and starve it.

💧

Connection Draining

What it is: when removing a backend, stop sending new connections but allow existing ones to finish.

When needed: every deployment. Required for zero-downtime.

Tuning: drain timeout 30–300s. Long enough for in-flight requests; short enough that deploys aren't blocked.

Gotcha: WebSockets and gRPC streams can hold connections for hours — cap the drain window.

🔐

SSL/TLS Termination at the LB

What it is: LB decrypts HTTPS, forwards plaintext HTTP to backends.

Pros: centralised cert management, frees backends from crypto overhead, easier rotation.

Cons: LB→backend traffic is unencrypted by default — mitigate with private VPC or re-encryption (TLS passthrough).

Gotcha: if your LB uses HTTP to talk to backends, don't expose backends to the public internet.

The Five Mistakes That Cause Outages

💀

Mistake 1 — No Real Health Checks

Dead backends keep receiving traffic until someone notices the alerts. Fix: health endpoint that touches DB, cache, and any critical dependency. Status code reflects real health.

⏳

Mistake 2 — DNS TTL Too Long

Failover takes hours instead of minutes. Fix: 60–300s TTL on records that may need to fail over. Accept slight extra DNS load as the cost of fast recovery.

🍬

Mistake 3 — Sticky Sessions as a Crutch

Hides stateful coupling instead of fixing it. Hot users pin to single backends; deployment becomes painful. Fix: move state to Redis; make backends interchangeable.

💥

Mistake 4 — Single LB Instance

Removed server SPOF, created LB SPOF. Fix: active/passive with VIP at minimum; active/active behind anycast at scale.

👁️

Mistake 5 — Port-Open Health Checks

Server appears healthy while the app is broken internally. TCP connect succeeds, every request 500s. Fix: HTTP-level health check that exercises real code paths.

📝

Bonus — No Connection Draining

Deploys drop in-flight requests. Users see random 502s during every release. Fix: configure drain timeout on the LB and respect SIGTERM in the app for graceful shutdown.

The pattern across all five: the load balancer is doing exactly what you told it to. The bug is in what you told it. Health checks, drain timeouts, and session models are configuration — treat them as production code.

📋 Chapter 5 — Summary

Sticky sessions are a band-aid for stateful apps — move state to Redis and remove them.
Connection draining is non-negotiable for zero-downtime deployments.
TLS termination at the LB centralises certificates but requires private networking between LB and backends.
The five outage mistakes: weak health checks, long TTL, sticky-as-crutch, single LB, port-open health checks. Audit your config for all five.
Configuration is code. Review it. Test failover quarterly. Deploys must be boring.

← Building Blocks Caching →