System Design · Security & Observability

Security & Observability

Protecting systems and understanding what is happening inside them.

Chapter One

Authentication & Authorization

Who Are You, and What Can You Do?

Authentication (AuthN) and authorization (AuthZ) are the two questions every request must answer before it touches your business logic. They are distinct concerns — confusing them is one of the most common security mistakes in distributed systems. AuthN is identity: proving you are who you claim to be. AuthZ is permission: determining what that identity is allowed to do. Getting either wrong usually means getting breached.

Authentication (AuthN)

Authorization (AuthZ)

Answers: "Who are you?"
Verifies identity — username/password, MFA, certificate
Results in: a trusted identity (user ID, service identity)
Happens once at login (then token-based for subsequent requests)
Standards: OAuth 2.0, OpenID Connect, SAML

Answers: "What can you do?"
Checks permissions — roles, policies, attributes
Results in: allow or deny for a specific action
Happens on every request (enforcement point)
Models: RBAC, ABAC, ReBAC, policy engines (OPA)

JWT & Token-Based Authentication

JWT Validation at API Gateway — Stateless Auth

🏷️

RBAC — Role-Based Access

Users assigned roles: admin, editor, viewer
Roles have fixed permissions
Simple, auditable, well-understood
Limitation: "role explosion" when permissions are complex
Best for: most B2B SaaS, internal tools

🧮

ABAC — Attribute-Based Access

Access based on attributes: user.dept=engineering AND resource.env=prod
Flexible, fine-grained, context-aware
Policy engine evaluates rules at runtime (OPA, Cedar)
More complex: harder to audit "who can do what"
Best for: multi-tenant, regulatory environments

🕸️

ReBAC — Relationship-Based Access

Access determined by the relationship between user and resource
Can edit a document if: owner, or member of group with edit access
How Google Docs permissions work
Tools: Google Zanzibar → OpenFGA, SpiceDB
Best for: permissions that are graph-like and contextual

OAuth 2.0 and OpenID Connect

OAuth 2.0 is an authorization framework — it allows a user to grant a third-party application limited access to their resources without sharing their password. The key concept is delegation: the user authorizes the app, the app receives an access token, the app uses that token to call APIs on the user's behalf. OAuth 2.0 alone does not tell the app who the user is — it only grants access.

OpenID Connect (OIDC) is an identity layer built on top of OAuth 2.0. It adds an ID token (a JWT containing user identity claims: sub, email, name) alongside the access token. This is how "Sign in with Google" works — OAuth 2.0 handles the authorization flow, OIDC provides the identity.

🔑

Authorization Code Flow + PKCE

For user-facing web and mobile apps.

User logs in at the IdP (Google, Auth0, Cognito)
IdP issues an authorization code to the app
App exchanges code for access token + ID token
PKCE (Proof Key for Code Exchange) prevents code interception attacks — required for all public clients
Access token: short-lived (15 min – 1 hr), used to call APIs
Refresh token: long-lived, obtains new access tokens silently
ID token: JWT with user claims — parsed by app, never sent to APIs

⚙️

Client Credentials Flow

For service-to-service (machine-to-machine) authentication.

No user involved — Service A authenticates as itself to call Service B
Service presents its client_id + client_secret to the IdP
Receives an access token scoped to the service's permissions
Used for background jobs, microservice communication, CI/CD pipelines
Rotate client_secret regularly — treat it as a production secret

⚠️ Token Storage Security

Where you store tokens determines your attack surface. Access tokens in localStorage are vulnerable to XSS — any injected script can read them. Access tokens in memory (JS variable) are lost on page refresh. Recommended pattern for web apps: store refresh tokens in HttpOnly, Secure, SameSite=Strict cookies (not accessible to JavaScript), and keep short-lived access tokens in memory only. For mobile: use iOS Keychain or Android Keystore. Never log tokens. Never send tokens in URL parameters — they appear in server logs and browser history.

Authentication is solved — use an identity provider (Auth0, Cognito, Keycloak). Do not build your own. Authorization is where the real design decisions live. The auth model you choose determines whether your permissions scale with your product complexity or collapse under it.

📋 Chapter 1 — Summary

AuthN = who are you. AuthZ = what can you do. They are separate concerns.
JWT: stateless token, verified by signature check at gateway. No session store needed.
OAuth 2.0: Authorization Code + PKCE for user apps, Client Credentials for service-to-service. OIDC adds identity (ID token) on top.
Token storage: refresh tokens in HttpOnly cookies, access tokens in memory only. Never localStorage.
RBAC: roles with fixed permissions. Simple, auditable. Best default.
ABAC: attribute + context rules. Flexible but harder to audit. Use when RBAC can't express your needs.
ReBAC: relationship-based — user can act on a resource if a graph relationship permits it. Google Docs model. OpenFGA / SpiceDB.

Chapter Two

Zero Trust & Defense in Depth

Never Trust, Always Verify

The traditional network security model was simple: everything inside the firewall is trusted, everything outside is not. That model is dead. Cloud services, remote workers, microservices spanning multiple networks, third-party integrations — the perimeter no longer exists. Zero Trust assumes that any request — internal or external — could be malicious, and verifies every single one. It is not a product you buy; it is an architecture principle you implement layer by layer.

Zero Trust — mTLS Between All Services

🌐

Network Layer

VPCs, security groups, NACLs
WAF for application-layer attacks
DDoS protection (AWS Shield, Cloudflare)
Private subnets for internal services

🔐

Application Layer

mTLS for service-to-service identity
Input validation, parameterized queries (SQL injection)
CSRF: SameSite=Strict cookies + CSRF tokens for state-changing ops
XSS: sanitize input, CSP headers restrict script execution
Rate limiting per identity

🗄️

Data Layer

Encryption at rest (AES-256)
Encryption in transit (TLS 1.3)
Column-level encryption for PII
Audit logs on all data access

Principle of Least Privilege

Every service, user, and process should have the minimum permissions required to do its job — nothing more. If a service only needs to read from one database table, it should not have write access, and certainly not admin access. This principle is foundational to Zero Trust — limiting blast radius when a component is compromised.

🚫

Overly Permissive (Wrong)

One shared IAM role with broad permissions across all services
When this role is compromised: blast radius = entire system
Application DB user has CREATE, DROP, TRUNCATE, schema admin
Services can reach any other service on the internal network
Done because it is faster in the short term — costs everything in a breach

✅

Least Privilege (Correct)

Each service gets its own IAM role with only its required operations
When one role is compromised: blast radius = one service only
Application DB user has SELECT, INSERT, UPDATE on specific tables only
Network paths defined explicitly — analytics service has no route to payment service
Reviewed quarterly — permissions that are no longer needed are removed

Defense in depth means that no single layer's failure causes a breach. The network can be breached — the app validates. The app has a bug — the data is encrypted. The encryption key leaks — the audit trail catches it. Each layer assumes the layer above has already failed.

📋 Chapter 2 — Summary

Zero Trust: no implicit trust for internal traffic. Verify identity + check policy on every call.
mTLS: both sides of a connection present certificates. Service identity, not just network location.
Defense in depth: network → application → data. Each layer assumes layers above have failed.
Principle of least privilege: per-service IAM roles, table-specific DB users, network paths only where needed. Broad permissions = large blast radius on compromise.
Web attacks: CSRF — SameSite=Strict cookies. SQL injection — parameterized queries. XSS — sanitize input + CSP headers.
Tools: service mesh (Istio/Linkerd) for mTLS, OPA for policy, WAF for edge protection.

Chapter Three

Secrets Management

Credentials Don't Belong in Code

Every significant breach in the past decade traces back to a leaked credential — an API key committed to Git, a database password in an environment variable exposed through a debug endpoint, an expired certificate that nobody rotated. Secrets management is not glamorous work, but it is the difference between "security incident" and "existential breach." The principle is simple: no human should know production secrets. No secret should be stored where code is stored.

Secrets Manager — Service Integration

🚫

Where Secrets Should NOT Live

Git repositories (even private ones)
Docker images or container env vars in manifests
Config files checked into source control
.env files on developer laptops
Slack messages, emails, wikis

✅

Where Secrets Should Live

HashiCorp Vault (self-hosted, lease-based)
AWS Secrets Manager / Parameter Store
GCP Secret Manager / Azure Key Vault
Kubernetes Secrets (encrypted at rest via KMS)
Injected at runtime, never baked into images

Rotation: secrets should rotate automatically on a schedule (30–90 days). The secrets manager updates the credential at both ends — the service gets the new value on next request, the target (DB, API) accepts the new credential. Zero downtime. Detection: tools like git-secrets, truffleHog, and pre-commit hooks scan for accidentally committed secrets before they reach the repository.

💡 Dynamic Secrets

Instead of storing a long-lived database password and rotating it periodically, Vault can generate a unique short-lived credential on demand for each service instance that needs it. The credential is valid for the duration of the lease (e.g., 1 hour) and automatically revoked when the lease expires. No rotation needed — the credential never lives long enough to become a liability. If a credential leaks, it expires in hours, not months. This is the pattern for database access in production Vault deployments.

🔑 The Secret Zero Problem

To access the secrets manager, the service needs a credential. But where does that initial credential come from?

Cloud IAM (preferred): on AWS, a service running on EC2 or Lambda uses its instance profile (IAM role). No credential stored anywhere — the cloud provider handles authentication. The instance identity IS the credential. This eliminates secret zero entirely.
On-premises: Vault's AppRole method with a trusted delivery mechanism (Kubernetes service accounts, CI/CD injected at runtime). The initial AppRole secret ID is short-lived and single-use.

A secret that cannot be rotated without downtime is a ticking time bomb. Design your systems so that credential rotation is automated, audited, and zero-downtime. If rotating a secret requires a deployment, you have a design problem — not an operations problem.

📋 Chapter 3 — Summary

Never store secrets in code, Git, images, or config files. Use a dedicated secrets manager.
Tools: Vault (lease-based), AWS Secrets Manager, GCP Secret Manager.
Rotation: automatic, on schedule, zero-downtime. Both ends updated atomically.
Dynamic secrets: Vault generates unique short-lived credentials per-request. No rotation needed — credentials expire automatically.
Secret zero: cloud IAM instance profiles eliminate the problem. The instance identity is the credential.
Detection: pre-commit hooks (git-secrets, truffleHog) catch leaks before push.

Chapter Four

Observability — Logs, Metrics, Traces

The Three Pillars of Knowing What's Happening

Monitoring tells you when something is broken. Observability tells you why it is broken. In a monolith, a stack trace is usually enough. In a distributed system with 50 services, a single request touches a dozen processes across multiple machines — you need correlated signals across all of them to understand a failure. The three pillars — logs, metrics, and traces — are not redundant; they answer fundamentally different questions.

📝

Logs

Discrete events with context
Structured (JSON) > unstructured text
Correlation IDs tie events across services
High volume — needs retention policies
Tools: ELK, Loki, CloudWatch Logs
Answers: "What happened?"

📊

Metrics

Numerical time-series data (counters, gauges, histograms)
Cheap to collect, aggregate, and alert on
RED method: Rate, Errors, Duration
USE method: Utilization, Saturation, Errors
Tools: Prometheus, Grafana, Datadog
Answers: "Is something wrong?"

🔗

Traces

End-to-end request path across services
Spans represent individual operations
Shows where time is spent (latency breakdown)
Critical for debugging distributed systems
Tools: Jaeger, Tempo, OpenTelemetry, Zipkin
Answers: "Where is it slow/failing?"

Structured Logging

❌

Unstructured Logging (Wrong)

"User 12345 failed login at 14:32:01"
Requires fragile regex to parse — breaks when format changes
"Find all failed logins for user X in the last hour" = multi-step grep
No consistent fields — impossible to build reliable alerts
Every service has a different format — no cross-service correlation

✅

Structured Logging (Correct)

JSON: {"ts":"2026-05-03T14:32:01Z","level":"WARN","event":"login_failed","user_id":"12345","ip":"1.2.3.4","reason":"bad_password","trace_id":"abc123","service":"auth-svc"}
Query: {service="auth-svc"} | event="login_failed" | user_id="12345"
Required fields every log: timestamp (ISO 8601 UTC), level, service, trace_id, span_id
Never log: passwords, tokens, PII (name, email, credit card). Log user_id — look up PII separately.
Do log: requests (method, path, status, duration), errors (full exception + context), significant business events

📐 RED and USE Methods

RED — for measuring service health (user-facing):

Rate: requests per second the service is receiving
Errors: percentage of requests failing
Duration: latency distribution (p50, p95, p99)

Rate drops, errors increase, or duration rises → something is wrong

USE — for measuring resource health:

Utilization: % of resource capacity in use
Saturation: work queued waiting for the resource
Errors: error events from the resource

Apply to CPU, memory, disk, network, DB connection pools. Saturation is the leading indicator.

🎯 Trace Sampling Strategy

Capturing every request as a trace at high traffic is prohibitively expensive. Two strategies:

Head-based sampling: decision made at request start. Simple, but misses rare errors — if you sample 1%, 99% of errors may go untraced.
Tail-based sampling: decision made after the request completes. Collector buffers spans; if the request had an error or high latency, it retains the trace — otherwise discards. Ensures all errors are captured. Implemented by OTel Collector and Grafana Tempo.

Recommended: tail-based for production (captures all errors), head-based for development and debugging.

Observability Stack — Collect, Store, Query, Alert

Observability is not about collecting more data — it is about asking questions you didn't anticipate. Monitoring alerts on known failure modes. Observability lets you investigate unknown unknowns — the failures you never predicted. That requires correlated logs + metrics + traces with a unified query layer.

📋 Chapter 4 — Summary

Logs: discrete events. Structured JSON, correlation IDs. "What happened."
Structured logging: JSON with fixed fields (timestamp, level, service, trace_id). Required for queryable log aggregation. Never log PII — log user_id instead.
Metrics: numerical time-series. RED/USE methods. Cheap to alert on. "Is something wrong."
RED (Rate, Errors, Duration): user-facing service health. USE (Utilization, Saturation, Errors): resource health.
Traces: end-to-end request path. Latency breakdown. "Where is it slow."
Trace sampling: tail-based preferred — captures all errors, discards healthy traces after completion.
OpenTelemetry: single SDK for all three — vendor-neutral, future-proof instrumentation.
Stack: OTel SDK → Collector → Backends (Loki/Prometheus/Tempo) → Grafana → Alerts.

Chapter Five

SLOs, SLAs, SLIs & Error Budgets

Reliability as a Measured Commitment

Reliability without measurement is just hope. SLIs, SLOs, and SLAs turn reliability into a concrete, measurable engineering discipline. They answer three questions: What are we measuring? What target are we committing to? And what happens when we miss? The error budget framework then gives you a rational mechanism for deciding when to ship features versus when to focus on reliability — ending the eternal argument between product and engineering.

Choosing the Right SLIs

Not all metrics make good SLIs. A good SLI directly measures what the user experiences — not internal system health metrics.

❌

Poor SLIs

CPU utilization of web servers — users don't experience CPU. 95% CPU can still serve users happily.
Memory usage % — internal resource metric, not user experience.
Disk I/O — infrastructure signal, not user outcome.
Number of deploys per day — activity metric, not reliability measurement.

✅

Good SLIs

Availability: (successful requests / total requests) over a rolling window. Measured at the load balancer, not inside the app.
Latency: % of requests completing under threshold. Based on user research or benchmarks — not engineering intuition. p99, not just p50.
Error rate: % returning 5xx only. 4xx errors are client errors — not your SLO. Never penalize users for their own mistakes.
Freshness (data systems): % of time the most recently ingested data is within acceptable age.

📏

SLI — Indicator

The measurement. A quantitative metric of service behavior as experienced by users.

Examples: % requests < 200ms, % requests returning 2xx, uptime percentage over 30 days.

🎯

SLO — Objective

The internal target. The threshold your team commits to maintain for an SLI.

Example: 99.9% of requests complete in < 200ms over a rolling 30-day window.

📜

SLA — Agreement

The contract. External promise to customers with financial consequences if breached.

Always set SLA below SLO — leave margin. SLA breach = credits/refunds to customers.

Error Budget = (1 − SLO) × Time Window

Example: SLO = 99.9% over 30 days. Error budget = 0.1% × 43,200 minutes = 43.2 minutes of allowed downtime. Every incident consumes from this budget. When the budget is exhausted: feature freeze, only reliability work until budget regenerates. This is the mechanism Google uses to balance innovation speed with system stability.

Error Budget — The Release/Reliability Decision

100% reliability is the wrong target. It is infinitely expensive and prevents all change. Error budgets make the trade-off explicit: "We accept X minutes of failure per month in exchange for the ability to ship features at Y velocity." When budget is healthy, ship faster. When budget is burned, slow down. No arguments — the math decides.

🔔 Multi-Window Burn Rate Alerting

A single SLO threshold creates alert fatigue. A burn rate that exhausts your budget in 30 minutes is a crisis; one that exhausts it in 30 days is a concern. Alert based on how fast the budget is burning:

Fast burn — page immediately: 5% of budget consumed in 1 hour. Catastrophic — requires immediate response. Wake someone up.
Slow burn — create a ticket: 10% of budget consumed in 6 hours. Concerning but not emergency-level.
Very slow burn — plan reliability work: budget on track to exhaust in 3 days. Schedule reliability work for next sprint.

This is the alerting model from Google's SRE Workbook. Eliminates both false alarm paging and budget exhaustion surprises.

📋 Chapter 5 — Summary

SLI: the measurement (latency, error rate, availability). Must measure user experience directly.
SLO: the target (99.9% of requests < 200ms). Internal commitment.
SLA: the contract (with financial consequences). Set below SLO for margin.
Good SLIs measure user experience: request success rate, latency percentile, error rate (5xx only). Not CPU or memory.
Error Budget = (1 − SLO) × time window. Exhausted = feature freeze.
Mechanism: budget healthy → ship fast. Budget burned → reliability focus only.
Burn rate alerting: fast burn (5% in 1hr) = page. Slow burn (10% in 6hr) = ticket. Avoids single-threshold alert fatigue.

Chapter Six

Incident Response Patterns

What You Do When Things Break at 3am

Incidents are not a sign of failure — they are inevitable in complex systems. The difference between teams that recover in minutes versus hours is not talent — it is preparation. Runbooks written when calm, severity definitions agreed upon before the fire, escalation paths that don't require remembering who is on call. Incident response is a practiced discipline, not an ad-hoc panic reaction.

Incident Lifecycle — Detect to Learn

On-Call Design

An on-call rotation that pages engineers every night is not an on-call rotation — it is a burnout program that destroys teams. On-call health is a leading indicator of engineering retention. Design for sustainability from day one.

🚫

Unsustainable On-Call Patterns

Alerts that fire but require no human action — noise drowns signals
Ops team separate from dev team — no feedback loop, no incentive to build reliably
No handoff: incoming engineer starts from zero every shift
More than 50% of on-call time spent on repetitive manual tasks (toil)
Consequence: alert fatigue → real alerts ignored → longer MTTD

✅

Sustainable On-Call Practices

Every alert must be actionable — if it resolves itself, downgrade or eliminate it
You build it, you run it — team that builds the service owns its on-call rotation
Written shift handoff: what happened, what is in progress, what needs attention next
Measure toil — if repetitive manual work exceeds 50% of engineering time, automate before building features
On-call health reviewed in retrospectives alongside feature velocity

📋

Runbooks & Playbooks

Written by humans at rest, for humans under stress
Step-by-step: what to check, what to run, who to call
Maintained alongside the service they document
Decision trees for common scenarios (DB full, OOM, spike)
Tested during game days — not discovered during incidents

📝

Blameless Post-Mortems

Focus on systemic causes, not individual mistakes
Timeline: what happened, when, how detected
Root cause: why the system allowed this failure
Action items: concrete, assigned, deadlined
Shared openly — learning benefits the entire org

🔴

SEV-1: Critical

Customer-facing outage. All hands. War room. Exec informed. Target: resolve in < 30 min.

🟡

SEV-2: Major

Degraded service, partial impact. On-call + team lead. Target: resolve in < 2 hours.

🟢

SEV-3: Minor

No user impact, internal issue. On-call handles during business hours. Fix within sprint.

🎮 Game Days & Chaos Engineering

Game days are scheduled exercises where failures are deliberately introduced in a controlled environment to test whether runbooks, alerts, and escalation paths work as expected. Teams that only discover runbook gaps during real incidents pay the cost in user impact and team stress.

Common scenarios: simulate a database failover, kill 50% of service instances, block network traffic between two services, exhaust disk space on a worker node. Each scenario has a hypothesis (what do we expect?), an observation (what actually happened?), and action items for gaps found.

This is the same philosophy as chaos engineering in the availability domain — validate resilience assumptions before production does it for you.

The goal of incident response is not "never have incidents." It is to minimize MTTD (time to detect) and MTTR (time to resolve). Invest in faster detection (better alerts), faster mitigation (rollback automation), and faster learning (blameless post-mortems). Incidents become data points, not disasters.

📋 Chapter 6 — Summary

Lifecycle: Detect → Triage → Mitigate → Resolve → Post-Mortem. Mitigate first, root-cause later.
Sustainable on-call: every alert actionable, you build it you run it, toil under 50% of engineering time.
Runbooks: written calm, used under stress. Decision trees, not novels. Tested on game days.
Severity levels: SEV-1 (outage, all hands) → SEV-2 (degraded) → SEV-3 (internal, no user impact).
Post-mortems: blameless, systemic, action-item-driven. Shared openly for org-wide learning.
Game days: deliberate controlled failures to validate runbooks and alerts before real incidents expose the gaps.
Goal: minimize MTTD + MTTR. Incidents are inevitable — fast recovery is the skill.

Security & Observability at a Glance

01 · AuthN & AuthZ

Identity First, Permissions Second

AuthN = who. AuthZ = what. Separate concerns
JWT: stateless, verified at gateway via JWKS
OAuth 2.0: Auth Code + PKCE for user apps, Client Credentials for service-to-service
Token storage: refresh tokens in HttpOnly cookies, access tokens in memory only
RBAC: simple, auditable. ReBAC: relationship-based (OpenFGA).

02 · Zero Trust

Never Trust, Always Verify

No implicit trust for internal network traffic
mTLS between every service pair
Defense in depth: network → app → data layers
Least privilege: per-service roles, table-specific DB users
CSRF: SameSite=Strict. SQLi: parameterized queries. XSS: CSP headers

03 · Secrets Management

No Human Knows Production Secrets

Never in code, Git, images, or env files
Vault / AWS SM: encrypted, rotated, audited
Dynamic secrets: Vault generates short-lived creds on demand
Secret zero: cloud IAM instance profile IS the credential

04 · Observability

Logs + Metrics + Traces = Understanding

Structured JSON logs: timestamp, level, service, trace_id. Never log PII
RED (Rate, Errors, Duration): service health. USE: resource health
Tail-based trace sampling: captures all errors, discards healthy traces
OTel → Collector → Loki/Prometheus/Tempo → Grafana

05 · SLOs & Error Budgets

Reliability as Engineering Math

Good SLIs: request success rate, latency p99, 5xx error rate only
SLI = measurement. SLO = target. SLA = contract
Error budget = (1 − SLO) × time window
Burn rate alerting: fast burn = page, slow burn = ticket

06 · Incident Response

Minimize MTTD + MTTR

Detect → Triage → Mitigate → Resolve → Post-Mortem
Sustainable on-call: every alert actionable, you build it you run it
Game days: deliberate failures to validate runbooks before real incidents
Blameless post-mortems: systemic, action-item-driven

← Data at Scale Distributed Systems →