System Design ยท Security & Observability

Security & Observability

Protecting systems and understanding what is happening inside them.

01
Chapter One

Authentication & Authorization

Who Are You, and What Can You Do?

Authentication (AuthN) and authorization (AuthZ) are the two questions every request must answer before it touches your business logic. They are distinct concerns โ€” confusing them is one of the most common security mistakes in distributed systems. AuthN is identity: proving you are who you claim to be. AuthZ is permission: determining what that identity is allowed to do. Getting either wrong usually means getting breached.

Authentication (AuthN)
Authorization (AuthZ)
  • Answers: "Who are you?"
  • Verifies identity โ€” username/password, MFA, certificate
  • Results in: a trusted identity (user ID, service identity)
  • Happens once at login (then token-based for subsequent requests)
  • Standards: OAuth 2.0, OpenID Connect, SAML
  • Answers: "What can you do?"
  • Checks permissions โ€” roles, policies, attributes
  • Results in: allow or deny for a specific action
  • Happens on every request (enforcement point)
  • Models: RBAC, ABAC, ReBAC, policy engines (OPA)
JWT & Token-Based Authentication
JWT Validation at API Gateway โ€” Stateless Auth
Client Bearer: JWT API Gateway 1. Verify signature 2. Check expiry 3. Extract claims Service A Service B Auth Server (IdP) Issues JWTs JWKS (public keys) Stateless: no session lookup needed. Gateway verifies JWT using cached public keys. Services receive pre-validated claims โ€” never talk to auth server themselves.
๐Ÿท๏ธ

RBAC โ€” Role-Based Access

  • Users assigned roles: admin, editor, viewer
  • Roles have fixed permissions
  • Simple, auditable, well-understood
  • Limitation: "role explosion" when permissions are complex
  • Best for: most B2B SaaS, internal tools
๐Ÿงฎ

ABAC โ€” Attribute-Based Access

  • Access based on attributes: user.dept=engineering AND resource.env=prod
  • Flexible, fine-grained, context-aware
  • Policy engine evaluates rules at runtime (OPA, Cedar)
  • More complex: harder to audit "who can do what"
  • Best for: multi-tenant, regulatory environments
๐Ÿ•ธ๏ธ

ReBAC โ€” Relationship-Based Access

  • Access determined by the relationship between user and resource
  • Can edit a document if: owner, or member of group with edit access
  • How Google Docs permissions work
  • Tools: Google Zanzibar โ†’ OpenFGA, SpiceDB
  • Best for: permissions that are graph-like and contextual
OAuth 2.0 and OpenID Connect

OAuth 2.0 is an authorization framework โ€” it allows a user to grant a third-party application limited access to their resources without sharing their password. The key concept is delegation: the user authorizes the app, the app receives an access token, the app uses that token to call APIs on the user's behalf. OAuth 2.0 alone does not tell the app who the user is โ€” it only grants access.

OpenID Connect (OIDC) is an identity layer built on top of OAuth 2.0. It adds an ID token (a JWT containing user identity claims: sub, email, name) alongside the access token. This is how "Sign in with Google" works โ€” OAuth 2.0 handles the authorization flow, OIDC provides the identity.

๐Ÿ”‘

Authorization Code Flow + PKCE

For user-facing web and mobile apps.

  • User logs in at the IdP (Google, Auth0, Cognito)
  • IdP issues an authorization code to the app
  • App exchanges code for access token + ID token
  • PKCE (Proof Key for Code Exchange) prevents code interception attacks โ€” required for all public clients
  • Access token: short-lived (15 min โ€“ 1 hr), used to call APIs
  • Refresh token: long-lived, obtains new access tokens silently
  • ID token: JWT with user claims โ€” parsed by app, never sent to APIs
โš™๏ธ

Client Credentials Flow

For service-to-service (machine-to-machine) authentication.

  • No user involved โ€” Service A authenticates as itself to call Service B
  • Service presents its client_id + client_secret to the IdP
  • Receives an access token scoped to the service's permissions
  • Used for background jobs, microservice communication, CI/CD pipelines
  • Rotate client_secret regularly โ€” treat it as a production secret

โš ๏ธ Token Storage Security

Where you store tokens determines your attack surface. Access tokens in localStorage are vulnerable to XSS โ€” any injected script can read them. Access tokens in memory (JS variable) are lost on page refresh. Recommended pattern for web apps: store refresh tokens in HttpOnly, Secure, SameSite=Strict cookies (not accessible to JavaScript), and keep short-lived access tokens in memory only. For mobile: use iOS Keychain or Android Keystore. Never log tokens. Never send tokens in URL parameters โ€” they appear in server logs and browser history.

Authentication is solved โ€” use an identity provider (Auth0, Cognito, Keycloak). Do not build your own. Authorization is where the real design decisions live. The auth model you choose determines whether your permissions scale with your product complexity or collapse under it.

๐Ÿ“‹ Chapter 1 โ€” Summary
  • AuthN = who are you. AuthZ = what can you do. They are separate concerns.
  • JWT: stateless token, verified by signature check at gateway. No session store needed.
  • OAuth 2.0: Authorization Code + PKCE for user apps, Client Credentials for service-to-service. OIDC adds identity (ID token) on top.
  • Token storage: refresh tokens in HttpOnly cookies, access tokens in memory only. Never localStorage.
  • RBAC: roles with fixed permissions. Simple, auditable. Best default.
  • ABAC: attribute + context rules. Flexible but harder to audit. Use when RBAC can't express your needs.
  • ReBAC: relationship-based โ€” user can act on a resource if a graph relationship permits it. Google Docs model. OpenFGA / SpiceDB.
02
Chapter Two

Zero Trust & Defense in Depth

Never Trust, Always Verify

The traditional network security model was simple: everything inside the firewall is trusted, everything outside is not. That model is dead. Cloud services, remote workers, microservices spanning multiple networks, third-party integrations โ€” the perimeter no longer exists. Zero Trust assumes that any request โ€” internal or external โ€” could be malicious, and verifies every single one. It is not a product you buy; it is an architecture principle you implement layer by layer.

Zero Trust โ€” mTLS Between All Services
Internal Network (NOT trusted) Service A cert: svc-a.mesh Service B cert: svc-b.mesh Service C cert: svc-c.mesh mTLS mTLS Policy Engine (OPA / Istio) svc-a โ†’ svc-b: ALLOW Every service proves identity (mTLS) AND checks policy (authorization) on every call
๐ŸŒ

Network Layer

  • VPCs, security groups, NACLs
  • WAF for application-layer attacks
  • DDoS protection (AWS Shield, Cloudflare)
  • Private subnets for internal services
๐Ÿ”

Application Layer

  • mTLS for service-to-service identity
  • Input validation, parameterized queries (SQL injection)
  • CSRF: SameSite=Strict cookies + CSRF tokens for state-changing ops
  • XSS: sanitize input, CSP headers restrict script execution
  • Rate limiting per identity
๐Ÿ—„๏ธ

Data Layer

  • Encryption at rest (AES-256)
  • Encryption in transit (TLS 1.3)
  • Column-level encryption for PII
  • Audit logs on all data access
Principle of Least Privilege

Every service, user, and process should have the minimum permissions required to do its job โ€” nothing more. If a service only needs to read from one database table, it should not have write access, and certainly not admin access. This principle is foundational to Zero Trust โ€” limiting blast radius when a component is compromised.

๐Ÿšซ

Overly Permissive (Wrong)

  • One shared IAM role with broad permissions across all services
  • When this role is compromised: blast radius = entire system
  • Application DB user has CREATE, DROP, TRUNCATE, schema admin
  • Services can reach any other service on the internal network
  • Done because it is faster in the short term โ€” costs everything in a breach
โœ…

Least Privilege (Correct)

  • Each service gets its own IAM role with only its required operations
  • When one role is compromised: blast radius = one service only
  • Application DB user has SELECT, INSERT, UPDATE on specific tables only
  • Network paths defined explicitly โ€” analytics service has no route to payment service
  • Reviewed quarterly โ€” permissions that are no longer needed are removed

Defense in depth means that no single layer's failure causes a breach. The network can be breached โ€” the app validates. The app has a bug โ€” the data is encrypted. The encryption key leaks โ€” the audit trail catches it. Each layer assumes the layer above has already failed.

๐Ÿ“‹ Chapter 2 โ€” Summary
  • Zero Trust: no implicit trust for internal traffic. Verify identity + check policy on every call.
  • mTLS: both sides of a connection present certificates. Service identity, not just network location.
  • Defense in depth: network โ†’ application โ†’ data. Each layer assumes layers above have failed.
  • Principle of least privilege: per-service IAM roles, table-specific DB users, network paths only where needed. Broad permissions = large blast radius on compromise.
  • Web attacks: CSRF โ€” SameSite=Strict cookies. SQL injection โ€” parameterized queries. XSS โ€” sanitize input + CSP headers.
  • Tools: service mesh (Istio/Linkerd) for mTLS, OPA for policy, WAF for edge protection.
03
Chapter Three

Secrets Management

Credentials Don't Belong in Code

Every significant breach in the past decade traces back to a leaked credential โ€” an API key committed to Git, a database password in an environment variable exposed through a debug endpoint, an expired certificate that nobody rotated. Secrets management is not glamorous work, but it is the difference between "security incident" and "existential breach." The principle is simple: no human should know production secrets. No secret should be stored where code is stored.

Secrets Manager โ€” Service Integration
Application Needs DB password at startup Secrets Manager Vault / AWS SM โ€ข Encrypted at rest โ€ข Auto-rotation โ€ข Audit trail โ€ข Lease-based access Database Rotated password Request Rotates App never stores the password. Requests a short-lived lease. Automatic rotation = zero-downtime credential changes.
๐Ÿšซ

Where Secrets Should NOT Live

  • Git repositories (even private ones)
  • Docker images or container env vars in manifests
  • Config files checked into source control
  • .env files on developer laptops
  • Slack messages, emails, wikis
โœ…

Where Secrets Should Live

  • HashiCorp Vault (self-hosted, lease-based)
  • AWS Secrets Manager / Parameter Store
  • GCP Secret Manager / Azure Key Vault
  • Kubernetes Secrets (encrypted at rest via KMS)
  • Injected at runtime, never baked into images

Rotation: secrets should rotate automatically on a schedule (30โ€“90 days). The secrets manager updates the credential at both ends โ€” the service gets the new value on next request, the target (DB, API) accepts the new credential. Zero downtime. Detection: tools like git-secrets, truffleHog, and pre-commit hooks scan for accidentally committed secrets before they reach the repository.

๐Ÿ’ก Dynamic Secrets

Instead of storing a long-lived database password and rotating it periodically, Vault can generate a unique short-lived credential on demand for each service instance that needs it. The credential is valid for the duration of the lease (e.g., 1 hour) and automatically revoked when the lease expires. No rotation needed โ€” the credential never lives long enough to become a liability. If a credential leaks, it expires in hours, not months. This is the pattern for database access in production Vault deployments.

๐Ÿ”‘ The Secret Zero Problem

To access the secrets manager, the service needs a credential. But where does that initial credential come from?

  • Cloud IAM (preferred): on AWS, a service running on EC2 or Lambda uses its instance profile (IAM role). No credential stored anywhere โ€” the cloud provider handles authentication. The instance identity IS the credential. This eliminates secret zero entirely.
  • On-premises: Vault's AppRole method with a trusted delivery mechanism (Kubernetes service accounts, CI/CD injected at runtime). The initial AppRole secret ID is short-lived and single-use.

A secret that cannot be rotated without downtime is a ticking time bomb. Design your systems so that credential rotation is automated, audited, and zero-downtime. If rotating a secret requires a deployment, you have a design problem โ€” not an operations problem.

๐Ÿ“‹ Chapter 3 โ€” Summary
  • Never store secrets in code, Git, images, or config files. Use a dedicated secrets manager.
  • Tools: Vault (lease-based), AWS Secrets Manager, GCP Secret Manager.
  • Rotation: automatic, on schedule, zero-downtime. Both ends updated atomically.
  • Dynamic secrets: Vault generates unique short-lived credentials per-request. No rotation needed โ€” credentials expire automatically.
  • Secret zero: cloud IAM instance profiles eliminate the problem. The instance identity is the credential.
  • Detection: pre-commit hooks (git-secrets, truffleHog) catch leaks before push.
04
Chapter Four

Observability โ€” Logs, Metrics, Traces

The Three Pillars of Knowing What's Happening

Monitoring tells you when something is broken. Observability tells you why it is broken. In a monolith, a stack trace is usually enough. In a distributed system with 50 services, a single request touches a dozen processes across multiple machines โ€” you need correlated signals across all of them to understand a failure. The three pillars โ€” logs, metrics, and traces โ€” are not redundant; they answer fundamentally different questions.

๐Ÿ“

Logs

  • Discrete events with context
  • Structured (JSON) > unstructured text
  • Correlation IDs tie events across services
  • High volume โ€” needs retention policies
  • Tools: ELK, Loki, CloudWatch Logs
  • Answers: "What happened?"
๐Ÿ“Š

Metrics

  • Numerical time-series data (counters, gauges, histograms)
  • Cheap to collect, aggregate, and alert on
  • RED method: Rate, Errors, Duration
  • USE method: Utilization, Saturation, Errors
  • Tools: Prometheus, Grafana, Datadog
  • Answers: "Is something wrong?"
๐Ÿ”—

Traces

  • End-to-end request path across services
  • Spans represent individual operations
  • Shows where time is spent (latency breakdown)
  • Critical for debugging distributed systems
  • Tools: Jaeger, Tempo, OpenTelemetry, Zipkin
  • Answers: "Where is it slow/failing?"
Structured Logging
โŒ

Unstructured Logging (Wrong)

  • "User 12345 failed login at 14:32:01"
  • Requires fragile regex to parse โ€” breaks when format changes
  • "Find all failed logins for user X in the last hour" = multi-step grep
  • No consistent fields โ€” impossible to build reliable alerts
  • Every service has a different format โ€” no cross-service correlation
โœ…

Structured Logging (Correct)

  • JSON: {"ts":"2026-05-03T14:32:01Z","level":"WARN","event":"login_failed","user_id":"12345","ip":"1.2.3.4","reason":"bad_password","trace_id":"abc123","service":"auth-svc"}
  • Query: {service="auth-svc"} | event="login_failed" | user_id="12345"
  • Required fields every log: timestamp (ISO 8601 UTC), level, service, trace_id, span_id
  • Never log: passwords, tokens, PII (name, email, credit card). Log user_id โ€” look up PII separately.
  • Do log: requests (method, path, status, duration), errors (full exception + context), significant business events

๐Ÿ“ RED and USE Methods

RED โ€” for measuring service health (user-facing):

  • Rate: requests per second the service is receiving
  • Errors: percentage of requests failing
  • Duration: latency distribution (p50, p95, p99)

Rate drops, errors increase, or duration rises โ†’ something is wrong

USE โ€” for measuring resource health:

  • Utilization: % of resource capacity in use
  • Saturation: work queued waiting for the resource
  • Errors: error events from the resource

Apply to CPU, memory, disk, network, DB connection pools. Saturation is the leading indicator.

๐ŸŽฏ Trace Sampling Strategy

Capturing every request as a trace at high traffic is prohibitively expensive. Two strategies:

  • Head-based sampling: decision made at request start. Simple, but misses rare errors โ€” if you sample 1%, 99% of errors may go untraced.
  • Tail-based sampling: decision made after the request completes. Collector buffers spans; if the request had an error or high latency, it retains the trace โ€” otherwise discards. Ensures all errors are captured. Implemented by OTel Collector and Grafana Tempo.

Recommended: tail-based for production (captures all errors), head-based for development and debugging.

Observability Stack โ€” Collect, Store, Query, Alert
Observability Stack โ€” Collect, Store, Query, Alert
Applications Logs Metrics Traces OTel SDK OTel Collector Buffer Transform Route Loki (Logs) Prometheus (Metrics) Tempo (Traces) Grafana Dashboards Alerts Correlate Alert PagerDuty

Observability is not about collecting more data โ€” it is about asking questions you didn't anticipate. Monitoring alerts on known failure modes. Observability lets you investigate unknown unknowns โ€” the failures you never predicted. That requires correlated logs + metrics + traces with a unified query layer.

๐Ÿ“‹ Chapter 4 โ€” Summary
  • Logs: discrete events. Structured JSON, correlation IDs. "What happened."
  • Structured logging: JSON with fixed fields (timestamp, level, service, trace_id). Required for queryable log aggregation. Never log PII โ€” log user_id instead.
  • Metrics: numerical time-series. RED/USE methods. Cheap to alert on. "Is something wrong."
  • RED (Rate, Errors, Duration): user-facing service health. USE (Utilization, Saturation, Errors): resource health.
  • Traces: end-to-end request path. Latency breakdown. "Where is it slow."
  • Trace sampling: tail-based preferred โ€” captures all errors, discards healthy traces after completion.
  • OpenTelemetry: single SDK for all three โ€” vendor-neutral, future-proof instrumentation.
  • Stack: OTel SDK โ†’ Collector โ†’ Backends (Loki/Prometheus/Tempo) โ†’ Grafana โ†’ Alerts.
05
Chapter Five

SLOs, SLAs, SLIs & Error Budgets

Reliability as a Measured Commitment

Reliability without measurement is just hope. SLIs, SLOs, and SLAs turn reliability into a concrete, measurable engineering discipline. They answer three questions: What are we measuring? What target are we committing to? And what happens when we miss? The error budget framework then gives you a rational mechanism for deciding when to ship features versus when to focus on reliability โ€” ending the eternal argument between product and engineering.

Choosing the Right SLIs

Not all metrics make good SLIs. A good SLI directly measures what the user experiences โ€” not internal system health metrics.

โŒ

Poor SLIs

  • CPU utilization of web servers โ€” users don't experience CPU. 95% CPU can still serve users happily.
  • Memory usage % โ€” internal resource metric, not user experience.
  • Disk I/O โ€” infrastructure signal, not user outcome.
  • Number of deploys per day โ€” activity metric, not reliability measurement.
โœ…

Good SLIs

  • Availability: (successful requests / total requests) over a rolling window. Measured at the load balancer, not inside the app.
  • Latency: % of requests completing under threshold. Based on user research or benchmarks โ€” not engineering intuition. p99, not just p50.
  • Error rate: % returning 5xx only. 4xx errors are client errors โ€” not your SLO. Never penalize users for their own mistakes.
  • Freshness (data systems): % of time the most recently ingested data is within acceptable age.
๐Ÿ“

SLI โ€” Indicator

The measurement. A quantitative metric of service behavior as experienced by users.

Examples: % requests < 200ms, % requests returning 2xx, uptime percentage over 30 days.

๐ŸŽฏ

SLO โ€” Objective

The internal target. The threshold your team commits to maintain for an SLI.

Example: 99.9% of requests complete in < 200ms over a rolling 30-day window.

๐Ÿ“œ

SLA โ€” Agreement

The contract. External promise to customers with financial consequences if breached.

Always set SLA below SLO โ€” leave margin. SLA breach = credits/refunds to customers.

Error Budget = (1 โˆ’ SLO) ร— Time Window

Example: SLO = 99.9% over 30 days. Error budget = 0.1% ร— 43,200 minutes = 43.2 minutes of allowed downtime. Every incident consumes from this budget. When the budget is exhausted: feature freeze, only reliability work until budget regenerates. This is the mechanism Google uses to balance innovation speed with system stability.

Error Budget โ€” The Release/Reliability Decision
Error Budget: 43.2 minutes (0.1% of 30 days) Incident 1: 15min Inc 2: 8min Deploy rollback: 12min Remaining: 8.2min Budget nearly exhausted โ†’ slow down releases, focus on reliability Budget healthy โ†’ ship features faster, accept more risk

100% reliability is the wrong target. It is infinitely expensive and prevents all change. Error budgets make the trade-off explicit: "We accept X minutes of failure per month in exchange for the ability to ship features at Y velocity." When budget is healthy, ship faster. When budget is burned, slow down. No arguments โ€” the math decides.

๐Ÿ”” Multi-Window Burn Rate Alerting

A single SLO threshold creates alert fatigue. A burn rate that exhausts your budget in 30 minutes is a crisis; one that exhausts it in 30 days is a concern. Alert based on how fast the budget is burning:

  • Fast burn โ€” page immediately: 5% of budget consumed in 1 hour. Catastrophic โ€” requires immediate response. Wake someone up.
  • Slow burn โ€” create a ticket: 10% of budget consumed in 6 hours. Concerning but not emergency-level.
  • Very slow burn โ€” plan reliability work: budget on track to exhaust in 3 days. Schedule reliability work for next sprint.

This is the alerting model from Google's SRE Workbook. Eliminates both false alarm paging and budget exhaustion surprises.

๐Ÿ“‹ Chapter 5 โ€” Summary
  • SLI: the measurement (latency, error rate, availability). Must measure user experience directly.
  • SLO: the target (99.9% of requests < 200ms). Internal commitment.
  • SLA: the contract (with financial consequences). Set below SLO for margin.
  • Good SLIs measure user experience: request success rate, latency percentile, error rate (5xx only). Not CPU or memory.
  • Error Budget = (1 โˆ’ SLO) ร— time window. Exhausted = feature freeze.
  • Mechanism: budget healthy โ†’ ship fast. Budget burned โ†’ reliability focus only.
  • Burn rate alerting: fast burn (5% in 1hr) = page. Slow burn (10% in 6hr) = ticket. Avoids single-threshold alert fatigue.
06
Chapter Six

Incident Response Patterns

What You Do When Things Break at 3am

Incidents are not a sign of failure โ€” they are inevitable in complex systems. The difference between teams that recover in minutes versus hours is not talent โ€” it is preparation. Runbooks written when calm, severity definitions agreed upon before the fire, escalation paths that don't require remembering who is on call. Incident response is a practiced discipline, not an ad-hoc panic reaction.

Incident Lifecycle โ€” Detect to Learn
1. Detect Alert fires 2. Triage Severity, scope 3. Mitigate Stop bleeding 4. Resolve Root cause fix 5. Post-Mortem Learn + prevent Mitigate first (rollback, failover, scale). Root-cause can wait until users are unblocked. MTTD (detect) + MTTR (resolve) = total user impact time
On-Call Design

An on-call rotation that pages engineers every night is not an on-call rotation โ€” it is a burnout program that destroys teams. On-call health is a leading indicator of engineering retention. Design for sustainability from day one.

๐Ÿšซ

Unsustainable On-Call Patterns

  • Alerts that fire but require no human action โ€” noise drowns signals
  • Ops team separate from dev team โ€” no feedback loop, no incentive to build reliably
  • No handoff: incoming engineer starts from zero every shift
  • More than 50% of on-call time spent on repetitive manual tasks (toil)
  • Consequence: alert fatigue โ†’ real alerts ignored โ†’ longer MTTD
โœ…

Sustainable On-Call Practices

  • Every alert must be actionable โ€” if it resolves itself, downgrade or eliminate it
  • You build it, you run it โ€” team that builds the service owns its on-call rotation
  • Written shift handoff: what happened, what is in progress, what needs attention next
  • Measure toil โ€” if repetitive manual work exceeds 50% of engineering time, automate before building features
  • On-call health reviewed in retrospectives alongside feature velocity
๐Ÿ“‹

Runbooks & Playbooks

  • Written by humans at rest, for humans under stress
  • Step-by-step: what to check, what to run, who to call
  • Maintained alongside the service they document
  • Decision trees for common scenarios (DB full, OOM, spike)
  • Tested during game days โ€” not discovered during incidents
๐Ÿ“

Blameless Post-Mortems

  • Focus on systemic causes, not individual mistakes
  • Timeline: what happened, when, how detected
  • Root cause: why the system allowed this failure
  • Action items: concrete, assigned, deadlined
  • Shared openly โ€” learning benefits the entire org
๐Ÿ”ด

SEV-1: Critical

Customer-facing outage. All hands. War room. Exec informed. Target: resolve in < 30 min.

๐ŸŸก

SEV-2: Major

Degraded service, partial impact. On-call + team lead. Target: resolve in < 2 hours.

๐ŸŸข

SEV-3: Minor

No user impact, internal issue. On-call handles during business hours. Fix within sprint.

๐ŸŽฎ Game Days & Chaos Engineering

Game days are scheduled exercises where failures are deliberately introduced in a controlled environment to test whether runbooks, alerts, and escalation paths work as expected. Teams that only discover runbook gaps during real incidents pay the cost in user impact and team stress.

Common scenarios: simulate a database failover, kill 50% of service instances, block network traffic between two services, exhaust disk space on a worker node. Each scenario has a hypothesis (what do we expect?), an observation (what actually happened?), and action items for gaps found.

This is the same philosophy as chaos engineering in the availability domain โ€” validate resilience assumptions before production does it for you.

The goal of incident response is not "never have incidents." It is to minimize MTTD (time to detect) and MTTR (time to resolve). Invest in faster detection (better alerts), faster mitigation (rollback automation), and faster learning (blameless post-mortems). Incidents become data points, not disasters.

๐Ÿ“‹ Chapter 6 โ€” Summary
  • Lifecycle: Detect โ†’ Triage โ†’ Mitigate โ†’ Resolve โ†’ Post-Mortem. Mitigate first, root-cause later.
  • Sustainable on-call: every alert actionable, you build it you run it, toil under 50% of engineering time.
  • Runbooks: written calm, used under stress. Decision trees, not novels. Tested on game days.
  • Severity levels: SEV-1 (outage, all hands) โ†’ SEV-2 (degraded) โ†’ SEV-3 (internal, no user impact).
  • Post-mortems: blameless, systemic, action-item-driven. Shared openly for org-wide learning.
  • Game days: deliberate controlled failures to validate runbooks and alerts before real incidents expose the gaps.
  • Goal: minimize MTTD + MTTR. Incidents are inevitable โ€” fast recovery is the skill.
Security & Observability at a Glance
01 ยท AuthN & AuthZ

Identity First, Permissions Second

  • AuthN = who. AuthZ = what. Separate concerns
  • JWT: stateless, verified at gateway via JWKS
  • OAuth 2.0: Auth Code + PKCE for user apps, Client Credentials for service-to-service
  • Token storage: refresh tokens in HttpOnly cookies, access tokens in memory only
  • RBAC: simple, auditable. ReBAC: relationship-based (OpenFGA).
02 ยท Zero Trust

Never Trust, Always Verify

  • No implicit trust for internal network traffic
  • mTLS between every service pair
  • Defense in depth: network โ†’ app โ†’ data layers
  • Least privilege: per-service roles, table-specific DB users
  • CSRF: SameSite=Strict. SQLi: parameterized queries. XSS: CSP headers
03 ยท Secrets Management

No Human Knows Production Secrets

  • Never in code, Git, images, or env files
  • Vault / AWS SM: encrypted, rotated, audited
  • Dynamic secrets: Vault generates short-lived creds on demand
  • Secret zero: cloud IAM instance profile IS the credential
04 ยท Observability

Logs + Metrics + Traces = Understanding

  • Structured JSON logs: timestamp, level, service, trace_id. Never log PII
  • RED (Rate, Errors, Duration): service health. USE: resource health
  • Tail-based trace sampling: captures all errors, discards healthy traces
  • OTel โ†’ Collector โ†’ Loki/Prometheus/Tempo โ†’ Grafana
05 ยท SLOs & Error Budgets

Reliability as Engineering Math

  • Good SLIs: request success rate, latency p99, 5xx error rate only
  • SLI = measurement. SLO = target. SLA = contract
  • Error budget = (1 โˆ’ SLO) ร— time window
  • Burn rate alerting: fast burn = page, slow burn = ticket
06 ยท Incident Response

Minimize MTTD + MTTR

  • Detect โ†’ Triage โ†’ Mitigate โ†’ Resolve โ†’ Post-Mortem
  • Sustainable on-call: every alert actionable, you build it you run it
  • Game days: deliberate failures to validate runbooks before real incidents
  • Blameless post-mortems: systemic, action-item-driven