Amazon CloudWatch
LearningTree Β· AWS Β· Management

Amazon CloudWatch β€”
Observability & Monitoring

The unified observability platform for AWS. CloudWatch collects metrics, aggregates logs, triggers alarms, builds dashboards, and traces requests β€” giving you complete visibility into the health of every resource in your cloud infrastructure.

01
Chapter One Β· Management

What is Amazon CloudWatch?

Amazon CloudWatch is a managed monitoring and observability service that collects data from every AWS resource β€” metrics (numbers over time), logs (text events), traces (request paths), and events (state changes). It's not one tool β€” it's a platform with five pillars that together give you full system visibility.

The Five Pillars of CloudWatch Introductory
πŸ“Š

Metrics

  • Time-series numeric data
  • CPU, memory, network, disk, latency
  • Auto-collected from 70+ AWS services
  • Custom metrics via API/agent
🚨

Alarms

  • Threshold-based alerting on metrics
  • Trigger SNS, Auto Scaling, Lambda
  • OK β†’ ALARM β†’ INSUFFICIENT states
  • Composite alarms (AND/OR logic)
πŸ“‹

Logs

  • Centralised log aggregation
  • From Lambda, ECS, EC2, any source
  • Log Insights for SQL-like queries
  • Metric filters β†’ alarms on log patterns
πŸ“ˆ

Dashboards

  • Custom visualisation panels
  • Cross-account, cross-region
  • Real-time and historical views
  • Shareable via link or embed
πŸ”—

Events / EventBridge

  • React to AWS state changes
  • Schedule cron-like rules
  • Route events to Lambda, SNS, SQS
  • (Now part of Amazon EventBridge)
🧠 Mental Model β€” The Hospital Monitoring System

CloudWatch is like the monitoring station in a hospital. Metrics = the vital signs displays (heart rate, blood pressure). Alarms = the beeping alerts when vitals go critical. Logs = the patient's medical chart (detailed history). Dashboards = the nurse's station screen with all patients at a glance. Every AWS resource is a "patient" being monitored continuously.

What CloudWatch Monitors Automatically Core
ServiceAuto-Collected MetricsResolution
EC2CPU, Network In/Out, Disk I/O, Status Checks5 min (basic) / 1 min (detailed)
RDSCPU, Connections, Read/Write IOPS, Free Storage1 min
LambdaInvocations, Duration, Errors, Throttles, Concurrency1 min
ALBRequest count, Latency, 4xx/5xx errors, Active connections1 min
S3Bucket size, Object count, Request metrics (if enabled)1 day (size) / 1 min (requests)
SQSMessages visible, Messages sent/received, Age of oldest1 min
DynamoDBRead/Write capacity used, Throttled requests, Latency1 min
ECS/FargateCPU, Memory utilisation per task/service1 min
CloudWatch Architecture Core
CloudWatch β€” How Data Flows
DATA SOURCES EC2 instances Lambda functions RDS databases ALB / ECS / EKS S3 / SQS / SNS CW Agent (custom) API (PutMetricData) 70+ AWS services metrics logs custom CLOUDWATCH METRICS ALARMS LOGS DASHBOARDS LOG INSIGHTS SNS Notifications Auto Scaling Actions Lambda / SSM Actions S3 Export / Kinesis alarm export Sources emit metrics/logs β†’ CloudWatch stores & evaluates β†’ triggers automated actions
AWS services automatically push metrics and logs into CloudWatch. Alarms evaluate thresholds and trigger responses.
CloudWatch vs CloudTrail vs Config Core
ServiceWhat It AnswersData TypeExample
CloudWatch"How is my system performing RIGHT NOW?"Metrics, logs, alarmsCPU at 85%, 500 errors spiking
CloudTrail"WHO did WHAT and WHEN?"API audit logs (who called what)User X deleted S3 bucket at 3:42PM
AWS Config"What changed in my infrastructure?"Resource configuration historySecurity group rule was modified
πŸ’‘ The Trio Works Together

In production, you use all three: CloudWatch tells you something is wrong (alarm fires). CloudTrail tells you who made the change that caused it. Config tells you exactly what the configuration looked like before and after. They're complementary β€” not competing.

Key Terminology Core
TermDefinitionExample
NamespaceContainer for metrics from one serviceAWS/EC2, AWS/Lambda, Custom/MyApp
MetricTime-series of data pointsCPUUtilization, Errors, Duration
DimensionKey-value pair identifying a metric streamInstanceId=i-1234, FunctionName=myFunc
StatisticAggregation over a periodAverage, Sum, Max, Min, p99
PeriodTime granularity for aggregation60s, 300s (5 min), 3600s (1 hour)
AlarmWatches a metric, changes state when threshold hitCPU > 80% for 5 minutes β†’ ALARM
Log GroupCollection of log streams from one source/aws/lambda/my-function
Log StreamIndividual sequence of log eventsOne Lambda instance's output
🎯 Exam Insight
  • "Monitor CPU/memory/network" β†’ CloudWatch Metrics
  • "Alert when threshold breached" β†’ CloudWatch Alarms
  • "Centralise application logs" β†’ CloudWatch Logs
  • "Query logs with SQL-like syntax" β†’ CloudWatch Logs Insights
  • "CloudWatch vs CloudTrail" β†’ CW = performance/metrics. CT = API audit/who-did-what.
  • "Custom metric" β†’ use PutMetricData API or CloudWatch Agent
  • "EC2 memory metric" β†’ NOT available by default. Requires CloudWatch Agent (custom metric).
  • "Default EC2 monitoring interval" β†’ 5 minutes (basic). Enable "detailed monitoring" for 1-minute.
Chapter 01 β€” Key Takeaway

CloudWatch is five services in one: Metrics (numbers over time), Alarms (threshold alerts), Logs (centralised text), Dashboards (visualisation), and Events (state change reactions). It monitors 70+ AWS services automatically. CloudWatch answers "how is my system performing?" β€” distinct from CloudTrail (who did what) and Config (what changed). EC2 memory/disk requires the CloudWatch Agent β€” it's NOT collected by default.

02
Chapter Two Β· Management

CloudWatch Metrics β€” Deep Dive

Metrics are the foundation of CloudWatch. A metric is a time-ordered set of data points β€” each representing a measurement (CPU%, latency ms, error count) at a specific time. Understanding namespaces, dimensions, resolution, and retention is critical for both production and exams.

Metric Anatomy β€” Namespace + Name + Dimensions Core

Every metric is uniquely identified by three things:

Metric Identity β€” How CloudWatch Locates a Metric
NAMESPACE AWS/EC2 "Which service?" + METRIC NAME CPUUtilization "What measurement?" + DIMENSIONS InstanceId=i-abc123 "Which specific resource?" = UNIQUE TIME SERIES Namespace + MetricName + Dimensions = one unique metric stream
Standard vs Custom Metrics Core
AspectStandard (Built-in)Custom (You publish)
SourceAWS services automatically (EC2, RDS, Lambda…)Your application via API/Agent
CostFree (included with the service)$0.30/metric/month (first 10K)
Resolution1 min or 5 min (service-dependent)Standard (60s) or High-res (1s)
NamespaceAWS/ServiceNameCustom/YourApp (you choose)
ExamplesCPUUtilization, NetworkIn, InvocationsActiveUsers, OrdersPerMinute, QueueDepth
The CloudWatch Agent Core

The CloudWatch Agent is a small daemon installed on EC2 (or on-prem servers) that collects metrics not available by default:

πŸ“Š

Metrics the Agent Collects

  • Memory utilisation (% used, available)
  • Disk space (% used, free bytes per mount)
  • Disk I/O (reads/writes per second)
  • Swap usage
  • Network (detailed) β€” packets, TCP connections
  • Process-level β€” CPU/memory per process
πŸ“‹

Logs the Agent Collects

  • Application log files (custom paths)
  • System logs (/var/log/syslog)
  • Windows Event Logs
  • Apache/Nginx access logs
  • Any text file you configure
  • Pushes to CloudWatch Logs groups
⚠️ Critical Exam Fact

EC2 Memory and Disk metrics are NOT available by default. You MUST install the CloudWatch Agent to get memory/disk utilisation. This is one of the most commonly tested facts. CPU and Network are default; Memory and Disk are not.

Metric Resolution & Retention Core
ResolutionPeriodRetentionUse Case
Basic Monitoring5 minutes15 monthsDefault for EC2 (free)
Detailed Monitoring1 minute15 monthsEC2 with detailed enabled ($)
High-Resolution1 second3 hours (1s), then rolls upCustom metrics via API ($$$)

Retention rollup β€” CloudWatch keeps data at decreasing granularity over time:

  • 1-second data β†’ retained for 3 hours
  • 1-minute data β†’ retained for 15 days
  • 5-minute data β†’ retained for 63 days
  • 1-hour data β†’ retained for 15 months (455 days)
Metric Math & Anomaly Detection Deep

Metric Math lets you combine multiple metrics with arithmetic expressions:

  • Error rate: m1/m2 * 100 (errors Γ· total requests Γ— 100)
  • Cost per request: METRICS("cost") / METRICS("requests")
  • Can be used in alarms β€” alarm on calculated expressions, not just raw metrics

Anomaly Detection applies ML to establish a "normal" band for a metric. When the metric breaches the band, it triggers an alarm β€” even without setting a fixed threshold. Useful for metrics with variable baselines (e.g. traffic patterns that differ by day of week).

🎯 Exam Insight
  • "Memory not available on EC2" β†’ install CloudWatch Agent
  • "1-second resolution" β†’ High-Resolution custom metrics (extra cost)
  • "Metric retention" β†’ 15 months for 1-hour aggregated data
  • "PutMetricData" β†’ API to publish custom metrics programmatically
  • "Detailed monitoring" β†’ EC2 at 1-minute intervals (vs 5-min basic)
  • "Namespace AWS/EC2 vs Custom" β†’ AWS/ prefix = built-in. Custom/ = yours.
  • "Aggregate across instances" β†’ use statistics (Average, Sum) without dimension filtering
  • "Alarm on calculated value" β†’ Metric Math expressions in alarms
Chapter 02 β€” Key Takeaway

Metrics are time-series data identified by Namespace + Name + Dimensions. EC2 gives you CPU/Network for free but NOT memory/disk β€” install the CloudWatch Agent for those. Standard resolution = 1min or 5min. High-resolution custom metrics go down to 1-second. Data is retained for 15 months at 1-hour granularity. Use Metric Math to combine metrics and alarm on calculated values.

03
Chapter Three Β· Management

CloudWatch Alarms β€” Automated Response

CloudWatch Alarms watch a metric (or metric math expression) and change state when a threshold is breached. When an alarm fires, it can send notifications, trigger Auto Scaling, stop/terminate EC2 instances, or invoke Lambda functions β€” enabling fully automated incident response.

Alarm States Core
CloudWatch Alarm β€” State Machine
OK Within threshold breached ALARM Threshold breached recovered no data INSUFFICIENT Not enough data to evaluate Alarms cycle between OK ↔ ALARM. INSUFFICIENT_DATA means no data points in the evaluation period.
Alarm Configuration β€” Key Parameters Core
ParameterWhat It ControlsExample
MetricWhich metric to watchAWS/EC2 CPUUtilization InstanceId=i-123
StatisticHow to aggregate within a periodAverage, Sum, Maximum, p99
PeriodEvaluation window per data point60 seconds, 300 seconds
Evaluation PeriodsHow many consecutive periods must breach3 (alarm fires after 3 bad periods)
Datapoints to AlarmM out of N periods must breach (flexible)3 out of 5 (alarm if 3 of last 5 breach)
ThresholdThe boundary value> 80 (fires when metric exceeds 80)
Comparison OperatorGreater, Less, GreaterOrEqual, etc.GreaterThanThreshold
ActionsWhat to do when state changesSNS topic, Auto Scaling policy, EC2 action
🧠 The "3 out of 5" Pattern

The Datapoints to Alarm parameter is powerful. Instead of requiring 3 consecutive breaches (which a single recovery resets), you can set "3 out of 5" β€” meaning the alarm fires if 3 of the last 5 evaluation periods breach the threshold. This avoids false positives from brief recoveries during an ongoing issue.

Alarm Actions β€” What Can Alarms Trigger? Core
🚨

Notification Actions

  • SNS Topic β†’ email, SMS, Slack (via Lambda), PagerDuty
  • Separate actions for ALARM, OK, and INSUFFICIENT states
  • Can notify on recovery (OK) too, not just alarm
βš™οΈ

Auto Scaling Actions

  • Trigger scale-out (add instances) on high CPU
  • Trigger scale-in (remove instances) on low CPU
  • Target Tracking uses alarms internally
  • Step Scaling = multiple alarm thresholds
πŸ–₯️

EC2 Actions

  • Stop instance (StatusCheckFailed_System)
  • Terminate instance
  • Reboot instance
  • Recover instance (move to new host)
πŸ”—

Other Actions

  • Lambda function (custom remediation)
  • Systems Manager (run automation doc)
  • EventBridge (route to many targets)
  • Create OpsItem in OpsCenter
Composite Alarms Deep

Composite Alarms combine multiple alarms using AND/OR logic. This prevents alarm noise:

  • Problem: You have 10 alarms monitoring different aspects. A single incident triggers all 10 β†’ notification storm.
  • Solution: Create a composite alarm: "Fire only when AlarmA AND AlarmB are both in ALARM state". One notification for the combined condition.
  • Composite alarms can suppress actions on child alarms (only the composite sends notifications)
  • Support AND, OR, NOT logic between child alarms
Common Alarm Patterns Core
PatternMetricThresholdAction
CPU Scale-OutCPUUtilization (Average)> 70% for 3 periodsASG: add 2 instances
CPU Scale-InCPUUtilization (Average)< 30% for 10 periodsASG: remove 1 instance
Error Spike5xx errors (Sum)> 100 in 5 minutesSNS: alert on-call team
Disk FullDiskSpaceUsed (custom agent)> 90%SNS: ops alert + Lambda cleanup
EC2 System FailureStatusCheckFailed_System= 1 for 2 periodsEC2: Recover instance
SQS Dead Letter Build-upApproximateNumberOfMessagesVisible> 0 for 1 periodSNS: investigate DLQ
Lambda ThrottlesThrottles (Sum)> 0SNS: review concurrency limits
Billing AlertEstimatedCharges> $100SNS: budget warning
Alarm Pricing Core
TypeCostNotes
Standard alarm$0.10/alarm/monthStandard resolution (60s+)
High-resolution alarm$0.30/alarm/month10s or 30s period
Composite alarm$0.50/alarm/monthCombines multiple child alarms
Anomaly detection alarm$0.30/alarm/monthML-based band detection
Free tier10 alarms freeStandard resolution only
🎯 Exam Insight
  • "Alarm triggers Auto Scaling" β†’ CloudWatch Alarm with ASG scaling policy action
  • "Alert the team when errors spike" β†’ Alarm β†’ SNS Topic β†’ email/Slack
  • "Recover EC2 from system failure" β†’ Alarm on StatusCheckFailed_System β†’ EC2 Recover action
  • "Reduce alarm noise" β†’ Composite Alarms (AND/OR logic)
  • "3 out of 5" β†’ Datapoints to Alarm = 3, Evaluation Periods = 5
  • "Alarm on billing" β†’ EstimatedCharges metric in us-east-1 (billing metrics only there)
  • "INSUFFICIENT_DATA state" β†’ metric hasn't reported data in the evaluation period (instance stopped, metric not emitting)
  • "Alarm can invoke Lambda" β†’ yes, directly as an alarm action (no EventBridge needed)
Chapter 03 β€” Key Takeaway

Alarms watch metrics and change state (OK β†’ ALARM β†’ INSUFFICIENT) when thresholds breach. They trigger SNS notifications, Auto Scaling actions, EC2 recovery, or Lambda functions. Use "M out of N" evaluation to avoid false positives. Use Composite Alarms to reduce notification noise by combining conditions with AND/OR logic. The most common pattern: CPU alarm triggers ASG scale-out.

04
Chapter Four Β· Management

CloudWatch Logs β€” Centralised Log Management

CloudWatch Logs is a fully managed log aggregation and analysis service. It ingests logs from Lambda, ECS, EC2 (via agent), API Gateway, VPC Flow Logs, Route 53 DNS queries, and any custom source β€” then lets you search, filter, create metrics from patterns, and export for long-term storage.

Log Hierarchy β€” Groups, Streams, Events Core
CloudWatch Logs β€” Organisational Hierarchy
LOG GROUP /aws/lambda/my-function β€’ Retention policy β€’ Encryption (KMS) β€’ Metric filters β€’ Subscription filters β€’ Access policy 1:N LOG STREAM 1 Instance i-abc / Lambda exec 1 LOG STREAM 2 Instance i-def / Lambda exec 2 LOG STREAM N One stream per source instance contains LOG EVENTS timestamp + message 2026-05-07T10:23:45Z ERROR NullPointer... 2026-05-07T10:23:46Z INFO Request completed 2026-05-07T10:23:47Z WARN Slow query 2.3s Log Group β†’ Log Streams (one per source) β†’ Log Events (timestamp + text)
ConceptDescriptionExample
Log GroupTop-level container. Defines retention, encryption, access./aws/lambda/order-service
Log StreamSequence of events from one source instance.One Lambda execution container, one EC2 instance
Log EventSingle log entry: timestamp + message string.2026-05-07T10:23:45Z ERROR NullPointerException…
Log Sources β€” What Sends Logs to CloudWatch? Core
⚑

Automatic (Built-in)

  • Lambda function output
  • API Gateway access logs
  • ECS/Fargate container stdout
  • RDS/Aurora error & slow-query
  • VPC Flow Logs
  • Route 53 DNS query logs
πŸ”§

Agent-Based (EC2/On-Prem)

  • CloudWatch Agent on EC2
  • Custom application log files
  • System logs (/var/log/syslog)
  • Windows Event Logs
  • On-premises servers
  • Any text file on disk
πŸ“‘

SDK/API (Programmatic)

  • PutLogEvents API
  • AWS SDKs (Boto3, Java, etc.)
  • Fluent Bit / Fluentd plugins
  • Docker logging drivers
  • Any HTTP client
Log Retention Core

By default, logs are retained forever (never expire). You set retention per log group:

Retention PeriodUse CaseCost Impact
1 day – 7 daysDevelopment/debugging onlyLowest storage cost
30 daysStandard operational logsModerate
90 daysCompliance (short-term)Higher
1 year – 10 yearsAudit/compliance requirementsHigh β€” consider S3 export
Never expireDefault (dangerous for cost!)Grows indefinitely
⚠️ Cost Trap

The default retention is Never Expire. This means log storage costs grow forever. Always set a retention policy on every log group. For long-term archival, export to S3 (much cheaper) or use S3 lifecycle rules to move to Glacier.

Metric Filters β€” Turn Logs into Metrics Core

Metric Filters scan incoming log events for patterns and emit CloudWatch metrics when matches occur. This lets you alarm on log content without reading logs manually.

Metric Filter β€” Log Pattern β†’ Metric β†’ Alarm
LOG EVENTS "ERROR" "Exception" "timeout" "OOM" METRIC FILTER Pattern: "ERROR" β†’ emit value 1 per match CUSTOM METRIC ErrorCount (Sum) ALARM ErrorCount > 10 β†’ SNS Pattern matches in logs β†’ custom metric emitted β†’ alarm evaluates β†’ action fires

Common metric filter patterns:

  • "ERROR" β€” any line containing ERROR
  • "[ip, user, timestamp, request, status_code=5*, bytes]" β€” space-delimited pattern matching 5xx codes
  • { $.statusCode = 500 } β€” JSON filter for structured logs
  • "OutOfMemoryError" β€” Java OOM detection
Subscription Filters β€” Real-Time Log Streaming Deep

Subscription Filters stream matching log events in real-time to a destination for processing:

πŸ”„

Destinations

  • Kinesis Data Streams β€” real-time analytics
  • Kinesis Data Firehose β€” load to S3/Redshift/OpenSearch
  • Lambda β€” custom processing per event
  • OpenSearch (via Firehose) β€” log search UI
πŸ“‹

Use Cases

  • Stream error logs to Slack via Lambda
  • Build real-time security dashboards
  • Feed logs into third-party SIEM tools
  • Cross-account log aggregation
  • Limit: 2 subscription filters per log group
CloudWatch Logs Insights Core

Logs Insights is a purpose-built query language for searching and analyzing log data interactively. It's like SQL for logs β€” fast, serverless, pay-per-query.

πŸ“ Logs Insights Query Examples

fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20

stats count(*) as errorCount by bin(5m) | filter @message like /Exception/

fields @timestamp, @message | parse @message "user=* action=* status=*" as user, action, status | filter status = "FAILED"

FeatureDetails
Query languagePurpose-built (fields, filter, stats, sort, parse, limit)
PerformanceScans GB of logs in seconds
Pricing$0.005 per GB of data scanned
Multi-groupQuery up to 50 log groups at once
VisualisationAuto-generates time-series charts from stats queries
Saved queriesSave & share commonly used queries
Auto-discoveryAutomatically detects fields in JSON logs
Logs Insights β€” Query Reference Core

Quick-reference for common Logs Insights query patterns:

What You WantQuery
All errors (last hour)fields @timestamp, @message | filter @message like /Error|Exception/ | sort @timestamp desc | limit 100
Error count by 5-min bucketfilter @message like /ERROR/ | stats count() by bin(5m)
Lambda cold startsfilter @message like /Init Duration/ | parse @message "Init Duration: * ms" as initDuration | sort initDuration desc
Slow Lambda (>1s)filter @type = "REPORT" | parse @message "Duration: * ms" as duration | filter duration > 1000 | sort duration desc
Top 10 IPsparse @message "client-ip=*" as ip | stats count() by ip | sort count() desc | limit 10
HTTP status breakdownparse @message "status=*" as status | stats count() by status
P95 latency per 5mfilter @type = "REPORT" | parse @message "Duration: * ms" as d | stats pct(d, 95) by bin(5m)

Syntax cheat-sheet:

CommandPurpose
fieldsSelect fields to display
filterRegex (like /pat/) or exact match
statsAggregation β€” count(), avg(), max(), pct(field, N)
sortOrder results (asc / desc)
limitMax results returned
parseExtract values from unstructured text
bin(interval)Group by time bucket (5m, 1h, etc.)
CloudWatch Logs Live Tail Core

Live Tail streams log events in real-time as they arrive β€” like tail -f for CloudWatch Logs. Available directly in the console.

πŸ“‘

Capabilities

  • Real-time filter by pattern (show only ERROR lines)
  • Highlight matching terms
  • Pause / resume streaming
  • View across multiple log groups simultaneously
  • Set time window (1m, 5m, 15m)
πŸ› οΈ

Use Cases

  • Debugging a deployment as it happens
  • Investigating live incidents in real-time
  • Monitoring Lambda during load tests
  • Tracing a request across services
  • Verifying log format changes
πŸ“ How to Access

CloudWatch β†’ Logs β†’ Log Groups β†’ Select group(s) β†’ Actions β†’ Start Live Tail. Console-only feature (not available via API). Best-effort delivery β€” not for audit/compliance capture.

Log Export & Cross-Account Deep
MethodLatencyUse Case
S3 Export (CreateExportTask)Up to 12 hoursBatch archival, compliance
Subscription Filter β†’ Firehose β†’ S3Near real-time (~60s)Continuous export for analytics
Cross-account subscriptionReal-timeCentralise logs in a security account
🧠 Export vs Subscription

S3 Export is a one-time batch job (can take 12 hours) β€” use for archival. Subscription Filter β†’ Firehose β†’ S3 is near real-time continuous streaming β€” use when you need ongoing export. For exam questions about "real-time log export to S3", the answer is Subscription Filter + Firehose, NOT CreateExportTask.

CloudWatch Logs Pricing Core
ComponentCost (US East)Notes
Ingestion$0.50 / GBAll data written to CW Logs
Storage$0.03 / GB / monthRetained log data
Logs Insights$0.005 / GB scannedPer query
Vended logs (VPC Flow, etc.)$0.10 / GBCheaper ingestion for AWS-generated
🎯 Exam Insight
  • "Centralise application logs" β†’ CloudWatch Logs (via agent or SDK)
  • "Query logs with SQL-like syntax" β†’ CloudWatch Logs Insights
  • "Create alarm from log pattern" β†’ Metric Filter β†’ custom metric β†’ alarm
  • "Real-time log export to S3" β†’ Subscription Filter + Kinesis Firehose (NOT CreateExportTask)
  • "Default log retention" β†’ Never expire (must explicitly set retention policy)
  • "Stream logs to OpenSearch" β†’ Subscription Filter β†’ Firehose/Lambda β†’ OpenSearch
  • "Cross-account logs" β†’ Subscription filter to destination in central account
  • "Cheapest long-term log storage" β†’ Export to S3 β†’ lifecycle to Glacier
  • "Lambda logs not appearing" β†’ Lambda execution role missing logs:CreateLogGroup / logs:PutLogEvents permissions
  • "Limit per log group" β†’ 2 subscription filters maximum
Chapter 04 β€” Key Takeaway

CloudWatch Logs organises data as Log Groups β†’ Log Streams β†’ Log Events. Default retention is forever (set it!). Metric Filters turn log patterns into metrics you can alarm on. Subscription Filters stream logs in real-time to Kinesis/Lambda/OpenSearch. Logs Insights provides SQL-like querying at $0.005/GB scanned. Live Tail gives real-time console streaming for debugging. For real-time S3 export, use Subscription Filter + Firehose β€” not CreateExportTask (which is batch, up to 12h delay).

05
Chapter Five Β· Management

CloudWatch Dashboards β€” Visualisation & Operational Views

CloudWatch Dashboards provide customisable real-time visualisation of metrics, logs, and alarms in a single pane of glass. They support cross-account, cross-region widgets β€” letting operations teams monitor entire multi-account architectures from one screen.

Dashboard Capabilities Core
πŸ“Š

Widget Types

  • Line graph β€” time-series trends (CPU, latency)
  • Stacked area β€” cumulative values
  • Number β€” single current value (big font)
  • Gauge β€” percentage with colour bands
  • Bar chart β€” comparisons
  • Log table β€” Logs Insights query results
  • Alarm status β€” red/green per alarm
  • Text (Markdown) β€” labels & documentation
🌐

Key Features

  • Cross-account β€” metrics from multiple AWS accounts
  • Cross-region β€” global view in one dashboard
  • Auto-refresh β€” 10s, 1m, 5m intervals
  • Time range control β€” relative or absolute
  • Full-screen mode β€” NOC/war-room displays
  • Dark mode β€” built-in for control rooms
  • Annotations β€” mark deployments/incidents
  • Variables β€” dynamic filtering (region, env)
Dashboard Architecture β€” How It Works Core
CloudWatch Dashboard β€” Widget Data Flow
DATA SOURCES Account A metrics Account B metrics Region: us-east-1 Region: eu-west-1 Logs Insights Alarm states CLOUDWATCH DASHBOARD Line Chart CPU over time Number Active users: 1,247 Alarm Status 3 OK, 1 ALARM Log Table Recent errors Gauge Disk: 72% Markdown Runbook links VIEWERS Console / URL sharing NOC monitors Auto-refresh loop API / embedded Cross-account, cross-region data β†’ unified dashboard β†’ shared with teams
Sharing & Access Core
Sharing MethodAccess ControlUse Case
IAM (console)IAM policies on cloudwatch:GetDashboardInternal teams with AWS access
Share via linkPublic URL (no auth required)NOC screens, external stakeholders
SSO-enabled sharingThird-party auth (Cognito, SAML)Partner teams without IAM accounts
CloudWatch cross-accountOrganization sharing setupCentral operations account sees all
Automatic Dashboards Introductory

CloudWatch provides automatic dashboards out of the box β€” zero configuration required:

  • Service-level dashboards β€” auto-generated for EC2, Lambda, RDS, etc. showing key metrics
  • Cross-service dashboard β€” aggregated health across all services in use
  • Account-level overview β€” alarms, anomalies, recent changes
  • Can be used as starting point β†’ clone & customise for your needs
Dashboard Best Practices Core
βœ…

Do

  • Create separate dashboards per environment (prod/staging)
  • Use annotations to mark deployments
  • Include alarm status widgets for at-a-glance health
  • Add Markdown widgets with runbook/escalation links
  • Use variables for dynamic filtering
  • Set appropriate auto-refresh (10s for real-time ops)
❌

Don't

  • Overload a single dashboard with 50+ widgets (slow rendering)
  • Rely solely on dashboards for alerting (use alarms)
  • Share public links with sensitive metric data
  • Forget cross-region widgets incur cross-region data transfer
  • Create dashboards without clear ownership
Dashboard Pricing Core
ItemCostNotes
First 3 dashboardsFreeUp to 50 metrics each
Additional dashboards$3.00/dashboard/monthEach can have up to 500 widgets
API callsIncludedGetMetricData calls for rendering
Cross-accountNo extra chargeRequires sharing setup
CloudWatch ServiceLens & Container Insights Deep

Beyond basic dashboards, CloudWatch offers advanced observability features:

πŸ”

ServiceLens

  • Unified view: metrics + traces + logs + alarms
  • Service map showing dependencies
  • Integrates with X-Ray traces
  • Click a node β†’ see latency, errors, logs
  • End-to-end request flow visualisation
🐳

Container Insights

  • Pre-built dashboards for ECS, EKS, Fargate
  • Cluster/service/task/pod-level metrics
  • CPU, memory, network, disk per container
  • Automatic discovery of running containers
  • Performance log events for deep analysis
Application Insights & Contributor Insights Deep
πŸ€–

Application Insights

  • ML-powered monitoring for .NET/Java/SQL Server workloads
  • Auto-detects problems and correlates metrics
  • Creates automated dashboards for app stacks
  • Reduces MTTR by highlighting root cause
πŸ“Š

Contributor Insights

  • Identify top-N contributors to a pattern
  • "Top 10 IPs generating 5xx errors"
  • "Top 5 Lambda functions by duration"
  • Use with VPC Flow Logs, CloudTrail, any log
  • Helps find noisy neighbours / hot keys
ServiceLens + X-Ray β€” End-to-End Tracing Deep

ServiceLens combines CloudWatch metrics + logs + X-Ray traces into a single application view. X-Ray traces requests across services (API Gateway β†’ Lambda β†’ DynamoDB) and ServiceLens overlays operational data on top.

ServiceLens + X-Ray β€” Request Trace Flow
Client Browser/App API Gateway Tracing ON Lambda Active Tracing DynamoDB SDK instrumented ServiceLens Service map + metrics + logs + traces combined view X-Ray traces each hop β€” ServiceLens overlays CloudWatch metrics, logs, and alarms per node
ComponentWhat It ProvidesHow to Enable
X-Ray SDKTrace segments per service (latency, errors, metadata)Add SDK to app code + IAM permissions
Lambda Active TracingAuto-instrumented traces for LambdaEnable checkbox in function config
API Gateway TracingTrace from API entry pointStage settings β†’ Enable X-Ray
X-Ray DaemonCollects segments from EC2-based appsInstall daemon on EC2 instances
ServiceLensUnified view of traces + metrics + logsNo extra setup (uses existing CW + X-Ray)

Pricing: X-Ray β€” $5 per 1M traces recorded, $0.50 per 1M traces retrieved. ServiceLens has no additional charge.

CloudWatch Lambda Insights Deep

Lambda Insights is a performance monitoring solution for Lambda functions. It uses a Lambda Layer to collect detailed runtime metrics not available in basic Lambda metrics β€” without code changes.

MetricWhat It Measures
memory_utilizationActual memory used vs allocated (basic only shows max allocated)
cpu_total_timeCPU utilisation β€” identifies CPU-bound functions
tmp_used/tmp disk space β€” detect when approaching 512 MB limit
init_durationCold start duration β€” separate from execution time
rx_bytes / tx_bytesNetwork I/O per invocation
total_networkTotal network bandwidth consumed
πŸ”§ How to Enable

1. Add the Lambda Insights layer ARN to your function. 2. Grant IAM permissions (cloudwatch:PutMetricData). 3. Metrics appear in the LambdaInsights namespace. Integrates with ServiceLens for combined trace + metrics view.

When to use: Out-of-memory troubleshooting, cold start optimisation, CPU-bound identification, high-concurrency monitoring. Cost: Standard CloudWatch metrics pricing ($0.30/metric) + log ingestion for performance logs.

CloudWatch Synthetics (Canaries) Core

Synthetics canaries are configurable Node.js or Python scripts that run on a schedule to simulate user behaviour β€” proactively detecting issues before customers do.

🐀

Canary Blueprints

  • Heartbeat β€” simple HTTP GET, verify endpoint is up
  • API Canary β€” test authenticated API endpoints
  • UI Canary β€” login β†’ add to cart β†’ checkout (Selenium/Playwright)
  • Broken Link Checker β€” crawl site for 404s
  • Visual Regression β€” screenshot comparison for UI changes
πŸ“‹

What Canaries Capture

  • Success / failure status per run
  • Screenshots of failure state
  • HAR file (full network request waterfall)
  • Execution logs for debugging
  • CloudWatch metrics (SuccessPercent, Duration)
  • Step-level timing breakdown
Synthetics Canary β€” Monitoring Flow
Canary Script Every 5 min Your Endpoint API / Website CW Metrics + S3 artifacts Alarm β†’ SNS If SuccessPercent < 100 Canary simulates user β†’ results β†’ CloudWatch metrics β†’ alarm if failure detected

Pricing: ~$0.0012 per canary run. A 5-minute canary β‰ˆ $0.35/month (8,640 runs). Multi-region: Run canaries from different regions to detect regional outages.

🎯 Exam Scenario

"Company needs to proactively detect if their checkout page is broken before customers report it" β†’ CloudWatch Synthetics canary that logs in, adds item to cart, and completes checkout flow on a 5-minute schedule.

CloudWatch RUM (Real User Monitoring) Core

CloudWatch RUM captures actual user performance data via a lightweight JavaScript snippet added to your web application β€” measuring what real users experience, not just synthetic tests.

πŸ‘€

Metrics Captured

  • Core Web Vitals β€” LCP, FID, CLS (SEO-critical)
  • Page load timing (Navigation Timing API)
  • JavaScript errors (uncaught exceptions)
  • XHR/Fetch request failures and latency
  • Session and user journey tracking
πŸ”—

X-Ray Integration

  • Correlate front-end sessions with backend X-Ray traces
  • End-to-end: user click β†’ API GW β†’ Lambda β†’ DynamoDB
  • Identify if latency is client-side or server-side
  • Segment by region, browser, device type

How to enable: Copy-paste the RUM JavaScript snippet into your web app. No server-side changes needed. Pricing: $1 per 100,000 events (1 page view = 1 event).

🎯 Exam Scenario

"Application seems slow for users in Australia but synthetic tests from us-east-1 pass fine" β†’ Enable CloudWatch RUM to see real user data segmented by geographic region β€” reveals regional latency issues invisible to synthetic tests.

CloudWatch Evidently β€” Feature Experiments Introductory

Evidently provides feature flags, A/B testing, and controlled rollouts with automatic metric tracking β€” all integrated into CloudWatch.

CapabilityUse Case
Feature FlagsToggle features on/off without redeployment
Gradual Rollout1% β†’ 5% β†’ 20% β†’ 100% of users over time
A/B TestingCompare conversion, revenue, or latency between variants
OverridesTarget specific users (beta testers, internal teams)
Auto-RollbackIf alarm triggers (error rate ↑), revert to safe variation

Pricing: $0.01 per 1,000 feature evaluations. $0.12 per 1,000 analysed events (experiments). Integrates with CloudWatch Alarms for automatic rollback if metrics degrade.

🎯 Exam Scenario

"Team wants to test new checkout UI on 10% of users first, with automatic rollback if error rate increases" β†’ CloudWatch Evidently with percentage-based launch + CW Alarm trigger for rollback.

🎯 Exam Insight
  • "Single pane of glass across accounts" β†’ Cross-account CloudWatch Dashboard
  • "Monitor without AWS console access" β†’ Dashboard sharing via public link or SSO
  • "Dashboard cost" β†’ First 3 free, then $3/month each
  • "Container monitoring" β†’ CloudWatch Container Insights (ECS/EKS)
  • "Service map with traces" β†’ ServiceLens (CloudWatch + X-Ray)
  • "Top contributors / hot keys" β†’ Contributor Insights
  • "Auto-generated dashboards" β†’ CloudWatch Automatic Dashboards (zero config)
  • "Cross-region view" β†’ Dashboard widgets can pull from any region
  • "Proactive endpoint monitoring" β†’ CloudWatch Synthetics canaries
  • "Real user performance data" β†’ CloudWatch RUM (client-side JavaScript)
  • "Feature flags with metrics" β†’ CloudWatch Evidently
  • "Lambda memory/CPU deep metrics" β†’ Lambda Insights (layer-based)
  • "Trace request across microservices" β†’ X-Ray + ServiceLens
  • "Cold start duration metric" β†’ Lambda Insights init_duration
Chapter 05 β€” Key Takeaway

CloudWatch Dashboards provide customisable, cross-account, cross-region visualisation with widgets (line, number, gauge, alarm, logs, markdown). First 3 dashboards are free, then $3/month. Share via console, public link, or SSO. For containers use Container Insights; for service maps with traces use ServiceLens + X-Ray; for top-N analysis use Contributor Insights. Synthetics canaries proactively test endpoints; RUM captures real user performance; Evidently enables safe feature rollouts. Lambda Insights provides deep per-function metrics (memory, CPU, cold starts).

06
Chapter Six Β· Management

Architecture Patterns & Cost Optimisation

CloudWatch becomes most powerful when you combine its primitives β€” metrics, alarms, logs, dashboards β€” into cohesive observability patterns. This chapter covers real-world architectures and cost strategies to keep monitoring affordable at scale.

Common Observability Patterns Core
πŸ”

Auto-Healing Pattern

  • Custom metric β†’ CloudWatch Alarm β†’ SNS β†’ Lambda
  • Lambda restarts unhealthy EC2 / ECS task
  • Composite alarm gates on multiple health signals
  • EventBridge rule as alternative action trigger
πŸ“ˆ

Auto-Scaling Pattern

  • Target tracking β†’ CloudWatch alarm (auto-created)
  • Step scaling β†’ manual alarm thresholds
  • Custom metric (queue depth / p99 latency) drives scaling
  • Cooldown period prevents alarm flapping
πŸ”Ž

Centralised Logging Pattern

  • All accounts β†’ CloudWatch Logs via unified agent
  • Cross-account subscription β†’ central Kinesis / S3
  • Metric filters extract KPIs from structured logs
  • Logs Insights for ad-hoc root-cause investigation
πŸ›‘οΈ

Security Monitoring Pattern

  • CloudTrail β†’ CloudWatch Logs β†’ metric filters
  • Detect root login, IAM changes, SG modifications
  • Alarm β†’ SNS β†’ security team / incident workflow
  • Pair with GuardDuty for ML-based threat detection
Auto-Healing Pattern β€” Event Flow
Unhealthy EC2 custom metric CW Alarm ALARM state SNS Topic triggers Remediation Ξ» restart / replace instance Unhealthy β†’ Alarm β†’ Notify β†’ Remediate β†’ Healthy β€” fully automated recovery loop
Multi-Account & Multi-Region Observability In-Depth
Cross-Account Observability Architecture (OAM)
SOURCE ACCOUNTS Account A β€” Metrics + Logs Account B β€” Metrics + Logs Account C β€” Metrics + Logs CW Agent / SDK / API OAM Link β†’ Sink MONITORING ACCOUNT Cross-Account Dashboards Composite Alarms Logs Insights Queries Anomaly Detection Models ACTIONS SNS Notifications Lambda Remediation Auto Scaling Triggers OpsCenter Incidents SSM Automation OAM = Observability Access Manager β€” share metrics, logs, and traces across AWS accounts

Key components for multi-account observability:

  • Observability Access Manager (OAM) β€” create a monitoring account sink, then link source accounts. Shared data: metrics, logs, X-Ray traces
  • Cross-account dashboards β€” single dashboard with widgets pulling from multiple accounts and regions
  • Cross-account alarms β€” alarm in monitoring account evaluates metrics from source accounts
  • Cross-account log queries β€” Logs Insights spans multiple account log groups simultaneously
πŸ’‘ OAM Setup

In the monitoring account, create a sink. In each source account, create a link pointing to that sink. Choose which telemetry types to share (metrics, logs, traces). OAM is region-scoped β€” configure per-region.

CloudWatch Pricing Model Core
Dimension Free Tier Paid Pricing
Custom Metrics 10 metrics/month $0.30/metric/month (first 10K)
Alarms 10 standard alarms $0.10/alarm/month (standard), $0.50 (high-res)
Dashboards 3 dashboards (50 metrics each) $3.00/dashboard/month
Log Ingestion 5 GB/month $0.50/GB ingested
Log Storage 5 GB (first month) $0.03/GB/month archived
Logs Insights β€” $0.005/GB scanned
API Requests 1M GetMetricData $0.01/1,000 GetMetricData calls
Anomaly Detection β€” $0.30/metric/month (same as custom)
Contributor Insights 1 rule (first month) $0.02/rule/month + matching events
Metric Streams β€” $0.003/1,000 metric updates
⚠️ Metric Resolution & Retention

High-resolution (1s) metrics are stored at full fidelity for 3 hours, then aggregated to 1-min for 15 days, 5-min for 63 days, 1-hour for 455 days. High-res alarms cost 5Γ— more ($0.50 vs $0.10). Only use 1-second resolution for latency-critical workloads.

Cost Optimisation Strategies In-Depth
πŸ’°

Reduce Log Costs

  • Set retention policies β€” default is never expire; set 7/14/30/90 days per group
  • Use Infrequent Access class β€” 50% cheaper ingestion for compliance-only logs
  • Filter at agent level β€” drop DEBUG/TRACE before ingestion
  • Archive to S3 β€” export old logs via subscription filter or export task
  • Compress payloads β€” CW agent supports gzip
πŸ“‰

Reduce Metric Costs

  • Embedded Metric Format (EMF) β€” extract metrics from logs without PutMetricData API calls
  • Avoid unnecessary high-res β€” 1-second metrics cost 10Γ— more than 1-minute
  • Consolidate with dimensions β€” one metric name + dimensions vs. many metric names
  • Remove stale alarms β€” each alarm incurs monthly cost
  • Use metric math β€” derive values instead of publishing more raw metrics
πŸ”§

Operational Savings

  • Automatic dashboards β€” free, zero-config service dashboards
  • Anomaly detection β€” fewer static thresholds to maintain
  • Composite alarms β€” one alarm tree replaces many SNS subscriptions
  • CloudWatch Agent β€” replaces third-party agents (no licence cost)
  • Metric Streams β†’ S3 β€” cheaper long-term metric storage than CW retention
⚠️

Common Cost Traps

  • Verbose logging β€” DEBUG in production can generate TB/month
  • Unlimited retention β€” forgotten log groups accumulate storage cost forever
  • High-res everywhere β€” 1-second resolution on non-critical metrics
  • API polling dashboards β€” excessive GetMetricData calls from auto-refresh
  • Cross-region transfer β€” streaming logs/metrics across regions adds data-transfer fees
Embedded Metric Format (EMF) In-Depth

EMF lets you embed custom metric definitions inside structured JSON log events. CloudWatch automatically extracts and publishes the metrics β€” no PutMetricData API calls, no extra cost beyond log ingestion.

EMF β€” Log-to-Metric Pipeline
Application Emits JSON log with _aws.CloudWatchMetrics stdout CW Logs Stores log event + detects EMF block auto CW Metrics Custom metric created No API calls needed Alarm Evaluate + alert as normal EMF = publish metrics via logs β€” cheaper than PutMetricData when you already ingest logs

EMF is supported in Lambda (natively), ECS, EKS, and EC2 (via CloudWatch agent). Ideal for high-cardinality scenarios where PutMetricData API call volume would be expensive.

CloudWatch Metric Streams In-Depth

Metric Streams provide near real-time, continuous streaming of CloudWatch metrics to a destination β€” eliminating the need for polling via API.

Metric Streams β€” Delivery Architecture
CW Metrics All namespaces stream Metric Stream Filter + OTel format < 2 min latency Kinesis Firehose Batch + deliver Amazon S3 Datadog / Splunk New Relic / Dynatrace Metric Streams deliver in OpenTelemetry 0.7 or JSON format β€” sub-2-minute latency
Use CaseHow Metric Streams Helps
Long-term retention (>15 months)Stream to S3 β†’ lifecycle to Glacier for compliance
Third-party monitoringDirect delivery to Datadog/Splunk β€” no custom polling code
Reduce API costsPush-based replaces expensive GetMetricData polling
Cross-account aggregationStream to central Kinesis β†’ unified analytics
Self-managed PrometheusStream β†’ Firehose β†’ Prometheus remote-write endpoint

Key features: Filter by namespace/metric/dimension. Supports OpenTelemetry 0.7 and JSON output formats. Automatic batching and compression. Pricing: $0.003 per 1,000 metric updates streamed.

🎯 Exam Scenario

"Company needs to retain EC2 CPU metrics for 5 years (compliance), but CloudWatch only retains for 15 months" β†’ Metric Streams β†’ S3 bucket β†’ S3 lifecycle rule (30 days β†’ Glacier) for cost-effective long-term retention.

CloudWatch vs Third-Party Observability Core
Factor CloudWatch Third-Party (Datadog, New Relic, etc.)
Integration Native β€” zero-setup for AWS services Requires agents / API keys / IAM roles
Multi-cloud AWS only AWS + Azure + GCP + on-prem
Log querying Logs Insights (good, not Splunk-level) Advanced analytics, ML-powered search
Pricing Pay per metric / GB / alarm Per host / per-GB / per-user
Alerting Alarms + composite + anomaly detection Advanced correlation, AIOps
APM / Tracing X-Ray (separate service, integrated via ServiceLens) Built-in APM with code-level profiling
Best for AWS-native, cost-sensitive workloads Multi-cloud, advanced analytics needs
CloudWatch Unified Agent Core

The CloudWatch Unified Agent replaces the legacy CloudWatch Logs agent and collectd. It collects both metrics and logs from EC2 instances and on-premises servers.

πŸ“Š

Agent β€” Metrics

  • CPU (per-core), RAM, disk I/O, network, swap
  • Process-level: memory, CPU per process name
  • collectd / StatsD protocol support
  • Publishes to custom namespace (e.g., CWAgent)
πŸ“

Agent β€” Logs

  • Tail any file path β†’ CloudWatch Logs group
  • Multi-line pattern matching (e.g., Java stack traces)
  • Timestamp extraction from log lines
  • Supports Windows Event Log collection
πŸ”§ Agent Configuration

Configure via the SSM Parameter Store wizard (recommended) or a JSON config file. Use amazon-cloudwatch-agent-ctl to start/stop. Deploy at scale with SSM Run Command across fleets.

🎯 Exam Insight
  • "Reduce CloudWatch costs" β†’ Set log retention, use Infrequent Access log class, filter verbose logs at agent level
  • "Cheapest way to monitor" β†’ Use basic (5-min) resolution, stay within free tier (10 metrics, 10 alarms, 3 dashboards, 5 GB logs)
  • "Cross-account observability" β†’ OAM β€” Observability Access Manager (sink in monitoring account, links from source accounts)
  • "Embedded metric format" β†’ Publish custom metrics from structured log data without PutMetricData API calls
  • "Log class selection" β†’ Standard for real-time querying; Infrequent Access for storage-only / compliance
  • "Unified agent vs Logs agent" β†’ Unified Agent collects both metrics + logs; Logs agent is legacy (logs only)
  • "Metric resolution trade-off" β†’ High-res (1s) costs more, stored 3 hours at full fidelity; standard (60s) stored 15 days
  • "Auto-healing architecture" β†’ CW Alarm β†’ SNS β†’ Lambda β†’ restart/replace resource
  • "Metric Streams" β†’ Near real-time metric delivery to S3/Firehose/third-party (Datadog, Splunk)
Chapter 06 β€” Key Takeaway

CloudWatch costs scale with data volume β€” control log ingestion via retention policies, Infrequent Access class, and agent-level filtering. Use EMF to extract metrics from logs without API costs. For multi-account setups, use OAM to centralise observability into a monitoring account. Common patterns include auto-healing (alarm β†’ SNS β†’ Lambda), auto-scaling (custom metric β†’ alarm β†’ ASG), and centralised logging (subscription filters β†’ Kinesis/S3). Choose CloudWatch for AWS-native monitoring; complement with third-party tools for multi-cloud or advanced APM.

CloudWatch β€” Complete Domain Summary

  • Metrics β€” namespaces, dimensions, custom metrics, high-resolution (1s), anomaly detection, metric math, embedded metric format, Metric Streams
  • Alarms β€” standard, composite, anomaly-based; actions via SNS, Auto Scaling, EC2, SSM; evaluation periods, datapoints-to-alarm, missing data
  • Logs β€” log groups, streams, unified agent, retention, metric filters, subscription filters, Logs Insights, Live Tail, Infrequent Access class
  • Dashboards β€” widgets, cross-account, cross-region, sharing, automatic dashboards, Container Insights, ServiceLens + X-Ray, Contributor Insights
  • Advanced Observability β€” Synthetics canaries (proactive endpoint testing), RUM (real user metrics), Evidently (feature flags/A/B), Lambda Insights (deep function metrics)
  • Patterns β€” auto-healing, auto-scaling, centralised logging, security monitoring, multi-account via OAM
  • Cost β€” free tier limits, log retention & IA class, EMF, Metric Streams to S3, avoid high-res everywhere, remove unused alarms