LearningTree · AWS · Management

Amazon CloudWatch —
Observability & Monitoring

The unified observability platform for AWS. CloudWatch collects metrics, aggregates logs, triggers alarms, builds dashboards, and traces requests — giving you complete visibility into the health of every resource in your cloud infrastructure.

Chapter One · Management

What is Amazon CloudWatch?

Amazon CloudWatch is a managed monitoring and observability service that collects data from every AWS resource — metrics (numbers over time), logs (text events), traces (request paths), and events (state changes). It's not one tool — it's a platform with five pillars that together give you full system visibility.

The Five Pillars of CloudWatch Introductory

📊

Metrics

Time-series numeric data
CPU, memory, network, disk, latency
Auto-collected from 70+ AWS services
Custom metrics via API/agent

🚨

Alarms

Threshold-based alerting on metrics
Trigger SNS, Auto Scaling, Lambda
OK → ALARM → INSUFFICIENT states
Composite alarms (AND/OR logic)

📋

Logs

Centralised log aggregation
From Lambda, ECS, EC2, any source
Log Insights for SQL-like queries
Metric filters → alarms on log patterns

📈

Dashboards

Custom visualisation panels
Cross-account, cross-region
Real-time and historical views
Shareable via link or embed

🔗

Events / EventBridge

React to AWS state changes
Schedule cron-like rules
Route events to Lambda, SNS, SQS
(Now part of Amazon EventBridge)

🧠 Mental Model — The Hospital Monitoring System

CloudWatch is like the monitoring station in a hospital. Metrics = the vital signs displays (heart rate, blood pressure). Alarms = the beeping alerts when vitals go critical. Logs = the patient's medical chart (detailed history). Dashboards = the nurse's station screen with all patients at a glance. Every AWS resource is a "patient" being monitored continuously.

What CloudWatch Monitors Automatically Core

Service	Auto-Collected Metrics	Resolution
EC2	CPU, Network In/Out, Disk I/O, Status Checks	5 min (basic) / 1 min (detailed)
RDS	CPU, Connections, Read/Write IOPS, Free Storage	1 min
Lambda	Invocations, Duration, Errors, Throttles, Concurrency	1 min
ALB	Request count, Latency, 4xx/5xx errors, Active connections	1 min
S3	Bucket size, Object count, Request metrics (if enabled)	1 day (size) / 1 min (requests)
SQS	Messages visible, Messages sent/received, Age of oldest	1 min
DynamoDB	Read/Write capacity used, Throttled requests, Latency	1 min
ECS/Fargate	CPU, Memory utilisation per task/service	1 min

CloudWatch Architecture Core

CloudWatch — How Data Flows

AWS services automatically push metrics and logs into CloudWatch. Alarms evaluate thresholds and trigger responses.

CloudWatch vs CloudTrail vs Config Core

Service	What It Answers	Data Type	Example
CloudWatch	"How is my system performing RIGHT NOW?"	Metrics, logs, alarms	CPU at 85%, 500 errors spiking
CloudTrail	"WHO did WHAT and WHEN?"	API audit logs (who called what)	User X deleted S3 bucket at 3:42PM
AWS Config	"What changed in my infrastructure?"	Resource configuration history	Security group rule was modified

💡 The Trio Works Together

In production, you use all three: CloudWatch tells you something is wrong (alarm fires). CloudTrail tells you who made the change that caused it. Config tells you exactly what the configuration looked like before and after. They're complementary — not competing.

Key Terminology Core

Term	Definition	Example
Namespace	Container for metrics from one service	`AWS/EC2`, `AWS/Lambda`, `Custom/MyApp`
Metric	Time-series of data points	`CPUUtilization`, `Errors`, `Duration`
Dimension	Key-value pair identifying a metric stream	`InstanceId=i-1234`, `FunctionName=myFunc`
Statistic	Aggregation over a period	Average, Sum, Max, Min, p99
Period	Time granularity for aggregation	60s, 300s (5 min), 3600s (1 hour)
Alarm	Watches a metric, changes state when threshold hit	CPU > 80% for 5 minutes → ALARM
Log Group	Collection of log streams from one source	`/aws/lambda/my-function`
Log Stream	Individual sequence of log events	One Lambda instance's output

🎯 Exam Insight

"Monitor CPU/memory/network" → CloudWatch Metrics
"Alert when threshold breached" → CloudWatch Alarms
"Centralise application logs" → CloudWatch Logs
"Query logs with SQL-like syntax" → CloudWatch Logs Insights
"CloudWatch vs CloudTrail" → CW = performance/metrics. CT = API audit/who-did-what.
"Custom metric" → use PutMetricData API or CloudWatch Agent
"EC2 memory metric" → NOT available by default. Requires CloudWatch Agent (custom metric).
"Default EC2 monitoring interval" → 5 minutes (basic). Enable "detailed monitoring" for 1-minute.

Chapter 01 — Key Takeaway

CloudWatch is five services in one: Metrics (numbers over time), Alarms (threshold alerts), Logs (centralised text), Dashboards (visualisation), and Events (state change reactions). It monitors 70+ AWS services automatically. CloudWatch answers "how is my system performing?" — distinct from CloudTrail (who did what) and Config (what changed). EC2 memory/disk requires the CloudWatch Agent — it's NOT collected by default.

Chapter Two · Management

CloudWatch Metrics — Deep Dive

Metrics are the foundation of CloudWatch. A metric is a time-ordered set of data points — each representing a measurement (CPU%, latency ms, error count) at a specific time. Understanding namespaces, dimensions, resolution, and retention is critical for both production and exams.

Metric Anatomy — Namespace + Name + Dimensions Core

Every metric is uniquely identified by three things:

Metric Identity — How CloudWatch Locates a Metric

Standard vs Custom Metrics Core

Aspect	Standard (Built-in)	Custom (You publish)
Source	AWS services automatically (EC2, RDS, Lambda…)	Your application via API/Agent
Cost	Free (included with the service)	$0.30/metric/month (first 10K)
Resolution	1 min or 5 min (service-dependent)	Standard (60s) or High-res (1s)
Namespace	`AWS/ServiceName`	`Custom/YourApp` (you choose)
Examples	CPUUtilization, NetworkIn, Invocations	ActiveUsers, OrdersPerMinute, QueueDepth

The CloudWatch Agent Core

The CloudWatch Agent is a small daemon installed on EC2 (or on-prem servers) that collects metrics not available by default:

📊

Metrics the Agent Collects

Memory utilisation (% used, available)
Disk space (% used, free bytes per mount)
Disk I/O (reads/writes per second)
Swap usage
Network (detailed) — packets, TCP connections
Process-level — CPU/memory per process

📋

Logs the Agent Collects

Application log files (custom paths)
System logs (/var/log/syslog)
Windows Event Logs
Apache/Nginx access logs
Any text file you configure
Pushes to CloudWatch Logs groups

⚠️ Critical Exam Fact

EC2 Memory and Disk metrics are NOT available by default. You MUST install the CloudWatch Agent to get memory/disk utilisation. This is one of the most commonly tested facts. CPU and Network are default; Memory and Disk are not.

Metric Resolution & Retention Core

Resolution	Period	Retention	Use Case
Basic Monitoring	5 minutes	15 months	Default for EC2 (free)
Detailed Monitoring	1 minute	15 months	EC2 with detailed enabled ($)
High-Resolution	1 second	3 hours (1s), then rolls up	Custom metrics via API ($$$)

Retention rollup — CloudWatch keeps data at decreasing granularity over time:

1-second data → retained for 3 hours
1-minute data → retained for 15 days
5-minute data → retained for 63 days
1-hour data → retained for 15 months (455 days)

Metric Math & Anomaly Detection Deep

Metric Math lets you combine multiple metrics with arithmetic expressions:

Error rate: m1/m2 * 100 (errors ÷ total requests × 100)
Cost per request: METRICS("cost") / METRICS("requests")
Can be used in alarms — alarm on calculated expressions, not just raw metrics

Anomaly Detection applies ML to establish a "normal" band for a metric. When the metric breaches the band, it triggers an alarm — even without setting a fixed threshold. Useful for metrics with variable baselines (e.g. traffic patterns that differ by day of week).

🎯 Exam Insight

"Memory not available on EC2" → install CloudWatch Agent
"1-second resolution" → High-Resolution custom metrics (extra cost)
"Metric retention" → 15 months for 1-hour aggregated data
"PutMetricData" → API to publish custom metrics programmatically
"Detailed monitoring" → EC2 at 1-minute intervals (vs 5-min basic)
"Namespace AWS/EC2 vs Custom" → AWS/ prefix = built-in. Custom/ = yours.
"Aggregate across instances" → use statistics (Average, Sum) without dimension filtering
"Alarm on calculated value" → Metric Math expressions in alarms

Chapter 02 — Key Takeaway

Metrics are time-series data identified by Namespace + Name + Dimensions. EC2 gives you CPU/Network for free but NOT memory/disk — install the CloudWatch Agent for those. Standard resolution = 1min or 5min. High-resolution custom metrics go down to 1-second. Data is retained for 15 months at 1-hour granularity. Use Metric Math to combine metrics and alarm on calculated values.

Chapter Three · Management

CloudWatch Alarms — Automated Response

CloudWatch Alarms watch a metric (or metric math expression) and change state when a threshold is breached. When an alarm fires, it can send notifications, trigger Auto Scaling, stop/terminate EC2 instances, or invoke Lambda functions — enabling fully automated incident response.

Alarm States Core

CloudWatch Alarm — State Machine

Alarm Configuration — Key Parameters Core

Parameter	What It Controls	Example
Metric	Which metric to watch	`AWS/EC2 CPUUtilization InstanceId=i-123`
Statistic	How to aggregate within a period	Average, Sum, Maximum, p99
Period	Evaluation window per data point	60 seconds, 300 seconds
Evaluation Periods	How many consecutive periods must breach	3 (alarm fires after 3 bad periods)
Datapoints to Alarm	M out of N periods must breach (flexible)	3 out of 5 (alarm if 3 of last 5 breach)
Threshold	The boundary value	> 80 (fires when metric exceeds 80)
Comparison Operator	Greater, Less, GreaterOrEqual, etc.	`GreaterThanThreshold`
Actions	What to do when state changes	SNS topic, Auto Scaling policy, EC2 action

🧠 The "3 out of 5" Pattern

The Datapoints to Alarm parameter is powerful. Instead of requiring 3 consecutive breaches (which a single recovery resets), you can set "3 out of 5" — meaning the alarm fires if 3 of the last 5 evaluation periods breach the threshold. This avoids false positives from brief recoveries during an ongoing issue.

Alarm Actions — What Can Alarms Trigger? Core

🚨

Notification Actions

SNS Topic → email, SMS, Slack (via Lambda), PagerDuty
Separate actions for ALARM, OK, and INSUFFICIENT states
Can notify on recovery (OK) too, not just alarm

⚙️

Auto Scaling Actions

Trigger scale-out (add instances) on high CPU
Trigger scale-in (remove instances) on low CPU
Target Tracking uses alarms internally
Step Scaling = multiple alarm thresholds

🖥️

EC2 Actions

Stop instance (StatusCheckFailed_System)
Terminate instance
Reboot instance
Recover instance (move to new host)

🔗

Other Actions

Lambda function (custom remediation)
Systems Manager (run automation doc)
EventBridge (route to many targets)
Create OpsItem in OpsCenter

Composite Alarms Deep

Composite Alarms combine multiple alarms using AND/OR logic. This prevents alarm noise:

Problem: You have 10 alarms monitoring different aspects. A single incident triggers all 10 → notification storm.
Solution: Create a composite alarm: "Fire only when AlarmA AND AlarmB are both in ALARM state". One notification for the combined condition.
Composite alarms can suppress actions on child alarms (only the composite sends notifications)
Support AND, OR, NOT logic between child alarms

Common Alarm Patterns Core

Pattern	Metric	Threshold	Action
CPU Scale-Out	CPUUtilization (Average)	> 70% for 3 periods	ASG: add 2 instances
CPU Scale-In	CPUUtilization (Average)	< 30% for 10 periods	ASG: remove 1 instance
Error Spike	5xx errors (Sum)	> 100 in 5 minutes	SNS: alert on-call team
Disk Full	DiskSpaceUsed (custom agent)	> 90%	SNS: ops alert + Lambda cleanup
EC2 System Failure	StatusCheckFailed_System	= 1 for 2 periods	EC2: Recover instance
SQS Dead Letter Build-up	ApproximateNumberOfMessagesVisible	> 0 for 1 period	SNS: investigate DLQ
Lambda Throttles	Throttles (Sum)	> 0	SNS: review concurrency limits
Billing Alert	EstimatedCharges	> $100	SNS: budget warning

Alarm Pricing Core

Type	Cost	Notes
Standard alarm	$0.10/alarm/month	Standard resolution (60s+)
High-resolution alarm	$0.30/alarm/month	10s or 30s period
Composite alarm	$0.50/alarm/month	Combines multiple child alarms
Anomaly detection alarm	$0.30/alarm/month	ML-based band detection
Free tier	10 alarms free	Standard resolution only

🎯 Exam Insight

"Alarm triggers Auto Scaling" → CloudWatch Alarm with ASG scaling policy action
"Alert the team when errors spike" → Alarm → SNS Topic → email/Slack
"Recover EC2 from system failure" → Alarm on StatusCheckFailed_System → EC2 Recover action
"Reduce alarm noise" → Composite Alarms (AND/OR logic)
"3 out of 5" → Datapoints to Alarm = 3, Evaluation Periods = 5
"Alarm on billing" → EstimatedCharges metric in us-east-1 (billing metrics only there)
"INSUFFICIENT_DATA state" → metric hasn't reported data in the evaluation period (instance stopped, metric not emitting)
"Alarm can invoke Lambda" → yes, directly as an alarm action (no EventBridge needed)

Chapter 03 — Key Takeaway

Alarms watch metrics and change state (OK → ALARM → INSUFFICIENT) when thresholds breach. They trigger SNS notifications, Auto Scaling actions, EC2 recovery, or Lambda functions. Use "M out of N" evaluation to avoid false positives. Use Composite Alarms to reduce notification noise by combining conditions with AND/OR logic. The most common pattern: CPU alarm triggers ASG scale-out.

Chapter Four · Management

CloudWatch Logs — Centralised Log Management

CloudWatch Logs is a fully managed log aggregation and analysis service. It ingests logs from Lambda, ECS, EC2 (via agent), API Gateway, VPC Flow Logs, Route 53 DNS queries, and any custom source — then lets you search, filter, create metrics from patterns, and export for long-term storage.

Log Hierarchy — Groups, Streams, Events Core

CloudWatch Logs — Organisational Hierarchy

Concept	Description	Example
Log Group	Top-level container. Defines retention, encryption, access.	`/aws/lambda/order-service`
Log Stream	Sequence of events from one source instance.	One Lambda execution container, one EC2 instance
Log Event	Single log entry: timestamp + message string.	`2026-05-07T10:23:45Z ERROR NullPointerException…`

Log Sources — What Sends Logs to CloudWatch? Core

⚡

Automatic (Built-in)

Lambda function output
API Gateway access logs
ECS/Fargate container stdout
RDS/Aurora error & slow-query
VPC Flow Logs
Route 53 DNS query logs

🔧

Agent-Based (EC2/On-Prem)

CloudWatch Agent on EC2
Custom application log files
System logs (/var/log/syslog)
Windows Event Logs
On-premises servers
Any text file on disk

📡

SDK/API (Programmatic)

PutLogEvents API
AWS SDKs (Boto3, Java, etc.)
Fluent Bit / Fluentd plugins
Docker logging drivers
Any HTTP client

Log Retention Core

By default, logs are retained forever (never expire). You set retention per log group:

Retention Period	Use Case	Cost Impact
1 day – 7 days	Development/debugging only	Lowest storage cost
30 days	Standard operational logs	Moderate
90 days	Compliance (short-term)	Higher
1 year – 10 years	Audit/compliance requirements	High — consider S3 export
Never expire	Default (dangerous for cost!)	Grows indefinitely

⚠️ Cost Trap

The default retention is Never Expire. This means log storage costs grow forever. Always set a retention policy on every log group. For long-term archival, export to S3 (much cheaper) or use S3 lifecycle rules to move to Glacier.

Metric Filters — Turn Logs into Metrics Core

Metric Filters scan incoming log events for patterns and emit CloudWatch metrics when matches occur. This lets you alarm on log content without reading logs manually.

Metric Filter — Log Pattern → Metric → Alarm

Common metric filter patterns:

"ERROR" — any line containing ERROR
"[ip, user, timestamp, request, status_code=5*, bytes]" — space-delimited pattern matching 5xx codes
{ $.statusCode = 500 } — JSON filter for structured logs
"OutOfMemoryError" — Java OOM detection

Subscription Filters — Real-Time Log Streaming Deep

Subscription Filters stream matching log events in real-time to a destination for processing:

🔄

Destinations

Kinesis Data Streams — real-time analytics
Kinesis Data Firehose — load to S3/Redshift/OpenSearch
Lambda — custom processing per event
OpenSearch (via Firehose) — log search UI

📋

Use Cases

Stream error logs to Slack via Lambda
Build real-time security dashboards
Feed logs into third-party SIEM tools
Cross-account log aggregation
Limit: 2 subscription filters per log group

CloudWatch Logs Insights Core

Logs Insights is a purpose-built query language for searching and analyzing log data interactively. It's like SQL for logs — fast, serverless, pay-per-query.

📝 Logs Insights Query Examples

fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20

stats count(*) as errorCount by bin(5m) | filter @message like /Exception/

fields @timestamp, @message | parse @message "user=* action=* status=*" as user, action, status | filter status = "FAILED"

Feature	Details
Query language	Purpose-built (fields, filter, stats, sort, parse, limit)
Performance	Scans GB of logs in seconds
Pricing	$0.005 per GB of data scanned
Multi-group	Query up to 50 log groups at once
Visualisation	Auto-generates time-series charts from stats queries
Saved queries	Save & share commonly used queries
Auto-discovery	Automatically detects fields in JSON logs

Logs Insights — Query Reference Core

Quick-reference for common Logs Insights query patterns:

What You Want	Query
All errors (last hour)	`fields @timestamp, @message \| filter @message like /Error\|Exception/ \| sort @timestamp desc \| limit 100`
Error count by 5-min bucket	`filter @message like /ERROR/ \| stats count() by bin(5m)`
Lambda cold starts	`filter @message like /Init Duration/ \| parse @message "Init Duration: * ms" as initDuration \| sort initDuration desc`
Slow Lambda (>1s)	`filter @type = "REPORT" \| parse @message "Duration: * ms" as duration \| filter duration > 1000 \| sort duration desc`
Top 10 IPs	`parse @message "client-ip=*" as ip \| stats count() by ip \| sort count() desc \| limit 10`
HTTP status breakdown	`parse @message "status=*" as status \| stats count() by status`
P95 latency per 5m	`filter @type = "REPORT" \| parse @message "Duration: * ms" as d \| stats pct(d, 95) by bin(5m)`

Syntax cheat-sheet:

Command	Purpose
`fields`	Select fields to display
`filter`	Regex (`like /pat/`) or exact match
`stats`	Aggregation — `count()`, `avg()`, `max()`, `pct(field, N)`
`sort`	Order results (`asc` / `desc`)
`limit`	Max results returned
`parse`	Extract values from unstructured text
`bin(interval)`	Group by time bucket (5m, 1h, etc.)

CloudWatch Logs Live Tail Core

Live Tail streams log events in real-time as they arrive — like tail -f for CloudWatch Logs. Available directly in the console.

📡

Capabilities

Real-time filter by pattern (show only ERROR lines)
Highlight matching terms
Pause / resume streaming
View across multiple log groups simultaneously
Set time window (1m, 5m, 15m)

🛠️

Use Cases

Debugging a deployment as it happens
Investigating live incidents in real-time
Monitoring Lambda during load tests
Tracing a request across services
Verifying log format changes

📍 How to Access

CloudWatch → Logs → Log Groups → Select group(s) → Actions → Start Live Tail. Console-only feature (not available via API). Best-effort delivery — not for audit/compliance capture.

Log Export & Cross-Account Deep

Method	Latency	Use Case
S3 Export (CreateExportTask)	Up to 12 hours	Batch archival, compliance
Subscription Filter → Firehose → S3	Near real-time (~60s)	Continuous export for analytics
Cross-account subscription	Real-time	Centralise logs in a security account

🧠 Export vs Subscription

S3 Export is a one-time batch job (can take 12 hours) — use for archival. Subscription Filter → Firehose → S3 is near real-time continuous streaming — use when you need ongoing export. For exam questions about "real-time log export to S3", the answer is Subscription Filter + Firehose, NOT CreateExportTask.

CloudWatch Logs Pricing Core

Component	Cost (US East)	Notes
Ingestion	$0.50 / GB	All data written to CW Logs
Storage	$0.03 / GB / month	Retained log data
Logs Insights	$0.005 / GB scanned	Per query
Vended logs (VPC Flow, etc.)	$0.10 / GB	Cheaper ingestion for AWS-generated

🎯 Exam Insight

"Centralise application logs" → CloudWatch Logs (via agent or SDK)
"Query logs with SQL-like syntax" → CloudWatch Logs Insights
"Create alarm from log pattern" → Metric Filter → custom metric → alarm
"Real-time log export to S3" → Subscription Filter + Kinesis Firehose (NOT CreateExportTask)
"Default log retention" → Never expire (must explicitly set retention policy)
"Stream logs to OpenSearch" → Subscription Filter → Firehose/Lambda → OpenSearch
"Cross-account logs" → Subscription filter to destination in central account
"Cheapest long-term log storage" → Export to S3 → lifecycle to Glacier
"Lambda logs not appearing" → Lambda execution role missing logs:CreateLogGroup / logs:PutLogEvents permissions
"Limit per log group" → 2 subscription filters maximum

Chapter 04 — Key Takeaway

CloudWatch Logs organises data as Log Groups → Log Streams → Log Events. Default retention is forever (set it!). Metric Filters turn log patterns into metrics you can alarm on. Subscription Filters stream logs in real-time to Kinesis/Lambda/OpenSearch. Logs Insights provides SQL-like querying at $0.005/GB scanned. Live Tail gives real-time console streaming for debugging. For real-time S3 export, use Subscription Filter + Firehose — not CreateExportTask (which is batch, up to 12h delay).

Chapter Five · Management

CloudWatch Dashboards — Visualisation & Operational Views

CloudWatch Dashboards provide customisable real-time visualisation of metrics, logs, and alarms in a single pane of glass. They support cross-account, cross-region widgets — letting operations teams monitor entire multi-account architectures from one screen.

Dashboard Capabilities Core

📊

Widget Types

Line graph — time-series trends (CPU, latency)
Stacked area — cumulative values
Number — single current value (big font)
Gauge — percentage with colour bands
Bar chart — comparisons
Log table — Logs Insights query results
Alarm status — red/green per alarm
Text (Markdown) — labels & documentation

🌐

Key Features

Cross-account — metrics from multiple AWS accounts
Cross-region — global view in one dashboard
Auto-refresh — 10s, 1m, 5m intervals
Time range control — relative or absolute
Full-screen mode — NOC/war-room displays
Dark mode — built-in for control rooms
Annotations — mark deployments/incidents
Variables — dynamic filtering (region, env)

Dashboard Architecture — How It Works Core

CloudWatch Dashboard — Widget Data Flow

Sharing & Access Core

Sharing Method	Access Control	Use Case
IAM (console)	IAM policies on `cloudwatch:GetDashboard`	Internal teams with AWS access
Share via link	Public URL (no auth required)	NOC screens, external stakeholders
SSO-enabled sharing	Third-party auth (Cognito, SAML)	Partner teams without IAM accounts
CloudWatch cross-account	Organization sharing setup	Central operations account sees all

Automatic Dashboards Introductory

CloudWatch provides automatic dashboards out of the box — zero configuration required:

Service-level dashboards — auto-generated for EC2, Lambda, RDS, etc. showing key metrics
Cross-service dashboard — aggregated health across all services in use
Account-level overview — alarms, anomalies, recent changes
Can be used as starting point → clone & customise for your needs

Dashboard Best Practices Core

✅

Do

Create separate dashboards per environment (prod/staging)
Use annotations to mark deployments
Include alarm status widgets for at-a-glance health
Add Markdown widgets with runbook/escalation links
Use variables for dynamic filtering
Set appropriate auto-refresh (10s for real-time ops)

❌

Don't

Overload a single dashboard with 50+ widgets (slow rendering)
Rely solely on dashboards for alerting (use alarms)
Share public links with sensitive metric data
Forget cross-region widgets incur cross-region data transfer
Create dashboards without clear ownership

Dashboard Pricing Core

Item	Cost	Notes
First 3 dashboards	Free	Up to 50 metrics each
Additional dashboards	$3.00/dashboard/month	Each can have up to 500 widgets
API calls	Included	GetMetricData calls for rendering
Cross-account	No extra charge	Requires sharing setup

CloudWatch ServiceLens & Container Insights Deep

Beyond basic dashboards, CloudWatch offers advanced observability features:

🔍

ServiceLens

Unified view: metrics + traces + logs + alarms
Service map showing dependencies
Integrates with X-Ray traces
Click a node → see latency, errors, logs
End-to-end request flow visualisation

🐳

Container Insights

Pre-built dashboards for ECS, EKS, Fargate
Cluster/service/task/pod-level metrics
CPU, memory, network, disk per container
Automatic discovery of running containers
Performance log events for deep analysis

Application Insights & Contributor Insights Deep

🤖

Application Insights

ML-powered monitoring for .NET/Java/SQL Server workloads
Auto-detects problems and correlates metrics
Creates automated dashboards for app stacks
Reduces MTTR by highlighting root cause

📊

Contributor Insights

Identify top-N contributors to a pattern
"Top 10 IPs generating 5xx errors"
"Top 5 Lambda functions by duration"
Use with VPC Flow Logs, CloudTrail, any log
Helps find noisy neighbours / hot keys

ServiceLens + X-Ray — End-to-End Tracing Deep

ServiceLens combines CloudWatch metrics + logs + X-Ray traces into a single application view. X-Ray traces requests across services (API Gateway → Lambda → DynamoDB) and ServiceLens overlays operational data on top.

ServiceLens + X-Ray — Request Trace Flow

Component	What It Provides	How to Enable
X-Ray SDK	Trace segments per service (latency, errors, metadata)	Add SDK to app code + IAM permissions
Lambda Active Tracing	Auto-instrumented traces for Lambda	Enable checkbox in function config
API Gateway Tracing	Trace from API entry point	Stage settings → Enable X-Ray
X-Ray Daemon	Collects segments from EC2-based apps	Install daemon on EC2 instances
ServiceLens	Unified view of traces + metrics + logs	No extra setup (uses existing CW + X-Ray)

Pricing: X-Ray — $5 per 1M traces recorded, $0.50 per 1M traces retrieved. ServiceLens has no additional charge.

CloudWatch Lambda Insights Deep

Lambda Insights is a performance monitoring solution for Lambda functions. It uses a Lambda Layer to collect detailed runtime metrics not available in basic Lambda metrics — without code changes.

Metric	What It Measures
`memory_utilization`	Actual memory used vs allocated (basic only shows max allocated)
`cpu_total_time`	CPU utilisation — identifies CPU-bound functions
`tmp_used`	/tmp disk space — detect when approaching 512 MB limit
`init_duration`	Cold start duration — separate from execution time
`rx_bytes` / `tx_bytes`	Network I/O per invocation
`total_network`	Total network bandwidth consumed

🔧 How to Enable

1. Add the Lambda Insights layer ARN to your function. 2. Grant IAM permissions (cloudwatch:PutMetricData). 3. Metrics appear in the LambdaInsights namespace. Integrates with ServiceLens for combined trace + metrics view.

When to use: Out-of-memory troubleshooting, cold start optimisation, CPU-bound identification, high-concurrency monitoring. Cost: Standard CloudWatch metrics pricing ($0.30/metric) + log ingestion for performance logs.

CloudWatch Synthetics (Canaries) Core

Synthetics canaries are configurable Node.js or Python scripts that run on a schedule to simulate user behaviour — proactively detecting issues before customers do.

🐤

Canary Blueprints

Heartbeat — simple HTTP GET, verify endpoint is up
API Canary — test authenticated API endpoints
UI Canary — login → add to cart → checkout (Selenium/Playwright)
Broken Link Checker — crawl site for 404s
Visual Regression — screenshot comparison for UI changes

📋

What Canaries Capture

Success / failure status per run
Screenshots of failure state
HAR file (full network request waterfall)
Execution logs for debugging
CloudWatch metrics (SuccessPercent, Duration)
Step-level timing breakdown

Synthetics Canary — Monitoring Flow

Pricing: ~$0.0012 per canary run. A 5-minute canary ≈ $0.35/month (8,640 runs). Multi-region: Run canaries from different regions to detect regional outages.

🎯 Exam Scenario

"Company needs to proactively detect if their checkout page is broken before customers report it" → CloudWatch Synthetics canary that logs in, adds item to cart, and completes checkout flow on a 5-minute schedule.

CloudWatch RUM (Real User Monitoring) Core

CloudWatch RUM captures actual user performance data via a lightweight JavaScript snippet added to your web application — measuring what real users experience, not just synthetic tests.

👤

Metrics Captured

Core Web Vitals — LCP, FID, CLS (SEO-critical)
Page load timing (Navigation Timing API)
JavaScript errors (uncaught exceptions)
XHR/Fetch request failures and latency
Session and user journey tracking

🔗

X-Ray Integration

Correlate front-end sessions with backend X-Ray traces
End-to-end: user click → API GW → Lambda → DynamoDB
Identify if latency is client-side or server-side
Segment by region, browser, device type

How to enable: Copy-paste the RUM JavaScript snippet into your web app. No server-side changes needed. Pricing: $1 per 100,000 events (1 page view = 1 event).

🎯 Exam Scenario

"Application seems slow for users in Australia but synthetic tests from us-east-1 pass fine" → Enable CloudWatch RUM to see real user data segmented by geographic region — reveals regional latency issues invisible to synthetic tests.

CloudWatch Evidently — Feature Experiments Introductory

Evidently provides feature flags, A/B testing, and controlled rollouts with automatic metric tracking — all integrated into CloudWatch.

Capability	Use Case
Feature Flags	Toggle features on/off without redeployment
Gradual Rollout	1% → 5% → 20% → 100% of users over time
A/B Testing	Compare conversion, revenue, or latency between variants
Overrides	Target specific users (beta testers, internal teams)
Auto-Rollback	If alarm triggers (error rate ↑), revert to safe variation

Pricing: $0.01 per 1,000 feature evaluations. $0.12 per 1,000 analysed events (experiments). Integrates with CloudWatch Alarms for automatic rollback if metrics degrade.

🎯 Exam Scenario

"Team wants to test new checkout UI on 10% of users first, with automatic rollback if error rate increases" → CloudWatch Evidently with percentage-based launch + CW Alarm trigger for rollback.

🎯 Exam Insight

"Single pane of glass across accounts" → Cross-account CloudWatch Dashboard
"Monitor without AWS console access" → Dashboard sharing via public link or SSO
"Dashboard cost" → First 3 free, then $3/month each
"Container monitoring" → CloudWatch Container Insights (ECS/EKS)
"Service map with traces" → ServiceLens (CloudWatch + X-Ray)
"Top contributors / hot keys" → Contributor Insights
"Auto-generated dashboards" → CloudWatch Automatic Dashboards (zero config)
"Cross-region view" → Dashboard widgets can pull from any region
"Proactive endpoint monitoring" → CloudWatch Synthetics canaries
"Real user performance data" → CloudWatch RUM (client-side JavaScript)
"Feature flags with metrics" → CloudWatch Evidently
"Lambda memory/CPU deep metrics" → Lambda Insights (layer-based)
"Trace request across microservices" → X-Ray + ServiceLens
"Cold start duration metric" → Lambda Insights init_duration

Chapter 05 — Key Takeaway

CloudWatch Dashboards provide customisable, cross-account, cross-region visualisation with widgets (line, number, gauge, alarm, logs, markdown). First 3 dashboards are free, then $3/month. Share via console, public link, or SSO. For containers use Container Insights; for service maps with traces use ServiceLens + X-Ray; for top-N analysis use Contributor Insights. Synthetics canaries proactively test endpoints; RUM captures real user performance; Evidently enables safe feature rollouts. Lambda Insights provides deep per-function metrics (memory, CPU, cold starts).

Chapter Six · Management

Architecture Patterns & Cost Optimisation

CloudWatch becomes most powerful when you combine its primitives — metrics, alarms, logs, dashboards — into cohesive observability patterns. This chapter covers real-world architectures and cost strategies to keep monitoring affordable at scale.

Common Observability Patterns Core

🔁

Auto-Healing Pattern

Custom metric → CloudWatch Alarm → SNS → Lambda
Lambda restarts unhealthy EC2 / ECS task
Composite alarm gates on multiple health signals
EventBridge rule as alternative action trigger

📈

Auto-Scaling Pattern

Target tracking → CloudWatch alarm (auto-created)
Step scaling → manual alarm thresholds
Custom metric (queue depth / p99 latency) drives scaling
Cooldown period prevents alarm flapping

🔎

Centralised Logging Pattern

All accounts → CloudWatch Logs via unified agent
Cross-account subscription → central Kinesis / S3
Metric filters extract KPIs from structured logs
Logs Insights for ad-hoc root-cause investigation

🛡️

Security Monitoring Pattern

CloudTrail → CloudWatch Logs → metric filters
Detect root login, IAM changes, SG modifications
Alarm → SNS → security team / incident workflow
Pair with GuardDuty for ML-based threat detection

Auto-Healing Pattern — Event Flow

Multi-Account & Multi-Region Observability In-Depth

Cross-Account Observability Architecture (OAM)

Key components for multi-account observability:

Observability Access Manager (OAM) — create a monitoring account sink, then link source accounts. Shared data: metrics, logs, X-Ray traces
Cross-account dashboards — single dashboard with widgets pulling from multiple accounts and regions
Cross-account alarms — alarm in monitoring account evaluates metrics from source accounts
Cross-account log queries — Logs Insights spans multiple account log groups simultaneously

💡 OAM Setup

In the monitoring account, create a sink. In each source account, create a link pointing to that sink. Choose which telemetry types to share (metrics, logs, traces). OAM is region-scoped — configure per-region.

CloudWatch Pricing Model Core

Dimension	Free Tier	Paid Pricing
Custom Metrics	10 metrics/month	$0.30/metric/month (first 10K)
Alarms	10 standard alarms	$0.10/alarm/month (standard), $0.50 (high-res)
Dashboards	3 dashboards (50 metrics each)	$3.00/dashboard/month
Log Ingestion	5 GB/month	$0.50/GB ingested
Log Storage	5 GB (first month)	$0.03/GB/month archived
Logs Insights	—	$0.005/GB scanned
API Requests	1M GetMetricData	$0.01/1,000 GetMetricData calls
Anomaly Detection	—	$0.30/metric/month (same as custom)
Contributor Insights	1 rule (first month)	$0.02/rule/month + matching events
Metric Streams	—	$0.003/1,000 metric updates

⚠️ Metric Resolution & Retention

High-resolution (1s) metrics are stored at full fidelity for 3 hours, then aggregated to 1-min for 15 days, 5-min for 63 days, 1-hour for 455 days. High-res alarms cost 5× more ($0.50 vs $0.10). Only use 1-second resolution for latency-critical workloads.

Cost Optimisation Strategies In-Depth

💰

Reduce Log Costs

Set retention policies — default is never expire; set 7/14/30/90 days per group
Use Infrequent Access class — 50% cheaper ingestion for compliance-only logs
Filter at agent level — drop DEBUG/TRACE before ingestion
Archive to S3 — export old logs via subscription filter or export task
Compress payloads — CW agent supports gzip

📉

Reduce Metric Costs

Embedded Metric Format (EMF) — extract metrics from logs without PutMetricData API calls
Avoid unnecessary high-res — 1-second metrics cost 10× more than 1-minute
Consolidate with dimensions — one metric name + dimensions vs. many metric names
Remove stale alarms — each alarm incurs monthly cost
Use metric math — derive values instead of publishing more raw metrics

🔧

Operational Savings

Automatic dashboards — free, zero-config service dashboards
Anomaly detection — fewer static thresholds to maintain
Composite alarms — one alarm tree replaces many SNS subscriptions
CloudWatch Agent — replaces third-party agents (no licence cost)
Metric Streams → S3 — cheaper long-term metric storage than CW retention

⚠️

Common Cost Traps

Verbose logging — DEBUG in production can generate TB/month
Unlimited retention — forgotten log groups accumulate storage cost forever
High-res everywhere — 1-second resolution on non-critical metrics
API polling dashboards — excessive GetMetricData calls from auto-refresh
Cross-region transfer — streaming logs/metrics across regions adds data-transfer fees

Embedded Metric Format (EMF) In-Depth

EMF lets you embed custom metric definitions inside structured JSON log events. CloudWatch automatically extracts and publishes the metrics — no PutMetricData API calls, no extra cost beyond log ingestion.

EMF — Log-to-Metric Pipeline

EMF is supported in Lambda (natively), ECS, EKS, and EC2 (via CloudWatch agent). Ideal for high-cardinality scenarios where PutMetricData API call volume would be expensive.

CloudWatch Metric Streams In-Depth

Metric Streams provide near real-time, continuous streaming of CloudWatch metrics to a destination — eliminating the need for polling via API.

Metric Streams — Delivery Architecture

Use Case	How Metric Streams Helps
Long-term retention (>15 months)	Stream to S3 → lifecycle to Glacier for compliance
Third-party monitoring	Direct delivery to Datadog/Splunk — no custom polling code
Reduce API costs	Push-based replaces expensive GetMetricData polling
Cross-account aggregation	Stream to central Kinesis → unified analytics
Self-managed Prometheus	Stream → Firehose → Prometheus remote-write endpoint

Key features: Filter by namespace/metric/dimension. Supports OpenTelemetry 0.7 and JSON output formats. Automatic batching and compression. Pricing: $0.003 per 1,000 metric updates streamed.

🎯 Exam Scenario

"Company needs to retain EC2 CPU metrics for 5 years (compliance), but CloudWatch only retains for 15 months" → Metric Streams → S3 bucket → S3 lifecycle rule (30 days → Glacier) for cost-effective long-term retention.

CloudWatch vs Third-Party Observability Core

Factor	CloudWatch	Third-Party (Datadog, New Relic, etc.)
Integration	Native — zero-setup for AWS services	Requires agents / API keys / IAM roles
Multi-cloud	AWS only	AWS + Azure + GCP + on-prem
Log querying	Logs Insights (good, not Splunk-level)	Advanced analytics, ML-powered search
Pricing	Pay per metric / GB / alarm	Per host / per-GB / per-user
Alerting	Alarms + composite + anomaly detection	Advanced correlation, AIOps
APM / Tracing	X-Ray (separate service, integrated via ServiceLens)	Built-in APM with code-level profiling
Best for	AWS-native, cost-sensitive workloads	Multi-cloud, advanced analytics needs

CloudWatch Unified Agent Core

The CloudWatch Unified Agent replaces the legacy CloudWatch Logs agent and collectd. It collects both metrics and logs from EC2 instances and on-premises servers.

📊

Agent — Metrics

CPU (per-core), RAM, disk I/O, network, swap
Process-level: memory, CPU per process name
collectd / StatsD protocol support
Publishes to custom namespace (e.g., CWAgent)

📝

Agent — Logs

Tail any file path → CloudWatch Logs group
Multi-line pattern matching (e.g., Java stack traces)
Timestamp extraction from log lines
Supports Windows Event Log collection

🔧 Agent Configuration

Configure via the SSM Parameter Store wizard (recommended) or a JSON config file. Use amazon-cloudwatch-agent-ctl to start/stop. Deploy at scale with SSM Run Command across fleets.

🎯 Exam Insight

"Reduce CloudWatch costs" → Set log retention, use Infrequent Access log class, filter verbose logs at agent level
"Cheapest way to monitor" → Use basic (5-min) resolution, stay within free tier (10 metrics, 10 alarms, 3 dashboards, 5 GB logs)
"Cross-account observability" → OAM — Observability Access Manager (sink in monitoring account, links from source accounts)
"Embedded metric format" → Publish custom metrics from structured log data without PutMetricData API calls
"Log class selection" → Standard for real-time querying; Infrequent Access for storage-only / compliance
"Unified agent vs Logs agent" → Unified Agent collects both metrics + logs; Logs agent is legacy (logs only)
"Metric resolution trade-off" → High-res (1s) costs more, stored 3 hours at full fidelity; standard (60s) stored 15 days
"Auto-healing architecture" → CW Alarm → SNS → Lambda → restart/replace resource
"Metric Streams" → Near real-time metric delivery to S3/Firehose/third-party (Datadog, Splunk)

Chapter 06 — Key Takeaway

CloudWatch costs scale with data volume — control log ingestion via retention policies, Infrequent Access class, and agent-level filtering. Use EMF to extract metrics from logs without API costs. For multi-account setups, use OAM to centralise observability into a monitoring account. Common patterns include auto-healing (alarm → SNS → Lambda), auto-scaling (custom metric → alarm → ASG), and centralised logging (subscription filters → Kinesis/S3). Choose CloudWatch for AWS-native monitoring; complement with third-party tools for multi-cloud or advanced APM.

 CloudWatch — Complete Domain Summary Metrics — namespaces, dimensions, custom metrics, high-resolution (1s), anomaly detection, metric math, embedded metric format, Metric Streams
Alarms — standard, composite, anomaly-based; actions via SNS, Auto Scaling, EC2, SSM; evaluation periods, datapoints-to-alarm, missing data
Logs — log groups, streams, unified agent, retention, metric filters, subscription filters, Logs Insights, Live Tail, Infrequent Access class
Dashboards — widgets, cross-account, cross-region, sharing, automatic dashboards, Container Insights, ServiceLens + X-Ray, Contributor Insights
Advanced Observability — Synthetics canaries (proactive endpoint testing), RUM (real user metrics), Evidently (feature flags/A/B), Lambda Insights (deep function metrics)
Patterns — auto-healing, auto-scaling, centralised logging, security monitoring, multi-account via OAM
Cost — free tier limits, log retention & IA class, EMF, Metric Streams to S3, avoid high-res everywhere, remove unused alarms