Amazon CloudWatch β
Observability & Monitoring
The unified observability platform for AWS. CloudWatch collects metrics, aggregates logs, triggers alarms, builds dashboards, and traces requests β giving you complete visibility into the health of every resource in your cloud infrastructure.
What is Amazon CloudWatch?
Amazon CloudWatch is a managed monitoring and observability service that collects data from every AWS resource β metrics (numbers over time), logs (text events), traces (request paths), and events (state changes). It's not one tool β it's a platform with five pillars that together give you full system visibility.
Metrics
- Time-series numeric data
- CPU, memory, network, disk, latency
- Auto-collected from 70+ AWS services
- Custom metrics via API/agent
Alarms
- Threshold-based alerting on metrics
- Trigger SNS, Auto Scaling, Lambda
- OK β ALARM β INSUFFICIENT states
- Composite alarms (AND/OR logic)
Logs
- Centralised log aggregation
- From Lambda, ECS, EC2, any source
- Log Insights for SQL-like queries
- Metric filters β alarms on log patterns
Dashboards
- Custom visualisation panels
- Cross-account, cross-region
- Real-time and historical views
- Shareable via link or embed
Events / EventBridge
- React to AWS state changes
- Schedule cron-like rules
- Route events to Lambda, SNS, SQS
- (Now part of Amazon EventBridge)
CloudWatch is like the monitoring station in a hospital. Metrics = the vital signs displays (heart rate, blood pressure). Alarms = the beeping alerts when vitals go critical. Logs = the patient's medical chart (detailed history). Dashboards = the nurse's station screen with all patients at a glance. Every AWS resource is a "patient" being monitored continuously.
| Service | Auto-Collected Metrics | Resolution |
|---|---|---|
| EC2 | CPU, Network In/Out, Disk I/O, Status Checks | 5 min (basic) / 1 min (detailed) |
| RDS | CPU, Connections, Read/Write IOPS, Free Storage | 1 min |
| Lambda | Invocations, Duration, Errors, Throttles, Concurrency | 1 min |
| ALB | Request count, Latency, 4xx/5xx errors, Active connections | 1 min |
| S3 | Bucket size, Object count, Request metrics (if enabled) | 1 day (size) / 1 min (requests) |
| SQS | Messages visible, Messages sent/received, Age of oldest | 1 min |
| DynamoDB | Read/Write capacity used, Throttled requests, Latency | 1 min |
| ECS/Fargate | CPU, Memory utilisation per task/service | 1 min |
| Service | What It Answers | Data Type | Example |
|---|---|---|---|
| CloudWatch | "How is my system performing RIGHT NOW?" | Metrics, logs, alarms | CPU at 85%, 500 errors spiking |
| CloudTrail | "WHO did WHAT and WHEN?" | API audit logs (who called what) | User X deleted S3 bucket at 3:42PM |
| AWS Config | "What changed in my infrastructure?" | Resource configuration history | Security group rule was modified |
In production, you use all three: CloudWatch tells you something is wrong (alarm fires). CloudTrail tells you who made the change that caused it. Config tells you exactly what the configuration looked like before and after. They're complementary β not competing.
| Term | Definition | Example |
|---|---|---|
| Namespace | Container for metrics from one service | AWS/EC2, AWS/Lambda, Custom/MyApp |
| Metric | Time-series of data points | CPUUtilization, Errors, Duration |
| Dimension | Key-value pair identifying a metric stream | InstanceId=i-1234, FunctionName=myFunc |
| Statistic | Aggregation over a period | Average, Sum, Max, Min, p99 |
| Period | Time granularity for aggregation | 60s, 300s (5 min), 3600s (1 hour) |
| Alarm | Watches a metric, changes state when threshold hit | CPU > 80% for 5 minutes β ALARM |
| Log Group | Collection of log streams from one source | /aws/lambda/my-function |
| Log Stream | Individual sequence of log events | One Lambda instance's output |
- "Monitor CPU/memory/network" β CloudWatch Metrics
- "Alert when threshold breached" β CloudWatch Alarms
- "Centralise application logs" β CloudWatch Logs
- "Query logs with SQL-like syntax" β CloudWatch Logs Insights
- "CloudWatch vs CloudTrail" β CW = performance/metrics. CT = API audit/who-did-what.
- "Custom metric" β use PutMetricData API or CloudWatch Agent
- "EC2 memory metric" β NOT available by default. Requires CloudWatch Agent (custom metric).
- "Default EC2 monitoring interval" β 5 minutes (basic). Enable "detailed monitoring" for 1-minute.
CloudWatch is five services in one: Metrics (numbers over time), Alarms (threshold alerts), Logs (centralised text), Dashboards (visualisation), and Events (state change reactions). It monitors 70+ AWS services automatically. CloudWatch answers "how is my system performing?" β distinct from CloudTrail (who did what) and Config (what changed). EC2 memory/disk requires the CloudWatch Agent β it's NOT collected by default.
CloudWatch Metrics β Deep Dive
Metrics are the foundation of CloudWatch. A metric is a time-ordered set of data points β each representing a measurement (CPU%, latency ms, error count) at a specific time. Understanding namespaces, dimensions, resolution, and retention is critical for both production and exams.
Every metric is uniquely identified by three things:
| Aspect | Standard (Built-in) | Custom (You publish) |
|---|---|---|
| Source | AWS services automatically (EC2, RDS, Lambdaβ¦) | Your application via API/Agent |
| Cost | Free (included with the service) | $0.30/metric/month (first 10K) |
| Resolution | 1 min or 5 min (service-dependent) | Standard (60s) or High-res (1s) |
| Namespace | AWS/ServiceName | Custom/YourApp (you choose) |
| Examples | CPUUtilization, NetworkIn, Invocations | ActiveUsers, OrdersPerMinute, QueueDepth |
The CloudWatch Agent is a small daemon installed on EC2 (or on-prem servers) that collects metrics not available by default:
Metrics the Agent Collects
- Memory utilisation (% used, available)
- Disk space (% used, free bytes per mount)
- Disk I/O (reads/writes per second)
- Swap usage
- Network (detailed) β packets, TCP connections
- Process-level β CPU/memory per process
Logs the Agent Collects
- Application log files (custom paths)
- System logs (
/var/log/syslog) - Windows Event Logs
- Apache/Nginx access logs
- Any text file you configure
- Pushes to CloudWatch Logs groups
EC2 Memory and Disk metrics are NOT available by default. You MUST install the CloudWatch Agent to get memory/disk utilisation. This is one of the most commonly tested facts. CPU and Network are default; Memory and Disk are not.
| Resolution | Period | Retention | Use Case |
|---|---|---|---|
| Basic Monitoring | 5 minutes | 15 months | Default for EC2 (free) |
| Detailed Monitoring | 1 minute | 15 months | EC2 with detailed enabled ($) |
| High-Resolution | 1 second | 3 hours (1s), then rolls up | Custom metrics via API ($$$) |
Retention rollup β CloudWatch keeps data at decreasing granularity over time:
- 1-second data β retained for 3 hours
- 1-minute data β retained for 15 days
- 5-minute data β retained for 63 days
- 1-hour data β retained for 15 months (455 days)
Metric Math lets you combine multiple metrics with arithmetic expressions:
- Error rate:
m1/m2 * 100(errors Γ· total requests Γ 100) - Cost per request:
METRICS("cost") / METRICS("requests") - Can be used in alarms β alarm on calculated expressions, not just raw metrics
Anomaly Detection applies ML to establish a "normal" band for a metric. When the metric breaches the band, it triggers an alarm β even without setting a fixed threshold. Useful for metrics with variable baselines (e.g. traffic patterns that differ by day of week).
- "Memory not available on EC2" β install CloudWatch Agent
- "1-second resolution" β High-Resolution custom metrics (extra cost)
- "Metric retention" β 15 months for 1-hour aggregated data
- "PutMetricData" β API to publish custom metrics programmatically
- "Detailed monitoring" β EC2 at 1-minute intervals (vs 5-min basic)
- "Namespace AWS/EC2 vs Custom" β AWS/ prefix = built-in. Custom/ = yours.
- "Aggregate across instances" β use statistics (Average, Sum) without dimension filtering
- "Alarm on calculated value" β Metric Math expressions in alarms
Metrics are time-series data identified by Namespace + Name + Dimensions. EC2 gives you CPU/Network for free but NOT memory/disk β install the CloudWatch Agent for those. Standard resolution = 1min or 5min. High-resolution custom metrics go down to 1-second. Data is retained for 15 months at 1-hour granularity. Use Metric Math to combine metrics and alarm on calculated values.
CloudWatch Alarms β Automated Response
CloudWatch Alarms watch a metric (or metric math expression) and change state when a threshold is breached. When an alarm fires, it can send notifications, trigger Auto Scaling, stop/terminate EC2 instances, or invoke Lambda functions β enabling fully automated incident response.
| Parameter | What It Controls | Example |
|---|---|---|
| Metric | Which metric to watch | AWS/EC2 CPUUtilization InstanceId=i-123 |
| Statistic | How to aggregate within a period | Average, Sum, Maximum, p99 |
| Period | Evaluation window per data point | 60 seconds, 300 seconds |
| Evaluation Periods | How many consecutive periods must breach | 3 (alarm fires after 3 bad periods) |
| Datapoints to Alarm | M out of N periods must breach (flexible) | 3 out of 5 (alarm if 3 of last 5 breach) |
| Threshold | The boundary value | > 80 (fires when metric exceeds 80) |
| Comparison Operator | Greater, Less, GreaterOrEqual, etc. | GreaterThanThreshold |
| Actions | What to do when state changes | SNS topic, Auto Scaling policy, EC2 action |
The Datapoints to Alarm parameter is powerful. Instead of requiring 3 consecutive breaches (which a single recovery resets), you can set "3 out of 5" β meaning the alarm fires if 3 of the last 5 evaluation periods breach the threshold. This avoids false positives from brief recoveries during an ongoing issue.
Notification Actions
- SNS Topic β email, SMS, Slack (via Lambda), PagerDuty
- Separate actions for ALARM, OK, and INSUFFICIENT states
- Can notify on recovery (OK) too, not just alarm
Auto Scaling Actions
- Trigger scale-out (add instances) on high CPU
- Trigger scale-in (remove instances) on low CPU
- Target Tracking uses alarms internally
- Step Scaling = multiple alarm thresholds
EC2 Actions
- Stop instance (StatusCheckFailed_System)
- Terminate instance
- Reboot instance
- Recover instance (move to new host)
Other Actions
- Lambda function (custom remediation)
- Systems Manager (run automation doc)
- EventBridge (route to many targets)
- Create OpsItem in OpsCenter
Composite Alarms combine multiple alarms using AND/OR logic. This prevents alarm noise:
- Problem: You have 10 alarms monitoring different aspects. A single incident triggers all 10 β notification storm.
- Solution: Create a composite alarm: "Fire only when AlarmA AND AlarmB are both in ALARM state". One notification for the combined condition.
- Composite alarms can suppress actions on child alarms (only the composite sends notifications)
- Support AND, OR, NOT logic between child alarms
| Pattern | Metric | Threshold | Action |
|---|---|---|---|
| CPU Scale-Out | CPUUtilization (Average) | > 70% for 3 periods | ASG: add 2 instances |
| CPU Scale-In | CPUUtilization (Average) | < 30% for 10 periods | ASG: remove 1 instance |
| Error Spike | 5xx errors (Sum) | > 100 in 5 minutes | SNS: alert on-call team |
| Disk Full | DiskSpaceUsed (custom agent) | > 90% | SNS: ops alert + Lambda cleanup |
| EC2 System Failure | StatusCheckFailed_System | = 1 for 2 periods | EC2: Recover instance |
| SQS Dead Letter Build-up | ApproximateNumberOfMessagesVisible | > 0 for 1 period | SNS: investigate DLQ |
| Lambda Throttles | Throttles (Sum) | > 0 | SNS: review concurrency limits |
| Billing Alert | EstimatedCharges | > $100 | SNS: budget warning |
| Type | Cost | Notes |
|---|---|---|
| Standard alarm | $0.10/alarm/month | Standard resolution (60s+) |
| High-resolution alarm | $0.30/alarm/month | 10s or 30s period |
| Composite alarm | $0.50/alarm/month | Combines multiple child alarms |
| Anomaly detection alarm | $0.30/alarm/month | ML-based band detection |
| Free tier | 10 alarms free | Standard resolution only |
- "Alarm triggers Auto Scaling" β CloudWatch Alarm with ASG scaling policy action
- "Alert the team when errors spike" β Alarm β SNS Topic β email/Slack
- "Recover EC2 from system failure" β Alarm on StatusCheckFailed_System β EC2 Recover action
- "Reduce alarm noise" β Composite Alarms (AND/OR logic)
- "3 out of 5" β Datapoints to Alarm = 3, Evaluation Periods = 5
- "Alarm on billing" β EstimatedCharges metric in us-east-1 (billing metrics only there)
- "INSUFFICIENT_DATA state" β metric hasn't reported data in the evaluation period (instance stopped, metric not emitting)
- "Alarm can invoke Lambda" β yes, directly as an alarm action (no EventBridge needed)
Alarms watch metrics and change state (OK β ALARM β INSUFFICIENT) when thresholds breach. They trigger SNS notifications, Auto Scaling actions, EC2 recovery, or Lambda functions. Use "M out of N" evaluation to avoid false positives. Use Composite Alarms to reduce notification noise by combining conditions with AND/OR logic. The most common pattern: CPU alarm triggers ASG scale-out.
CloudWatch Logs β Centralised Log Management
CloudWatch Logs is a fully managed log aggregation and analysis service. It ingests logs from Lambda, ECS, EC2 (via agent), API Gateway, VPC Flow Logs, Route 53 DNS queries, and any custom source β then lets you search, filter, create metrics from patterns, and export for long-term storage.
| Concept | Description | Example |
|---|---|---|
| Log Group | Top-level container. Defines retention, encryption, access. | /aws/lambda/order-service |
| Log Stream | Sequence of events from one source instance. | One Lambda execution container, one EC2 instance |
| Log Event | Single log entry: timestamp + message string. | 2026-05-07T10:23:45Z ERROR NullPointerException⦠|
Automatic (Built-in)
- Lambda function output
- API Gateway access logs
- ECS/Fargate container stdout
- RDS/Aurora error & slow-query
- VPC Flow Logs
- Route 53 DNS query logs
Agent-Based (EC2/On-Prem)
- CloudWatch Agent on EC2
- Custom application log files
- System logs (/var/log/syslog)
- Windows Event Logs
- On-premises servers
- Any text file on disk
SDK/API (Programmatic)
PutLogEventsAPI- AWS SDKs (Boto3, Java, etc.)
- Fluent Bit / Fluentd plugins
- Docker logging drivers
- Any HTTP client
By default, logs are retained forever (never expire). You set retention per log group:
| Retention Period | Use Case | Cost Impact |
|---|---|---|
| 1 day β 7 days | Development/debugging only | Lowest storage cost |
| 30 days | Standard operational logs | Moderate |
| 90 days | Compliance (short-term) | Higher |
| 1 year β 10 years | Audit/compliance requirements | High β consider S3 export |
| Never expire | Default (dangerous for cost!) | Grows indefinitely |
The default retention is Never Expire. This means log storage costs grow forever. Always set a retention policy on every log group. For long-term archival, export to S3 (much cheaper) or use S3 lifecycle rules to move to Glacier.
Metric Filters scan incoming log events for patterns and emit CloudWatch metrics when matches occur. This lets you alarm on log content without reading logs manually.
Common metric filter patterns:
"ERROR"β any line containing ERROR"[ip, user, timestamp, request, status_code=5*, bytes]"β space-delimited pattern matching 5xx codes{ $.statusCode = 500 }β JSON filter for structured logs"OutOfMemoryError"β Java OOM detection
Subscription Filters stream matching log events in real-time to a destination for processing:
Destinations
- Kinesis Data Streams β real-time analytics
- Kinesis Data Firehose β load to S3/Redshift/OpenSearch
- Lambda β custom processing per event
- OpenSearch (via Firehose) β log search UI
Use Cases
- Stream error logs to Slack via Lambda
- Build real-time security dashboards
- Feed logs into third-party SIEM tools
- Cross-account log aggregation
- Limit: 2 subscription filters per log group
Logs Insights is a purpose-built query language for searching and analyzing log data interactively. It's like SQL for logs β fast, serverless, pay-per-query.
fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20
stats count(*) as errorCount by bin(5m) | filter @message like /Exception/
fields @timestamp, @message | parse @message "user=* action=* status=*" as user, action, status | filter status = "FAILED"
| Feature | Details |
|---|---|
| Query language | Purpose-built (fields, filter, stats, sort, parse, limit) |
| Performance | Scans GB of logs in seconds |
| Pricing | $0.005 per GB of data scanned |
| Multi-group | Query up to 50 log groups at once |
| Visualisation | Auto-generates time-series charts from stats queries |
| Saved queries | Save & share commonly used queries |
| Auto-discovery | Automatically detects fields in JSON logs |
Quick-reference for common Logs Insights query patterns:
| What You Want | Query |
|---|---|
| All errors (last hour) | fields @timestamp, @message | filter @message like /Error|Exception/ | sort @timestamp desc | limit 100 |
| Error count by 5-min bucket | filter @message like /ERROR/ | stats count() by bin(5m) |
| Lambda cold starts | filter @message like /Init Duration/ | parse @message "Init Duration: * ms" as initDuration | sort initDuration desc |
| Slow Lambda (>1s) | filter @type = "REPORT" | parse @message "Duration: * ms" as duration | filter duration > 1000 | sort duration desc |
| Top 10 IPs | parse @message "client-ip=*" as ip | stats count() by ip | sort count() desc | limit 10 |
| HTTP status breakdown | parse @message "status=*" as status | stats count() by status |
| P95 latency per 5m | filter @type = "REPORT" | parse @message "Duration: * ms" as d | stats pct(d, 95) by bin(5m) |
Syntax cheat-sheet:
| Command | Purpose |
|---|---|
fields | Select fields to display |
filter | Regex (like /pat/) or exact match |
stats | Aggregation β count(), avg(), max(), pct(field, N) |
sort | Order results (asc / desc) |
limit | Max results returned |
parse | Extract values from unstructured text |
bin(interval) | Group by time bucket (5m, 1h, etc.) |
Live Tail streams log events in real-time as they arrive β like tail -f for CloudWatch Logs. Available directly in the console.
Capabilities
- Real-time filter by pattern (show only ERROR lines)
- Highlight matching terms
- Pause / resume streaming
- View across multiple log groups simultaneously
- Set time window (1m, 5m, 15m)
Use Cases
- Debugging a deployment as it happens
- Investigating live incidents in real-time
- Monitoring Lambda during load tests
- Tracing a request across services
- Verifying log format changes
CloudWatch β Logs β Log Groups β Select group(s) β Actions β Start Live Tail. Console-only feature (not available via API). Best-effort delivery β not for audit/compliance capture.
| Method | Latency | Use Case |
|---|---|---|
| S3 Export (CreateExportTask) | Up to 12 hours | Batch archival, compliance |
| Subscription Filter β Firehose β S3 | Near real-time (~60s) | Continuous export for analytics |
| Cross-account subscription | Real-time | Centralise logs in a security account |
S3 Export is a one-time batch job (can take 12 hours) β use for archival. Subscription Filter β Firehose β S3 is near real-time continuous streaming β use when you need ongoing export. For exam questions about "real-time log export to S3", the answer is Subscription Filter + Firehose, NOT CreateExportTask.
| Component | Cost (US East) | Notes |
|---|---|---|
| Ingestion | $0.50 / GB | All data written to CW Logs |
| Storage | $0.03 / GB / month | Retained log data |
| Logs Insights | $0.005 / GB scanned | Per query |
| Vended logs (VPC Flow, etc.) | $0.10 / GB | Cheaper ingestion for AWS-generated |
- "Centralise application logs" β CloudWatch Logs (via agent or SDK)
- "Query logs with SQL-like syntax" β CloudWatch Logs Insights
- "Create alarm from log pattern" β Metric Filter β custom metric β alarm
- "Real-time log export to S3" β Subscription Filter + Kinesis Firehose (NOT CreateExportTask)
- "Default log retention" β Never expire (must explicitly set retention policy)
- "Stream logs to OpenSearch" β Subscription Filter β Firehose/Lambda β OpenSearch
- "Cross-account logs" β Subscription filter to destination in central account
- "Cheapest long-term log storage" β Export to S3 β lifecycle to Glacier
- "Lambda logs not appearing" β Lambda execution role missing
logs:CreateLogGroup/logs:PutLogEventspermissions - "Limit per log group" β 2 subscription filters maximum
CloudWatch Logs organises data as Log Groups β Log Streams β Log Events. Default retention is forever (set it!). Metric Filters turn log patterns into metrics you can alarm on. Subscription Filters stream logs in real-time to Kinesis/Lambda/OpenSearch. Logs Insights provides SQL-like querying at $0.005/GB scanned. Live Tail gives real-time console streaming for debugging. For real-time S3 export, use Subscription Filter + Firehose β not CreateExportTask (which is batch, up to 12h delay).
CloudWatch Dashboards β Visualisation & Operational Views
CloudWatch Dashboards provide customisable real-time visualisation of metrics, logs, and alarms in a single pane of glass. They support cross-account, cross-region widgets β letting operations teams monitor entire multi-account architectures from one screen.
Widget Types
- Line graph β time-series trends (CPU, latency)
- Stacked area β cumulative values
- Number β single current value (big font)
- Gauge β percentage with colour bands
- Bar chart β comparisons
- Log table β Logs Insights query results
- Alarm status β red/green per alarm
- Text (Markdown) β labels & documentation
Key Features
- Cross-account β metrics from multiple AWS accounts
- Cross-region β global view in one dashboard
- Auto-refresh β 10s, 1m, 5m intervals
- Time range control β relative or absolute
- Full-screen mode β NOC/war-room displays
- Dark mode β built-in for control rooms
- Annotations β mark deployments/incidents
- Variables β dynamic filtering (region, env)
| Sharing Method | Access Control | Use Case |
|---|---|---|
| IAM (console) | IAM policies on cloudwatch:GetDashboard | Internal teams with AWS access |
| Share via link | Public URL (no auth required) | NOC screens, external stakeholders |
| SSO-enabled sharing | Third-party auth (Cognito, SAML) | Partner teams without IAM accounts |
| CloudWatch cross-account | Organization sharing setup | Central operations account sees all |
CloudWatch provides automatic dashboards out of the box β zero configuration required:
- Service-level dashboards β auto-generated for EC2, Lambda, RDS, etc. showing key metrics
- Cross-service dashboard β aggregated health across all services in use
- Account-level overview β alarms, anomalies, recent changes
- Can be used as starting point β clone & customise for your needs
Do
- Create separate dashboards per environment (prod/staging)
- Use annotations to mark deployments
- Include alarm status widgets for at-a-glance health
- Add Markdown widgets with runbook/escalation links
- Use variables for dynamic filtering
- Set appropriate auto-refresh (10s for real-time ops)
Don't
- Overload a single dashboard with 50+ widgets (slow rendering)
- Rely solely on dashboards for alerting (use alarms)
- Share public links with sensitive metric data
- Forget cross-region widgets incur cross-region data transfer
- Create dashboards without clear ownership
| Item | Cost | Notes |
|---|---|---|
| First 3 dashboards | Free | Up to 50 metrics each |
| Additional dashboards | $3.00/dashboard/month | Each can have up to 500 widgets |
| API calls | Included | GetMetricData calls for rendering |
| Cross-account | No extra charge | Requires sharing setup |
Beyond basic dashboards, CloudWatch offers advanced observability features:
ServiceLens
- Unified view: metrics + traces + logs + alarms
- Service map showing dependencies
- Integrates with X-Ray traces
- Click a node β see latency, errors, logs
- End-to-end request flow visualisation
Container Insights
- Pre-built dashboards for ECS, EKS, Fargate
- Cluster/service/task/pod-level metrics
- CPU, memory, network, disk per container
- Automatic discovery of running containers
- Performance log events for deep analysis
Application Insights
- ML-powered monitoring for .NET/Java/SQL Server workloads
- Auto-detects problems and correlates metrics
- Creates automated dashboards for app stacks
- Reduces MTTR by highlighting root cause
Contributor Insights
- Identify top-N contributors to a pattern
- "Top 10 IPs generating 5xx errors"
- "Top 5 Lambda functions by duration"
- Use with VPC Flow Logs, CloudTrail, any log
- Helps find noisy neighbours / hot keys
ServiceLens combines CloudWatch metrics + logs + X-Ray traces into a single application view. X-Ray traces requests across services (API Gateway β Lambda β DynamoDB) and ServiceLens overlays operational data on top.
| Component | What It Provides | How to Enable |
|---|---|---|
| X-Ray SDK | Trace segments per service (latency, errors, metadata) | Add SDK to app code + IAM permissions |
| Lambda Active Tracing | Auto-instrumented traces for Lambda | Enable checkbox in function config |
| API Gateway Tracing | Trace from API entry point | Stage settings β Enable X-Ray |
| X-Ray Daemon | Collects segments from EC2-based apps | Install daemon on EC2 instances |
| ServiceLens | Unified view of traces + metrics + logs | No extra setup (uses existing CW + X-Ray) |
Pricing: X-Ray β $5 per 1M traces recorded, $0.50 per 1M traces retrieved. ServiceLens has no additional charge.
Lambda Insights is a performance monitoring solution for Lambda functions. It uses a Lambda Layer to collect detailed runtime metrics not available in basic Lambda metrics β without code changes.
| Metric | What It Measures |
|---|---|
memory_utilization | Actual memory used vs allocated (basic only shows max allocated) |
cpu_total_time | CPU utilisation β identifies CPU-bound functions |
tmp_used | /tmp disk space β detect when approaching 512 MB limit |
init_duration | Cold start duration β separate from execution time |
rx_bytes / tx_bytes | Network I/O per invocation |
total_network | Total network bandwidth consumed |
1. Add the Lambda Insights layer ARN to your function. 2. Grant IAM permissions (cloudwatch:PutMetricData). 3. Metrics appear in the LambdaInsights namespace. Integrates with ServiceLens for combined trace + metrics view.
When to use: Out-of-memory troubleshooting, cold start optimisation, CPU-bound identification, high-concurrency monitoring. Cost: Standard CloudWatch metrics pricing ($0.30/metric) + log ingestion for performance logs.
Synthetics canaries are configurable Node.js or Python scripts that run on a schedule to simulate user behaviour β proactively detecting issues before customers do.
Canary Blueprints
- Heartbeat β simple HTTP GET, verify endpoint is up
- API Canary β test authenticated API endpoints
- UI Canary β login β add to cart β checkout (Selenium/Playwright)
- Broken Link Checker β crawl site for 404s
- Visual Regression β screenshot comparison for UI changes
What Canaries Capture
- Success / failure status per run
- Screenshots of failure state
- HAR file (full network request waterfall)
- Execution logs for debugging
- CloudWatch metrics (SuccessPercent, Duration)
- Step-level timing breakdown
Pricing: ~$0.0012 per canary run. A 5-minute canary β $0.35/month (8,640 runs). Multi-region: Run canaries from different regions to detect regional outages.
"Company needs to proactively detect if their checkout page is broken before customers report it" β CloudWatch Synthetics canary that logs in, adds item to cart, and completes checkout flow on a 5-minute schedule.
CloudWatch RUM captures actual user performance data via a lightweight JavaScript snippet added to your web application β measuring what real users experience, not just synthetic tests.
Metrics Captured
- Core Web Vitals β LCP, FID, CLS (SEO-critical)
- Page load timing (Navigation Timing API)
- JavaScript errors (uncaught exceptions)
- XHR/Fetch request failures and latency
- Session and user journey tracking
X-Ray Integration
- Correlate front-end sessions with backend X-Ray traces
- End-to-end: user click β API GW β Lambda β DynamoDB
- Identify if latency is client-side or server-side
- Segment by region, browser, device type
How to enable: Copy-paste the RUM JavaScript snippet into your web app. No server-side changes needed. Pricing: $1 per 100,000 events (1 page view = 1 event).
"Application seems slow for users in Australia but synthetic tests from us-east-1 pass fine" β Enable CloudWatch RUM to see real user data segmented by geographic region β reveals regional latency issues invisible to synthetic tests.
Evidently provides feature flags, A/B testing, and controlled rollouts with automatic metric tracking β all integrated into CloudWatch.
| Capability | Use Case |
|---|---|
| Feature Flags | Toggle features on/off without redeployment |
| Gradual Rollout | 1% β 5% β 20% β 100% of users over time |
| A/B Testing | Compare conversion, revenue, or latency between variants |
| Overrides | Target specific users (beta testers, internal teams) |
| Auto-Rollback | If alarm triggers (error rate β), revert to safe variation |
Pricing: $0.01 per 1,000 feature evaluations. $0.12 per 1,000 analysed events (experiments). Integrates with CloudWatch Alarms for automatic rollback if metrics degrade.
"Team wants to test new checkout UI on 10% of users first, with automatic rollback if error rate increases" β CloudWatch Evidently with percentage-based launch + CW Alarm trigger for rollback.
- "Single pane of glass across accounts" β Cross-account CloudWatch Dashboard
- "Monitor without AWS console access" β Dashboard sharing via public link or SSO
- "Dashboard cost" β First 3 free, then $3/month each
- "Container monitoring" β CloudWatch Container Insights (ECS/EKS)
- "Service map with traces" β ServiceLens (CloudWatch + X-Ray)
- "Top contributors / hot keys" β Contributor Insights
- "Auto-generated dashboards" β CloudWatch Automatic Dashboards (zero config)
- "Cross-region view" β Dashboard widgets can pull from any region
- "Proactive endpoint monitoring" β CloudWatch Synthetics canaries
- "Real user performance data" β CloudWatch RUM (client-side JavaScript)
- "Feature flags with metrics" β CloudWatch Evidently
- "Lambda memory/CPU deep metrics" β Lambda Insights (layer-based)
- "Trace request across microservices" β X-Ray + ServiceLens
- "Cold start duration metric" β Lambda Insights
init_duration
CloudWatch Dashboards provide customisable, cross-account, cross-region visualisation with widgets (line, number, gauge, alarm, logs, markdown). First 3 dashboards are free, then $3/month. Share via console, public link, or SSO. For containers use Container Insights; for service maps with traces use ServiceLens + X-Ray; for top-N analysis use Contributor Insights. Synthetics canaries proactively test endpoints; RUM captures real user performance; Evidently enables safe feature rollouts. Lambda Insights provides deep per-function metrics (memory, CPU, cold starts).
Architecture Patterns & Cost Optimisation
CloudWatch becomes most powerful when you combine its primitives β metrics, alarms, logs, dashboards β into cohesive observability patterns. This chapter covers real-world architectures and cost strategies to keep monitoring affordable at scale.
Auto-Healing Pattern
- Custom metric β CloudWatch Alarm β SNS β Lambda
- Lambda restarts unhealthy EC2 / ECS task
- Composite alarm gates on multiple health signals
- EventBridge rule as alternative action trigger
Auto-Scaling Pattern
- Target tracking β CloudWatch alarm (auto-created)
- Step scaling β manual alarm thresholds
- Custom metric (queue depth / p99 latency) drives scaling
- Cooldown period prevents alarm flapping
Centralised Logging Pattern
- All accounts β CloudWatch Logs via unified agent
- Cross-account subscription β central Kinesis / S3
- Metric filters extract KPIs from structured logs
- Logs Insights for ad-hoc root-cause investigation
Security Monitoring Pattern
- CloudTrail β CloudWatch Logs β metric filters
- Detect root login, IAM changes, SG modifications
- Alarm β SNS β security team / incident workflow
- Pair with GuardDuty for ML-based threat detection
Key components for multi-account observability:
- Observability Access Manager (OAM) β create a monitoring account sink, then link source accounts. Shared data: metrics, logs, X-Ray traces
- Cross-account dashboards β single dashboard with widgets pulling from multiple accounts and regions
- Cross-account alarms β alarm in monitoring account evaluates metrics from source accounts
- Cross-account log queries β Logs Insights spans multiple account log groups simultaneously
In the monitoring account, create a sink. In each source account, create a link pointing to that sink. Choose which telemetry types to share (metrics, logs, traces). OAM is region-scoped β configure per-region.
| Dimension | Free Tier | Paid Pricing |
|---|---|---|
| Custom Metrics | 10 metrics/month | $0.30/metric/month (first 10K) |
| Alarms | 10 standard alarms | $0.10/alarm/month (standard), $0.50 (high-res) |
| Dashboards | 3 dashboards (50 metrics each) | $3.00/dashboard/month |
| Log Ingestion | 5 GB/month | $0.50/GB ingested |
| Log Storage | 5 GB (first month) | $0.03/GB/month archived |
| Logs Insights | β | $0.005/GB scanned |
| API Requests | 1M GetMetricData | $0.01/1,000 GetMetricData calls |
| Anomaly Detection | β | $0.30/metric/month (same as custom) |
| Contributor Insights | 1 rule (first month) | $0.02/rule/month + matching events |
| Metric Streams | β | $0.003/1,000 metric updates |
High-resolution (1s) metrics are stored at full fidelity for 3 hours, then aggregated to 1-min for 15 days, 5-min for 63 days, 1-hour for 455 days. High-res alarms cost 5Γ more ($0.50 vs $0.10). Only use 1-second resolution for latency-critical workloads.
Reduce Log Costs
- Set retention policies β default is never expire; set 7/14/30/90 days per group
- Use Infrequent Access class β 50% cheaper ingestion for compliance-only logs
- Filter at agent level β drop DEBUG/TRACE before ingestion
- Archive to S3 β export old logs via subscription filter or export task
- Compress payloads β CW agent supports gzip
Reduce Metric Costs
- Embedded Metric Format (EMF) β extract metrics from logs without PutMetricData API calls
- Avoid unnecessary high-res β 1-second metrics cost 10Γ more than 1-minute
- Consolidate with dimensions β one metric name + dimensions vs. many metric names
- Remove stale alarms β each alarm incurs monthly cost
- Use metric math β derive values instead of publishing more raw metrics
Operational Savings
- Automatic dashboards β free, zero-config service dashboards
- Anomaly detection β fewer static thresholds to maintain
- Composite alarms β one alarm tree replaces many SNS subscriptions
- CloudWatch Agent β replaces third-party agents (no licence cost)
- Metric Streams β S3 β cheaper long-term metric storage than CW retention
Common Cost Traps
- Verbose logging β DEBUG in production can generate TB/month
- Unlimited retention β forgotten log groups accumulate storage cost forever
- High-res everywhere β 1-second resolution on non-critical metrics
- API polling dashboards β excessive GetMetricData calls from auto-refresh
- Cross-region transfer β streaming logs/metrics across regions adds data-transfer fees
EMF lets you embed custom metric definitions inside structured JSON log events. CloudWatch automatically extracts and publishes the metrics β no PutMetricData API calls, no extra cost beyond log ingestion.
EMF is supported in Lambda (natively), ECS, EKS, and EC2 (via CloudWatch agent). Ideal for high-cardinality scenarios where PutMetricData API call volume would be expensive.
Metric Streams provide near real-time, continuous streaming of CloudWatch metrics to a destination β eliminating the need for polling via API.
| Use Case | How Metric Streams Helps |
|---|---|
| Long-term retention (>15 months) | Stream to S3 β lifecycle to Glacier for compliance |
| Third-party monitoring | Direct delivery to Datadog/Splunk β no custom polling code |
| Reduce API costs | Push-based replaces expensive GetMetricData polling |
| Cross-account aggregation | Stream to central Kinesis β unified analytics |
| Self-managed Prometheus | Stream β Firehose β Prometheus remote-write endpoint |
Key features: Filter by namespace/metric/dimension. Supports OpenTelemetry 0.7 and JSON output formats. Automatic batching and compression. Pricing: $0.003 per 1,000 metric updates streamed.
"Company needs to retain EC2 CPU metrics for 5 years (compliance), but CloudWatch only retains for 15 months" β Metric Streams β S3 bucket β S3 lifecycle rule (30 days β Glacier) for cost-effective long-term retention.
| Factor | CloudWatch | Third-Party (Datadog, New Relic, etc.) |
|---|---|---|
| Integration | Native β zero-setup for AWS services | Requires agents / API keys / IAM roles |
| Multi-cloud | AWS only | AWS + Azure + GCP + on-prem |
| Log querying | Logs Insights (good, not Splunk-level) | Advanced analytics, ML-powered search |
| Pricing | Pay per metric / GB / alarm | Per host / per-GB / per-user |
| Alerting | Alarms + composite + anomaly detection | Advanced correlation, AIOps |
| APM / Tracing | X-Ray (separate service, integrated via ServiceLens) | Built-in APM with code-level profiling |
| Best for | AWS-native, cost-sensitive workloads | Multi-cloud, advanced analytics needs |
The CloudWatch Unified Agent replaces the legacy CloudWatch Logs agent and collectd. It collects both metrics and logs from EC2 instances and on-premises servers.
Agent β Metrics
- CPU (per-core), RAM, disk I/O, network, swap
- Process-level: memory, CPU per process name
- collectd / StatsD protocol support
- Publishes to custom namespace (e.g.,
CWAgent)
Agent β Logs
- Tail any file path β CloudWatch Logs group
- Multi-line pattern matching (e.g., Java stack traces)
- Timestamp extraction from log lines
- Supports Windows Event Log collection
Configure via the SSM Parameter Store wizard (recommended) or a JSON config file. Use amazon-cloudwatch-agent-ctl to start/stop. Deploy at scale with SSM Run Command across fleets.
- "Reduce CloudWatch costs" β Set log retention, use Infrequent Access log class, filter verbose logs at agent level
- "Cheapest way to monitor" β Use basic (5-min) resolution, stay within free tier (10 metrics, 10 alarms, 3 dashboards, 5 GB logs)
- "Cross-account observability" β OAM β Observability Access Manager (sink in monitoring account, links from source accounts)
- "Embedded metric format" β Publish custom metrics from structured log data without PutMetricData API calls
- "Log class selection" β Standard for real-time querying; Infrequent Access for storage-only / compliance
- "Unified agent vs Logs agent" β Unified Agent collects both metrics + logs; Logs agent is legacy (logs only)
- "Metric resolution trade-off" β High-res (1s) costs more, stored 3 hours at full fidelity; standard (60s) stored 15 days
- "Auto-healing architecture" β CW Alarm β SNS β Lambda β restart/replace resource
- "Metric Streams" β Near real-time metric delivery to S3/Firehose/third-party (Datadog, Splunk)
CloudWatch costs scale with data volume β control log ingestion via retention policies, Infrequent Access class, and agent-level filtering. Use EMF to extract metrics from logs without API costs. For multi-account setups, use OAM to centralise observability into a monitoring account. Common patterns include auto-healing (alarm β SNS β Lambda), auto-scaling (custom metric β alarm β ASG), and centralised logging (subscription filters β Kinesis/S3). Choose CloudWatch for AWS-native monitoring; complement with third-party tools for multi-cloud or advanced APM.
CloudWatch β Complete Domain Summary
- Metrics β namespaces, dimensions, custom metrics, high-resolution (1s), anomaly detection, metric math, embedded metric format, Metric Streams
- Alarms β standard, composite, anomaly-based; actions via SNS, Auto Scaling, EC2, SSM; evaluation periods, datapoints-to-alarm, missing data
- Logs β log groups, streams, unified agent, retention, metric filters, subscription filters, Logs Insights, Live Tail, Infrequent Access class
- Dashboards β widgets, cross-account, cross-region, sharing, automatic dashboards, Container Insights, ServiceLens + X-Ray, Contributor Insights
- Advanced Observability β Synthetics canaries (proactive endpoint testing), RUM (real user metrics), Evidently (feature flags/A/B), Lambda Insights (deep function metrics)
- Patterns β auto-healing, auto-scaling, centralised logging, security monitoring, multi-account via OAM
- Cost β free tier limits, log retention & IA class, EMF, Metric Streams to S3, avoid high-res everywhere, remove unused alarms