Amazon ECS
LearningTree Β· AWS Β· Compute

Amazon ECS β€”
Elastic Container Service

Fully managed container orchestration. Run Docker containers at scale with EC2 or Fargate β€” define your containers, ECS handles scheduling, placement, and lifecycle.

⚑ ECS in 30 Seconds

  • Run Docker containers on AWS β€” ECS handles orchestration, scheduling, and placement
  • Fargate = serverless containers (no servers to manage). EC2 launch type = you manage the instances
  • Services keep desired task count running, auto-restart failed tasks, integrate with ALB
  • Task Definitions define your container blueprint: image, CPU, memory, networking, IAM roles
  • Deep integration with ALB, ECR, CloudWatch, IAM, Secrets Manager, and X-Ray
01
Chapter One

What is ECS

What Are Containers Introductory

A container packages your application code together with all its dependencies β€” runtime, libraries, system tools β€” into a single, portable unit. Unlike a virtual machine, a container shares the host operating system's kernel. This makes containers lightweight (megabytes, not gigabytes), fast to start (seconds, not minutes), and identical across environments (your laptop = staging = production).

Docker is the standard container runtime. You define a Dockerfile, build an image, and run it as a container. The image is immutable β€” the same image produces the same behavior everywhere.

πŸ–₯️

Virtual Machine

  • Full OS per VM (kernel + userspace)
  • Gigabytes in size
  • Minutes to boot
  • Strong isolation (separate kernels)
  • Managed by hypervisor (e.g., Nitro)
πŸ“¦

Container

  • Shares host OS kernel
  • Megabytes in size
  • Seconds to start
  • Process-level isolation (cgroups, namespaces)
  • Managed by container runtime (Docker)

πŸ‘‰ Key mental model: A VM virtualizes the hardware. A container virtualizes the OS. Containers are lighter-weight but share a kernel β€” if the kernel has a vulnerability, all containers are affected. VMs have stronger isolation boundaries.

Why You Need Orchestration Introductory

Running one container on your laptop is easy. Running 200 containers across 50 servers in production β€” keeping them healthy, distributing traffic, replacing failures, scaling up at peak, and rolling out updates without downtime β€” is not something Docker alone can do. That is the orchestration problem.

πŸ“

Scheduling

Which server should this container run on? Orchestrator picks the best host based on available CPU, memory, and placement constraints.

πŸ’š

Health & Recovery

Container crashed? Orchestrator detects the failure and starts a replacement automatically. No manual intervention.

πŸ“ˆ

Scaling

Traffic spikes? Orchestrator launches more container instances. Traffic drops? Scales back down. Keeps desired count running at all times.

An orchestrator solves: where to place containers, how many to run, when to replace them, and how to update them without downtime. ECS is AWS's answer to this problem.

What ECS Provides Core

Amazon ECS (Elastic Container Service) is a fully managed container orchestration service. You define your containers, ECS handles the rest:

βœ…

ECS Manages

  • Control plane β€” scheduling, placement, lifecycle
  • Task management β€” run, stop, replace containers
  • Service management β€” maintain desired task count
  • Load balancer integration β€” register/deregister targets
  • Rolling deployments β€” update without downtime
  • Scaling β€” auto-adjust task count based on metrics
πŸ‘€

You Define

  • Container image β€” Docker image from ECR or Docker Hub
  • Resource requirements β€” CPU and memory per container
  • Networking β€” VPC, subnets, security groups
  • IAM roles β€” permissions for your containers
  • Launch type β€” EC2 (you manage servers) or Fargate (serverless)
  • Desired count β€” how many task copies to run

The critical point: ECS is the control plane only. It does not run your containers itself. It tells EC2 instances or Fargate to run them. Think of ECS as the "brain" that decides what runs where β€” the compute comes from your chosen launch type.

ECS vs Docker Compose vs Kubernetes Core

When should you pick ECS over other container orchestrators? This comparison covers the three most common choices on AWS:

FeatureDocker ComposeECSEKS (Kubernetes)
What it isLocal multi-container toolAWS-managed orchestratorAWS-managed Kubernetes
Scale1 machineThousands of containersThousands of containers
Learning curveLowMediumHigh
Multi-hostNoYesYes
Auto healingBasic restartFull (replace + reschedule)Full (pod restart + reschedule)
AWS integrationNoneDeep (IAM, ALB, ECR, CloudWatch)Good (via add-ons)
PortabilityDocker standardAWS-onlyMulti-cloud (K8s standard)
Control plane costFreeFree~$72/month per cluster
Best forLocal dev, small projectsAWS-native production workloadsMulti-cloud, existing K8s teams

πŸ‘‰ Rule of thumb: If your team is on AWS and does not already use Kubernetes, choose ECS. It is simpler, free control plane, and has deeper AWS integration. Choose EKS only if you need Kubernetes portability across clouds or have existing K8s expertise. Choose Docker Compose only for local development.

Mental Model for ECS Introductory

Think of ECS as a restaurant kitchen:

πŸ“‹

Task Definition = Recipe

Specifies what to cook β€” ingredients (image), portion size (CPU/memory), instructions (environment variables, commands).

🍽️

Task = A Plate of Food

One running instance of the recipe. Each plate is independent. If one drops, the kitchen makes another.

πŸ‘¨β€πŸ³

Service = The Head Chef

Ensures "always 5 plates ready." If a plate breaks, chef makes a new one. If demand spikes, chef makes more. ECS Service = desired count manager.

πŸ—οΈ

Cluster = The Kitchen

The physical space where everything runs. Can be your own equipment (EC2 instances) or the restaurant's built-in kitchen (Fargate β€” you don't manage the ovens).

πŸ“¦

Container = One Dish Component

A single container inside a task. A task can have multiple containers β€” like a plate with main course + side dish running together.

Concept Diagram β€” Container vs VM Introductory
Containers vs Virtual Machines β€” Architecture Comparison
Virtual Machines Infrastructure (Hardware) Hypervisor (Nitro / Xen) VM 1 App A Bins/Libs Guest OS Full kernel ~GB size Mins to boot VM 2 App B Bins/Libs Guest OS VM 3 App C Bins/Libs Guest OS Containers Infrastructure (Hardware) Host OS Kernel (shared) Container Runtime (Docker) Container 1 App A Bins/Libs No guest OS ~MB size Secs to start Container 2 App B Bins/Libs Container 3 App C Bins/Libs = Full guest OS per VM = Shared host kernel
AWS Diagram β€” ECS in the Container Ecosystem Core
ECS Container Ecosystem β€” Build, Store, Run, Serve
DEVELOPER πŸ‘¨β€πŸ’» docker build docker push ECR Container Registry Image Store pull Amazon ECS Control Plane schedule + place + manage lifecycle EC2 Fargate Launch Types ALB Load Balancer 🌐 USERS CloudWatch Logs Β· Container Insights Β· X-Ray β†’ = data/control flow ECR stores images Β· ECS orchestrates Β· ALB routes traffic Β· CloudWatch observes
Architecture Diagram β€” Simple Web App on ECS Core
Production Web App β€” ALB + ECS Fargate across 2 Availability Zones
🌐 INTERNET USERS Application Load Balancer HTTPS :443 VPC AZ-a Β· Private Subnet Task 1 web-api:v2 0.5 vCPU Β· 1GB Task 2 web-api:v2 0.5 vCPU Β· 1GB ECS Service: web-api (desired: 4) AZ-b Β· Private Subnet Task 3 web-api:v2 0.5 vCPU Β· 1GB Task 4 web-api:v2 0.5 vCPU Β· 1GB ECS Service: web-api (desired: 4) RDS (Multi-AZ) PostgreSQL β–  Fargate task (serverless container) β–  ALB (routes HTTPS traffic) β–  RDS (database) 4 tasks spread across 2 AZs β€” any AZ can fail without downtime

This is the most common ECS production pattern: an ALB distributes traffic to Fargate tasks running in private subnets across two Availability Zones. If an entire AZ goes down, the remaining tasks continue serving traffic. The ECS Service automatically replaces failed tasks and maintains the desired count of 4.

ECS vs Lambda vs EC2 β€” When to Use Which Core
FeatureLambdaECS (Fargate)EC2
ModelServerless functionsServerless containersVirtual machines
Max duration15 minutesUnlimitedUnlimited
Max memory10 GB120 GB (16 vCPU)Terabytes (instance-dependent)
Startup latency~100ms (warm) / 1-10s (cold)30-60 secondsMinutes
PricingPer request + durationPer vCPU/memory per secondPer instance-hour
ScalingAuto (per-request, 1000s concurrently)Auto (task count, seconds)Auto (instance count, minutes)
Container supportContainer images (read-only)Full Docker supportFull Docker / any runtime
Persistent storage/tmp only (10 GB)EFS (shared)EBS, EFS, instance store
GPUNoNo (Fargate) / Yes (EC2 type)Yes
Best forEvent handlers, APIs <15min, glue codeMicroservices, APIs, workersStateful apps, GPU, full OS control
When NOT to Use ECS Core
⚑

Simple APIs / Event Handlers

If your workload is short-lived (<15 min), event-driven, and stateless β†’ use Lambda. No container to manage, no task definitions, no service configuration. Lambda is simpler and cheaper for request-response patterns.

☸️

Existing Kubernetes Teams

If your team already uses Kubernetes and needs multi-cloud portability β†’ use EKS. ECS is AWS-only. Migrating K8s manifests to ECS task definitions is non-trivial.

πŸ–₯️

Stateful / Legacy Workloads

If your app needs persistent local disk, specific OS configuration, or isn't containerized β†’ use EC2 directly. ECS requires Docker images. Some legacy middleware won't containerize easily.

πŸŽ“ Exam Tips β€” Chapter 01
  • ECS control plane is free. You only pay for the EC2 instances or Fargate tasks β€” not for ECS itself. EKS charges ~$72/month per cluster.
  • ECS vs EKS: If the question says "simplest" or "least operational overhead" for containers on AWS β†’ ECS + Fargate. If it says "Kubernetes" or "multi-cloud" β†’ EKS.
  • ECS vs Lambda: Lambda is per-request, max 15 min, max 10GB memory. ECS is for long-running services, larger workloads, or when you need full Docker compatibility.
  • Fargate = serverless containers. If the question mentions "no server management" with containers β†’ Fargate. If it says "GPU" or "daemon" β†’ must use EC2 launch type.
  • Distractor: "Docker Compose can scale to production on AWS" β€” wrong. Compose is single-host only and has no auto-healing or multi-AZ support.
πŸ“‹ Chapter 1 β€” Summary
  • Containers package app + dependencies into portable units. Lighter than VMs, seconds to start.
  • Orchestration solves scheduling, health recovery, scaling, and zero-downtime deploys.
  • ECS is the AWS-managed orchestrator β€” free control plane, deep AWS integration.
  • Two launch types: EC2 (you manage servers) and Fargate (serverless).
  • ECS vs EKS: ECS for AWS-native simplicity, EKS for Kubernetes portability.
  • Production pattern: ALB β†’ ECS Fargate tasks across Multi-AZ.
02
Chapter Two

Core Concepts

The Five Building Blocks Introductory

ECS has five core entities that form a clear hierarchy: Cluster β†’ Service β†’ Task β†’ Container, plus a Task Definition that serves as the blueprint. Understanding how these relate is the single most important concept in ECS.

Cluster Core

A cluster is the logical grouping that holds all your ECS resources. It is the top-level boundary β€” services, tasks, and capacity all live inside a cluster. A cluster does not contain compute by itself β€” you register EC2 instances to it, or use Fargate which provisions compute on demand.

πŸ“¦

What a Cluster Contains

  • One or more services (long-running containers)
  • Standalone tasks (one-off jobs)
  • Registered EC2 instances (EC2 launch type) or Fargate capacity
  • Capacity provider strategies
πŸ’‘

Key Facts

  • A cluster is free β€” no cost for the cluster itself
  • You can have multiple clusters per account (one per environment is common)
  • A cluster can mix EC2 and Fargate launch types
  • Default cluster auto-created, but create named clusters for production

Common pattern: one cluster per environment β€” dev, staging, production. Each cluster has its own services and capacity, providing isolation between environments.

Task Definition Core

A task definition is a JSON document that describes how to run your container(s). Think of it as a blueprint or recipe β€” it specifies the Docker image, CPU and memory requirements, networking mode, IAM roles, environment variables, log configuration, and more. You never run a task definition directly β€” you use it to launch tasks.

πŸ–ΌοΈ

Container Image

Which Docker image to pull β€” from ECR, Docker Hub, or any registry. Example: 123456.dkr.ecr.us-east-1.amazonaws.com/web-api:v2

βš™οΈ

Resource Limits

CPU and memory per task. Fargate has fixed combinations (e.g., 0.5 vCPU / 1GB). EC2 launch type is more flexible.

πŸ”Œ

Networking & Ports

Network mode (awsvpc, bridge, host), port mappings, and security group assignments.

πŸ‘‰ Task definitions are versioned. Each update creates a new revision (e.g., web-api:1, web-api:2, web-api:3). You point your service at a specific revision. Rolling back = pointing the service to a previous revision. Old revisions are never deleted automatically.

Here is a minimal task definition in JSON β€” the key fields every ECS user must understand:

Task Definition β€” Key Fields (Visual Breakdown)
Task Definition: web-api:3 revision 3 Β· Fargate Β· awsvpc Task-Level Settings "cpu": 512 (0.5 vCPU) "memory": 1024 (1 GB) "networkMode": "awsvpc" "requiresCompatibilities": ["FARGATE"] IAM Roles "taskRoleArn": β†’ your container's AWS permissions (access S3, DynamoDB, SQS...) "executionRoleArn": β†’ ECS agent: pull image, push logs containerDefinitions[0] β€” web-api essential: true "image": "123456.dkr.ecr...amazonaws.com/web-api:v2" "portMappings": [{"containerPort": 8080}] "environment": [{"name":"DB_HOST","value":"rds.xxx"}] "secrets": [{"name":"DB_PASS","valueFrom":"arn:aws:ssm..."}] "logConfiguration": { "logDriver": "awslogs", "options": {"awslogs-group": "/ecs/web-api"} } sidecar β€” xray-daemon essential: false (optional) "image": "amazon/aws-xray-daemon" "portMappings": [{"containerPort":2000, "protocol":"udp"}] Multi-container task: app + sidecar share network
Task Core

A task is a running instance of a task definition. When ECS launches a task, it pulls the Docker image, allocates CPU/memory, assigns an ENI (in awsvpc mode), and starts the container(s). A task can contain one container (most common) or multiple (sidecar pattern).

πŸ”„

Task Lifecycle

  • PROVISIONING β†’ allocating resources (ENI, storage)
  • PENDING β†’ pulling image, starting containers
  • RUNNING β†’ containers executing
  • STOPPED β†’ container exited (success or failure)
πŸ’‘

Key Facts

  • Each task gets its own private IP (awsvpc mode)
  • Tasks are ephemeral β€” they can be replaced anytime
  • Essential container exits β†’ entire task stops
  • Non-essential sidecar can crash without killing the task
Service Core

An ECS service maintains a desired count of running tasks. If a task crashes, the service replaces it. If you want 4 copies running at all times, the service ensures exactly 4 are always healthy. Services also integrate with load balancers β€” automatically registering and deregistering tasks as targets.

🎯

Desired Count

"Run 4 tasks." If one dies, service launches a 5th to replace it. If you scale to 8, service launches 4 more. Always maintains the target.

βš–οΈ

Load Balancer

Service registers each task's IP:port with the ALB target group. When a task starts β†’ registered. When it stops β†’ deregistered. Zero manual work.

πŸš€

Deployments

Update the service's task definition β†’ rolling deployment. New tasks start, old tasks drain. Configurable via minimumHealthyPercent and maximumPercent.

πŸ‘‰ Service vs standalone task: Use a service for long-running workloads (web servers, APIs, workers). Use a standalone task for one-off jobs (database migration, scheduled batch, data export). The service restarts failed tasks. A standalone task runs once and stops.

Failure Handling & Self-Healing Core

ECS services are self-healing by default. You don't configure recovery β€” it is built into the service abstraction. If anything goes wrong with a running task, ECS replaces it automatically. Combined with ALB health checks, this creates a resilient system that recovers from failures without human intervention.

πŸ”„

What ECS Heals Automatically

  • Task crashes (exit code β‰  0): service launches replacement
  • ALB health check fails: task deregistered β†’ replaced
  • EC2 instance dies: tasks rescheduled to healthy instances
  • AZ goes down: tasks rebalanced across remaining AZs
  • Spot interruption: Fargate Spot task replaced on on-demand
⏱️

Recovery Timeline

  • Task crash: ~30-60s to launch replacement (Fargate)
  • Health check failure: deregistration delay + new task start
  • ALB update: automatic β€” new task registered, old drained
  • No manual intervention: service maintains desired count
  • Deployment rollback: circuit breaker auto-reverts bad deploys
Container Introductory

A container is a single Docker container inside a task. Most tasks run one container (your application). But ECS supports multi-container tasks β€” a common pattern for sidecars like log routers, tracing agents (X-Ray daemon), or envoy proxies. Containers in the same task share the network namespace (they can communicate over localhost) and can share volumes.

⚠️

Essential Containers

If a container marked "essential": true exits, the entire task stops. Your main app container should always be essential. Sidecar containers can be non-essential.

πŸ”—

Multi-Container Patterns

  • Sidecar: X-Ray daemon, Datadog agent, Envoy proxy
  • Log router: Fluent Bit forwarding to CloudWatch/S3
  • Init container: runs before main app (supported since 2023)
Task Role vs Execution Role In-Depth

This is the most exam-tested ECS concept, and the most commonly confused. ECS uses two separate IAM roles with completely different purposes:

AspectTask RoleExecution Role
Who uses itYour application code inside the containerThe ECS agent (not your code)
PurposeAccess AWS services from your appInfrastructure setup: pull images, push logs
Example permissionsS3:GetObject, DynamoDB:PutItem, SQS:SendMessageecr:GetAuthorizationToken, logs:CreateLogStream
JSON fieldtaskRoleArnexecutionRoleArn
Required?Only if your app calls AWS APIsYes β€” always needed for Fargate
AnalogyEmployee badge β€” what rooms they can enterBuilding manager β€” keeps the lights and doors working
If missingApp gets "Access Denied" calling AWS servicesTask fails to start (can't pull image or push logs)

πŸ‘‰ Exam trap: "The container needs to write to S3 β€” which role?" β†’ Task Role (your app's permissions). "The container fails to start because it can't pull from ECR" β†’ Execution Role is missing or wrong. Never confuse the two β€” the exam does this deliberately.

Concept Diagram β€” Entity Hierarchy Introductory
ECS Entity Hierarchy β€” Cluster β†’ Service β†’ Task β†’ Container
ECS Cluster: production Logical grouping of all resources Service: web-api (desired: 3) maintains count Β· restarts failed Β· integrates ALB Task 1 10.0.1.15 web-api:v2 xray-daemon Task 2 10.0.2.22 web-api:v2 xray-daemon Task 3 10.0.1.38 web-api:v2 xray-daemon Standalone Task: db-migrate Runs once β†’ exits. Not managed by a service. πŸ“‹ Task Definition: web-api:3 Blueprint β€” image, CPU, memory, roles, ports, logs creates tasks from Task Role β†’ your app calls S3, DynamoDB, SQS Execution Role β†’ ECS pulls image, pushes logs = Task (running container group) = Service (desired count manager) = Cluster (logical boundary)
AWS Diagram β€” Service with ALB across 2 AZs Core
Running Service β€” 3 Tasks across 2 AZs with ALB Integration
Application Load Balancer Target Group: web-api-tg VPC AZ-a Β· Private Subnet Task 1 web-api:v2 10.0.1.15:8080 Task 2 web-api:v2 10.0.1.22:8080 ECS Service distributes tasks across AZs AZ-b Β· Private Subnet Task 3 web-api:v2 10.0.2.38:8080 auto-registered in target group ECR image pulls CloudWatch logs + metrics β–  Task = running container with private IP β–  ALB routes traffic to task IPs β–  Execution Role pulls from ECR, pushes to CW
Architecture Diagram β€” Multi-Container Task Detail In-Depth
Multi-Container Task β€” Sidecar Pattern with X-Ray + Fluent Bit
ECS Task β€” awsvpc mode ENI: 10.0.1.15 Β· SG: sg-0abc123 Β· 1 vCPU / 2GB RAM web-api (essential) Image: 123456.dkr.ecr.../web-api:v2 Port: 8080 (containerPort) CPU: 768 units Β· Memory: 1.5GB β†’ Task Role: access S3, DynamoDB essential: true ← exits = task stops xray-daemon Image: amazon/aws-xray-daemon Port: 2000/udp CPU: 128 units Β· Memory: 256MB essential: false (can crash) Receives traces via localhost:2000 fluent-bit (log router) Image: amazon/aws-for-fluent-bit Reads container stdout Routes to CloudWatch + S3 essential: false Custom log parsing + filtering Shared: network (localhost), volumes, and task-level CPU/memory budget localhost β†’ X-Ray API β†’ CW Logs β–  Essential container (exits β†’ task stops) --- Non-essential sidecar (can crash independently) β†’ localhost comms
πŸŽ“ Exam Tips β€” Chapter 02
  • Task Role vs Execution Role β€” the #1 most tested concept. Task Role = your app's permissions. Execution Role = ECS agent's permissions (pulling images, pushing logs).
  • "Container can't pull image from ECR" β†’ Missing or incorrect Execution Role. Not the Task Role.
  • "App returns Access Denied when writing to S3" β†’ Missing or incorrect Task Role. Not the Execution Role.
  • Essential container exits β†’ entire task stops. Non-essential sidecars can fail without killing the task.
  • Task Definition is versioned. Each update = new revision. Rollback = point service to older revision number.
  • Containers in the same task share network (communicate via localhost) and share the CPU/memory budget.
  • Distractor: "Use EC2 instance role for container AWS access" β€” wrong. ECS containers use Task Role, not the EC2 instance profile (even on EC2 launch type).
πŸ“‹ Chapter 2 β€” Summary
  • Cluster: logical grouping. Free. One per environment is common.
  • Task Definition: JSON blueprint β€” image, CPU, memory, roles, ports, logs. Versioned with revisions.
  • Task: running instance of a task definition. Gets its own IP (awsvpc). Ephemeral.
  • Service: maintains desired task count. Auto-restarts failed tasks. Integrates with ALB.
  • Container: Docker container inside a task. Essential flag controls task lifecycle.
  • Task Role: your app's AWS permissions (S3, DynamoDB). Execution Role: ECS agent's permissions (ECR pull, CW logs).
  • Multi-container tasks: sidecar pattern β€” X-Ray daemon, log router, envoy proxy share network with main app.
03
Chapter Three

Launch Types

Two Ways to Run Containers Introductory

ECS gives you exactly two choices for where your containers physically run: EC2 launch type (you manage the servers) or Fargate launch type (AWS manages the servers). This is the single most impactful architectural decision in ECS β€” it determines your pricing model, operational burden, scaling behavior, and what features are available.

EC2 Launch Type Core

With the EC2 launch type, you provision and manage a fleet of EC2 instances. You register these instances with your ECS cluster by installing the ECS container agent (pre-installed on the Amazon ECS-optimized AMI). ECS places your containers on these instances based on available CPU and memory. You are responsible for patching, scaling, and monitoring the instances themselves.

βœ…

Strengths

  • Full control β€” instance type, AMI, OS patches, SSH access
  • GPU support β€” P3, P4, G4 instances for ML workloads
  • Persistent EBS volumes β€” attach to specific instances
  • Daemon scheduling β€” run one agent per instance (monitoring, logging)
  • Higher task density β€” pack many small tasks on one large instance
  • Cheaper for steady-state β€” Reserved Instances / Savings Plans work
⚠️

Trade-offs

  • You manage instances β€” patching, AMI updates, agent upgrades
  • Capacity planning β€” must provision enough instances for peak
  • ENI limits β€” each awsvpc task consumes one ENI. Small instances (t3.micro) may support only 1-2 tasks. Enable ENI trunking to increase limit.
  • Idle waste β€” pay for full instance even if half-empty
  • Scaling is two-layer β€” scale tasks AND scale instances (Auto Scaling Group)

πŸ‘‰ The ECS container agent is a Docker container itself that runs on every EC2 instance. It communicates with the ECS control plane, receives task placement instructions, starts/stops containers, and reports health. Use the ECS-optimized AMI (Amazon Linux 2023) β€” it comes pre-configured with Docker and the agent.

Fargate Launch Type Core

With Fargate, you do not provision or manage any servers. You specify CPU and memory requirements in the task definition, and AWS provisions a compute environment for each task. You never see the underlying instance. Each task runs in its own isolated micro-VM (using Firecracker), providing strong security isolation β€” one customer's task cannot affect another's.

βœ…

Strengths

  • Zero server management β€” no instances to patch, scale, or monitor
  • Per-task pricing β€” pay only for the vCPU and memory your task uses
  • No idle waste β€” no instance running empty at 2 AM
  • Task-level isolation β€” Firecracker micro-VM per task
  • Scaling is one-layer β€” just change desired count, Fargate handles capacity
  • Fargate Spot β€” up to 70% discount for interruptible tasks
⚠️

Trade-offs

  • No GPU support β€” cannot use GPU instance types
  • No daemon scheduling β€” can't run one agent per "host"
  • No EBS volumes β€” ephemeral storage only (20GB default, up to 200GB)
  • Fixed CPU/memory combos β€” limited set of valid pairings
  • No SSH access β€” debug via ECS Exec only
  • Higher per-unit cost β€” ~20% more expensive per vCPU-hour than EC2

The Fargate pricing model is straightforward: you pay per vCPU-second and per GB-second your task runs. There is no cost when no tasks are running (unlike EC2 where the instance bill continues). This makes Fargate ideal for variable workloads β€” the cost matches the actual usage precisely.

Fargate CPU/Memory Combinations Core

Fargate does not let you specify arbitrary CPU and memory β€” there are fixed valid combinations. If you specify an invalid pairing, the task definition fails to register.

vCPUMemory Options (GB)Typical Use Case
0.25 vCPU0.5, 1, 2Tiny microservices, health checkers
0.5 vCPU1, 2, 3, 4APIs, lightweight web servers
1 vCPU2, 3, 4, 5, 6, 7, 8Standard APIs, workers
2 vCPU4 – 16 (in 1GB steps)Batch jobs, heavier services
4 vCPU8 – 30 (in 1GB steps)Data processing, analytics
8 vCPU16 – 60 (in 4GB steps)ML inference, heavy compute
16 vCPU32 – 120 (in 8GB steps)Large in-memory workloads

πŸ‘‰ Exam tip: If a question says "the task requires 3 vCPU and 6GB memory" β€” there is no 3 vCPU option in Fargate. You must round up to 4 vCPU. This is a common exam trap. Know the valid vCPU values: 0.25, 0.5, 1, 2, 4, 8, 16.

EC2 vs Fargate β€” Full Comparison Core
FeatureEC2 Launch TypeFargate Launch Type
Server managementYou manage EC2 instancesAWS manages (serverless)
PricingPay for EC2 instances (running or not)Pay per vCPU-second + GB-second per task
IsolationInstance-level (shared host for tasks)Task-level (Firecracker micro-VM)
GPU supportβœ… Full GPU access (P3, P4, G4, G5)❌ No GPU available
Persistent storage (EBS)βœ… EBS volumes attachable❌ Ephemeral only (20-200GB)
EFS (shared file system)βœ… Supportedβœ… Supported
Spot instancesβœ… EC2 Spot (up to 90% discount)βœ… Fargate Spot (up to 70% discount)
Daemon schedulingβœ… One task per instance❌ Not supported
Task densityPack multiple tasks per instanceOne micro-VM per task
SSH accessβœ… Direct SSH to instance❌ ECS Exec only (SSM-based)
Scaling layers2 layers: tasks + instances (ASG)1 layer: tasks only
Cold start~minutes (if ASG needs new instance)~30-60s (Fargate provisions infra)
Best forLarge steady workloads, GPU, tight cost controlVariable/spiky workloads, simplicity, microservices
Hybrid: EC2 + Fargate in the Same Cluster In-Depth

A single ECS cluster can use both launch types simultaneously. This is the production-standard pattern for cost optimization: run steady-state workloads on EC2 Reserved Instances (cheapest baseline), and burst overflow to Fargate (no pre-provisioning needed). Capacity Provider strategies let you define the mix β€” for example, "80% on EC2, 20% overflow on Fargate" or "batch jobs on Fargate Spot, web tier on Fargate."

🌐

Web Tier β†’ Fargate

Variable traffic, auto-scales, no servers to manage. Simplest operational model for customer-facing services.

βš™οΈ

Workers β†’ EC2

Steady-state processing, Reserved Instances for cost. Pack multiple worker tasks per large instance for efficiency.

πŸ“Š

Batch β†’ Fargate Spot

Interruptible batch jobs get up to 70% discount. Task retries handle interruptions naturally.

πŸ‘‰ Decision framework: Start with Fargate. It is simpler and scales naturally. Move to EC2 launch type only when you need: (1) GPU, (2) EBS persistent volumes, (3) daemon scheduling, (4) cost optimization on large steady-state fleets, or (5) specific instance types. Fargate is the default for most new workloads on ECS.

Task Placement Strategies (EC2 Launch Type) In-Depth

When using the EC2 launch type, ECS decides which instance gets each new task. Task placement strategies control this decision. They apply only to EC2 β€” Fargate handles placement internally (one micro-VM per task, AWS chooses the host).

StrategyHow It WorksBest For
spreadDistribute tasks evenly across the specified field (e.g., attribute:ecs.availability-zone or instanceId)High availability β€” ensures AZ failure impacts minimum tasks
binpackPack tasks onto the fewest instances possible (by CPU or memory)Cost optimization β€” fewer instances running, lower EC2 bill
randomPlace tasks on random instancesSimple workloads, testing β€” no preference
🎯

Combining Strategies

You can chain strategies in order of priority. Example: spread(az) first, then binpack(memory). This spreads across AZs for HA, then packs tightly within each AZ for cost savings.

🚧

Placement Constraints

Constraints filter which instances are eligible: distinctInstance (no two tasks on same instance) or memberOf (custom expressions like attribute:ecs.instance-type == g4dn.xlarge).

πŸ‘‰ Default behavior: ECS uses spread across Availability Zones by default. This is the safest default β€” it maximizes availability. Switch to binpack when cost optimization is the priority and you can tolerate reduced AZ spread.

EC2 vs Fargate β€” Who Manages What
EC2 Launch Type Fargate Launch Type πŸ‘€ YOU MANAGE β€’ EC2 instances (provision, patch, AMI updates) β€’ Auto Scaling Group (instance count) β€’ ECS agent updates β€’ OS security patches β€’ Instance type selection + capacity planning β€’ Docker version management ☁️ AWS MANAGES β€’ ECS control plane (scheduling, placement) β€’ Task lifecycle management β€’ Service desired count + ALB integration πŸ‘€ YOU MANAGE β€’ Task definition (image, CPU, memory, roles) β€’ Service desired count + scaling policies ☁️ AWS MANAGES β€’ Infrastructure provisioning (Firecracker VMs) β€’ OS patching + security updates β€’ Docker runtime management β€’ Capacity β€” always available, no planning β€’ ECS control plane (scheduling, placement) β€’ Task lifecycle management β€’ Service desired count + ALB integration = Your responsibility = AWS manages (EC2) = AWS manages (Fargate)
AWS Diagram β€” EC2 Cluster vs Fargate Cluster Core
EC2 Launch Type (with ASG) vs Fargate Launch Type
EC2 Launch Type Auto Scaling Group min: 2 Β· max: 6 Β· desired: 3 Instance 1 m5.xlarge Β· ECS Agent Task A Task B Task C idle Instance 2 m5.xlarge Β· ECS Agent Task D Task E idle You manage: instances, ASG, patching Scale 2 layers: tasks + instances Pay for instances (even "idle" slots) βœ… GPU Β· EBS Β· Daemon Β· SSH Cost: $$ (Reserved Instances save up to 72%) ⚠ ENI limit per instance limits tasks (awsvpc) vs Fargate Launch Type AWS-Managed Infrastructure (Firecracker) Task A 0.5 vCPU Β· 1GB isolated micro-VM Task B 1 vCPU Β· 2GB isolated micro-VM Task C 0.5 vCPU Β· 1GB isolated micro-VM Task D 2 vCPU Β· 4GB isolated micro-VM AWS manages: all infrastructure Scale 1 layer: just change desired count Pay per task runtime (no idle waste) ❌ No GPU Β· No EBS Β· No Daemon Β· No SSH Cost: $$$ per unit but zero waste
Architecture Diagram β€” Mixed Workload Cluster In-Depth
Hybrid Cluster β€” Web on Fargate + Batch on EC2 Spot + ML on EC2 GPU
ECS Cluster: production (hybrid) ALB β†’ routes to web tier Web Tier β€” Fargate Service: web-api (desired: 6) 0.5 vCPU / 1GB per task Auto-scales 2–20 by CPU util βœ… No servers Β· pay per task βœ… Customer-facing Β· variable traffic $0.04/vCPU-hr Β· zero idle cost Batch β€” EC2 Spot Service: data-pipeline (desired: 10) c5.2xlarge Spot instances EventBridge triggers hourly jobs βœ… 90% cheaper with Spot βœ… Interruption-tolerant (retries) $0.03/vCPU-hr (Spot discount) ML Inference β€” EC2 GPU Service: model-server (desired: 2) p3.2xlarge (1Γ— V100 GPU each) Reserved Instances (1-year) βœ… GPU required β€” Fargate can't βœ… Dedicated instances for isolation $2.30/hr RI (save 40% vs on-demand) SQS Queue job messages β–  Fargate (serverless, variable traffic) β–  EC2 Spot (batch, interruptible, cheap) β–  EC2 GPU (ML, requires specific hardware) All three workloads in one cluster. Capacity Providers control the mix. Each optimized for its cost/feature profile.

This architecture uses each launch type where it shines: Fargate for the web tier (simple, no servers, auto-scales), EC2 Spot for batch processing (cheapest compute, interruption-tolerant), and EC2 GPU for ML inference (needs hardware that Fargate can't provide). All managed through a single ECS cluster with capacity provider strategies.

πŸŽ“ Exam Tips β€” Chapter 03
  • "No server management" + containers β†’ Always Fargate. This is the exam's favorite phrase for Fargate.
  • "Requires GPU" β†’ Must use EC2 launch type. Fargate does not support GPU instances.
  • "Run one monitoring agent per host" β†’ Daemon scheduling on EC2 launch type. Fargate doesn't support daemons.
  • "Need persistent block storage (EBS)" β†’ EC2 launch type. Fargate only has ephemeral storage.
  • "Need shared file storage across tasks" β†’ EFS works with both EC2 and Fargate. Don't pick EC2 just for shared storage.
  • Fargate Spot β€” up to 70% discount but tasks can be interrupted with 2-minute warning. Good for batch. Not for web servers.
  • Valid vCPU values: 0.25, 0.5, 1, 2, 4, 8, 16. If a question says "3 vCPU" β€” that's invalid, must round up to 4.
  • Fargate cold start ~30-60s. If the question requires "sub-second scaling" β†’ EC2 with pre-warmed instances.
  • Distractor: "Fargate is always cheaper than EC2" β€” wrong. For large steady-state workloads, EC2 with Reserved Instances is cheaper per unit.
πŸ“‹ Chapter 3 β€” Summary
  • EC2 launch type: you manage instances. Full control, GPU, EBS, daemon scheduling. Cheaper for steady-state (RI/SP).
  • Fargate: serverless containers. Zero server management. Per-task pricing. No GPU, no EBS, no SSH.
  • Fargate isolation: each task runs in its own Firecracker micro-VM (task-level isolation vs instance-level).
  • Fargate CPU/memory: fixed combinations. vCPU options: 0.25, 0.5, 1, 2, 4, 8, 16.
  • Hybrid clusters: use both launch types. Web on Fargate, batch on EC2 Spot, ML on EC2 GPU.
  • Default choice: start with Fargate. Move to EC2 only when you need GPU, EBS, daemons, or cost optimization at scale.
  • Fargate Spot: up to 70% discount for interruptible batch workloads.
☁️
Deep Dive

Fargate β€” Complete Understanding

What is Fargate β€” Behind the Scenes Core

Fargate is a serverless compute engine for containers. It removes the server layer entirely β€” you define what you want to run and how much CPU/memory it needs. AWS handles everything else: provisioning compute, patching the OS, managing the container runtime, and isolating your workload.

🧠

The Right Mental Model

Most people think Fargate = "Lambda for containers." That's not quite right.

πŸ‘‰ Better mental model:

"Fargate = EC2 without access to EC2"

Your container gets a VPC, an ENI, security groups, private IP β€” just like EC2. You just can't SSH in, can't pick the instance type, can't install host-level agents. The EC2 exists β€” you just don't see it.

πŸ”§

What Happens Under the Hood

  • AWS provisions a Firecracker micro-VM per task
  • AWS manages the host OS, container runtime (containerd), and ECS agent
  • You never see the underlying EC2 instance
  • Isolation: each task is a separate micro-VM (not just a container on a shared host)
  • You only define: CPU, memory, container image, networking

Internally: Fargate still runs on EC2 hardware (Nitro instances). It's EC2 that AWS manages for you β€” not a different compute technology.

πŸ‘‰ Key insight: Fargate is NOT a separate compute platform. It's an abstraction layer over EC2. AWS is running EC2 instances, launching Firecracker micro-VMs on them, and exposing only the container interface to you. This is why Fargate tasks behave like EC2 instances (own IP, security groups, VPC placement) β€” because under the hood, they ARE running on EC2.

Fargate Networking Model Deep

Every Fargate task gets its own Elastic Network Interface (ENI) with a private IP address in your VPC. This has major implications:

🌐

How It Works

  • Each task = own ENI = own private IP
  • ENI lives in your subnet (public or private)
  • You attach security groups directly to the task
  • Tasks can communicate using standard VPC networking
  • Always uses awsvpc network mode (no other option)
πŸ’‘

Implications

  • No port conflicts β€” every task has its own IP, so all can use port 80
  • Security group per task β€” fine-grained firewall rules
  • Task behaves like an EC2 instance from a networking perspective
  • ALB targets individual tasks by IP (not instance + port)

⚠️ Critical requirement: Fargate tasks in private subnets need a NAT Gateway for internet access (pulling images from Docker Hub, calling external APIs). Without NAT, the task hangs at "PROVISIONING" and eventually times out. For ECR image pulls, you can alternatively use VPC Endpoints (PrivateLink) to avoid NAT Gateway costs.

Fargate Resource Model β€” CPU/Memory Combinations Core

Fargate does NOT allow arbitrary resource values. You must choose from predefined CPU/memory combinations. If you specify an invalid pair, the task definition will fail to register.

vCPUMemory Options (GB)Typical Use Case
0.25 vCPU0.5, 1, 2Microservices, health checks, lightweight APIs
0.5 vCPU1, 2, 3, 4Small web apps, background workers
1 vCPU2, 3, 4, 5, 6, 7, 8Standard web apps, APIs
2 vCPU4–16 (in 1GB increments)Medium workloads, data processing
4 vCPU8–30 (in 1GB increments)Large apps, compute-heavy tasks
8 vCPU16–60 (in 4GB increments)Heavy processing, in-memory caching
16 vCPU32–120 (in 8GB increments)Max power (rare, expensive)

πŸ‘‰ Exam trap: "The task requires 3 vCPU and 6GB memory." There is no 3 vCPU option β€” you must round up to 4 vCPU. Valid vCPU values: 0.25, 0.5, 1, 2, 4, 8, 16. Nothing in between. This is a frequently tested concept.

Fargate Startup Behavior Core

Fargate task startup is not instant β€” understanding the timeline helps you set realistic expectations for scaling and health check grace periods:

⏱️

Startup Time: ~30–60 seconds

Typical cold start. Includes compute provisioning + image pull + container start. Larger images = longer startup.

πŸ“¦

What Happens During Startup

  • AWS provisions Firecracker micro-VM
  • Attaches ENI to your subnet
  • Pulls container image from ECR/registry
  • Starts your container process
  • Health check grace period begins
πŸ“Š

Compared To

  • Lambda cold start: 100ms–3s (faster)
  • Fargate: 30–60s
  • EC2 launch: 2–5 min (slower)

For latency-sensitive scaling, keep min tasks > 0 to avoid cold starts.

Fargate Storage Core

Storage in Fargate is fundamentally different from EC2 β€” there is no EBS available. Understanding what you get (and don't get) prevents painful surprises:

πŸ’Ύ

Ephemeral Storage

  • Default: 20 GB per task
  • Configurable: up to 200 GB
  • Lifecycle: destroyed when task stops
  • Fast local SSD β€” good for temp files, caching, scratch space
  • Shared across all containers in the task
πŸ“‚

Persistent Storage: EFS

  • Amazon EFS = only persistent storage option for Fargate
  • Shared filesystem β€” multiple tasks read/write simultaneously
  • Survives task restarts
  • Mount as a volume in task definition
  • Use for: shared config, uploaded files, ML models

⚠️ No EBS on Fargate. If your workload requires EBS volumes (high IOPS, block storage, databases), you must use EC2 launch type. Fargate only supports ephemeral storage + EFS. This is a common exam question and a real-world constraint.

Fargate Limitations Core

Fargate is excellent β€” but it's NOT always the right choice. These limitations are critical for architecture decisions and exam answers:

🚫

What Fargate Cannot Do

  • No GPU support β€” ML training, rendering β†’ use EC2
  • No EBS volumes β€” only ephemeral + EFS
  • No daemon containers β€” can't run node-level agents (Datadog agent, Fluentd)
  • No SSH access β€” cannot log into the host
  • No custom AMI β€” can't customize the underlying OS
  • No privileged mode β€” can't run containers with root-level host access
  • No Windows containers (limited support, still maturing)
  • Fixed CPU/memory combos β€” can't choose arbitrary values
πŸ’°

Cost Considerations

  • Fargate is ~20-40% more expensive per vCPU-hour than EC2 On-Demand
  • EC2 with Reserved Instances / Savings Plans = much cheaper for steady workloads
  • Fargate Spot helps (up to 70% off) but can be interrupted
  • Break-even point: if task utilization is >70% consistently for 24/7 workloads, EC2 is cheaper
  • Fargate wins: for variable workloads, burst traffic, short-lived tasks
Fargate vs EC2 β€” Decision Guide Deep
☁️

Choose Fargate When

  • You want zero infrastructure management
  • Workloads are variable or bursty
  • Fast setup and iteration speed matter
  • Small team with no dedicated DevOps
  • Security isolation per task is important
  • You're running microservices (many small containers)
  • Development and staging environments
πŸ–₯️

Choose EC2 When

  • You need GPU instances (ML training, rendering)
  • You want cost optimization at scale (RI/SP at 50-60% off)
  • You need EBS volumes (databases, high IOPS)
  • You run daemon containers (log agents, monitoring sidecars)
  • You need privileged mode or custom OS configs
  • Workloads are steady-state 24/7 at high utilization
  • You need instance types Fargate doesn't match (compute-optimized, memory-optimized)

πŸ‘‰ Golden rule: Start with Fargate. Move to EC2 only when you hit a specific limitation (GPU, EBS, cost, daemons). Don't pre-optimize β€” Fargate's operational simplicity saves engineering time that often exceeds the compute cost difference.

Fargate Execution Flow Core
Fargate Task Lifecycle β€” From Request to Running
ECS Control Plane "run task" Fargate Provision Firecracker ΞΌVM + CPU/Memory ~10-20s ENI Attach to VPC Private IP + SG Pull Container image from ECR ~10-30s RUNNING Container active Own IP/port 80 ALB can route βœ“ Healthy 0s ~10s ~15s ~30s ~30-60s ECS requests task β†’ Fargate provisions micro-VM β†’ ENI attaches (IP + SG) β†’ Image pulls from ECR β†’ Container starts β†’ ~30-60s total cold start
Common Fargate Mistakes Core
❌

Assuming GPU Support

Fargate does NOT support GPU workloads. For ML training, inference with GPU, or rendering β€” you MUST use EC2 launch type with P/G instance families.

❌

Forgetting NAT Gateway

Fargate tasks in private subnets cannot reach the internet (or ECR) without a NAT Gateway or VPC Endpoints. Task gets stuck in PROVISIONING forever.

❌

Overprovisioning Resources

Choosing 4 vCPU / 8GB when the app uses 0.5 vCPU / 512MB. Fargate bills per-second β€” oversized tasks = wasted money every second they run.

❌

Expecting EBS

Fargate only has ephemeral storage + EFS. If you need high-IOPS block storage (databases, caches with persistence) β€” use EC2 launch type.

❌

Large Image + Cold Start

Using 2GB+ images on Fargate β†’ 60+ second startup times. Keep images lean (<500MB). Use multi-stage Docker builds to minimize image size.

❌

Daemon Scheduling

Trying to run "one per host" containers (log agents, monitoring) on Fargate β€” there's no concept of "host." Use ECS daemon service with EC2 launch type instead.

☁️ Fargate Deep Dive β€” Summary

Fargate = EC2 without access to EC2. Serverless containers with full VPC networking.

  • Behind the scenes: Firecracker micro-VMs on AWS-managed EC2. Each task = isolated VM, own ENI, own IP.
  • Networking: awsvpc mode only. Each task gets an ENI in your subnet with security groups. NAT Gateway required for private subnets.
  • Resources: Fixed CPU/memory combos. Valid vCPU: 0.25, 0.5, 1, 2, 4, 8, 16. No arbitrary values.
  • Startup: ~30-60 seconds (provision + ENI + image pull + start). Not instant like Lambda.
  • Storage: Ephemeral 20-200GB (destroyed on stop) + EFS (persistent, shared). NO EBS.
  • Limitations: No GPU, no EBS, no daemons, no SSH, no privileged mode, no custom AMI.
  • Cost: ~20-40% more than EC2 On-Demand. Wins for variable/burst workloads. Loses for 24/7 steady-state at scale.
  • Golden rule: Start with Fargate. Move to EC2 only when you hit a limitation.
04
Chapter Four

Networking & Storage

Networking Modes Core

How your ECS tasks connect to the network determines their security posture, IP behavior, and load balancer integration. ECS supports three networking modes β€” but for all practical purposes, awsvpc is the only one you should use (and the only one that works with Fargate).

Network ModeHow It WorksLaunch TypeUse Case
awsvpcEach task gets its own ENI (Elastic Network Interface) with a private IP in your VPCEC2 + FargateProduction standard β€” all new workloads
bridgeTasks share the host's network via Docker bridge. Dynamic port mapping.EC2 onlyLegacy. Only if migrating from Docker Compose
hostTask uses the host EC2 instance's network directly. No isolation.EC2 onlyMaximum performance (no NAT overhead). Rare.
awsvpc Mode β€” The Standard Core

In awsvpc mode, each ECS task gets its own Elastic Network Interface (ENI) β€” a real VPC network interface with a private IP address. This means each task has its own security group, appears as a distinct network entity in your VPC, and can be targeted directly by load balancers. There is no port conflict β€” every task listens on the same container port (e.g., 8080) because each has its own IP.

βœ…

Benefits

  • Task-level security groups β€” different rules per service
  • No port conflicts β€” every task uses port 8080, own IP
  • VPC Flow Logs per task β€” full network visibility
  • Direct ALB targeting by IP β€” no dynamic port mapping needed
  • Required for Fargate β€” the only option that works
⚠️

ENI Limits (EC2 only)

  • Each EC2 instance type has a max ENI count
  • Each task in awsvpc mode consumes one ENI
  • t3.micro: 2 ENIs β†’ only 1 task (1 ENI for the instance itself)
  • m5.xlarge: 4 ENIs β†’ 3 tasks max
  • ENI trunking (opt-in) increases the limit significantly
  • Not a concern with Fargate β€” AWS manages this

πŸ‘‰ ENI trunking is an opt-in feature that lets you run more tasks per EC2 instance in awsvpc mode. It creates a "trunk" ENI with multiple "branch" ENIs sharing it. Enable it via account settings: aws ecs put-account-setting --name awsvpcTrunking --value enabled. With trunking, an m5.xlarge can support ~18 tasks instead of 3.

πŸ‘‰ Mental model: In awsvpc mode, each task behaves exactly like a standalone EC2 instance from a networking perspective β€” it has its own private IP, its own security group, its own entry in VPC Flow Logs, and can be directly addressed by other services. The only difference: it is a container, not a VM. This is why awsvpc is required for Fargate β€” it provides the clean network isolation that serverless containers need.

Security Groups for Tasks Core

In awsvpc mode, security groups are assigned at the task level (via the service's network configuration), not at the instance level. This gives you fine-grained control:

🌐

Web Tier SG

  • Inbound: port 8080 from ALB SG
  • Outbound: port 5432 to DB SG
  • Outbound: port 443 to internet (HTTPS)
βš™οΈ

Worker Tier SG

  • Inbound: none (pulls from SQS)
  • Outbound: port 443 to SQS/S3
  • Outbound: port 5432 to DB SG
πŸ—„οΈ

Database SG

  • Inbound: port 5432 from Web SG + Worker SG
  • Outbound: none
  • Reference SGs by ID (not IP ranges)

The key insight: reference security groups by their group ID, not by IP ranges. Since task IPs change on every restart, IP-based rules would constantly break. SG-to-SG references are stable regardless of task IP churn.

Load Balancer Integration Core

ECS integrates with ALB (Application Load Balancer) and NLB (Network Load Balancer) through target groups. When you create a service with a load balancer, ECS automatically registers each task's IP:port as a target. When a task is replaced, the old target is deregistered and the new one registered β€” seamlessly.

βš–οΈ

ALB (Layer 7)

  • Path-based routing: /api/* β†’ API service, /web/* β†’ frontend
  • Host-based routing: api.example.com β†’ one service
  • Health checks: HTTP GET /health β†’ 200 OK
  • Sticky sessions: route same user to same task
  • Target type: ip (required for awsvpc + Fargate)
  • Best for: HTTP/HTTPS workloads, microservices
πŸ”Œ

NLB (Layer 4)

  • TCP/UDP pass-through: no HTTP awareness
  • Ultra-low latency: millions of requests/sec
  • Static IP: one IP per AZ (great for whitelisting)
  • TLS termination or pass-through
  • Target type: ip (for awsvpc mode)
  • Best for: gRPC, WebSocket, non-HTTP protocols

πŸ‘‰ Target type must be ip for Fargate. The ALB target group must use target type: ip (not instance). With awsvpc mode, ECS registers the task's ENI IP directly. If you create the target group with type instance, the deployment fails silently β€” tasks start but are never registered.

Service Discovery (AWS Cloud Map) In-Depth

For service-to-service communication without a load balancer, ECS integrates with AWS Cloud Map to provide DNS-based service discovery. When a task starts, it registers a DNS record (e.g., web-api.production.local). Other services resolve this name to get the task's current IP address. When the task stops, the record is removed.

πŸ—ΊοΈ

Cloud Map (Service Discovery)

  • DNS A records pointing to task IPs
  • Private DNS namespace (e.g., production.local)
  • Auto-register on task start, deregister on stop
  • Health checks to remove unhealthy instances
  • Works with both EC2 and Fargate
πŸ”—

Service Connect (newer, simpler)

  • Built on Cloud Map + Envoy proxy sidecar
  • Service-to-service via logical names (not IPs)
  • Automatic retries, timeouts, circuit breaking
  • Traffic metrics out of the box
  • Recommended over raw Cloud Map for new services

Service Connect is the newer recommended approach. It injects an Envoy sidecar proxy into your tasks automatically. Your code calls http://web-api:8080, and the proxy handles discovery, load balancing, retries, and telemetry. Think of it as a lightweight service mesh managed by ECS β€” no Kubernetes or App Mesh complexity.

πŸ‘‰ Service-to-service communication: Cloud Map eliminates hardcoded IPs entirely. Example: your orders service calls http://payments.production.local:8080/charge β€” DNS resolves to the current task IP. No load balancer needed for internal calls. For exam: if the question says "internal service communication without ALB" β†’ Service Discovery via Cloud Map.

Concept Diagram β€” awsvpc Network Mode Introductory
awsvpc Mode β€” Each Task Gets Its Own ENI and Private IP
VPC: 10.0.0.0/16 Β· Private Subnet Task 1 β€” web-api ENI: eni-abc123 Β· IP: 10.0.1.15 SG: sg-web (inbound :8080 from ALB) Container port: 8080 Task 2 β€” web-api ENI: eni-def456 Β· IP: 10.0.1.22 SG: sg-web (same rules, own ENI) Container port: 8080 ← no conflict! Task 3 β€” worker ENI: eni-ghi789 Β· IP: 10.0.2.38 SG: sg-worker (different rules!) No inbound β€” outbound to SQS only ALB β†’ target type: ip routes to 10.0.1.15:8080 + 10.0.1.22:8080 SQS Queue Key: each task = own ENI + own IP + own security group No port conflicts (both web tasks use :8080 on different IPs). Different services get different SGs. = ENI (per-task network interface) = Security Group (per-service rules) = ALB targeting task IPs directly
Storage Options Core

ECS containers need storage for application data, temp files, shared state, and logs. The options depend heavily on your launch type:

Storage TypePersistenceEC2FargateShared Across TasksBest For
Ephemeral (container layer)Deleted on task stopβœ…βœ… (20-200GB)❌Temp files, caches, scratch space
EBS VolumePersists beyond task lifecycleβœ…βŒβŒ (one AZ only)Database data, stateful single-task workloads
EFS (Elastic File System)Persistent, durableβœ…βœ…βœ… (multi-AZ, multi-task)Shared config, ML models, CMS uploads
Instance Store (NVMe)Lost on instance stop/terminateβœ…βŒβŒHigh-IOPS scratch (ML training, video encode)
Docker VolumesDepends on driverβœ…βŒBetween containers in same taskSidecar data sharing within a task
EFS β€” Shared Storage for ECS In-Depth

Amazon EFS is the most important storage integration for ECS because it works with both EC2 and Fargate and supports concurrent access from multiple tasks across multiple AZs. Mount an EFS file system in your task definition, and every task gets read/write access to the same files β€” no matter which AZ it runs in.

πŸ“‚

When to Use EFS

  • Shared configuration files across multiple tasks
  • ML model files (load once, serve from many tasks)
  • CMS file uploads (WordPress media, user uploads)
  • Log aggregation (multiple writers, one reader)
  • Any workload needing shared persistent storage on Fargate
⚠️

Gotchas

  • Latency: EFS is network-attached β€” higher latency than local SSD
  • Throughput: scales with data stored (or use provisioned throughput)
  • Cost: $0.30/GB-month (standard). Use Infrequent Access for cold data
  • Security: must configure SG to allow NFS (port 2049) from task SG
  • IAM auth: use EFS access points for per-task directory isolation

πŸ‘‰ Fargate ephemeral storage β€” each Fargate task gets 20GB of ephemeral storage by default (stored on the micro-VM's local disk). You can configure up to 200GB in the task definition. This data is fast (local NVMe) but deleted when the task stops. Use it for temp files, build artifacts, or caching β€” not for anything you need to persist.

AWS Diagram β€” ECS Service with ALB + Service Discovery Core
ECS Networking β€” ALB for External Traffic + Cloud Map for Internal
🌐 External clients ALB :443 VPC Β· Private Subnets Service: web-api Fargate Task 1 :8080 Task 2 :8080 DNS: web-api.prod.local Service: order-svc Fargate Task 1 :3000 Task 2 :3000 DNS: order-svc.prod.local HTTP Service: worker EC2 Task 1 Task 2 DNS: worker.prod.local AWS Cloud Map Namespace: prod.local Β· auto-register A records per task RDS port 5432 EFS shared files β†’ ALB (external traffic) β†’ Cloud Map DNS (internal service-to-service) --- = auto-registration
Architecture Diagram β€” Web Tier + API Tier + Shared EFS In-Depth
Multi-Tier Architecture with Shared EFS Storage
🌐 ALB β†’ /api/* and /uploads/* AZ-a API Task Reads/writes to EFS /mnt/shared/uploads Processor Task Reads uploads from EFS Generates thumbnails AZ-b API Task Same EFS mount /mnt/shared/uploads Processor Task Same EFS mount Generates thumbnails Amazon EFS Multi-AZ Β· Shared Β· Persistent /uploads Β· /models Β· /config β–  API tasks write uploads to EFS β–  Processor tasks read from same EFS All 4 tasks share one EFS filesystem across 2 AZs

This pattern is common for file processing: API tasks accept uploads and write to EFS, processor tasks read from EFS and generate thumbnails or transcodes. EFS is shared across all tasks and all AZs β€” no need to copy files between tasks or use S3 as an intermediary for simple file sharing.

πŸŽ“ Exam Tips β€” Chapter 04
  • awsvpc = required for Fargate. If using Fargate, awsvpc is the only networking mode. If the exam says "bridge mode" + Fargate β€” that's impossible.
  • ALB target type must be ip for Fargate. Not instance. This is a common configuration error tested in exams.
  • "Need shared storage across Fargate tasks" β†’ EFS. It's the only persistent shared storage that works with Fargate.
  • "Need persistent block storage" β†’ EBS, which means EC2 launch type only. Fargate ephemeral storage is deleted on stop.
  • "Tasks can't communicate with each other" β†’ Check security groups. In awsvpc mode, each task has its own SG. The SG must allow the needed ports.
  • ENI limits on EC2: each awsvpc task uses one ENI. Small instances (t3.micro) may only support 1 task. Enable ENI trunking for more.
  • Service Discovery vs ALB: Use ALB for external-facing traffic. Use Cloud Map/Service Connect for internal service-to-service calls.
  • Fargate ephemeral storage: 20GB default, configurable up to 200GB. Fast (local NVMe) but non-persistent.
  • EFS security: task SG must allow outbound to port 2049 (NFS). EFS SG must allow inbound port 2049 from task SG.
πŸ“‹ Chapter 4 β€” Summary
  • awsvpc: production standard. Each task gets own ENI + private IP + security group. Required for Fargate.
  • Security groups: applied per task (not per instance). Reference by SG ID, not IP ranges.
  • ALB: target type must be ip for Fargate/awsvpc. Auto-registers task IPs in target group.
  • Service Discovery: Cloud Map provides DNS records per task. Service Connect adds Envoy proxy for retries/metrics.
  • EFS: shared persistent storage across tasks and AZs. Works with both EC2 and Fargate.
  • EBS: persistent block storage, EC2 only. Single-AZ. For stateful single-instance workloads.
  • Fargate ephemeral: 20-200GB, fast NVMe, deleted on task stop. Great for temp/scratch data.
  • ENI trunking: opt-in to run more awsvpc tasks per EC2 instance by sharing trunk ENI.
05
Chapter Five

Capacity Providers

What Is a Capacity Provider Introductory

A capacity provider is the bridge between your ECS tasks and the infrastructure they run on. It answers a simple question: "When ECS needs to launch a new task, where does the compute come from?" Without capacity providers you must manually ensure enough EC2 instances exist. With them, ECS automatically provisions capacity β€” either by scaling an Auto Scaling Group (EC2) or by simply requesting Fargate resources from AWS.

☁️

FARGATE

Built-in. AWS provisions compute per task. No configuration needed β€” always available by default.

πŸ’°

FARGATE_SPOT

Built-in. Same as Fargate but uses spare capacity at up to 70% discount. Tasks can be interrupted with 2-minute warning.

πŸ–₯️

ASG Capacity Provider

Links an Auto Scaling Group to ECS. When tasks need capacity, ECS tells the ASG to scale out. You manage the instance fleet.

Fargate + Fargate Spot Core

The FARGATE and FARGATE_SPOT capacity providers are built into ECS β€” you don't create them. They are available on every cluster. Fargate is the default: every task you launch on Fargate uses this provider unless you configure otherwise.

βœ…

Fargate (On-Demand)

  • Always available β€” AWS guarantees capacity
  • No interruptions β€” task runs until it exits or you stop it
  • Full per-second billing for vCPU + memory
  • Use for: production web services, customer-facing APIs
πŸ’°

Fargate Spot

  • Up to 70% cheaper than on-demand Fargate
  • Uses spare AWS capacity β€” can be reclaimed anytime
  • 2-minute SIGTERM before task is terminated
  • ECS service auto-replaces interrupted tasks on on-demand
  • Use for: batch jobs, queue workers, data processing

πŸ‘‰ Fargate Spot interruption handling: When AWS reclaims your Spot task, ECS sends SIGTERM β†’ waits 2 minutes β†’ then SIGKILL. Your app should handle SIGTERM gracefully (finish current work, checkpoint state). The ECS service will automatically launch a replacement task on on-demand Fargate β€” you don't lose desired count.

🐍

Python β€” SIGTERM Handler

import signal, sys

def graceful_shutdown(signum, frame):
    print("SIGTERM received β€” finishing work...")
    # flush queues, save checkpoint, close DB
    sys.exit(0)

signal.signal(signal.SIGTERM, graceful_shutdown)
🟒

Node.js β€” SIGTERM Handler

process.on('SIGTERM', async () => {
  console.log('SIGTERM received β€” draining...');
  server.close(); // stop accepting new requests
  await flushQueues();
  await db.close();
  process.exit(0);
});
ASG Capacity Provider (EC2 Launch Type) In-Depth

For the EC2 launch type, capacity providers link your ECS cluster to an Auto Scaling Group. This enables Cluster Auto Scaling (CAS) β€” when ECS needs to place tasks but no instance has enough room, CAS triggers the ASG to launch new instances. When instances are underutilized, CAS scales them in. This eliminates manual capacity planning.

βš™οΈ

How CAS Works

  • 1. ECS receives a task placement request
  • 2. No instance has enough CPU/memory available
  • 3. CAS calculates how many instances are needed
  • 4. CAS sets the ASG's desired count β†’ ASG launches instances
  • 5. New instances register with ECS β†’ tasks placed
  • Scale-in: ECS drains tasks first β†’ then terminates instance
πŸ“Š

Configuration

  • Target capacity %: how full instances should be
  • 100% = pack instances fully (maximize density)
  • 80% = leave 20% headroom for burst (faster placement)
  • Managed scaling: on/off toggle for CAS
  • Managed termination protection: prevents ASG from terminating instances that still have running tasks

The target capacity % is the most important CAS parameter. Set it to 100% for maximum cost efficiency (every instance fully packed, but new tasks wait for scale-out). Set it to 70-80% for responsiveness (headroom means tasks place instantly, but you pay for idle capacity).

Capacity Provider Strategy In-Depth

A capacity provider strategy defines how tasks are distributed across multiple capacity providers. You assign weights and an optional base count per provider. This is how you build hybrid workloads β€” for example, "run 2 tasks on on-demand Fargate as baseline, then spread additional tasks 80% to Fargate Spot and 20% to on-demand."

Strategy ExampleProviderBaseWeightBehavior
Cost-optimized batchFARGATE_SPOT0480% of tasks β†’ Spot (cheap)
FARGATE0120% on-demand (fallback)
HA web serviceFARGATE21Always 2 on-demand tasks (base)
FARGATE_SPOT03Extra tasks 75% Spot (save $)
Hybrid EC2 + overflowASG (EC2 RI)0375% on Reserved EC2 instances
FARGATE0125% overflow to Fargate (burst)

The base count guarantees a minimum number of tasks on that provider (placed first, before weights apply). After base is filled, additional tasks distribute according to the weight ratio. This gives you predictable baseline capacity with elastic overflow.

Concept Diagram β€” Capacity Provider as Bridge Introductory
Capacity Providers β€” Bridge Between ECS Tasks and Infrastructure
ECS Service (desired: 10) Capacity Provider Strategy: base=2 FARGATE, weight 3:1 SPOT:ON-DEMAND FARGATE Base: 2 tasks guaranteed Weight: 1 (20% of extra) β†’ 4 tasks (2 base + 2 extra) FARGATE_SPOT Base: 0 Weight: 3 (75% of extra) β†’ 6 tasks (all from extra pool) ASG (EC2) (not used in this strategy) Available if configured β†’ 0 tasks Result: 4 on-demand + 6 Spot = 10 tasks. Baseline HA + 70% cost saving on 6 tasks.
AWS Diagram β€” EC2 Cluster with ASG Capacity Provider Core
Cluster Auto Scaling β€” ASG Capacity Provider + Fargate Overflow
ECS Cluster: production Service: api-server Β· desired: 8 Β· Strategy: weight 3 ASG-CP, weight 1 FARGATE ASG Capacity Provider (target: 80%) m5.large (RI) Task 1 Task 2 80% utilized βœ“ m5.large (RI) Task 3 Task 4 80% utilized βœ“ m5.large (new!) Task 5 Task 6 ← CAS scaled out ASG 6 tasks on EC2 (75% of total per weight=3) Fargate (overflow) Task 7 Task 8 No instances needed Instant capacity 2 tasks on Fargate (25% per weight=1) β–  ASG Capacity Provider: CAS scales EC2 instances automatically β–  Fargate: instant overflow, no capacity planning Strategy weight 3:1 β†’ 75% on EC2 (cheap RI) + 25% on Fargate (burst). CAS maintains 80% target utilization.
Architecture Diagram β€” Spot-Heavy Batch with Fargate Overflow In-Depth
Batch Processing β€” EC2 Spot + Fargate Spot + On-Demand Fallback
EventBridge hourly trigger SQS Queue data-pipeline (desired: 20) Strategy: base=2 FARGATE, wt 4 EC2-SPOT, wt 2 FG-SPOT, wt 1 FARGATE EC2 Spot ASG wt: 4 (57%) c5.2xlarge Spot instances CAS auto-adds instances ~$0.03/vCPU-hr (90% off) β†’ ~12 tasks ⚠ Interrupted β†’ tasks retry via SQS Managed termination protection on Task draining before instance stop Fargate Spot wt: 2 (29%) 2 vCPU / 4GB per task AWS manages infra ~$0.01/vCPU-hr (70% off) β†’ ~6 tasks ⚠ 2-min SIGTERM warning Checkpoint β†’ SQS re-queue Fargate (On-Demand) base: 2 (guaranteed) weight: 1 (14% of extra) $0.04/vCPU-hr (full price) β†’ 2 base + ~0 extra = 2 tasks βœ… Never interrupted Guarantees min throughput Total: 20 tasks. ~85% running on Spot (EC2+Fargate). Estimated cost: ~60% less than all on-demand Fargate.

This pattern maximizes cost savings for batch workloads: EC2 Spot gives the deepest discount (up to 90%), Fargate Spot adds overflow without managing instances, and 2 on-demand Fargate tasks guarantee a minimum processing rate even during Spot capacity shortages.

πŸŽ“ Exam Tips β€” Chapter 05
  • FARGATE and FARGATE_SPOT are built-in β€” you don't create them. They exist on every cluster.
  • "Reduce cost for batch processing on ECS" β†’ Fargate Spot (up to 70% off) or EC2 Spot ASG capacity provider (up to 90% off).
  • "Ensure minimum availability while minimizing cost" β†’ Capacity provider strategy with base on FARGATE (guaranteed) and weight on FARGATE_SPOT (cheap excess).
  • Cluster Auto Scaling (CAS) β€” only works with EC2 launch type via ASG capacity provider. Fargate doesn't need CAS because AWS handles capacity.
  • Target capacity % = 100% means "pack instances fully before scaling out." 80% means "keep headroom for faster placement."
  • Managed termination protection prevents ASG from terminating instances that still have running ECS tasks. Always enable this.
  • Fargate Spot interruption: SIGTERM β†’ 2 min β†’ SIGKILL. Service auto-replaces on on-demand. Design for graceful shutdown.
  • Distractor: "Fargate Spot is the same as EC2 Spot" β€” no. Fargate Spot discount is ~70%, EC2 Spot can reach ~90%. EC2 Spot also has diversified instance fleets for better availability.
πŸ“‹ Chapter 5 β€” Summary
  • Capacity providers: bridge between ECS and compute. Fargate (on-demand) Β· Fargate Spot (70% off) Β· ASG (EC2, managed by CAS).
  • Capacity provider strategy: base (guaranteed) + weight (ratio). Distribute tasks across providers for cost/HA balance.
  • Cluster Auto Scaling (CAS): auto-scales EC2 instances based on task demand. Target capacity % controls utilization.
  • Fargate Spot: up to 70% cheaper. 2-minute SIGTERM before termination. Service auto-replaces interrupted tasks.
  • EC2 Spot via ASG: up to 90% cheaper. CAS manages the ASG. Managed termination protection drains tasks before instance stop.
  • Hybrid pattern: baseline on-demand (guaranteed) + Spot overflow (cheap). Best cost-to-availability ratio for batch.
06
Chapter Six

Scaling & Deployment

Service Auto Scaling Core

ECS Service Auto Scaling adjusts the desired task count automatically based on CloudWatch metrics. It uses Application Auto Scaling β€” the same system that scales DynamoDB tables and Aurora replicas. You define a target value (e.g., "keep average CPU at 70%"), and the system adds or removes tasks to maintain it.

🎯

Target Tracking

  • Set target: "Average CPU = 70%"
  • System auto-creates CloudWatch alarms
  • Scales out when above, in when below
  • Simplest, most common approach
  • Supported metrics: CPU, Memory, ALB request count
πŸ“

Step Scaling

  • Define steps: "CPU 70-80% β†’ add 1, 80-90% β†’ add 3, 90%+ β†’ add 5"
  • More control over scaling aggressiveness
  • Requires manual CloudWatch alarm setup
  • Good for: bursty workloads needing fast scale-out
πŸ“…

Scheduled Scaling

  • "Scale to 20 tasks at 9am, back to 5 at 6pm"
  • Cron-based, predictable patterns
  • Use with target tracking (scheduled sets min, TT adjusts within range)
  • Good for: known traffic patterns (business hours, events)
Scaling MetricTarget SuggestionWhen to Use
ECSServiceAverageCPUUtilization60-75%CPU-bound workloads (computation, encoding)
ECSServiceAverageMemoryUtilization70-80%Memory-bound (caching, JVM, data processing)
ALBRequestCountPerTarget1000 req/targetRequest-driven APIs (scale per request volume)
Custom CloudWatch metricVariesSQS queue depth, business metric, latency P99

πŸ‘‰ Scale on ALBRequestCountPerTarget for web APIs, not CPU. Web APIs often have low CPU but high request count. If you scale on CPU alone, you'll be under-provisioned β€” requests queue up and latency spikes before CPU triggers. ALBRequestCountPerTarget scales based on actual request volume, which directly correlates with user experience.

Deployment Strategies Core

When you update a service (new image version, config change), ECS must replace old tasks with new ones. How it does this determines whether your users experience downtime, mixed versions, or seamless updates.

πŸ”„

Rolling Update (default)

  • ECS launches new tasks β†’ waits for health check β†’ drains old tasks
  • Controlled by minimumHealthyPercent and maximumPercent
  • minimumHealthyPercent: 100 = never go below desired count (add new before removing old)
  • maximumPercent: 200 = can double task count temporarily during deploy
  • No additional cost (uses ECS built-in controller)
  • Rollback: manual (deploy previous revision)
πŸ”΅πŸŸ’

Blue/Green (CodeDeploy)

  • Two target groups: blue (current) and green (new)
  • CodeDeploy shifts traffic: 100% blue β†’ 100% green
  • Options: all-at-once, linear (10% every 5min), canary (10% β†’ 100%)
  • Instant rollback: shift traffic back to blue
  • Both old and new tasks run simultaneously
  • Requires ALB with two target groups + CodeDeploy setup

πŸ‘‰ Blue/Green = true zero-downtime: Unlike rolling updates where old+new tasks coexist briefly, Blue/Green keeps the full blue fleet running until green is 100% validated. Rollback is instant β€” just flip the ALB listener back. For exam: if a question requires "zero-downtime deployment with instant rollback" β†’ Blue/Green with CodeDeploy. If it says "simplest deployment" β†’ Rolling Update (default, no extra setup).

Rolling Update Parameters In-Depth
minimumHealthyPercentmaximumPercentBehaviorBest For
100%200%Launch new tasks first, then drain old. Never below desired count. Temporarily doubles cost.Production services needing zero-downtime
50%100%Stop half the old tasks, then start new ones. Brief capacity reduction.Cost-sensitive, can tolerate brief capacity dip
0%100%Stop ALL old tasks, then start new. Full downtime during deploy.Dev/staging only. Never production.
100%150%Launch 50% new tasks, drain some old, repeat. Moderate overhead.Balance between speed and cost

πŸ‘‰ Fargate constraint: Fargate enforces a minimum minimumHealthyPercent of 50%. You cannot use 0% (full-stop deployment) with Fargate β€” only EC2 launch type supports it. For Fargate zero-downtime deploys, use 100%/200% or Blue/Green with CodeDeploy.

Deployment Circuit Breaker Core

The deployment circuit breaker automatically detects when a deployment is failing (new tasks keep crashing) and rolls back to the previous stable version. Without it, a bad deployment loops endlessly: ECS launches new task β†’ task crashes β†’ ECS launches another β†’ crashes β†’ repeat forever, burning compute.

πŸ›‘οΈ

How It Works

  • ECS monitors new tasks during deployment
  • If tasks repeatedly fail to reach RUNNING state...
  • Circuit breaker triggers: stops launching new tasks
  • If rollback: true β†’ automatically reverts to last stable
  • Based on failure threshold (number of consecutive task failures)
βš™οΈ

Configuration

  • Enable: deploymentCircuitBreaker: {enable: true, rollback: true}
  • Works with both ECS rolling update and CodeDeploy
  • Always enable for production. Default is disabled.
  • Failure reasons detected: OOM, crash loop, health check failure

πŸ‘‰ Always enable deployment circuit breaker with rollback in production. Without it, a bad image tag or misconfigured environment variable causes infinite task restarts. Your service degrades while ECS keeps trying to deploy the broken version. Circuit breaker + rollback catches this in seconds and reverts automatically.

Concept Diagram β€” Rolling Update Stages Introductory
Rolling Update β€” minimumHealthy=100%, maximumPercent=200%
Stage 1: Before Stage 2: New launching Stage 3: New healthy Stage 4: Complete v1 v1 v1 v1 4 tasks (v1) desired: 4 100% healthy βœ“ v1 v1 v1 v1 v2 v2 v2 v2 4 old + 4 new (8 total) max: 200% = 8 tasks v2 starting up... v1↓ v1↓ v1↓ v2βœ“ v2βœ“ v2βœ“ Draining old v1 tasks v2 healthy β†’ deregister v1 ALB drains connections v2 v2 v2 v2 4 tasks (v2) Deploy complete βœ“ Zero downtime: old tasks stay until new tasks pass health checks ALB connection draining ensures in-flight requests complete before old tasks stop = old version (v1) = new version launching = new version healthy
AWS Diagram β€” Service Auto Scaling with CloudWatch Core
Service Auto Scaling β€” Target Tracking on ALB Request Count
ALB 1500 req/sec api-service (3 tasks) 500 req/task β†’ target: 1000 Below target β†’ should scale out CloudWatch Metric: ALBRequestCount PerTarget: 500 (target: 1000) metrics App Auto Scaling target tracking policy alarm scale out: 3β†’5 api-service (5 tasks) βœ“ 300 req/task β†’ below target βœ“ Healthy β€” scaling complete Flow: Traffic ↑ β†’ CloudWatch metric breaches β†’ Auto Scaling adds tasks β†’ Load rebalances Cooldown: 300s (default) between scale actions to avoid oscillation β–  Target tracking: system maintains metric at target value automatically β–  Green: scaled-out state (stable)
Architecture Diagram β€” Blue/Green Deployment In-Depth
Blue/Green Deployment β€” CodeDeploy Shifts Traffic Between Target Groups
🌐 Users ALB Prod :443 Test :8443 BLUE Target Group (v1) Currently receiving 100% traffic Task Task Task Task 100% GREEN Target Group (v2) New version β€” test traffic only Task Task Task Task test AWS CodeDeploy Traffic shift: canary 10% β†’ wait 5min β†’ 100% Rollback: instant switch back 1. Deploy v2 tasks β†’ 2. Register in green TG β†’ 3. CodeDeploy shifts traffic β†’ 4. Monitor β†’ 5. Complete or Rollback Canary: 10% green for 5 min β†’ if healthy β†’ 100% green. If unhealthy β†’ instant rollback to blue. = Blue (current stable, v1) = Green (new version, v2) Rollback = switch ALB back to blue TG (seconds)
πŸŽ“ Exam Tips β€” Chapter 06
  • "Zero downtime deployment" β†’ Rolling update with minimumHealthyPercent=100%, maximumPercent=200%. Or Blue/Green with CodeDeploy.
  • "Automatically rollback failed deployments" β†’ Deployment circuit breaker with rollback=true. Or CodeDeploy blue/green with automatic rollback alarm.
  • "Scale based on SQS queue depth" β†’ Custom CloudWatch metric for ApproximateNumberOfMessagesVisible, step scaling policy.
  • Service Auto Scaling β‰  Cluster Auto Scaling. Service AS changes task count. Cluster AS (CAS) changes EC2 instance count. They work together but are separate.
  • Target tracking is "set and forget." You specify the target value; AWS creates and manages the CloudWatch alarms. Step scaling requires you to create alarms manually.
  • Blue/Green requires ALB with two target groups. NLB is supported but less common for blue/green. Cannot do blue/green without a load balancer.
  • "Gradual traffic shift" β†’ CodeDeploy canary or linear deployment. Not possible with ECS rolling update (which is binary per task).
  • Cooldown period: default 300s between scaling actions. Too short = oscillation (scale up/down/up/down). Too long = slow reaction.
  • Distractor: "ECS rolling update supports canary deployment" β€” false. Canary requires CodeDeploy blue/green with traffic shifting.
πŸ“‹ Chapter 6 β€” Summary
  • Service Auto Scaling: target tracking (CPU, memory, ALB requests, custom) adjusts task count automatically.
  • Scale on ALBRequestCountPerTarget for APIs, not CPU. Request volume correlates better with user experience.
  • Rolling update: minHealthy=100%, max=200% β†’ zero downtime. New tasks must pass health check before old ones drain.
  • Blue/Green (CodeDeploy): two target groups. Traffic shift: all-at-once, linear, or canary. Instant rollback to blue.
  • Circuit breaker: detects failed deployments, auto-reverts. Always enable in production.
  • Scheduled scaling: predictable patterns (business hours). Combine with target tracking for best results.
07
Chapter Seven

Integrations

Amazon ECR β€” Container Registry Core

Amazon ECR (Elastic Container Registry) is a fully managed Docker container image registry. It stores, manages, and deploys your container images. ECS pulls images from ECR during task launch β€” this is the standard production pattern. ECR integrates with IAM for access control, encrypts images at rest, and scans for known vulnerabilities.

πŸ“¦

Key Features

  • Private repositories: IAM-based access, no public exposure
  • Image scanning: Basic scanning (free, on push, Clair-based CVE detection). Enhanced scanning (uses Amazon Inspector, continuous, per-image cost)
  • Lifecycle policies: auto-delete untagged/old images (save cost)
  • Cross-region replication: replicate images to multiple regions
  • Image immutability: prevent tag overwrites (tag=v1 always same image)
πŸ”§

ECS + ECR Flow

  • 1. Build: docker build -t my-api:v2 .
  • 2. Tag: docker tag my-api:v2 123456.dkr.ecr.us-east-1.amazonaws.com/my-api:v2
  • 3. Auth: aws ecr get-login-password | docker login...
  • 4. Push: docker push 123456.dkr.ecr.../my-api:v2
  • 5. ECS task definition references the ECR image URI
  • 6. ECS Execution Role must have ecr:GetAuthorizationToken + ecr:BatchGetImage
ECR Image Pull Flow Core

When ECS launches a task, the image pull follows a precise sequence. Understanding this flow helps debug "CannotPullContainerError" β€” the most common task startup failure:

Image Pull Flow β€” What Happens at Task Launch
1. ECS Scheduler Decides to start task 2. Execution Role Assumes role for ECR auth 3. Pull from ECR Download image layers 4. Container Starts Task enters RUNNING ❌ Failure at step 2 or 3 β†’ "CannotPullContainerError" β€” check Execution Role permissions + VPC endpoints / NAT

πŸ‘‰ Most common fix: If tasks fail with CannotPullContainerError β€” (1) verify the Execution Role has ecr:GetAuthorizationToken + ecr:BatchGetImage + ecr:GetDownloadUrlForLayer, (2) ensure the task's subnet has a NAT Gateway or VPC endpoint for ECR (com.amazonaws.region.ecr.dkr + com.amazonaws.region.ecr.api).

Load Balancer Integration Core

ECS services integrate with ALB and NLB via target groups. When a task starts, ECS automatically registers it with the target group. When a task stops, ECS deregisters it after the ALB drains active connections. For Fargate (awsvpc mode), the target type must be ip (not instance).

FeatureALB (Application)NLB (Network)
LayerLayer 7 (HTTP/HTTPS)Layer 4 (TCP/UDP/TLS)
RoutingPath-based, host-based, header-basedPort-based only
Health checksHTTP GET /health (path + status code)TCP connect or HTTP
WebSocketβœ… Native supportβœ… TCP passthrough
Static IP❌ DNS only (changes)βœ… Elastic IP per AZ
Sticky sessionsβœ… Cookie-based❌ Not supported
Best for ECSREST APIs, web apps, microservicesgRPC, real-time, extreme throughput

πŸ‘‰ ALB path-based routing is the standard pattern for ECS microservices. One ALB, multiple listener rules: /api/users/* β†’ user-service target group, /api/orders/* β†’ order-service target group. Each service registers its own target group. This avoids one-LB-per-service cost while keeping services independently deployable.

IAM Roles β€” Task Role vs Execution Role Core

ECS tasks use two different IAM roles. Confusing them is one of the most common ECS mistakes and a frequent exam question.

πŸ”

Task Execution Role

  • Who uses it: ECS agent (not your application)
  • Purpose: pull images, push logs, read secrets
  • Permissions needed:
    • ecr:GetAuthorizationToken
    • ecr:BatchGetImage
    • logs:CreateLogStream
    • logs:PutLogEvents
    • ssm:GetParameters (if injecting from Parameter Store)
    • secretsmanager:GetSecretValue (if injecting secrets)
  • AWS provides managed policy: AmazonECSTaskExecutionRolePolicy
πŸ—οΈ

Task Role

  • Who uses it: your application code (inside the container)
  • Purpose: access AWS services from your app
  • Examples:
    • s3:PutObject (upload files)
    • dynamodb:PutItem (write data)
    • sqs:SendMessage (queue messages)
    • sns:Publish (send notifications)
  • Follow least privilege β€” only what your app actually needs
  • Accessible via instance metadata endpoint (SDK auto-discovers)
Secrets Manager & Parameter Store Core

Never hardcode secrets (database passwords, API keys) in your Docker image or task definition environment variables. Instead, reference them from AWS Secrets Manager or SSM Parameter Store. ECS injects the secret value at task launch time β€” your container sees the value as a regular environment variable, but the actual secret never appears in the task definition.

πŸ”’

Secrets Manager

  • Designed specifically for secrets (credentials, tokens, keys)
  • Automatic rotation (Lambda-based, $0.40/secret/month)
  • Reference in task def: "valueFrom": "arn:aws:secretsmanager:..."
  • Execution Role needs secretsmanager:GetSecretValue
πŸ“

SSM Parameter Store

  • Config values + secrets (Standard tier free, up to 10K params)
  • SecureString type encrypts with KMS
  • Reference: "valueFrom": "arn:aws:ssm:...:parameter/db_host"
  • Execution Role needs ssm:GetParameters
  • Free for standard params (cheaper than Secrets Manager)
CloudWatch Logs & Container Insights Core

ECS containers send logs to CloudWatch via the awslogs log driver. Each container gets its own log stream within a log group. Container Insights provides CPU, memory, network, and disk metrics at the task and container level β€” critical for troubleshooting and capacity planning.

πŸ“‹

awslogs Driver

  • Configured in task definition per container
  • Options: awslogs-group, awslogs-region, awslogs-stream-prefix
  • Log stream name: prefix/container-name/task-id
  • Execution Role needs logs:CreateLogStream, logs:PutLogEvents
  • Set log group retention (default: never expires β†’ cost grows forever)
πŸ“Š

Container Insights

  • Enable per cluster: containerInsights: enabled
  • Metrics: CPU/memory utilization per task, per service, per cluster
  • Network: bytes in/out, packet errors
  • Storage: ephemeral storage utilization (Fargate)
  • Costs ~$0.30/task/month (CloudWatch custom metrics pricing)
AWS X-Ray β€” Distributed Tracing In-Depth

X-Ray traces requests across your microservices β€” showing where time is spent, which service is slow, and where errors occur. For ECS, you run the X-Ray daemon as a sidecar container in the same task. Your application sends trace data to the daemon (localhost:2000/udp), and the daemon forwards it to the X-Ray service.

πŸ”

Setup Steps

  • 1. Add X-Ray daemon container to task definition (sidecar)
  • 2. Image: amazon/aws-xray-daemon
  • 3. Port: 2000/UDP
  • 4. Task Role needs: xray:PutTraceSegments, xray:PutTelemetryRecords
  • 5. Your app uses X-Ray SDK (or OpenTelemetry) to instrument requests
πŸ“ˆ

What You Get

  • Service map: visual graph of all services and their connections
  • Latency breakdown: where each millisecond was spent
  • Error rates per service
  • Trace filtering by URL, status code, duration
  • Integration with CloudWatch ServiceLens for unified view

πŸ‘‰ Complete observability stack: CloudWatch Logs (what happened β€” container stdout/stderr), Container Insights (how it's performing β€” CPU/memory metrics), X-Ray (where time is spent β€” distributed traces). For exam: "how to view container logs" β†’ awslogs driver + CloudWatch. "How to find slow microservice" β†’ X-Ray. "How to set up CPU-based auto scaling" β†’ Container Insights metrics.

ECS Integration Ecosystem β€” Build β†’ Deploy β†’ Serve β†’ Observe
BUILD ECR Image registry CVE scanning Lifecycle policies Cross-region replication ORCHESTRATE ECS Task Execution Role Task Role (IAM) Secrets Manager Parameter Store SERVE ALB / NLB Path-based routing Health checks Target groups (ip type) SSL termination OBSERVE CloudWatch Logs (awslogs driver) Container Insights X-Ray tracing Alarms β†’ Auto Scaling Build image β†’ Push to ECR β†’ ECS pulls + injects secrets β†’ ALB routes traffic β†’ CloudWatch observes Execution Role: pull ECR images + push logs + read secrets. Task Role: app's own AWS permissions. Build Orchestrate Serve Observe β†’ = data/control flow
AWS Diagram β€” Secure Microservice with All Integrations Core
Complete ECS Workload β€” ECR + ALB + Secrets + X-Ray + CloudWatch
VPC β€” Private Subnets 🌐 HTTPS ALB TG: ip type /health check ECS Task (Fargate) api-service:v3 Port 8080 Task Role β†’ S3 + DynamoDB env: DB_PASS (secret) X-Ray Daemon Port 2000/udp Sidecar Traces β†’ Secrets Manager DB_PASS, API_KEY inject ECR api-service:v3 pull CloudWatch Logs (awslogs) Β· Container Insights Β· X-Ray traces Β· Alarms alarms β†’ trigger Auto Scaling DynamoDB Task Role β–  Execution Role: pull ECR + push logs + read secrets β–  Task Role: app's DynamoDB/S3 permissions β–  X-Ray sidecar shares task network (localhost:2000) Secrets injected at launch, never in image or task def plaintext
Architecture Diagram β€” ALB Path-Based Routing to Multiple Services In-Depth
Microservice Routing β€” One ALB, Multiple ECS Services via Path Rules
🌐 ALB Rule 1: /api/users/* Rule 2: /api/orders/* Rule 3: /api/products/* Default: /static/* One LB, multiple rules user-service (3 tasks) TG: user-svc-tg (port 8080) /api/users/* order-service (5 tasks) TG: order-svc-tg (port 8080) /api/orders/* product-service (2 tasks) TG: product-svc-tg (port 8080) /api/products/* RDS (users) DynamoDB (orders) ElastiCache One ALB β†’ path rules β†’ separate target groups β†’ independent services Each service deploys independently (different image versions, scaling policies, task roles)
πŸŽ“ Exam Tips β€” Chapter 07
  • Task Execution Role β‰  Task Role. Execution Role = ECS agent (pull images, push logs, read secrets). Task Role = your application code (DynamoDB, S3, SQS).
  • "Container cannot pull image from ECR" β†’ check Execution Role has ecr:GetAuthorizationToken + ecr:BatchGetImage.
  • "Application needs to write to S3" β†’ add S3 permissions to the Task Role, not the Execution Role.
  • "Inject database password securely" β†’ Secrets Manager or SSM SecureString referenced in task definition. Execution Role needs read permission.
  • ALB target type must be ip for Fargate (awsvpc mode). instance type only works with EC2 launch type bridge/host networking.
  • X-Ray for ECS: run daemon as sidecar, not standalone service. Using port 2000/UDP. Task Role needs xray:PutTraceSegments.
  • ECR lifecycle policies auto-delete old/untagged images β€” prevents storage cost creep. Set to keep last 10 tagged images.
  • "Logs not appearing in CloudWatch" β†’ check Execution Role has logs:CreateLogStream + logs:PutLogEvents, and check log group exists.
  • Distractor: "Task Role is needed to pull images from ECR" β€” false. Image pull uses the Execution Role.
πŸ“‹ Chapter 7 β€” Summary
  • ECR: managed Docker registry. Build β†’ tag β†’ push β†’ ECS pulls. Enable scanning + lifecycle policies.
  • ALB: path-based routing to multiple ECS services via target groups (ip type for Fargate).
  • Execution Role vs Task Role: Execution = infrastructure (ECR, logs, secrets). Task = application (S3, DynamoDB, SQS).
  • Secrets: inject from Secrets Manager or SSM Parameter Store at task launch. Never hardcode in images.
  • CloudWatch: awslogs driver for logs. Container Insights for metrics. Set log retention to avoid cost creep.
  • X-Ray: sidecar daemon for distributed tracing. Task Role needs xray permissions.
08
Chapter Eight

Architecture Patterns

When to Use Which Pattern Introductory

ECS is flexible enough to support many application styles β€” from long-running web services to one-shot batch jobs. The key is matching the right ECS features (service vs standalone task, Fargate vs EC2, Spot vs On-Demand) to each workload's requirements.

PatternECS FeatureLaunch TypeScaling TriggerExample
MicroservicesService + ALB + Service DiscoveryFargateALBRequestCountPerTargetE-commerce (user, order, product services)
API BackendService + ALB + Auto ScalingFargateALBRequestCountPerTarget or CPUMobile app backend
Batch ProcessingStandalone task (RunTask API)Fargate SpotEventBridge schedule or SQSNightly reports, video transcoding
Event-DrivenService + SQS pollingFargate SpotSQS queue depth (custom metric)Order processing, image resizing
Scheduled TasksRunTask triggered by EventBridgeFargate SpotCron scheduleDB cleanup, daily sync, report generation
Web App + APIService + CloudFront + ALBFargateALBRequestCountPerTargetSPA frontend + REST API
Pattern 1 β€” Microservices Platform Core

The most common ECS architecture: multiple independent services, each in its own ECS service with its own task definition, scaling policy, and deployment lifecycle. An ALB routes requests by path to the correct target group. Services discover each other via AWS Cloud Map (Service Discovery) for internal communication.

βœ…

When to Use

  • Multiple teams owning different services
  • Services scale independently (orders spike on sales, users steady)
  • Independent deployment β€” deploy user-service without touching order-service
  • Different tech stacks per service (Node.js, Java, Python in same cluster)
βš™οΈ

ECS Features Used

  • ALB with path-based routing (one LB, many target groups)
  • Service Discovery (Cloud Map) for service-to-service calls
  • Fargate per-service with independent scaling
  • ECR separate repository per service
  • Secrets Manager per-service credentials

πŸ‘‰ Use Service Discovery (Cloud Map) for internal calls, ALB for external. Service A calls Service B via DNS: order-service.local:8080 β€” Cloud Map maintains the DNS records. This avoids routing internal traffic through the ALB (extra hop, extra cost). External traffic still goes ALB β†’ target group β†’ service.

Pattern 2 β€” Event-Driven Queue Processing Core

A service polls SQS for messages and processes them. When the queue grows, Auto Scaling adds tasks. When the queue empties, it scales back down. This pattern decouples producers from consumers and handles traffic spikes gracefully β€” the queue absorbs the burst while consumers process at their own pace.

βœ…

When to Use

  • Async processing: order placed β†’ process payment, send email
  • Unpredictable bursts: 10K images uploaded at once β†’ resize queue
  • Decoupled: producer doesn't wait for consumer to finish
  • Retry built-in: failed messages go to DLQ for investigation
βš™οΈ

ECS Features Used

  • ECS Service with desired count = min workers
  • Step Scaling on SQS ApproximateNumberOfMessagesVisible
  • Fargate Spot for cost savings (interruptible processing is OK)
  • SQS DLQ for failed messages
  • Task Role with sqs:ReceiveMessage, sqs:DeleteMessage
Pattern 3 β€” Batch Processing & Scheduled Tasks Core

One-shot tasks triggered by a schedule (EventBridge cron) or an event. Unlike services, batch tasks run to completion and exit β€” they are not restarted. Perfect for nightly reports, database migrations, ETL jobs, and data exports.

βœ…

When to Use

  • Scheduled jobs: "run nightly at 2am UTC"
  • Finite workloads: process file, generate report, exit
  • Cost-sensitive: Fargate Spot for up to 70% savings
  • No load balancer needed β€” tasks run independently
βš™οΈ

ECS Features Used

  • EventBridge rule or Scheduler β†’ ecs:RunTask
  • Standalone task (not a service β€” exits when done)
  • Fargate Spot for cost optimization
  • EFS for shared data across batch tasks
  • CloudWatch Logs for output capture
Pattern 4 β€” Web App with Static Frontend In-Depth

A modern web application with a static frontend (React/Vue SPA) served from S3 + CloudFront, and an API backend running on ECS behind ALB. CloudFront routes /api/* to ALB origin and everything else to S3. This separates the static delivery (CDN-optimized) from the dynamic API (container-optimized).

🌐

Frontend (Static)

  • React/Vue/Angular SPA built β†’ uploaded to S3
  • CloudFront CDN for global low-latency delivery
  • Origin Access Control: S3 bucket not publicly accessible
  • Cache-Control headers: immutable assets cached at edge
⚑

Backend (ECS)

  • REST API on ECS Fargate behind ALB
  • CloudFront origin: /api/* β†’ ALB
  • Auto Scaling on request count per target
  • Private subnets β€” not directly internet accessible
Concept Diagram β€” Microservices Communication Introductory
Microservices Communication β€” External (ALB) vs Internal (Service Discovery)
External Traffic (via ALB) 🌐 ALB user-service port 8080 /api/users order-service port 8080 /api/orders product-service port 8080 Internal Traffic (via Service Discovery / Cloud Map) user-service user-svc.local order-service order-svc.local product-service product-svc.local DNS DNS AWS Cloud Map (Service Discovery) ━ External: client β†’ ALB β†’ service (path-based routing) ━ Internal: service β†’ service (DNS via Cloud Map, no ALB hop)
AWS Diagram β€” Event-Driven Processing with SQS Core
Event-Driven Architecture β€” SQS Queue β†’ ECS Service β†’ DLQ for Failures
Producer Web API order.placed SQS Queue order-processing-queue Visibility: 60s ~5000 msgs pending send order-processor ECS Service: 5 tasks Each task polls SQS Process β†’ delete message Fargate Spot (cost-optimized) poll Auto Scaling: queue depth > 1000 scale 5 β†’ 20 tasks DLQ Failed after 3 retries Alarm on msg count fail DynamoDB Producer sends β†’ SQS buffers β†’ ECS tasks poll + process β†’ DLQ catches failures Auto Scaling adds tasks when queue depth grows. Fargate Spot for 70% cost savings on queue workers.
Architecture Diagram β€” Production Microservices Platform In-Depth
Production Platform β€” CloudFront + ALB + ECS Microservices + Event Queue + Database
🌐 CloudFront CDN (static + /api) S3 (SPA) /static ALB /api/* VPC β€” Private Subnets (Multi-AZ) api-gateway Auth + route to services Fargate Β· 4 tasks user-service CRUD users + auth Fargate Β· 3 tasks order-service Create + query orders Fargate Β· 5 tasks SQS (order events) order.placed β†’ process publish order-processor Fargate Spot Β· poll SQS poll RDS (users) DynamoDB (orders) ElastiCache EventBridge Scheduled tasks report-generator Fargate Spot (cron) ━ Services: Fargate (long-running) β”… Workers: Fargate Spot (queue-driven) β”… Batch: Fargate Spot (EventBridge cron) Each service: own task def, own scaling policy, own Task Role, own ECR image. Independent lifecycle.
πŸŽ“ Exam Tips β€” Chapter 08
  • "Decouple order processing from API" β†’ SQS queue between order-service and order-processor. ECS service polls SQS. Scale on queue depth.
  • "Run a task on a schedule" β†’ EventBridge Scheduler rule with ecs:RunTask target. NOT an ECS service (services are long-running). Use Fargate Spot for cost.
  • "Service-to-service communication inside ECS" β†’ AWS Cloud Map (Service Discovery). DNS-based: order-svc.local:8080. No ALB needed for internal traffic.
  • Service vs Standalone Task: Service = long-running, auto-restarts, load balanced. Task = one-shot, exits when done, no restart.
  • "Cheapest way to run batch container jobs" β†’ Fargate Spot + EventBridge trigger. If interruptible, Spot saves up to 70%.
  • SQS + ECS scaling: use ApproximateNumberOfMessagesVisible as a custom metric for step scaling. NOT target tracking (it doesn't support SQS natively).
  • "Static website + API on same domain" β†’ CloudFront + S3 (static) + ALB origin (/api/*). Not served from ECS containers.
  • Distractor: "Lambda is always cheaper than Fargate for event processing" β€” false. For sustained high-throughput queues, Fargate Spot costs less than millions of Lambda invocations.
πŸ“‹ Chapter 8 β€” Summary
  • Microservices: ALB path-routing + Service Discovery (Cloud Map). Each service independently deployed and scaled.
  • Event-driven: SQS β†’ ECS service polling. Scale on queue depth. Fargate Spot for cost. DLQ for failures.
  • Batch/scheduled: EventBridge β†’ RunTask (standalone, not a service). Fargate Spot. Exits when complete.
  • Web app: CloudFront β†’ S3 (static), CloudFront β†’ ALB (/api/*) β†’ ECS. Separate static and dynamic delivery.
  • Internal comms: Cloud Map DNS for service-to-service. Service Connect (built on Cloud Map + Envoy) is the modern alternative β€” adds retries, timeouts, and circuit breaking automatically. ALB only for external traffic.
  • Cost pattern: long-running APIs on Fargate On-Demand. Queue workers and batch on Fargate Spot.
09
Chapter Nine

Troubleshooting & Observability

Stopped Reason Codes Core

When an ECS task stops unexpectedly, ECS records a stopped reason that tells you what went wrong. This is the first place to look when debugging β€” run aws ecs describe-tasks and check the stoppedReason and containers[].reason fields.

Stopped ReasonWhat HappenedFix
EssentialΒ­ContainerΒ­Exited A container marked essential: true exited (crashed, exited with non-zero code) Check container exit code + CloudWatch Logs for stack trace. Fix the application bug.
OutOfMemoryError Container exceeded its memory limit. Killed by OOM killer. Increase memory in task definition. Check for memory leaks. JVM: set -Xmx to 75% of container memory.
CannotPullΒ­ContainerError ECS cannot pull image from ECR or Docker Hub. Check: (1) Image exists in ECR (2) Execution Role has ecr:BatchGetImage (3) VPC has NAT gateway or ECR VPC endpoint (4) Image tag is correct.
ResourceInitializationΒ­Error Task could not attach ENI (awsvpc) or mount volume. Check: (1) Subnet has available IPs (2) Security group allows traffic (3) EFS mount target exists in task's AZ.
TaskFailedΒ­ToStart Task launch failed before any container started. Usually infrastructure issue: no capacity, ENI limit reached, or secret injection failure. Check Execution Role permissions.
AGENT ECS agent on EC2 instance is unreachable or unhealthy. EC2 launch type only. Check instance health, ECS agent logs (/var/log/ecs/ecs-agent.log). Restart agent or replace instance.
SERVICE_SCHEDULER_INITIATED Service deliberately stopped the task (deployment, scale-in, health check failure). Normal during deployments. If unexpected: check ALB health check config, ensure /health endpoint returns 200.

πŸ‘‰ The debugging command you'll use most: aws ecs describe-tasks --cluster my-cluster --tasks <task-id> β†’ look at stoppedReason + each container's reason and exitCode. Exit code 137 = OOM killed. Exit code 1 = application error. Exit code 0 = normal shutdown.

ECS Exec β€” Shell Into Running Containers Core

ECS Exec lets you exec into a running container β€” like docker exec -it but for containers running on Fargate or EC2. It uses AWS Systems Manager Session Manager under the hood. This is essential for debugging running containers that aren't behaving as expected.

πŸ”§

Setup Requirements

  • 1. Enable execute command on service: --enable-execute-command
  • 2. Task Role needs: ssmmessages:CreateControlChannel, ssmmessages:CreateDataChannel, ssmmessages:OpenControlChannel, ssmmessages:OpenDataChannel
  • 3. SSM agent is bundled with Fargate platform version 1.4.0+
  • 4. VPC needs NAT gateway or SSM VPC endpoints
πŸ’»

Usage

  • Open shell: aws ecs execute-command --cluster my-cluster --task <task-id> --container my-app --interactive --command "/bin/sh"
  • Check env vars, filesystem, network connectivity
  • Test DNS resolution: nslookup order-svc.local
  • Check if secrets injected: echo $DB_PASSWORD
  • Audit: all exec sessions logged in CloudTrail
Observability Stack Core
πŸ“‹

Logs (CloudWatch)

  • awslogs driver captures stdout/stderr
  • Log group: /ecs/my-service
  • Log stream: prefix/container/task-id
  • Filter patterns for error detection
  • Metric filters: count errors β†’ alarm
πŸ“Š

Metrics (Container Insights)

  • CpuUtilized / CpuReserved per task
  • MemoryUtilized / MemoryReserved per task
  • NetworkRxBytes / NetworkTxBytes
  • RunningTaskCount per service
  • StorageUtilized (Fargate ephemeral)
πŸ”

Traces (X-Ray)

  • End-to-end request traces across services
  • Latency breakdown per service hop
  • Error rate visualization
  • Service map: which service calls which
  • Sidecar daemon + SDK instrumentation

πŸ‘‰ Health check failures are the #1 cause of "task keeps restarting." The ALB health check calls your /health endpoint. If it returns non-200 three times in a row, the ALB marks the target unhealthy, ECS stops the task and starts a new one, which hasn't warmed up yet, fails health check again β†’ restart loop. Fix: (1) ensure /health is fast (<5s response), (2) set health check grace period (give app time to start before first check), (3) check that security group allows ALB β†’ task traffic.

Concept Diagram β€” Troubleshooting Decision Tree Introductory
ECS Troubleshooting β€” Where to Look Based on Symptom
Task not running. Why? Task never started? CannotPullContainer β†’ ECR perms, image tag, NAT/VPC endpoint ResourceInitError β†’ ENI/IP exhaustion, SG rules, EFS mount Started then crashed? Exit code 137 β†’ OutOfMemory Increase memory limit Exit code 1 β†’ App error/exception Check CloudWatch Logs Running but keeps restarting? Health check failing β†’ /health returns non-200 β†’ SG blocks ALB β†’ task β†’ App not ready in time Diagnostic Commands 1. aws ecs describe-tasks β†’ stoppedReason + exitCode 2. CloudWatch Logs β†’ application stack traces 3. aws ecs execute-command β†’ shell into running task 4. ALB target health β†’ check registration + health status Infrastructure issues (before app starts) Application issues (app crashes) Health check issues (app running but unhealthy)
AWS Diagram β€” Observability Stack Core
ECS Observability β€” Logs + Metrics + Traces + Alarms
ECS Service Container: api-service Sidecar: X-Ray daemon Log driver: awslogs Container Insights: ON Health check: /health CloudWatch Logs /ecs/api-service β†’ stdout/stderr Metric filter: count "ERROR" logs Container Insights CPU, memory, network per task RunningTaskCount per service metrics AWS X-Ray Service map + latency traces Error rate per service hop traces CloudWatch Alarms CPU > 80% β†’ scale out ERROR count > 10 β†’ SNS alert CloudWatch Dashboard Unified view: logs + metrics SNS β†’ PagerDuty Logs: application output. Metric filters detect patterns. Insights: per-task CPU/memory/network metrics. X-Ray: distributed traces across services. Alarms trigger scaling actions and SNS notifications.
Architecture Diagram β€” ECS Exec Debugging Session In-Depth
ECS Exec β€” Shell into Running Fargate Container via SSM Session Manager
Developer πŸ‘¨β€πŸ’» aws ecs execute-command SSM Session Manager Encrypted channel Logged in CloudTrail API call Fargate Task api-service /bin/sh session $ echo $DB_HOST SSM Agent Built into Fargate 1.4+ session Requirements βœ“ --enable-execute-command on service | βœ“ Task Role: ssmmessages:* | βœ“ NAT gateway or SSM endpoints | βœ“ Fargate platform 1.4+
πŸŽ“ Exam Tips β€” Chapter 09
  • "Task keeps failing to start + CannotPullContainerError" β†’ check: (1) ECR image exists (2) Execution Role permissions (3) NAT gateway in private subnet or ECR VPC endpoint.
  • "Container killed with exit code 137" β†’ OOM. Container exceeded memory limit. Increase memory in task definition.
  • "Container exited with exit code 143" β†’ SIGTERM received (graceful shutdown). Normal during service scaling, deployments, or Fargate Spot interruptions. Not an error β€” means your app received a shutdown signal.
  • "How to debug a running ECS container" β†’ ECS Exec (aws ecs execute-command). Requires SSM permissions on Task Role + enable-execute-command on service.
  • "Logs not appearing" β†’ Execution Role missing logs:CreateLogStream or logs:PutLogEvents. Also check log group exists and awslogs driver is configured.
  • ECS Exec requires SSM permissions on the Task Role (not the Execution Role). This is a common exam distractor.
  • Container Insights costs extra. It's not free β€” it generates CloudWatch custom metrics. Budget ~$0.30/task/month.
  • "Service never reaches steady state" β†’ aws ecs describe-services β†’ check events field for recent messages. Usually: health check failures, insufficient capacity, or image pull errors.
  • Health check grace period: seconds to wait before first health check after task registration. Set to app startup time (e.g., 60s for Java Spring Boot). Default: 0 (immediate check).
πŸ“‹ Chapter 9 β€” Summary
  • Stopped reasons: describe-tasks β†’ stoppedReason + exitCode. 137 = OOM. 1 = app error. CannotPullContainer = ECR permissions/networking.
  • ECS Exec: shell into running containers via SSM. Requires enable-execute-command + Task Role ssmmessages permissions.
  • Observability: CloudWatch Logs (awslogs), Container Insights (metrics), X-Ray (traces), Alarms (auto-scaling + alerts).
  • Health check loop: most common "task keeps restarting" cause. Fix: grace period, check /health endpoint, verify security group rules.
  • describe-services events: first place to check when service won't stabilize. Shows recent scheduling failures and reasons.
πŸ“š ECS Cheatsheet Core
πŸ’»

Key CLI Commands

  • aws ecs create-cluster --cluster-name my-cluster
  • aws ecs register-task-definition --cli-input-json file://task-def.json
  • aws ecs create-service --cluster my-cluster --service-name my-svc ...
  • aws ecs update-service --cluster my-cluster --service my-svc --desired-count 5
  • aws ecs run-task --cluster my-cluster --task-definition my-task:3
  • aws ecs describe-tasks --cluster my-cluster --tasks <id>
  • aws ecs describe-services --cluster my-cluster --services my-svc
  • aws ecs execute-command --cluster my-cluster --task <id> --command "/bin/sh" --interactive
  • aws ecs list-tasks --cluster my-cluster --service-name my-svc
  • aws ecs stop-task --cluster my-cluster --task <id>
🏷️

ARN Formats

  • Cluster: arn:aws:ecs:region:account:cluster/name
  • Task Definition: arn:aws:ecs:region:account:task-definition/family:revision
  • Service: arn:aws:ecs:region:account:service/cluster/service-name
  • Task: arn:aws:ecs:region:account:task/cluster/task-id
  • Container Instance: arn:aws:ecs:region:account:container-instance/cluster/id
Stopped ReasonExit CodeQuick Fix
EssentialContainerExited1Check CloudWatch Logs for stack trace
OutOfMemoryError137Increase task memory
CannotPullContainerErrorβ€”ECR perms + NAT/VPC endpoint
ResourceInitializationErrorβ€”Subnet IPs + SG rules + EFS mounts
TaskFailedToStartβ€”Execution Role + capacity