LearningTree · AWS · Compute

Amazon ECS —
Elastic Container Service

Fully managed container orchestration. Run Docker containers at scale with EC2 or Fargate — define your containers, ECS handles scheduling, placement, and lifecycle.

⚡ ECS in 30 Seconds

Run Docker containers on AWS — ECS handles orchestration, scheduling, and placement
Fargate = serverless containers (no servers to manage). EC2 launch type = you manage the instances
Services keep desired task count running, auto-restart failed tasks, integrate with ALB
Task Definitions define your container blueprint: image, CPU, memory, networking, IAM roles
Deep integration with ALB, ECR, CloudWatch, IAM, Secrets Manager, and X-Ray

Chapter One

What is ECS

What Are Containers Introductory

A container packages your application code together with all its dependencies — runtime, libraries, system tools — into a single, portable unit. Unlike a virtual machine, a container shares the host operating system's kernel. This makes containers lightweight (megabytes, not gigabytes), fast to start (seconds, not minutes), and identical across environments (your laptop = staging = production).

Docker is the standard container runtime. You define a Dockerfile, build an image, and run it as a container. The image is immutable — the same image produces the same behavior everywhere.

🖥️

Virtual Machine

Full OS per VM (kernel + userspace)
Gigabytes in size
Minutes to boot
Strong isolation (separate kernels)
Managed by hypervisor (e.g., Nitro)

📦

Container

Shares host OS kernel
Megabytes in size
Seconds to start
Process-level isolation (cgroups, namespaces)
Managed by container runtime (Docker)

👉 Key mental model: A VM virtualizes the hardware. A container virtualizes the OS. Containers are lighter-weight but share a kernel — if the kernel has a vulnerability, all containers are affected. VMs have stronger isolation boundaries.

Why You Need Orchestration Introductory

Running one container on your laptop is easy. Running 200 containers across 50 servers in production — keeping them healthy, distributing traffic, replacing failures, scaling up at peak, and rolling out updates without downtime — is not something Docker alone can do. That is the orchestration problem.

📍

Scheduling

Which server should this container run on? Orchestrator picks the best host based on available CPU, memory, and placement constraints.

💚

Health & Recovery

Container crashed? Orchestrator detects the failure and starts a replacement automatically. No manual intervention.

📈

Scaling

Traffic spikes? Orchestrator launches more container instances. Traffic drops? Scales back down. Keeps desired count running at all times.

An orchestrator solves: where to place containers, how many to run, when to replace them, and how to update them without downtime. ECS is AWS's answer to this problem.

What ECS Provides Core

Amazon ECS (Elastic Container Service) is a fully managed container orchestration service. You define your containers, ECS handles the rest:

✅

ECS Manages

Control plane — scheduling, placement, lifecycle
Task management — run, stop, replace containers
Service management — maintain desired task count
Load balancer integration — register/deregister targets
Rolling deployments — update without downtime
Scaling — auto-adjust task count based on metrics

👤

You Define

Container image — Docker image from ECR or Docker Hub
Resource requirements — CPU and memory per container
Networking — VPC, subnets, security groups
IAM roles — permissions for your containers
Launch type — EC2 (you manage servers) or Fargate (serverless)
Desired count — how many task copies to run

The critical point: ECS is the control plane only. It does not run your containers itself. It tells EC2 instances or Fargate to run them. Think of ECS as the "brain" that decides what runs where — the compute comes from your chosen launch type.

ECS vs Docker Compose vs Kubernetes Core

When should you pick ECS over other container orchestrators? This comparison covers the three most common choices on AWS:

Feature	Docker Compose	ECS	EKS (Kubernetes)
What it is	Local multi-container tool	AWS-managed orchestrator	AWS-managed Kubernetes
Scale	1 machine	Thousands of containers	Thousands of containers
Learning curve	Low	Medium	High
Multi-host	No	Yes	Yes
Auto healing	Basic restart	Full (replace + reschedule)	Full (pod restart + reschedule)
AWS integration	None	Deep (IAM, ALB, ECR, CloudWatch)	Good (via add-ons)
Portability	Docker standard	AWS-only	Multi-cloud (K8s standard)
Control plane cost	Free	Free	~$72/month per cluster
Best for	Local dev, small projects	AWS-native production workloads	Multi-cloud, existing K8s teams

👉 Rule of thumb: If your team is on AWS and does not already use Kubernetes, choose ECS. It is simpler, free control plane, and has deeper AWS integration. Choose EKS only if you need Kubernetes portability across clouds or have existing K8s expertise. Choose Docker Compose only for local development.

Mental Model for ECS Introductory

Think of ECS as a restaurant kitchen:

📋

Task Definition = Recipe

Specifies what to cook — ingredients (image), portion size (CPU/memory), instructions (environment variables, commands).

🍽️

Task = A Plate of Food

One running instance of the recipe. Each plate is independent. If one drops, the kitchen makes another.

👨‍🍳

Service = The Head Chef

Ensures "always 5 plates ready." If a plate breaks, chef makes a new one. If demand spikes, chef makes more. ECS Service = desired count manager.

🏗️

Cluster = The Kitchen

The physical space where everything runs. Can be your own equipment (EC2 instances) or the restaurant's built-in kitchen (Fargate — you don't manage the ovens).

📦

Container = One Dish Component

A single container inside a task. A task can have multiple containers — like a plate with main course + side dish running together.

Concept Diagram — Container vs VM Introductory

Containers vs Virtual Machines — Architecture Comparison

AWS Diagram — ECS in the Container Ecosystem Core

ECS Container Ecosystem — Build, Store, Run, Serve

Architecture Diagram — Simple Web App on ECS Core

Production Web App — ALB + ECS Fargate across 2 Availability Zones

This is the most common ECS production pattern: an ALB distributes traffic to Fargate tasks running in private subnets across two Availability Zones. If an entire AZ goes down, the remaining tasks continue serving traffic. The ECS Service automatically replaces failed tasks and maintains the desired count of 4.

ECS vs Lambda vs EC2 — When to Use Which Core

Feature	Lambda	ECS (Fargate)	EC2
Model	Serverless functions	Serverless containers	Virtual machines
Max duration	15 minutes	Unlimited	Unlimited
Max memory	10 GB	120 GB (16 vCPU)	Terabytes (instance-dependent)
Startup latency	~100ms (warm) / 1-10s (cold)	30-60 seconds	Minutes
Pricing	Per request + duration	Per vCPU/memory per second	Per instance-hour
Scaling	Auto (per-request, 1000s concurrently)	Auto (task count, seconds)	Auto (instance count, minutes)
Container support	Container images (read-only)	Full Docker support	Full Docker / any runtime
Persistent storage	/tmp only (10 GB)	EFS (shared)	EBS, EFS, instance store
GPU	No	No (Fargate) / Yes (EC2 type)	Yes
Best for	Event handlers, APIs <15min, glue code	Microservices, APIs, workers	Stateful apps, GPU, full OS control

When NOT to Use ECS Core

⚡

Simple APIs / Event Handlers

If your workload is short-lived (<15 min), event-driven, and stateless → use Lambda. No container to manage, no task definitions, no service configuration. Lambda is simpler and cheaper for request-response patterns.

☸️

Existing Kubernetes Teams

If your team already uses Kubernetes and needs multi-cloud portability → use EKS. ECS is AWS-only. Migrating K8s manifests to ECS task definitions is non-trivial.

🖥️

Stateful / Legacy Workloads

If your app needs persistent local disk, specific OS configuration, or isn't containerized → use EC2 directly. ECS requires Docker images. Some legacy middleware won't containerize easily.

🎓 Exam Tips — Chapter 01

ECS control plane is free. You only pay for the EC2 instances or Fargate tasks — not for ECS itself. EKS charges ~$72/month per cluster.
ECS vs EKS: If the question says "simplest" or "least operational overhead" for containers on AWS → ECS + Fargate. If it says "Kubernetes" or "multi-cloud" → EKS.
ECS vs Lambda: Lambda is per-request, max 15 min, max 10GB memory. ECS is for long-running services, larger workloads, or when you need full Docker compatibility.
Fargate = serverless containers. If the question mentions "no server management" with containers → Fargate. If it says "GPU" or "daemon" → must use EC2 launch type.
Distractor: "Docker Compose can scale to production on AWS" — wrong. Compose is single-host only and has no auto-healing or multi-AZ support.

📋 Chapter 1 — Summary

Containers package app + dependencies into portable units. Lighter than VMs, seconds to start.
Orchestration solves scheduling, health recovery, scaling, and zero-downtime deploys.
ECS is the AWS-managed orchestrator — free control plane, deep AWS integration.
Two launch types: EC2 (you manage servers) and Fargate (serverless).
ECS vs EKS: ECS for AWS-native simplicity, EKS for Kubernetes portability.
Production pattern: ALB → ECS Fargate tasks across Multi-AZ.

Chapter Two

Core Concepts

The Five Building Blocks Introductory

ECS has five core entities that form a clear hierarchy: Cluster → Service → Task → Container, plus a Task Definition that serves as the blueprint. Understanding how these relate is the single most important concept in ECS.

Cluster Core

A cluster is the logical grouping that holds all your ECS resources. It is the top-level boundary — services, tasks, and capacity all live inside a cluster. A cluster does not contain compute by itself — you register EC2 instances to it, or use Fargate which provisions compute on demand.

📦

What a Cluster Contains

One or more services (long-running containers)
Standalone tasks (one-off jobs)
Registered EC2 instances (EC2 launch type) or Fargate capacity
Capacity provider strategies

💡

Key Facts

A cluster is free — no cost for the cluster itself
You can have multiple clusters per account (one per environment is common)
A cluster can mix EC2 and Fargate launch types
Default cluster auto-created, but create named clusters for production

Common pattern: one cluster per environment — dev, staging, production. Each cluster has its own services and capacity, providing isolation between environments.

Task Definition Core

A task definition is a JSON document that describes how to run your container(s). Think of it as a blueprint or recipe — it specifies the Docker image, CPU and memory requirements, networking mode, IAM roles, environment variables, log configuration, and more. You never run a task definition directly — you use it to launch tasks.

🖼️

Container Image

Which Docker image to pull — from ECR, Docker Hub, or any registry. Example: 123456.dkr.ecr.us-east-1.amazonaws.com/web-api:v2

⚙️

Resource Limits

CPU and memory per task. Fargate has fixed combinations (e.g., 0.5 vCPU / 1GB). EC2 launch type is more flexible.

🔌

Networking & Ports

Network mode (awsvpc, bridge, host), port mappings, and security group assignments.

👉 Task definitions are versioned. Each update creates a new revision (e.g., web-api:1, web-api:2, web-api:3). You point your service at a specific revision. Rolling back = pointing the service to a previous revision. Old revisions are never deleted automatically.

Here is a minimal task definition in JSON — the key fields every ECS user must understand:

Task Definition — Key Fields (Visual Breakdown)

Task Core

A task is a running instance of a task definition. When ECS launches a task, it pulls the Docker image, allocates CPU/memory, assigns an ENI (in awsvpc mode), and starts the container(s). A task can contain one container (most common) or multiple (sidecar pattern).

🔄

Task Lifecycle

PROVISIONING → allocating resources (ENI, storage)
PENDING → pulling image, starting containers
RUNNING → containers executing
STOPPED → container exited (success or failure)

💡

Key Facts

Each task gets its own private IP (awsvpc mode)
Tasks are ephemeral — they can be replaced anytime
Essential container exits → entire task stops
Non-essential sidecar can crash without killing the task

Service Core

An ECS service maintains a desired count of running tasks. If a task crashes, the service replaces it. If you want 4 copies running at all times, the service ensures exactly 4 are always healthy. Services also integrate with load balancers — automatically registering and deregistering tasks as targets.

🎯

Desired Count

"Run 4 tasks." If one dies, service launches a 5th to replace it. If you scale to 8, service launches 4 more. Always maintains the target.

⚖️

Load Balancer

Service registers each task's IP:port with the ALB target group. When a task starts → registered. When it stops → deregistered. Zero manual work.

🚀

Deployments

Update the service's task definition → rolling deployment. New tasks start, old tasks drain. Configurable via minimumHealthyPercent and maximumPercent.

👉 Service vs standalone task: Use a service for long-running workloads (web servers, APIs, workers). Use a standalone task for one-off jobs (database migration, scheduled batch, data export). The service restarts failed tasks. A standalone task runs once and stops.

Failure Handling & Self-Healing Core

ECS services are self-healing by default. You don't configure recovery — it is built into the service abstraction. If anything goes wrong with a running task, ECS replaces it automatically. Combined with ALB health checks, this creates a resilient system that recovers from failures without human intervention.

🔄

What ECS Heals Automatically

Task crashes (exit code ≠ 0): service launches replacement
ALB health check fails: task deregistered → replaced
EC2 instance dies: tasks rescheduled to healthy instances
AZ goes down: tasks rebalanced across remaining AZs
Spot interruption: Fargate Spot task replaced on on-demand

⏱️

Recovery Timeline

Task crash: ~30-60s to launch replacement (Fargate)
Health check failure: deregistration delay + new task start
ALB update: automatic — new task registered, old drained
No manual intervention: service maintains desired count
Deployment rollback: circuit breaker auto-reverts bad deploys

Container Introductory

A container is a single Docker container inside a task. Most tasks run one container (your application). But ECS supports multi-container tasks — a common pattern for sidecars like log routers, tracing agents (X-Ray daemon), or envoy proxies. Containers in the same task share the network namespace (they can communicate over localhost) and can share volumes.

⚠️

Essential Containers

If a container marked "essential": true exits, the entire task stops. Your main app container should always be essential. Sidecar containers can be non-essential.

🔗

Multi-Container Patterns

Sidecar: X-Ray daemon, Datadog agent, Envoy proxy
Log router: Fluent Bit forwarding to CloudWatch/S3
Init container: runs before main app (supported since 2023)

Task Role vs Execution Role In-Depth

This is the most exam-tested ECS concept, and the most commonly confused. ECS uses two separate IAM roles with completely different purposes:

Aspect	Task Role	Execution Role
Who uses it	Your application code inside the container	The ECS agent (not your code)
Purpose	Access AWS services from your app	Infrastructure setup: pull images, push logs
Example permissions	S3:GetObject, DynamoDB:PutItem, SQS:SendMessage	ecr:GetAuthorizationToken, logs:CreateLogStream
JSON field	`taskRoleArn`	`executionRoleArn`
Required?	Only if your app calls AWS APIs	Yes — always needed for Fargate
Analogy	Employee badge — what rooms they can enter	Building manager — keeps the lights and doors working
If missing	App gets "Access Denied" calling AWS services	Task fails to start (can't pull image or push logs)

👉 Exam trap: "The container needs to write to S3 — which role?" → Task Role (your app's permissions). "The container fails to start because it can't pull from ECR" → Execution Role is missing or wrong. Never confuse the two — the exam does this deliberately.

Concept Diagram — Entity Hierarchy Introductory

ECS Entity Hierarchy — Cluster → Service → Task → Container

AWS Diagram — Service with ALB across 2 AZs Core

Running Service — 3 Tasks across 2 AZs with ALB Integration

Architecture Diagram — Multi-Container Task Detail In-Depth

Multi-Container Task — Sidecar Pattern with X-Ray + Fluent Bit

🎓 Exam Tips — Chapter 02

Task Role vs Execution Role — the #1 most tested concept. Task Role = your app's permissions. Execution Role = ECS agent's permissions (pulling images, pushing logs).
"Container can't pull image from ECR" → Missing or incorrect Execution Role. Not the Task Role.
"App returns Access Denied when writing to S3" → Missing or incorrect Task Role. Not the Execution Role.
Essential container exits → entire task stops. Non-essential sidecars can fail without killing the task.
Task Definition is versioned. Each update = new revision. Rollback = point service to older revision number.
Containers in the same task share network (communicate via localhost) and share the CPU/memory budget.
Distractor: "Use EC2 instance role for container AWS access" — wrong. ECS containers use Task Role, not the EC2 instance profile (even on EC2 launch type).

📋 Chapter 2 — Summary

Cluster: logical grouping. Free. One per environment is common.
Task Definition: JSON blueprint — image, CPU, memory, roles, ports, logs. Versioned with revisions.
Task: running instance of a task definition. Gets its own IP (awsvpc). Ephemeral.
Service: maintains desired task count. Auto-restarts failed tasks. Integrates with ALB.
Container: Docker container inside a task. Essential flag controls task lifecycle.
Task Role: your app's AWS permissions (S3, DynamoDB). Execution Role: ECS agent's permissions (ECR pull, CW logs).
Multi-container tasks: sidecar pattern — X-Ray daemon, log router, envoy proxy share network with main app.

Chapter Three

Launch Types

Two Ways to Run Containers Introductory

ECS gives you exactly two choices for where your containers physically run: EC2 launch type (you manage the servers) or Fargate launch type (AWS manages the servers). This is the single most impactful architectural decision in ECS — it determines your pricing model, operational burden, scaling behavior, and what features are available.

EC2 Launch Type Core

With the EC2 launch type, you provision and manage a fleet of EC2 instances. You register these instances with your ECS cluster by installing the ECS container agent (pre-installed on the Amazon ECS-optimized AMI). ECS places your containers on these instances based on available CPU and memory. You are responsible for patching, scaling, and monitoring the instances themselves.

✅

Strengths

Full control — instance type, AMI, OS patches, SSH access
GPU support — P3, P4, G4 instances for ML workloads
Persistent EBS volumes — attach to specific instances
Daemon scheduling — run one agent per instance (monitoring, logging)
Higher task density — pack many small tasks on one large instance
Cheaper for steady-state — Reserved Instances / Savings Plans work

⚠️

Trade-offs

You manage instances — patching, AMI updates, agent upgrades
Capacity planning — must provision enough instances for peak
ENI limits — each awsvpc task consumes one ENI. Small instances (t3.micro) may support only 1-2 tasks. Enable ENI trunking to increase limit.
Idle waste — pay for full instance even if half-empty
Scaling is two-layer — scale tasks AND scale instances (Auto Scaling Group)

👉 The ECS container agent is a Docker container itself that runs on every EC2 instance. It communicates with the ECS control plane, receives task placement instructions, starts/stops containers, and reports health. Use the ECS-optimized AMI (Amazon Linux 2023) — it comes pre-configured with Docker and the agent.

Fargate Launch Type Core

With Fargate, you do not provision or manage any servers. You specify CPU and memory requirements in the task definition, and AWS provisions a compute environment for each task. You never see the underlying instance. Each task runs in its own isolated micro-VM (using Firecracker), providing strong security isolation — one customer's task cannot affect another's.

✅

Strengths

Zero server management — no instances to patch, scale, or monitor
Per-task pricing — pay only for the vCPU and memory your task uses
No idle waste — no instance running empty at 2 AM
Task-level isolation — Firecracker micro-VM per task
Scaling is one-layer — just change desired count, Fargate handles capacity
Fargate Spot — up to 70% discount for interruptible tasks

⚠️

Trade-offs

No GPU support — cannot use GPU instance types
No daemon scheduling — can't run one agent per "host"
No EBS volumes — ephemeral storage only (20GB default, up to 200GB)
Fixed CPU/memory combos — limited set of valid pairings
No SSH access — debug via ECS Exec only
Higher per-unit cost — ~20% more expensive per vCPU-hour than EC2

The Fargate pricing model is straightforward: you pay per vCPU-second and per GB-second your task runs. There is no cost when no tasks are running (unlike EC2 where the instance bill continues). This makes Fargate ideal for variable workloads — the cost matches the actual usage precisely.

Fargate CPU/Memory Combinations Core

Fargate does not let you specify arbitrary CPU and memory — there are fixed valid combinations. If you specify an invalid pairing, the task definition fails to register.

vCPU	Memory Options (GB)	Typical Use Case
0.25 vCPU	0.5, 1, 2	Tiny microservices, health checkers
0.5 vCPU	1, 2, 3, 4	APIs, lightweight web servers
1 vCPU	2, 3, 4, 5, 6, 7, 8	Standard APIs, workers
2 vCPU	4 – 16 (in 1GB steps)	Batch jobs, heavier services
4 vCPU	8 – 30 (in 1GB steps)	Data processing, analytics
8 vCPU	16 – 60 (in 4GB steps)	ML inference, heavy compute
16 vCPU	32 – 120 (in 8GB steps)	Large in-memory workloads

👉 Exam tip: If a question says "the task requires 3 vCPU and 6GB memory" — there is no 3 vCPU option in Fargate. You must round up to 4 vCPU. This is a common exam trap. Know the valid vCPU values: 0.25, 0.5, 1, 2, 4, 8, 16.

EC2 vs Fargate — Full Comparison Core

Feature	EC2 Launch Type	Fargate Launch Type
Server management	You manage EC2 instances	AWS manages (serverless)
Pricing	Pay for EC2 instances (running or not)	Pay per vCPU-second + GB-second per task
Isolation	Instance-level (shared host for tasks)	Task-level (Firecracker micro-VM)
GPU support	✅ Full GPU access (P3, P4, G4, G5)	❌ No GPU available
Persistent storage (EBS)	✅ EBS volumes attachable	❌ Ephemeral only (20-200GB)
EFS (shared file system)	✅ Supported	✅ Supported
Spot instances	✅ EC2 Spot (up to 90% discount)	✅ Fargate Spot (up to 70% discount)
Daemon scheduling	✅ One task per instance	❌ Not supported
Task density	Pack multiple tasks per instance	One micro-VM per task
SSH access	✅ Direct SSH to instance	❌ ECS Exec only (SSM-based)
Scaling layers	2 layers: tasks + instances (ASG)	1 layer: tasks only
Cold start	~minutes (if ASG needs new instance)	~30-60s (Fargate provisions infra)
Best for	Large steady workloads, GPU, tight cost control	Variable/spiky workloads, simplicity, microservices

Hybrid: EC2 + Fargate in the Same Cluster In-Depth

A single ECS cluster can use both launch types simultaneously. This is the production-standard pattern for cost optimization: run steady-state workloads on EC2 Reserved Instances (cheapest baseline), and burst overflow to Fargate (no pre-provisioning needed). Capacity Provider strategies let you define the mix — for example, "80% on EC2, 20% overflow on Fargate" or "batch jobs on Fargate Spot, web tier on Fargate."

🌐

Web Tier → Fargate

Variable traffic, auto-scales, no servers to manage. Simplest operational model for customer-facing services.

⚙️

Workers → EC2

Steady-state processing, Reserved Instances for cost. Pack multiple worker tasks per large instance for efficiency.

📊

Batch → Fargate Spot

Interruptible batch jobs get up to 70% discount. Task retries handle interruptions naturally.

👉 Decision framework: Start with Fargate. It is simpler and scales naturally. Move to EC2 launch type only when you need: (1) GPU, (2) EBS persistent volumes, (3) daemon scheduling, (4) cost optimization on large steady-state fleets, or (5) specific instance types. Fargate is the default for most new workloads on ECS.

Task Placement Strategies (EC2 Launch Type) In-Depth

When using the EC2 launch type, ECS decides which instance gets each new task. Task placement strategies control this decision. They apply only to EC2 — Fargate handles placement internally (one micro-VM per task, AWS chooses the host).

Strategy	How It Works	Best For
spread	Distribute tasks evenly across the specified field (e.g., `attribute:ecs.availability-zone` or `instanceId`)	High availability — ensures AZ failure impacts minimum tasks
binpack	Pack tasks onto the fewest instances possible (by CPU or memory)	Cost optimization — fewer instances running, lower EC2 bill
random	Place tasks on random instances	Simple workloads, testing — no preference

🎯

Combining Strategies

You can chain strategies in order of priority. Example: spread(az) first, then binpack(memory). This spreads across AZs for HA, then packs tightly within each AZ for cost savings.

🚧

Placement Constraints

Constraints filter which instances are eligible: distinctInstance (no two tasks on same instance) or memberOf (custom expressions like attribute:ecs.instance-type == g4dn.xlarge).

👉 Default behavior: ECS uses spread across Availability Zones by default. This is the safest default — it maximizes availability. Switch to binpack when cost optimization is the priority and you can tolerate reduced AZ spread.

EC2 vs Fargate — Who Manages What

AWS Diagram — EC2 Cluster vs Fargate Cluster Core

EC2 Launch Type (with ASG) vs Fargate Launch Type

Architecture Diagram — Mixed Workload Cluster In-Depth

Hybrid Cluster — Web on Fargate + Batch on EC2 Spot + ML on EC2 GPU

This architecture uses each launch type where it shines: Fargate for the web tier (simple, no servers, auto-scales), EC2 Spot for batch processing (cheapest compute, interruption-tolerant), and EC2 GPU for ML inference (needs hardware that Fargate can't provide). All managed through a single ECS cluster with capacity provider strategies.

🎓 Exam Tips — Chapter 03

"No server management" + containers → Always Fargate. This is the exam's favorite phrase for Fargate.
"Requires GPU" → Must use EC2 launch type. Fargate does not support GPU instances.
"Run one monitoring agent per host" → Daemon scheduling on EC2 launch type. Fargate doesn't support daemons.
"Need persistent block storage (EBS)" → EC2 launch type. Fargate only has ephemeral storage.
"Need shared file storage across tasks" → EFS works with both EC2 and Fargate. Don't pick EC2 just for shared storage.
Fargate Spot — up to 70% discount but tasks can be interrupted with 2-minute warning. Good for batch. Not for web servers.
Valid vCPU values: 0.25, 0.5, 1, 2, 4, 8, 16. If a question says "3 vCPU" — that's invalid, must round up to 4.
Fargate cold start ~30-60s. If the question requires "sub-second scaling" → EC2 with pre-warmed instances.
Distractor: "Fargate is always cheaper than EC2" — wrong. For large steady-state workloads, EC2 with Reserved Instances is cheaper per unit.

📋 Chapter 3 — Summary

EC2 launch type: you manage instances. Full control, GPU, EBS, daemon scheduling. Cheaper for steady-state (RI/SP).
Fargate: serverless containers. Zero server management. Per-task pricing. No GPU, no EBS, no SSH.
Fargate isolation: each task runs in its own Firecracker micro-VM (task-level isolation vs instance-level).
Fargate CPU/memory: fixed combinations. vCPU options: 0.25, 0.5, 1, 2, 4, 8, 16.
Hybrid clusters: use both launch types. Web on Fargate, batch on EC2 Spot, ML on EC2 GPU.
Default choice: start with Fargate. Move to EC2 only when you need GPU, EBS, daemons, or cost optimization at scale.
Fargate Spot: up to 70% discount for interruptible batch workloads.

☁️

Deep Dive

Fargate — Complete Understanding

What is Fargate — Behind the Scenes Core

Fargate is a serverless compute engine for containers. It removes the server layer entirely — you define what you want to run and how much CPU/memory it needs. AWS handles everything else: provisioning compute, patching the OS, managing the container runtime, and isolating your workload.

🧠

The Right Mental Model

Most people think Fargate = "Lambda for containers." That's not quite right.

👉 Better mental model:

"Fargate = EC2 without access to EC2"

Your container gets a VPC, an ENI, security groups, private IP — just like EC2. You just can't SSH in, can't pick the instance type, can't install host-level agents. The EC2 exists — you just don't see it.

🔧

What Happens Under the Hood

AWS provisions a Firecracker micro-VM per task
AWS manages the host OS, container runtime (containerd), and ECS agent
You never see the underlying EC2 instance
Isolation: each task is a separate micro-VM (not just a container on a shared host)
You only define: CPU, memory, container image, networking

Internally: Fargate still runs on EC2 hardware (Nitro instances). It's EC2 that AWS manages for you — not a different compute technology.

👉 Key insight: Fargate is NOT a separate compute platform. It's an abstraction layer over EC2. AWS is running EC2 instances, launching Firecracker micro-VMs on them, and exposing only the container interface to you. This is why Fargate tasks behave like EC2 instances (own IP, security groups, VPC placement) — because under the hood, they ARE running on EC2.

Fargate Networking Model Deep

Every Fargate task gets its own Elastic Network Interface (ENI) with a private IP address in your VPC. This has major implications:

🌐

How It Works

Each task = own ENI = own private IP
ENI lives in your subnet (public or private)
You attach security groups directly to the task
Tasks can communicate using standard VPC networking
Always uses awsvpc network mode (no other option)

💡

Implications

No port conflicts — every task has its own IP, so all can use port 80
Security group per task — fine-grained firewall rules
Task behaves like an EC2 instance from a networking perspective
ALB targets individual tasks by IP (not instance + port)

⚠️ Critical requirement: Fargate tasks in private subnets need a NAT Gateway for internet access (pulling images from Docker Hub, calling external APIs). Without NAT, the task hangs at "PROVISIONING" and eventually times out. For ECR image pulls, you can alternatively use VPC Endpoints (PrivateLink) to avoid NAT Gateway costs.

Fargate Resource Model — CPU/Memory Combinations Core

Fargate does NOT allow arbitrary resource values. You must choose from predefined CPU/memory combinations. If you specify an invalid pair, the task definition will fail to register.

vCPU	Memory Options (GB)	Typical Use Case
0.25 vCPU	0.5, 1, 2	Microservices, health checks, lightweight APIs
0.5 vCPU	1, 2, 3, 4	Small web apps, background workers
1 vCPU	2, 3, 4, 5, 6, 7, 8	Standard web apps, APIs
2 vCPU	4–16 (in 1GB increments)	Medium workloads, data processing
4 vCPU	8–30 (in 1GB increments)	Large apps, compute-heavy tasks
8 vCPU	16–60 (in 4GB increments)	Heavy processing, in-memory caching
16 vCPU	32–120 (in 8GB increments)	Max power (rare, expensive)

👉 Exam trap: "The task requires 3 vCPU and 6GB memory." There is no 3 vCPU option — you must round up to 4 vCPU. Valid vCPU values: 0.25, 0.5, 1, 2, 4, 8, 16. Nothing in between. This is a frequently tested concept.

Fargate Startup Behavior Core

Fargate task startup is not instant — understanding the timeline helps you set realistic expectations for scaling and health check grace periods:

⏱️

Startup Time: ~30–60 seconds

Typical cold start. Includes compute provisioning + image pull + container start. Larger images = longer startup.

📦

What Happens During Startup

AWS provisions Firecracker micro-VM
Attaches ENI to your subnet
Pulls container image from ECR/registry
Starts your container process
Health check grace period begins

📊

Compared To

Lambda cold start: 100ms–3s (faster)
Fargate: 30–60s
EC2 launch: 2–5 min (slower)

For latency-sensitive scaling, keep min tasks > 0 to avoid cold starts.

Fargate Storage Core

Storage in Fargate is fundamentally different from EC2 — there is no EBS available. Understanding what you get (and don't get) prevents painful surprises:

💾

Ephemeral Storage

Default: 20 GB per task
Configurable: up to 200 GB
Lifecycle: destroyed when task stops
Fast local SSD — good for temp files, caching, scratch space
Shared across all containers in the task

📂

Persistent Storage: EFS

Amazon EFS = only persistent storage option for Fargate
Shared filesystem — multiple tasks read/write simultaneously
Survives task restarts
Mount as a volume in task definition
Use for: shared config, uploaded files, ML models

⚠️ No EBS on Fargate. If your workload requires EBS volumes (high IOPS, block storage, databases), you must use EC2 launch type. Fargate only supports ephemeral storage + EFS. This is a common exam question and a real-world constraint.

Fargate Limitations Core

Fargate is excellent — but it's NOT always the right choice. These limitations are critical for architecture decisions and exam answers:

🚫

What Fargate Cannot Do

No GPU support — ML training, rendering → use EC2
No EBS volumes — only ephemeral + EFS
No daemon containers — can't run node-level agents (Datadog agent, Fluentd)
No SSH access — cannot log into the host
No custom AMI — can't customize the underlying OS
No privileged mode — can't run containers with root-level host access
No Windows containers (limited support, still maturing)
Fixed CPU/memory combos — can't choose arbitrary values

💰

Cost Considerations

Fargate is ~20-40% more expensive per vCPU-hour than EC2 On-Demand
EC2 with Reserved Instances / Savings Plans = much cheaper for steady workloads
Fargate Spot helps (up to 70% off) but can be interrupted
Break-even point: if task utilization is >70% consistently for 24/7 workloads, EC2 is cheaper
Fargate wins: for variable workloads, burst traffic, short-lived tasks

Fargate vs EC2 — Decision Guide Deep

☁️

Choose Fargate When

You want zero infrastructure management
Workloads are variable or bursty
Fast setup and iteration speed matter
Small team with no dedicated DevOps
Security isolation per task is important
You're running microservices (many small containers)
Development and staging environments

🖥️

Choose EC2 When

You need GPU instances (ML training, rendering)
You want cost optimization at scale (RI/SP at 50-60% off)
You need EBS volumes (databases, high IOPS)
You run daemon containers (log agents, monitoring sidecars)
You need privileged mode or custom OS configs
Workloads are steady-state 24/7 at high utilization
You need instance types Fargate doesn't match (compute-optimized, memory-optimized)

👉 Golden rule: Start with Fargate. Move to EC2 only when you hit a specific limitation (GPU, EBS, cost, daemons). Don't pre-optimize — Fargate's operational simplicity saves engineering time that often exceeds the compute cost difference.

Fargate Execution Flow Core

Fargate Task Lifecycle — From Request to Running

Common Fargate Mistakes Core

❌

Assuming GPU Support

Fargate does NOT support GPU workloads. For ML training, inference with GPU, or rendering — you MUST use EC2 launch type with P/G instance families.

❌

Forgetting NAT Gateway

Fargate tasks in private subnets cannot reach the internet (or ECR) without a NAT Gateway or VPC Endpoints. Task gets stuck in PROVISIONING forever.

❌

Overprovisioning Resources

Choosing 4 vCPU / 8GB when the app uses 0.5 vCPU / 512MB. Fargate bills per-second — oversized tasks = wasted money every second they run.

❌

Expecting EBS

Fargate only has ephemeral storage + EFS. If you need high-IOPS block storage (databases, caches with persistence) — use EC2 launch type.

❌

Large Image + Cold Start

Using 2GB+ images on Fargate → 60+ second startup times. Keep images lean (<500MB). Use multi-stage Docker builds to minimize image size.

❌

Daemon Scheduling

Trying to run "one per host" containers (log agents, monitoring) on Fargate — there's no concept of "host." Use ECS daemon service with EC2 launch type instead.

☁️ Fargate Deep Dive — Summary

Fargate = EC2 without access to EC2. Serverless containers with full VPC networking.

Behind the scenes: Firecracker micro-VMs on AWS-managed EC2. Each task = isolated VM, own ENI, own IP.
Networking: awsvpc mode only. Each task gets an ENI in your subnet with security groups. NAT Gateway required for private subnets.
Resources: Fixed CPU/memory combos. Valid vCPU: 0.25, 0.5, 1, 2, 4, 8, 16. No arbitrary values.
Startup: ~30-60 seconds (provision + ENI + image pull + start). Not instant like Lambda.
Storage: Ephemeral 20-200GB (destroyed on stop) + EFS (persistent, shared). NO EBS.
Limitations: No GPU, no EBS, no daemons, no SSH, no privileged mode, no custom AMI.
Cost: ~20-40% more than EC2 On-Demand. Wins for variable/burst workloads. Loses for 24/7 steady-state at scale.
Golden rule: Start with Fargate. Move to EC2 only when you hit a limitation.

Chapter Four

Networking & Storage

Networking Modes Core

How your ECS tasks connect to the network determines their security posture, IP behavior, and load balancer integration. ECS supports three networking modes — but for all practical purposes, awsvpc is the only one you should use (and the only one that works with Fargate).

Network Mode	How It Works	Launch Type	Use Case
awsvpc	Each task gets its own ENI (Elastic Network Interface) with a private IP in your VPC	EC2 + Fargate	Production standard — all new workloads
bridge	Tasks share the host's network via Docker bridge. Dynamic port mapping.	EC2 only	Legacy. Only if migrating from Docker Compose
host	Task uses the host EC2 instance's network directly. No isolation.	EC2 only	Maximum performance (no NAT overhead). Rare.

awsvpc Mode — The Standard Core

In awsvpc mode, each ECS task gets its own Elastic Network Interface (ENI) — a real VPC network interface with a private IP address. This means each task has its own security group, appears as a distinct network entity in your VPC, and can be targeted directly by load balancers. There is no port conflict — every task listens on the same container port (e.g., 8080) because each has its own IP.

✅

Benefits

Task-level security groups — different rules per service
No port conflicts — every task uses port 8080, own IP
VPC Flow Logs per task — full network visibility
Direct ALB targeting by IP — no dynamic port mapping needed
Required for Fargate — the only option that works

⚠️

ENI Limits (EC2 only)

Each EC2 instance type has a max ENI count
Each task in awsvpc mode consumes one ENI
t3.micro: 2 ENIs → only 1 task (1 ENI for the instance itself)
m5.xlarge: 4 ENIs → 3 tasks max
ENI trunking (opt-in) increases the limit significantly
Not a concern with Fargate — AWS manages this

👉 ENI trunking is an opt-in feature that lets you run more tasks per EC2 instance in awsvpc mode. It creates a "trunk" ENI with multiple "branch" ENIs sharing it. Enable it via account settings: aws ecs put-account-setting --name awsvpcTrunking --value enabled. With trunking, an m5.xlarge can support ~18 tasks instead of 3.

👉 Mental model: In awsvpc mode, each task behaves exactly like a standalone EC2 instance from a networking perspective — it has its own private IP, its own security group, its own entry in VPC Flow Logs, and can be directly addressed by other services. The only difference: it is a container, not a VM. This is why awsvpc is required for Fargate — it provides the clean network isolation that serverless containers need.

Security Groups for Tasks Core

In awsvpc mode, security groups are assigned at the task level (via the service's network configuration), not at the instance level. This gives you fine-grained control:

🌐

Web Tier SG

Inbound: port 8080 from ALB SG
Outbound: port 5432 to DB SG
Outbound: port 443 to internet (HTTPS)

⚙️

Worker Tier SG

Inbound: none (pulls from SQS)
Outbound: port 443 to SQS/S3
Outbound: port 5432 to DB SG

🗄️

Database SG

Inbound: port 5432 from Web SG + Worker SG
Outbound: none
Reference SGs by ID (not IP ranges)

The key insight: reference security groups by their group ID, not by IP ranges. Since task IPs change on every restart, IP-based rules would constantly break. SG-to-SG references are stable regardless of task IP churn.

Load Balancer Integration Core

ECS integrates with ALB (Application Load Balancer) and NLB (Network Load Balancer) through target groups. When you create a service with a load balancer, ECS automatically registers each task's IP:port as a target. When a task is replaced, the old target is deregistered and the new one registered — seamlessly.

⚖️

ALB (Layer 7)

Path-based routing: /api/* → API service, /web/* → frontend
Host-based routing: api.example.com → one service
Health checks: HTTP GET /health → 200 OK
Sticky sessions: route same user to same task
Target type: ip (required for awsvpc + Fargate)
Best for: HTTP/HTTPS workloads, microservices

🔌

NLB (Layer 4)

TCP/UDP pass-through: no HTTP awareness
Ultra-low latency: millions of requests/sec
Static IP: one IP per AZ (great for whitelisting)
TLS termination or pass-through
Target type: ip (for awsvpc mode)
Best for: gRPC, WebSocket, non-HTTP protocols

👉 Target type must be ip for Fargate. The ALB target group must use target type: ip (not instance). With awsvpc mode, ECS registers the task's ENI IP directly. If you create the target group with type instance, the deployment fails silently — tasks start but are never registered.

Service Discovery (AWS Cloud Map) In-Depth

For service-to-service communication without a load balancer, ECS integrates with AWS Cloud Map to provide DNS-based service discovery. When a task starts, it registers a DNS record (e.g., web-api.production.local). Other services resolve this name to get the task's current IP address. When the task stops, the record is removed.

🗺️

Cloud Map (Service Discovery)

DNS A records pointing to task IPs
Private DNS namespace (e.g., production.local)
Auto-register on task start, deregister on stop
Health checks to remove unhealthy instances
Works with both EC2 and Fargate

🔗

Service Connect (newer, simpler)

Built on Cloud Map + Envoy proxy sidecar
Service-to-service via logical names (not IPs)
Automatic retries, timeouts, circuit breaking
Traffic metrics out of the box
Recommended over raw Cloud Map for new services

Service Connect is the newer recommended approach. It injects an Envoy sidecar proxy into your tasks automatically. Your code calls http://web-api:8080, and the proxy handles discovery, load balancing, retries, and telemetry. Think of it as a lightweight service mesh managed by ECS — no Kubernetes or App Mesh complexity.

👉 Service-to-service communication: Cloud Map eliminates hardcoded IPs entirely. Example: your orders service calls http://payments.production.local:8080/charge — DNS resolves to the current task IP. No load balancer needed for internal calls. For exam: if the question says "internal service communication without ALB" → Service Discovery via Cloud Map.

Concept Diagram — awsvpc Network Mode Introductory

awsvpc Mode — Each Task Gets Its Own ENI and Private IP

Storage Options Core

ECS containers need storage for application data, temp files, shared state, and logs. The options depend heavily on your launch type:

Storage Type	Persistence	EC2	Fargate	Shared Across Tasks	Best For
Ephemeral (container layer)	Deleted on task stop	✅	✅ (20-200GB)	❌	Temp files, caches, scratch space
EBS Volume	Persists beyond task lifecycle	✅	❌	❌ (one AZ only)	Database data, stateful single-task workloads
EFS (Elastic File System)	Persistent, durable	✅	✅	✅ (multi-AZ, multi-task)	Shared config, ML models, CMS uploads
Instance Store (NVMe)	Lost on instance stop/terminate	✅	❌	❌	High-IOPS scratch (ML training, video encode)
Docker Volumes	Depends on driver	✅	❌	Between containers in same task	Sidecar data sharing within a task

EFS — Shared Storage for ECS In-Depth

Amazon EFS is the most important storage integration for ECS because it works with both EC2 and Fargate and supports concurrent access from multiple tasks across multiple AZs. Mount an EFS file system in your task definition, and every task gets read/write access to the same files — no matter which AZ it runs in.

📂

When to Use EFS

Shared configuration files across multiple tasks
ML model files (load once, serve from many tasks)
CMS file uploads (WordPress media, user uploads)
Log aggregation (multiple writers, one reader)
Any workload needing shared persistent storage on Fargate

⚠️

Gotchas

Latency: EFS is network-attached — higher latency than local SSD
Throughput: scales with data stored (or use provisioned throughput)
Cost: $0.30/GB-month (standard). Use Infrequent Access for cold data
Security: must configure SG to allow NFS (port 2049) from task SG
IAM auth: use EFS access points for per-task directory isolation

👉 Fargate ephemeral storage — each Fargate task gets 20GB of ephemeral storage by default (stored on the micro-VM's local disk). You can configure up to 200GB in the task definition. This data is fast (local NVMe) but deleted when the task stops. Use it for temp files, build artifacts, or caching — not for anything you need to persist.

AWS Diagram — ECS Service with ALB + Service Discovery Core

ECS Networking — ALB for External Traffic + Cloud Map for Internal

Architecture Diagram — Web Tier + API Tier + Shared EFS In-Depth

Multi-Tier Architecture with Shared EFS Storage

This pattern is common for file processing: API tasks accept uploads and write to EFS, processor tasks read from EFS and generate thumbnails or transcodes. EFS is shared across all tasks and all AZs — no need to copy files between tasks or use S3 as an intermediary for simple file sharing.

🎓 Exam Tips — Chapter 04

awsvpc = required for Fargate. If using Fargate, awsvpc is the only networking mode. If the exam says "bridge mode" + Fargate — that's impossible.
ALB target type must be ip for Fargate. Not instance. This is a common configuration error tested in exams.
"Need shared storage across Fargate tasks" → EFS. It's the only persistent shared storage that works with Fargate.
"Need persistent block storage" → EBS, which means EC2 launch type only. Fargate ephemeral storage is deleted on stop.
"Tasks can't communicate with each other" → Check security groups. In awsvpc mode, each task has its own SG. The SG must allow the needed ports.
ENI limits on EC2: each awsvpc task uses one ENI. Small instances (t3.micro) may only support 1 task. Enable ENI trunking for more.
Service Discovery vs ALB: Use ALB for external-facing traffic. Use Cloud Map/Service Connect for internal service-to-service calls.
Fargate ephemeral storage: 20GB default, configurable up to 200GB. Fast (local NVMe) but non-persistent.
EFS security: task SG must allow outbound to port 2049 (NFS). EFS SG must allow inbound port 2049 from task SG.

📋 Chapter 4 — Summary

awsvpc: production standard. Each task gets own ENI + private IP + security group. Required for Fargate.
Security groups: applied per task (not per instance). Reference by SG ID, not IP ranges.
ALB: target type must be ip for Fargate/awsvpc. Auto-registers task IPs in target group.
Service Discovery: Cloud Map provides DNS records per task. Service Connect adds Envoy proxy for retries/metrics.
EFS: shared persistent storage across tasks and AZs. Works with both EC2 and Fargate.
EBS: persistent block storage, EC2 only. Single-AZ. For stateful single-instance workloads.
Fargate ephemeral: 20-200GB, fast NVMe, deleted on task stop. Great for temp/scratch data.
ENI trunking: opt-in to run more awsvpc tasks per EC2 instance by sharing trunk ENI.

Chapter Five

Capacity Providers

What Is a Capacity Provider Introductory

A capacity provider is the bridge between your ECS tasks and the infrastructure they run on. It answers a simple question: "When ECS needs to launch a new task, where does the compute come from?" Without capacity providers you must manually ensure enough EC2 instances exist. With them, ECS automatically provisions capacity — either by scaling an Auto Scaling Group (EC2) or by simply requesting Fargate resources from AWS.

☁️

FARGATE

Built-in. AWS provisions compute per task. No configuration needed — always available by default.

💰

FARGATE_SPOT

Built-in. Same as Fargate but uses spare capacity at up to 70% discount. Tasks can be interrupted with 2-minute warning.

🖥️

ASG Capacity Provider

Links an Auto Scaling Group to ECS. When tasks need capacity, ECS tells the ASG to scale out. You manage the instance fleet.

Fargate + Fargate Spot Core

The FARGATE and FARGATE_SPOT capacity providers are built into ECS — you don't create them. They are available on every cluster. Fargate is the default: every task you launch on Fargate uses this provider unless you configure otherwise.

✅

Fargate (On-Demand)

Always available — AWS guarantees capacity
No interruptions — task runs until it exits or you stop it
Full per-second billing for vCPU + memory
Use for: production web services, customer-facing APIs

💰

Fargate Spot

Up to 70% cheaper than on-demand Fargate
Uses spare AWS capacity — can be reclaimed anytime
2-minute SIGTERM before task is terminated
ECS service auto-replaces interrupted tasks on on-demand
Use for: batch jobs, queue workers, data processing

👉 Fargate Spot interruption handling: When AWS reclaims your Spot task, ECS sends SIGTERM → waits 2 minutes → then SIGKILL. Your app should handle SIGTERM gracefully (finish current work, checkpoint state). The ECS service will automatically launch a replacement task on on-demand Fargate — you don't lose desired count.

🐍

Python — SIGTERM Handler

import signal, sys

def graceful_shutdown(signum, frame):
    print("SIGTERM received — finishing work...")
    # flush queues, save checkpoint, close DB
    sys.exit(0)

signal.signal(signal.SIGTERM, graceful_shutdown)

🟢

Node.js — SIGTERM Handler

process.on('SIGTERM', async () => {
  console.log('SIGTERM received — draining...');
  server.close(); // stop accepting new requests
  await flushQueues();
  await db.close();
  process.exit(0);
});

ASG Capacity Provider (EC2 Launch Type) In-Depth

For the EC2 launch type, capacity providers link your ECS cluster to an Auto Scaling Group. This enables Cluster Auto Scaling (CAS) — when ECS needs to place tasks but no instance has enough room, CAS triggers the ASG to launch new instances. When instances are underutilized, CAS scales them in. This eliminates manual capacity planning.

⚙️

How CAS Works

1. ECS receives a task placement request
2. No instance has enough CPU/memory available
3. CAS calculates how many instances are needed
4. CAS sets the ASG's desired count → ASG launches instances
5. New instances register with ECS → tasks placed
Scale-in: ECS drains tasks first → then terminates instance

📊

Configuration

Target capacity %: how full instances should be
100% = pack instances fully (maximize density)
80% = leave 20% headroom for burst (faster placement)
Managed scaling: on/off toggle for CAS
Managed termination protection: prevents ASG from terminating instances that still have running tasks

The target capacity % is the most important CAS parameter. Set it to 100% for maximum cost efficiency (every instance fully packed, but new tasks wait for scale-out). Set it to 70-80% for responsiveness (headroom means tasks place instantly, but you pay for idle capacity).

Capacity Provider Strategy In-Depth

A capacity provider strategy defines how tasks are distributed across multiple capacity providers. You assign weights and an optional base count per provider. This is how you build hybrid workloads — for example, "run 2 tasks on on-demand Fargate as baseline, then spread additional tasks 80% to Fargate Spot and 20% to on-demand."

Strategy Example	Provider	Base	Weight	Behavior
Cost-optimized batch	FARGATE_SPOT	0	4	80% of tasks → Spot (cheap)
Cost-optimized batch	FARGATE	0	1	20% on-demand (fallback)
HA web service	FARGATE	2	1	Always 2 on-demand tasks (base)
HA web service	FARGATE_SPOT	0	3	Extra tasks 75% Spot (save $)
Hybrid EC2 + overflow	ASG (EC2 RI)	0	3	75% on Reserved EC2 instances
Hybrid EC2 + overflow	FARGATE	0	1	25% overflow to Fargate (burst)

The base count guarantees a minimum number of tasks on that provider (placed first, before weights apply). After base is filled, additional tasks distribute according to the weight ratio. This gives you predictable baseline capacity with elastic overflow.

Concept Diagram — Capacity Provider as Bridge Introductory

Capacity Providers — Bridge Between ECS Tasks and Infrastructure

AWS Diagram — EC2 Cluster with ASG Capacity Provider Core

Cluster Auto Scaling — ASG Capacity Provider + Fargate Overflow

Architecture Diagram — Spot-Heavy Batch with Fargate Overflow In-Depth

Batch Processing — EC2 Spot + Fargate Spot + On-Demand Fallback

This pattern maximizes cost savings for batch workloads: EC2 Spot gives the deepest discount (up to 90%), Fargate Spot adds overflow without managing instances, and 2 on-demand Fargate tasks guarantee a minimum processing rate even during Spot capacity shortages.

🎓 Exam Tips — Chapter 05

FARGATE and FARGATE_SPOT are built-in — you don't create them. They exist on every cluster.
"Reduce cost for batch processing on ECS" → Fargate Spot (up to 70% off) or EC2 Spot ASG capacity provider (up to 90% off).
"Ensure minimum availability while minimizing cost" → Capacity provider strategy with base on FARGATE (guaranteed) and weight on FARGATE_SPOT (cheap excess).
Cluster Auto Scaling (CAS) — only works with EC2 launch type via ASG capacity provider. Fargate doesn't need CAS because AWS handles capacity.
Target capacity % = 100% means "pack instances fully before scaling out." 80% means "keep headroom for faster placement."
Managed termination protection prevents ASG from terminating instances that still have running ECS tasks. Always enable this.
Fargate Spot interruption: SIGTERM → 2 min → SIGKILL. Service auto-replaces on on-demand. Design for graceful shutdown.
Distractor: "Fargate Spot is the same as EC2 Spot" — no. Fargate Spot discount is ~70%, EC2 Spot can reach ~90%. EC2 Spot also has diversified instance fleets for better availability.

📋 Chapter 5 — Summary

Capacity providers: bridge between ECS and compute. Fargate (on-demand) · Fargate Spot (70% off) · ASG (EC2, managed by CAS).
Capacity provider strategy: base (guaranteed) + weight (ratio). Distribute tasks across providers for cost/HA balance.
Cluster Auto Scaling (CAS): auto-scales EC2 instances based on task demand. Target capacity % controls utilization.
Fargate Spot: up to 70% cheaper. 2-minute SIGTERM before termination. Service auto-replaces interrupted tasks.
EC2 Spot via ASG: up to 90% cheaper. CAS manages the ASG. Managed termination protection drains tasks before instance stop.
Hybrid pattern: baseline on-demand (guaranteed) + Spot overflow (cheap). Best cost-to-availability ratio for batch.

Chapter Six

Scaling & Deployment

Service Auto Scaling Core

ECS Service Auto Scaling adjusts the desired task count automatically based on CloudWatch metrics. It uses Application Auto Scaling — the same system that scales DynamoDB tables and Aurora replicas. You define a target value (e.g., "keep average CPU at 70%"), and the system adds or removes tasks to maintain it.

🎯

Target Tracking

Set target: "Average CPU = 70%"
System auto-creates CloudWatch alarms
Scales out when above, in when below
Simplest, most common approach
Supported metrics: CPU, Memory, ALB request count

📐

Step Scaling

Define steps: "CPU 70-80% → add 1, 80-90% → add 3, 90%+ → add 5"
More control over scaling aggressiveness
Requires manual CloudWatch alarm setup
Good for: bursty workloads needing fast scale-out

📅

Scheduled Scaling

"Scale to 20 tasks at 9am, back to 5 at 6pm"
Cron-based, predictable patterns
Use with target tracking (scheduled sets min, TT adjusts within range)
Good for: known traffic patterns (business hours, events)

Scaling Metric	Target Suggestion	When to Use
ECSServiceAverageCPUUtilization	60-75%	CPU-bound workloads (computation, encoding)
ECSServiceAverageMemoryUtilization	70-80%	Memory-bound (caching, JVM, data processing)
ALBRequestCountPerTarget	1000 req/target	Request-driven APIs (scale per request volume)
Custom CloudWatch metric	Varies	SQS queue depth, business metric, latency P99

👉 Scale on ALBRequestCountPerTarget for web APIs, not CPU. Web APIs often have low CPU but high request count. If you scale on CPU alone, you'll be under-provisioned — requests queue up and latency spikes before CPU triggers. ALBRequestCountPerTarget scales based on actual request volume, which directly correlates with user experience.

Deployment Strategies Core

When you update a service (new image version, config change), ECS must replace old tasks with new ones. How it does this determines whether your users experience downtime, mixed versions, or seamless updates.

🔄

Rolling Update (default)

ECS launches new tasks → waits for health check → drains old tasks
Controlled by minimumHealthyPercent and maximumPercent
minimumHealthyPercent: 100 = never go below desired count (add new before removing old)
maximumPercent: 200 = can double task count temporarily during deploy
No additional cost (uses ECS built-in controller)
Rollback: manual (deploy previous revision)

🔵🟢

Blue/Green (CodeDeploy)

Two target groups: blue (current) and green (new)
CodeDeploy shifts traffic: 100% blue → 100% green
Options: all-at-once, linear (10% every 5min), canary (10% → 100%)
Instant rollback: shift traffic back to blue
Both old and new tasks run simultaneously
Requires ALB with two target groups + CodeDeploy setup

👉 Blue/Green = true zero-downtime: Unlike rolling updates where old+new tasks coexist briefly, Blue/Green keeps the full blue fleet running until green is 100% validated. Rollback is instant — just flip the ALB listener back. For exam: if a question requires "zero-downtime deployment with instant rollback" → Blue/Green with CodeDeploy. If it says "simplest deployment" → Rolling Update (default, no extra setup).

Rolling Update Parameters In-Depth

minimumHealthyPercent	maximumPercent	Behavior	Best For
100%	200%	Launch new tasks first, then drain old. Never below desired count. Temporarily doubles cost.	Production services needing zero-downtime
50%	100%	Stop half the old tasks, then start new ones. Brief capacity reduction.	Cost-sensitive, can tolerate brief capacity dip
0%	100%	Stop ALL old tasks, then start new. Full downtime during deploy.	Dev/staging only. Never production.
100%	150%	Launch 50% new tasks, drain some old, repeat. Moderate overhead.	Balance between speed and cost

👉 Fargate constraint: Fargate enforces a minimum minimumHealthyPercent of 50%. You cannot use 0% (full-stop deployment) with Fargate — only EC2 launch type supports it. For Fargate zero-downtime deploys, use 100%/200% or Blue/Green with CodeDeploy.

Deployment Circuit Breaker Core

The deployment circuit breaker automatically detects when a deployment is failing (new tasks keep crashing) and rolls back to the previous stable version. Without it, a bad deployment loops endlessly: ECS launches new task → task crashes → ECS launches another → crashes → repeat forever, burning compute.

🛡️

How It Works

ECS monitors new tasks during deployment
If tasks repeatedly fail to reach RUNNING state...
Circuit breaker triggers: stops launching new tasks
If rollback: true → automatically reverts to last stable
Based on failure threshold (number of consecutive task failures)

⚙️

Configuration

Enable: deploymentCircuitBreaker: {enable: true, rollback: true}
Works with both ECS rolling update and CodeDeploy
Always enable for production. Default is disabled.
Failure reasons detected: OOM, crash loop, health check failure

👉 Always enable deployment circuit breaker with rollback in production. Without it, a bad image tag or misconfigured environment variable causes infinite task restarts. Your service degrades while ECS keeps trying to deploy the broken version. Circuit breaker + rollback catches this in seconds and reverts automatically.

Concept Diagram — Rolling Update Stages Introductory

Rolling Update — minimumHealthy=100%, maximumPercent=200%

AWS Diagram — Service Auto Scaling with CloudWatch Core

Service Auto Scaling — Target Tracking on ALB Request Count

Architecture Diagram — Blue/Green Deployment In-Depth

Blue/Green Deployment — CodeDeploy Shifts Traffic Between Target Groups

🎓 Exam Tips — Chapter 06

"Zero downtime deployment" → Rolling update with minimumHealthyPercent=100%, maximumPercent=200%. Or Blue/Green with CodeDeploy.
"Automatically rollback failed deployments" → Deployment circuit breaker with rollback=true. Or CodeDeploy blue/green with automatic rollback alarm.
"Scale based on SQS queue depth" → Custom CloudWatch metric for ApproximateNumberOfMessagesVisible, step scaling policy.
Service Auto Scaling ≠ Cluster Auto Scaling. Service AS changes task count. Cluster AS (CAS) changes EC2 instance count. They work together but are separate.
Target tracking is "set and forget." You specify the target value; AWS creates and manages the CloudWatch alarms. Step scaling requires you to create alarms manually.
Blue/Green requires ALB with two target groups. NLB is supported but less common for blue/green. Cannot do blue/green without a load balancer.
"Gradual traffic shift" → CodeDeploy canary or linear deployment. Not possible with ECS rolling update (which is binary per task).
Cooldown period: default 300s between scaling actions. Too short = oscillation (scale up/down/up/down). Too long = slow reaction.
Distractor: "ECS rolling update supports canary deployment" — false. Canary requires CodeDeploy blue/green with traffic shifting.

📋 Chapter 6 — Summary

Service Auto Scaling: target tracking (CPU, memory, ALB requests, custom) adjusts task count automatically.
Scale on ALBRequestCountPerTarget for APIs, not CPU. Request volume correlates better with user experience.
Rolling update: minHealthy=100%, max=200% → zero downtime. New tasks must pass health check before old ones drain.
Blue/Green (CodeDeploy): two target groups. Traffic shift: all-at-once, linear, or canary. Instant rollback to blue.
Circuit breaker: detects failed deployments, auto-reverts. Always enable in production.
Scheduled scaling: predictable patterns (business hours). Combine with target tracking for best results.

Chapter Seven

Integrations

Amazon ECR — Container Registry Core

Amazon ECR (Elastic Container Registry) is a fully managed Docker container image registry. It stores, manages, and deploys your container images. ECS pulls images from ECR during task launch — this is the standard production pattern. ECR integrates with IAM for access control, encrypts images at rest, and scans for known vulnerabilities.

📦

Key Features

Private repositories: IAM-based access, no public exposure
Image scanning: Basic scanning (free, on push, Clair-based CVE detection). Enhanced scanning (uses Amazon Inspector, continuous, per-image cost)
Lifecycle policies: auto-delete untagged/old images (save cost)
Cross-region replication: replicate images to multiple regions
Image immutability: prevent tag overwrites (tag=v1 always same image)

🔧

ECS + ECR Flow

1. Build: docker build -t my-api:v2 .
2. Tag: docker tag my-api:v2 123456.dkr.ecr.us-east-1.amazonaws.com/my-api:v2
3. Auth: aws ecr get-login-password | docker login...
4. Push: docker push 123456.dkr.ecr.../my-api:v2
5. ECS task definition references the ECR image URI
6. ECS Execution Role must have ecr:GetAuthorizationToken + ecr:BatchGetImage

ECR Image Pull Flow Core

When ECS launches a task, the image pull follows a precise sequence. Understanding this flow helps debug "CannotPullContainerError" — the most common task startup failure:

Image Pull Flow — What Happens at Task Launch

👉 Most common fix: If tasks fail with CannotPullContainerError — (1) verify the Execution Role has ecr:GetAuthorizationToken + ecr:BatchGetImage + ecr:GetDownloadUrlForLayer, (2) ensure the task's subnet has a NAT Gateway or VPC endpoint for ECR (com.amazonaws.region.ecr.dkr + com.amazonaws.region.ecr.api).

Load Balancer Integration Core

ECS services integrate with ALB and NLB via target groups. When a task starts, ECS automatically registers it with the target group. When a task stops, ECS deregisters it after the ALB drains active connections. For Fargate (awsvpc mode), the target type must be ip (not instance).

Feature	ALB (Application)	NLB (Network)
Layer	Layer 7 (HTTP/HTTPS)	Layer 4 (TCP/UDP/TLS)
Routing	Path-based, host-based, header-based	Port-based only
Health checks	HTTP GET /health (path + status code)	TCP connect or HTTP
WebSocket	✅ Native support	✅ TCP passthrough
Static IP	❌ DNS only (changes)	✅ Elastic IP per AZ
Sticky sessions	✅ Cookie-based	❌ Not supported
Best for ECS	REST APIs, web apps, microservices	gRPC, real-time, extreme throughput

👉 ALB path-based routing is the standard pattern for ECS microservices. One ALB, multiple listener rules: /api/users/* → user-service target group, /api/orders/* → order-service target group. Each service registers its own target group. This avoids one-LB-per-service cost while keeping services independently deployable.

IAM Roles — Task Role vs Execution Role Core

ECS tasks use two different IAM roles. Confusing them is one of the most common ECS mistakes and a frequent exam question.

🔐

Task Execution Role

Who uses it: ECS agent (not your application)
Purpose: pull images, push logs, read secrets
Permissions needed:
- ecr:GetAuthorizationToken
- ecr:BatchGetImage
- logs:CreateLogStream
- logs:PutLogEvents
- ssm:GetParameters (if injecting from Parameter Store)
- secretsmanager:GetSecretValue (if injecting secrets)
AWS provides managed policy: AmazonECSTaskExecutionRolePolicy

🗝️

Task Role

Who uses it: your application code (inside the container)
Purpose: access AWS services from your app
Examples:
- s3:PutObject (upload files)
- dynamodb:PutItem (write data)
- sqs:SendMessage (queue messages)
- sns:Publish (send notifications)
Follow least privilege — only what your app actually needs
Accessible via instance metadata endpoint (SDK auto-discovers)

Secrets Manager & Parameter Store Core

Never hardcode secrets (database passwords, API keys) in your Docker image or task definition environment variables. Instead, reference them from AWS Secrets Manager or SSM Parameter Store. ECS injects the secret value at task launch time — your container sees the value as a regular environment variable, but the actual secret never appears in the task definition.

🔒

Secrets Manager

Designed specifically for secrets (credentials, tokens, keys)
Automatic rotation (Lambda-based, $0.40/secret/month)
Reference in task def: "valueFrom": "arn:aws:secretsmanager:..."
Execution Role needs secretsmanager:GetSecretValue

📝

SSM Parameter Store

Config values + secrets (Standard tier free, up to 10K params)
SecureString type encrypts with KMS
Reference: "valueFrom": "arn:aws:ssm:...:parameter/db_host"
Execution Role needs ssm:GetParameters
Free for standard params (cheaper than Secrets Manager)

CloudWatch Logs & Container Insights Core

ECS containers send logs to CloudWatch via the awslogs log driver. Each container gets its own log stream within a log group. Container Insights provides CPU, memory, network, and disk metrics at the task and container level — critical for troubleshooting and capacity planning.

📋

awslogs Driver

Configured in task definition per container
Options: awslogs-group, awslogs-region, awslogs-stream-prefix
Log stream name: prefix/container-name/task-id
Execution Role needs logs:CreateLogStream, logs:PutLogEvents
Set log group retention (default: never expires → cost grows forever)

📊

Container Insights

Enable per cluster: containerInsights: enabled
Metrics: CPU/memory utilization per task, per service, per cluster
Network: bytes in/out, packet errors
Storage: ephemeral storage utilization (Fargate)
Costs ~$0.30/task/month (CloudWatch custom metrics pricing)

AWS X-Ray — Distributed Tracing In-Depth

X-Ray traces requests across your microservices — showing where time is spent, which service is slow, and where errors occur. For ECS, you run the X-Ray daemon as a sidecar container in the same task. Your application sends trace data to the daemon (localhost:2000/udp), and the daemon forwards it to the X-Ray service.

🔍

Setup Steps

1. Add X-Ray daemon container to task definition (sidecar)
2. Image: amazon/aws-xray-daemon
3. Port: 2000/UDP
4. Task Role needs: xray:PutTraceSegments, xray:PutTelemetryRecords
5. Your app uses X-Ray SDK (or OpenTelemetry) to instrument requests

📈

What You Get

Service map: visual graph of all services and their connections
Latency breakdown: where each millisecond was spent
Error rates per service
Trace filtering by URL, status code, duration
Integration with CloudWatch ServiceLens for unified view

👉 Complete observability stack: CloudWatch Logs (what happened — container stdout/stderr), Container Insights (how it's performing — CPU/memory metrics), X-Ray (where time is spent — distributed traces). For exam: "how to view container logs" → awslogs driver + CloudWatch. "How to find slow microservice" → X-Ray. "How to set up CPU-based auto scaling" → Container Insights metrics.

ECS Integration Ecosystem — Build → Deploy → Serve → Observe

AWS Diagram — Secure Microservice with All Integrations Core

Complete ECS Workload — ECR + ALB + Secrets + X-Ray + CloudWatch

Architecture Diagram — ALB Path-Based Routing to Multiple Services In-Depth

Microservice Routing — One ALB, Multiple ECS Services via Path Rules

🎓 Exam Tips — Chapter 07

Task Execution Role ≠ Task Role. Execution Role = ECS agent (pull images, push logs, read secrets). Task Role = your application code (DynamoDB, S3, SQS).
"Container cannot pull image from ECR" → check Execution Role has ecr:GetAuthorizationToken + ecr:BatchGetImage.
"Application needs to write to S3" → add S3 permissions to the Task Role, not the Execution Role.
"Inject database password securely" → Secrets Manager or SSM SecureString referenced in task definition. Execution Role needs read permission.
ALB target type must be ip for Fargate (awsvpc mode). instance type only works with EC2 launch type bridge/host networking.
X-Ray for ECS: run daemon as sidecar, not standalone service. Using port 2000/UDP. Task Role needs xray:PutTraceSegments.
ECR lifecycle policies auto-delete old/untagged images — prevents storage cost creep. Set to keep last 10 tagged images.
"Logs not appearing in CloudWatch" → check Execution Role has logs:CreateLogStream + logs:PutLogEvents, and check log group exists.
Distractor: "Task Role is needed to pull images from ECR" — false. Image pull uses the Execution Role.

📋 Chapter 7 — Summary

ECR: managed Docker registry. Build → tag → push → ECS pulls. Enable scanning + lifecycle policies.
ALB: path-based routing to multiple ECS services via target groups (ip type for Fargate).
Execution Role vs Task Role: Execution = infrastructure (ECR, logs, secrets). Task = application (S3, DynamoDB, SQS).
Secrets: inject from Secrets Manager or SSM Parameter Store at task launch. Never hardcode in images.
CloudWatch: awslogs driver for logs. Container Insights for metrics. Set log retention to avoid cost creep.
X-Ray: sidecar daemon for distributed tracing. Task Role needs xray permissions.

Chapter Eight

Architecture Patterns

When to Use Which Pattern Introductory

ECS is flexible enough to support many application styles — from long-running web services to one-shot batch jobs. The key is matching the right ECS features (service vs standalone task, Fargate vs EC2, Spot vs On-Demand) to each workload's requirements.

Pattern	ECS Feature	Launch Type	Scaling Trigger	Example
Microservices	Service + ALB + Service Discovery	Fargate	ALBRequestCountPerTarget	E-commerce (user, order, product services)
API Backend	Service + ALB + Auto Scaling	Fargate	ALBRequestCountPerTarget or CPU	Mobile app backend
Batch Processing	Standalone task (RunTask API)	Fargate Spot	EventBridge schedule or SQS	Nightly reports, video transcoding
Event-Driven	Service + SQS polling	Fargate Spot	SQS queue depth (custom metric)	Order processing, image resizing
Scheduled Tasks	RunTask triggered by EventBridge	Fargate Spot	Cron schedule	DB cleanup, daily sync, report generation
Web App + API	Service + CloudFront + ALB	Fargate	ALBRequestCountPerTarget	SPA frontend + REST API

Pattern 1 — Microservices Platform Core

The most common ECS architecture: multiple independent services, each in its own ECS service with its own task definition, scaling policy, and deployment lifecycle. An ALB routes requests by path to the correct target group. Services discover each other via AWS Cloud Map (Service Discovery) for internal communication.

✅

When to Use

Multiple teams owning different services
Services scale independently (orders spike on sales, users steady)
Independent deployment — deploy user-service without touching order-service
Different tech stacks per service (Node.js, Java, Python in same cluster)

⚙️

ECS Features Used

ALB with path-based routing (one LB, many target groups)
Service Discovery (Cloud Map) for service-to-service calls
Fargate per-service with independent scaling
ECR separate repository per service
Secrets Manager per-service credentials

👉 Use Service Discovery (Cloud Map) for internal calls, ALB for external. Service A calls Service B via DNS: order-service.local:8080 — Cloud Map maintains the DNS records. This avoids routing internal traffic through the ALB (extra hop, extra cost). External traffic still goes ALB → target group → service.

Pattern 2 — Event-Driven Queue Processing Core

A service polls SQS for messages and processes them. When the queue grows, Auto Scaling adds tasks. When the queue empties, it scales back down. This pattern decouples producers from consumers and handles traffic spikes gracefully — the queue absorbs the burst while consumers process at their own pace.

✅

When to Use

Async processing: order placed → process payment, send email
Unpredictable bursts: 10K images uploaded at once → resize queue
Decoupled: producer doesn't wait for consumer to finish
Retry built-in: failed messages go to DLQ for investigation

⚙️

ECS Features Used

ECS Service with desired count = min workers
Step Scaling on SQS ApproximateNumberOfMessagesVisible
Fargate Spot for cost savings (interruptible processing is OK)
SQS DLQ for failed messages
Task Role with sqs:ReceiveMessage, sqs:DeleteMessage

Pattern 3 — Batch Processing & Scheduled Tasks Core

One-shot tasks triggered by a schedule (EventBridge cron) or an event. Unlike services, batch tasks run to completion and exit — they are not restarted. Perfect for nightly reports, database migrations, ETL jobs, and data exports.

✅

When to Use

Scheduled jobs: "run nightly at 2am UTC"
Finite workloads: process file, generate report, exit
Cost-sensitive: Fargate Spot for up to 70% savings
No load balancer needed — tasks run independently

⚙️

ECS Features Used

EventBridge rule or Scheduler → ecs:RunTask
Standalone task (not a service — exits when done)
Fargate Spot for cost optimization
EFS for shared data across batch tasks
CloudWatch Logs for output capture

Pattern 4 — Web App with Static Frontend In-Depth

A modern web application with a static frontend (React/Vue SPA) served from S3 + CloudFront, and an API backend running on ECS behind ALB. CloudFront routes /api/* to ALB origin and everything else to S3. This separates the static delivery (CDN-optimized) from the dynamic API (container-optimized).

🌐

Frontend (Static)

React/Vue/Angular SPA built → uploaded to S3
CloudFront CDN for global low-latency delivery
Origin Access Control: S3 bucket not publicly accessible
Cache-Control headers: immutable assets cached at edge

⚡

Backend (ECS)

REST API on ECS Fargate behind ALB
CloudFront origin: /api/* → ALB
Auto Scaling on request count per target
Private subnets — not directly internet accessible

Concept Diagram — Microservices Communication Introductory

Microservices Communication — External (ALB) vs Internal (Service Discovery)

AWS Diagram — Event-Driven Processing with SQS Core

Event-Driven Architecture — SQS Queue → ECS Service → DLQ for Failures

Architecture Diagram — Production Microservices Platform In-Depth

Production Platform — CloudFront + ALB + ECS Microservices + Event Queue + Database

🎓 Exam Tips — Chapter 08

"Decouple order processing from API" → SQS queue between order-service and order-processor. ECS service polls SQS. Scale on queue depth.
"Run a task on a schedule" → EventBridge Scheduler rule with ecs:RunTask target. NOT an ECS service (services are long-running). Use Fargate Spot for cost.
"Service-to-service communication inside ECS" → AWS Cloud Map (Service Discovery). DNS-based: order-svc.local:8080. No ALB needed for internal traffic.
Service vs Standalone Task: Service = long-running, auto-restarts, load balanced. Task = one-shot, exits when done, no restart.
"Cheapest way to run batch container jobs" → Fargate Spot + EventBridge trigger. If interruptible, Spot saves up to 70%.
SQS + ECS scaling: use ApproximateNumberOfMessagesVisible as a custom metric for step scaling. NOT target tracking (it doesn't support SQS natively).
"Static website + API on same domain" → CloudFront + S3 (static) + ALB origin (/api/*). Not served from ECS containers.
Distractor: "Lambda is always cheaper than Fargate for event processing" — false. For sustained high-throughput queues, Fargate Spot costs less than millions of Lambda invocations.

📋 Chapter 8 — Summary

Microservices: ALB path-routing + Service Discovery (Cloud Map). Each service independently deployed and scaled.
Event-driven: SQS → ECS service polling. Scale on queue depth. Fargate Spot for cost. DLQ for failures.
Batch/scheduled: EventBridge → RunTask (standalone, not a service). Fargate Spot. Exits when complete.
Web app: CloudFront → S3 (static), CloudFront → ALB (/api/*) → ECS. Separate static and dynamic delivery.
Internal comms: Cloud Map DNS for service-to-service. Service Connect (built on Cloud Map + Envoy) is the modern alternative — adds retries, timeouts, and circuit breaking automatically. ALB only for external traffic.
Cost pattern: long-running APIs on Fargate On-Demand. Queue workers and batch on Fargate Spot.

Chapter Nine

Troubleshooting & Observability

Stopped Reason Codes Core

When an ECS task stops unexpectedly, ECS records a stopped reason that tells you what went wrong. This is the first place to look when debugging — run aws ecs describe-tasks and check the stoppedReason and containers[].reason fields.

Stopped Reason	What Happened	Fix
EssentialContainerExited	A container marked `essential: true` exited (crashed, exited with non-zero code)	Check container exit code + CloudWatch Logs for stack trace. Fix the application bug.
OutOfMemoryError	Container exceeded its memory limit. Killed by OOM killer.	Increase `memory` in task definition. Check for memory leaks. JVM: set -Xmx to 75% of container memory.
CannotPullContainerError	ECS cannot pull image from ECR or Docker Hub.	Check: (1) Image exists in ECR (2) Execution Role has ecr:BatchGetImage (3) VPC has NAT gateway or ECR VPC endpoint (4) Image tag is correct.
ResourceInitializationError	Task could not attach ENI (awsvpc) or mount volume.	Check: (1) Subnet has available IPs (2) Security group allows traffic (3) EFS mount target exists in task's AZ.
TaskFailedToStart	Task launch failed before any container started.	Usually infrastructure issue: no capacity, ENI limit reached, or secret injection failure. Check Execution Role permissions.
AGENT	ECS agent on EC2 instance is unreachable or unhealthy.	EC2 launch type only. Check instance health, ECS agent logs (`/var/log/ecs/ecs-agent.log`). Restart agent or replace instance.
SERVICE_SCHEDULER_INITIATED	Service deliberately stopped the task (deployment, scale-in, health check failure).	Normal during deployments. If unexpected: check ALB health check config, ensure /health endpoint returns 200.

👉 The debugging command you'll use most: aws ecs describe-tasks --cluster my-cluster --tasks <task-id> → look at stoppedReason + each container's reason and exitCode. Exit code 137 = OOM killed. Exit code 1 = application error. Exit code 0 = normal shutdown.

ECS Exec — Shell Into Running Containers Core

ECS Exec lets you exec into a running container — like docker exec -it but for containers running on Fargate or EC2. It uses AWS Systems Manager Session Manager under the hood. This is essential for debugging running containers that aren't behaving as expected.

🔧

Setup Requirements

1. Enable execute command on service: --enable-execute-command
2. Task Role needs: ssmmessages:CreateControlChannel, ssmmessages:CreateDataChannel, ssmmessages:OpenControlChannel, ssmmessages:OpenDataChannel
3. SSM agent is bundled with Fargate platform version 1.4.0+
4. VPC needs NAT gateway or SSM VPC endpoints

💻

Usage

Open shell: aws ecs execute-command --cluster my-cluster --task <task-id> --container my-app --interactive --command "/bin/sh"
Check env vars, filesystem, network connectivity
Test DNS resolution: nslookup order-svc.local
Check if secrets injected: echo $DB_PASSWORD
Audit: all exec sessions logged in CloudTrail

Observability Stack Core

📋

Logs (CloudWatch)

awslogs driver captures stdout/stderr
Log group: /ecs/my-service
Log stream: prefix/container/task-id
Filter patterns for error detection
Metric filters: count errors → alarm

📊

Metrics (Container Insights)

CpuUtilized / CpuReserved per task
MemoryUtilized / MemoryReserved per task
NetworkRxBytes / NetworkTxBytes
RunningTaskCount per service
StorageUtilized (Fargate ephemeral)

🔍

Traces (X-Ray)

End-to-end request traces across services
Latency breakdown per service hop
Error rate visualization
Service map: which service calls which
Sidecar daemon + SDK instrumentation

👉 Health check failures are the #1 cause of "task keeps restarting." The ALB health check calls your /health endpoint. If it returns non-200 three times in a row, the ALB marks the target unhealthy, ECS stops the task and starts a new one, which hasn't warmed up yet, fails health check again → restart loop. Fix: (1) ensure /health is fast (<5s response), (2) set health check grace period (give app time to start before first check), (3) check that security group allows ALB → task traffic.

Concept Diagram — Troubleshooting Decision Tree Introductory

ECS Troubleshooting — Where to Look Based on Symptom

aws ecs describe-tasks → stoppedReason + exitCode 2. CloudWatch Logs → application stack traces 3. aws ecs execute-command → shell into running task 4. ALB target health → check registration + health status Infrastructure issues (before app starts) Application issues (app crashes) Health check issues (app running but unhealthy)

AWS Diagram — Observability Stack Core

ECS Observability — Logs + Metrics + Traces + Alarms

Architecture Diagram — ECS Exec Debugging Session In-Depth

ECS Exec — Shell into Running Fargate Container via SSM Session Manager

🎓 Exam Tips — Chapter 09

"Task keeps failing to start + CannotPullContainerError" → check: (1) ECR image exists (2) Execution Role permissions (3) NAT gateway in private subnet or ECR VPC endpoint.
"Container killed with exit code 137" → OOM. Container exceeded memory limit. Increase memory in task definition.
"Container exited with exit code 143" → SIGTERM received (graceful shutdown). Normal during service scaling, deployments, or Fargate Spot interruptions. Not an error — means your app received a shutdown signal.
"How to debug a running ECS container" → ECS Exec (aws ecs execute-command). Requires SSM permissions on Task Role + enable-execute-command on service.
"Logs not appearing" → Execution Role missing logs:CreateLogStream or logs:PutLogEvents. Also check log group exists and awslogs driver is configured.
ECS Exec requires SSM permissions on the Task Role (not the Execution Role). This is a common exam distractor.
Container Insights costs extra. It's not free — it generates CloudWatch custom metrics. Budget ~$0.30/task/month.
"Service never reaches steady state" → aws ecs describe-services → check events field for recent messages. Usually: health check failures, insufficient capacity, or image pull errors.
Health check grace period: seconds to wait before first health check after task registration. Set to app startup time (e.g., 60s for Java Spring Boot). Default: 0 (immediate check).

📋 Chapter 9 — Summary

Stopped reasons: describe-tasks → stoppedReason + exitCode. 137 = OOM. 1 = app error. CannotPullContainer = ECR permissions/networking.
ECS Exec: shell into running containers via SSM. Requires enable-execute-command + Task Role ssmmessages permissions.
Observability: CloudWatch Logs (awslogs), Container Insights (metrics), X-Ray (traces), Alarms (auto-scaling + alerts).
Health check loop: most common "task keeps restarting" cause. Fix: grace period, check /health endpoint, verify security group rules.
describe-services events: first place to check when service won't stabilize. Shows recent scheduling failures and reasons.

📚 ECS Cheatsheet Core

💻

Key CLI Commands

aws ecs create-cluster --cluster-name my-cluster
aws ecs register-task-definition --cli-input-json file://task-def.json
aws ecs create-service --cluster my-cluster --service-name my-svc ...
aws ecs update-service --cluster my-cluster --service my-svc --desired-count 5
aws ecs run-task --cluster my-cluster --task-definition my-task:3
aws ecs describe-tasks --cluster my-cluster --tasks <id>
aws ecs describe-services --cluster my-cluster --services my-svc
aws ecs execute-command --cluster my-cluster --task <id> --command "/bin/sh" --interactive
aws ecs list-tasks --cluster my-cluster --service-name my-svc
aws ecs stop-task --cluster my-cluster --task <id>

🏷️

ARN Formats

Cluster: arn:aws:ecs:region:account:cluster/name
Task Definition: arn:aws:ecs:region:account:task-definition/family:revision
Service: arn:aws:ecs:region:account:service/cluster/service-name
Task: arn:aws:ecs:region:account:task/cluster/task-id
Container Instance: arn:aws:ecs:region:account:container-instance/cluster/id

Stopped Reason	Exit Code	Quick Fix
EssentialContainerExited	1	Check CloudWatch Logs for stack trace
OutOfMemoryError	137	Increase task memory
CannotPullContainerError	—	ECR perms + NAT/VPC endpoint
ResourceInitializationError	—	Subnet IPs + SG rules + EFS mounts
TaskFailedToStart	—	Execution Role + capacity