Amazon ECS β
Elastic Container Service
Fully managed container orchestration. Run Docker containers at scale with EC2 or Fargate β define your containers, ECS handles scheduling, placement, and lifecycle.
β‘ ECS in 30 Seconds
- Run Docker containers on AWS β ECS handles orchestration, scheduling, and placement
- Fargate = serverless containers (no servers to manage). EC2 launch type = you manage the instances
- Services keep desired task count running, auto-restart failed tasks, integrate with ALB
- Task Definitions define your container blueprint: image, CPU, memory, networking, IAM roles
- Deep integration with ALB, ECR, CloudWatch, IAM, Secrets Manager, and X-Ray
What is ECS
A container packages your application code together with all its dependencies β runtime, libraries, system tools β into a single, portable unit. Unlike a virtual machine, a container shares the host operating system's kernel. This makes containers lightweight (megabytes, not gigabytes), fast to start (seconds, not minutes), and identical across environments (your laptop = staging = production).
Docker is the standard container runtime. You define a Dockerfile, build an image, and run it as a container. The image is immutable β the same image produces the same behavior everywhere.
Virtual Machine
- Full OS per VM (kernel + userspace)
- Gigabytes in size
- Minutes to boot
- Strong isolation (separate kernels)
- Managed by hypervisor (e.g., Nitro)
Container
- Shares host OS kernel
- Megabytes in size
- Seconds to start
- Process-level isolation (cgroups, namespaces)
- Managed by container runtime (Docker)
π Key mental model: A VM virtualizes the hardware. A container virtualizes the OS. Containers are lighter-weight but share a kernel β if the kernel has a vulnerability, all containers are affected. VMs have stronger isolation boundaries.
Running one container on your laptop is easy. Running 200 containers across 50 servers in production β keeping them healthy, distributing traffic, replacing failures, scaling up at peak, and rolling out updates without downtime β is not something Docker alone can do. That is the orchestration problem.
Scheduling
Which server should this container run on? Orchestrator picks the best host based on available CPU, memory, and placement constraints.
Health & Recovery
Container crashed? Orchestrator detects the failure and starts a replacement automatically. No manual intervention.
Scaling
Traffic spikes? Orchestrator launches more container instances. Traffic drops? Scales back down. Keeps desired count running at all times.
An orchestrator solves: where to place containers, how many to run, when to replace them, and how to update them without downtime. ECS is AWS's answer to this problem.
Amazon ECS (Elastic Container Service) is a fully managed container orchestration service. You define your containers, ECS handles the rest:
ECS Manages
- Control plane β scheduling, placement, lifecycle
- Task management β run, stop, replace containers
- Service management β maintain desired task count
- Load balancer integration β register/deregister targets
- Rolling deployments β update without downtime
- Scaling β auto-adjust task count based on metrics
You Define
- Container image β Docker image from ECR or Docker Hub
- Resource requirements β CPU and memory per container
- Networking β VPC, subnets, security groups
- IAM roles β permissions for your containers
- Launch type β EC2 (you manage servers) or Fargate (serverless)
- Desired count β how many task copies to run
The critical point: ECS is the control plane only. It does not run your containers itself. It tells EC2 instances or Fargate to run them. Think of ECS as the "brain" that decides what runs where β the compute comes from your chosen launch type.
When should you pick ECS over other container orchestrators? This comparison covers the three most common choices on AWS:
| Feature | Docker Compose | ECS | EKS (Kubernetes) |
|---|---|---|---|
| What it is | Local multi-container tool | AWS-managed orchestrator | AWS-managed Kubernetes |
| Scale | 1 machine | Thousands of containers | Thousands of containers |
| Learning curve | Low | Medium | High |
| Multi-host | No | Yes | Yes |
| Auto healing | Basic restart | Full (replace + reschedule) | Full (pod restart + reschedule) |
| AWS integration | None | Deep (IAM, ALB, ECR, CloudWatch) | Good (via add-ons) |
| Portability | Docker standard | AWS-only | Multi-cloud (K8s standard) |
| Control plane cost | Free | Free | ~$72/month per cluster |
| Best for | Local dev, small projects | AWS-native production workloads | Multi-cloud, existing K8s teams |
π Rule of thumb: If your team is on AWS and does not already use Kubernetes, choose ECS. It is simpler, free control plane, and has deeper AWS integration. Choose EKS only if you need Kubernetes portability across clouds or have existing K8s expertise. Choose Docker Compose only for local development.
Think of ECS as a restaurant kitchen:
Task Definition = Recipe
Specifies what to cook β ingredients (image), portion size (CPU/memory), instructions (environment variables, commands).
Task = A Plate of Food
One running instance of the recipe. Each plate is independent. If one drops, the kitchen makes another.
Service = The Head Chef
Ensures "always 5 plates ready." If a plate breaks, chef makes a new one. If demand spikes, chef makes more. ECS Service = desired count manager.
Cluster = The Kitchen
The physical space where everything runs. Can be your own equipment (EC2 instances) or the restaurant's built-in kitchen (Fargate β you don't manage the ovens).
Container = One Dish Component
A single container inside a task. A task can have multiple containers β like a plate with main course + side dish running together.
This is the most common ECS production pattern: an ALB distributes traffic to Fargate tasks running in private subnets across two Availability Zones. If an entire AZ goes down, the remaining tasks continue serving traffic. The ECS Service automatically replaces failed tasks and maintains the desired count of 4.
| Feature | Lambda | ECS (Fargate) | EC2 |
|---|---|---|---|
| Model | Serverless functions | Serverless containers | Virtual machines |
| Max duration | 15 minutes | Unlimited | Unlimited |
| Max memory | 10 GB | 120 GB (16 vCPU) | Terabytes (instance-dependent) |
| Startup latency | ~100ms (warm) / 1-10s (cold) | 30-60 seconds | Minutes |
| Pricing | Per request + duration | Per vCPU/memory per second | Per instance-hour |
| Scaling | Auto (per-request, 1000s concurrently) | Auto (task count, seconds) | Auto (instance count, minutes) |
| Container support | Container images (read-only) | Full Docker support | Full Docker / any runtime |
| Persistent storage | /tmp only (10 GB) | EFS (shared) | EBS, EFS, instance store |
| GPU | No | No (Fargate) / Yes (EC2 type) | Yes |
| Best for | Event handlers, APIs <15min, glue code | Microservices, APIs, workers | Stateful apps, GPU, full OS control |
Simple APIs / Event Handlers
If your workload is short-lived (<15 min), event-driven, and stateless β use Lambda. No container to manage, no task definitions, no service configuration. Lambda is simpler and cheaper for request-response patterns.
Existing Kubernetes Teams
If your team already uses Kubernetes and needs multi-cloud portability β use EKS. ECS is AWS-only. Migrating K8s manifests to ECS task definitions is non-trivial.
Stateful / Legacy Workloads
If your app needs persistent local disk, specific OS configuration, or isn't containerized β use EC2 directly. ECS requires Docker images. Some legacy middleware won't containerize easily.
- ECS control plane is free. You only pay for the EC2 instances or Fargate tasks β not for ECS itself. EKS charges ~$72/month per cluster.
- ECS vs EKS: If the question says "simplest" or "least operational overhead" for containers on AWS β ECS + Fargate. If it says "Kubernetes" or "multi-cloud" β EKS.
- ECS vs Lambda: Lambda is per-request, max 15 min, max 10GB memory. ECS is for long-running services, larger workloads, or when you need full Docker compatibility.
- Fargate = serverless containers. If the question mentions "no server management" with containers β Fargate. If it says "GPU" or "daemon" β must use EC2 launch type.
- Distractor: "Docker Compose can scale to production on AWS" β wrong. Compose is single-host only and has no auto-healing or multi-AZ support.
- Containers package app + dependencies into portable units. Lighter than VMs, seconds to start.
- Orchestration solves scheduling, health recovery, scaling, and zero-downtime deploys.
- ECS is the AWS-managed orchestrator β free control plane, deep AWS integration.
- Two launch types: EC2 (you manage servers) and Fargate (serverless).
- ECS vs EKS: ECS for AWS-native simplicity, EKS for Kubernetes portability.
- Production pattern: ALB β ECS Fargate tasks across Multi-AZ.
Core Concepts
ECS has five core entities that form a clear hierarchy: Cluster β Service β Task β Container, plus a Task Definition that serves as the blueprint. Understanding how these relate is the single most important concept in ECS.
A cluster is the logical grouping that holds all your ECS resources. It is the top-level boundary β services, tasks, and capacity all live inside a cluster. A cluster does not contain compute by itself β you register EC2 instances to it, or use Fargate which provisions compute on demand.
What a Cluster Contains
- One or more services (long-running containers)
- Standalone tasks (one-off jobs)
- Registered EC2 instances (EC2 launch type) or Fargate capacity
- Capacity provider strategies
Key Facts
- A cluster is free β no cost for the cluster itself
- You can have multiple clusters per account (one per environment is common)
- A cluster can mix EC2 and Fargate launch types
- Default cluster auto-created, but create named clusters for production
Common pattern: one cluster per environment β dev, staging, production. Each cluster has its own services and capacity, providing isolation between environments.
A task definition is a JSON document that describes how to run your container(s). Think of it as a blueprint or recipe β it specifies the Docker image, CPU and memory requirements, networking mode, IAM roles, environment variables, log configuration, and more. You never run a task definition directly β you use it to launch tasks.
Container Image
Which Docker image to pull β from ECR, Docker Hub, or any registry. Example: 123456.dkr.ecr.us-east-1.amazonaws.com/web-api:v2
Resource Limits
CPU and memory per task. Fargate has fixed combinations (e.g., 0.5 vCPU / 1GB). EC2 launch type is more flexible.
Networking & Ports
Network mode (awsvpc, bridge, host), port mappings, and security group assignments.
π Task definitions are versioned. Each update creates a new revision (e.g., web-api:1, web-api:2, web-api:3). You point your service at a specific revision. Rolling back = pointing the service to a previous revision. Old revisions are never deleted automatically.
Here is a minimal task definition in JSON β the key fields every ECS user must understand:
A task is a running instance of a task definition. When ECS launches a task, it pulls the Docker image, allocates CPU/memory, assigns an ENI (in awsvpc mode), and starts the container(s). A task can contain one container (most common) or multiple (sidecar pattern).
Task Lifecycle
- PROVISIONING β allocating resources (ENI, storage)
- PENDING β pulling image, starting containers
- RUNNING β containers executing
- STOPPED β container exited (success or failure)
Key Facts
- Each task gets its own private IP (awsvpc mode)
- Tasks are ephemeral β they can be replaced anytime
- Essential container exits β entire task stops
- Non-essential sidecar can crash without killing the task
An ECS service maintains a desired count of running tasks. If a task crashes, the service replaces it. If you want 4 copies running at all times, the service ensures exactly 4 are always healthy. Services also integrate with load balancers β automatically registering and deregistering tasks as targets.
Desired Count
"Run 4 tasks." If one dies, service launches a 5th to replace it. If you scale to 8, service launches 4 more. Always maintains the target.
Load Balancer
Service registers each task's IP:port with the ALB target group. When a task starts β registered. When it stops β deregistered. Zero manual work.
Deployments
Update the service's task definition β rolling deployment. New tasks start, old tasks drain. Configurable via minimumHealthyPercent and maximumPercent.
π Service vs standalone task: Use a service for long-running workloads (web servers, APIs, workers). Use a standalone task for one-off jobs (database migration, scheduled batch, data export). The service restarts failed tasks. A standalone task runs once and stops.
ECS services are self-healing by default. You don't configure recovery β it is built into the service abstraction. If anything goes wrong with a running task, ECS replaces it automatically. Combined with ALB health checks, this creates a resilient system that recovers from failures without human intervention.
What ECS Heals Automatically
- Task crashes (exit code β 0): service launches replacement
- ALB health check fails: task deregistered β replaced
- EC2 instance dies: tasks rescheduled to healthy instances
- AZ goes down: tasks rebalanced across remaining AZs
- Spot interruption: Fargate Spot task replaced on on-demand
Recovery Timeline
- Task crash: ~30-60s to launch replacement (Fargate)
- Health check failure: deregistration delay + new task start
- ALB update: automatic β new task registered, old drained
- No manual intervention: service maintains desired count
- Deployment rollback: circuit breaker auto-reverts bad deploys
A container is a single Docker container inside a task. Most tasks run one container (your application). But ECS supports multi-container tasks β a common pattern for sidecars like log routers, tracing agents (X-Ray daemon), or envoy proxies. Containers in the same task share the network namespace (they can communicate over localhost) and can share volumes.
Essential Containers
If a container marked "essential": true exits, the entire task stops. Your main app container should always be essential. Sidecar containers can be non-essential.
Multi-Container Patterns
- Sidecar: X-Ray daemon, Datadog agent, Envoy proxy
- Log router: Fluent Bit forwarding to CloudWatch/S3
- Init container: runs before main app (supported since 2023)
This is the most exam-tested ECS concept, and the most commonly confused. ECS uses two separate IAM roles with completely different purposes:
| Aspect | Task Role | Execution Role |
|---|---|---|
| Who uses it | Your application code inside the container | The ECS agent (not your code) |
| Purpose | Access AWS services from your app | Infrastructure setup: pull images, push logs |
| Example permissions | S3:GetObject, DynamoDB:PutItem, SQS:SendMessage | ecr:GetAuthorizationToken, logs:CreateLogStream |
| JSON field | taskRoleArn | executionRoleArn |
| Required? | Only if your app calls AWS APIs | Yes β always needed for Fargate |
| Analogy | Employee badge β what rooms they can enter | Building manager β keeps the lights and doors working |
| If missing | App gets "Access Denied" calling AWS services | Task fails to start (can't pull image or push logs) |
π Exam trap: "The container needs to write to S3 β which role?" β Task Role (your app's permissions). "The container fails to start because it can't pull from ECR" β Execution Role is missing or wrong. Never confuse the two β the exam does this deliberately.
- Task Role vs Execution Role β the #1 most tested concept. Task Role = your app's permissions. Execution Role = ECS agent's permissions (pulling images, pushing logs).
- "Container can't pull image from ECR" β Missing or incorrect Execution Role. Not the Task Role.
- "App returns Access Denied when writing to S3" β Missing or incorrect Task Role. Not the Execution Role.
- Essential container exits β entire task stops. Non-essential sidecars can fail without killing the task.
- Task Definition is versioned. Each update = new revision. Rollback = point service to older revision number.
- Containers in the same task share network (communicate via
localhost) and share the CPU/memory budget. - Distractor: "Use EC2 instance role for container AWS access" β wrong. ECS containers use Task Role, not the EC2 instance profile (even on EC2 launch type).
- Cluster: logical grouping. Free. One per environment is common.
- Task Definition: JSON blueprint β image, CPU, memory, roles, ports, logs. Versioned with revisions.
- Task: running instance of a task definition. Gets its own IP (awsvpc). Ephemeral.
- Service: maintains desired task count. Auto-restarts failed tasks. Integrates with ALB.
- Container: Docker container inside a task. Essential flag controls task lifecycle.
- Task Role: your app's AWS permissions (S3, DynamoDB). Execution Role: ECS agent's permissions (ECR pull, CW logs).
- Multi-container tasks: sidecar pattern β X-Ray daemon, log router, envoy proxy share network with main app.
Launch Types
ECS gives you exactly two choices for where your containers physically run: EC2 launch type (you manage the servers) or Fargate launch type (AWS manages the servers). This is the single most impactful architectural decision in ECS β it determines your pricing model, operational burden, scaling behavior, and what features are available.
With the EC2 launch type, you provision and manage a fleet of EC2 instances. You register these instances with your ECS cluster by installing the ECS container agent (pre-installed on the Amazon ECS-optimized AMI). ECS places your containers on these instances based on available CPU and memory. You are responsible for patching, scaling, and monitoring the instances themselves.
Strengths
- Full control β instance type, AMI, OS patches, SSH access
- GPU support β P3, P4, G4 instances for ML workloads
- Persistent EBS volumes β attach to specific instances
- Daemon scheduling β run one agent per instance (monitoring, logging)
- Higher task density β pack many small tasks on one large instance
- Cheaper for steady-state β Reserved Instances / Savings Plans work
Trade-offs
- You manage instances β patching, AMI updates, agent upgrades
- Capacity planning β must provision enough instances for peak
- ENI limits β each awsvpc task consumes one ENI. Small instances (t3.micro) may support only 1-2 tasks. Enable ENI trunking to increase limit.
- Idle waste β pay for full instance even if half-empty
- Scaling is two-layer β scale tasks AND scale instances (Auto Scaling Group)
π The ECS container agent is a Docker container itself that runs on every EC2 instance. It communicates with the ECS control plane, receives task placement instructions, starts/stops containers, and reports health. Use the ECS-optimized AMI (Amazon Linux 2023) β it comes pre-configured with Docker and the agent.
With Fargate, you do not provision or manage any servers. You specify CPU and memory requirements in the task definition, and AWS provisions a compute environment for each task. You never see the underlying instance. Each task runs in its own isolated micro-VM (using Firecracker), providing strong security isolation β one customer's task cannot affect another's.
Strengths
- Zero server management β no instances to patch, scale, or monitor
- Per-task pricing β pay only for the vCPU and memory your task uses
- No idle waste β no instance running empty at 2 AM
- Task-level isolation β Firecracker micro-VM per task
- Scaling is one-layer β just change desired count, Fargate handles capacity
- Fargate Spot β up to 70% discount for interruptible tasks
Trade-offs
- No GPU support β cannot use GPU instance types
- No daemon scheduling β can't run one agent per "host"
- No EBS volumes β ephemeral storage only (20GB default, up to 200GB)
- Fixed CPU/memory combos β limited set of valid pairings
- No SSH access β debug via ECS Exec only
- Higher per-unit cost β ~20% more expensive per vCPU-hour than EC2
The Fargate pricing model is straightforward: you pay per vCPU-second and per GB-second your task runs. There is no cost when no tasks are running (unlike EC2 where the instance bill continues). This makes Fargate ideal for variable workloads β the cost matches the actual usage precisely.
Fargate does not let you specify arbitrary CPU and memory β there are fixed valid combinations. If you specify an invalid pairing, the task definition fails to register.
| vCPU | Memory Options (GB) | Typical Use Case |
|---|---|---|
| 0.25 vCPU | 0.5, 1, 2 | Tiny microservices, health checkers |
| 0.5 vCPU | 1, 2, 3, 4 | APIs, lightweight web servers |
| 1 vCPU | 2, 3, 4, 5, 6, 7, 8 | Standard APIs, workers |
| 2 vCPU | 4 β 16 (in 1GB steps) | Batch jobs, heavier services |
| 4 vCPU | 8 β 30 (in 1GB steps) | Data processing, analytics |
| 8 vCPU | 16 β 60 (in 4GB steps) | ML inference, heavy compute |
| 16 vCPU | 32 β 120 (in 8GB steps) | Large in-memory workloads |
π Exam tip: If a question says "the task requires 3 vCPU and 6GB memory" β there is no 3 vCPU option in Fargate. You must round up to 4 vCPU. This is a common exam trap. Know the valid vCPU values: 0.25, 0.5, 1, 2, 4, 8, 16.
| Feature | EC2 Launch Type | Fargate Launch Type |
|---|---|---|
| Server management | You manage EC2 instances | AWS manages (serverless) |
| Pricing | Pay for EC2 instances (running or not) | Pay per vCPU-second + GB-second per task |
| Isolation | Instance-level (shared host for tasks) | Task-level (Firecracker micro-VM) |
| GPU support | β Full GPU access (P3, P4, G4, G5) | β No GPU available |
| Persistent storage (EBS) | β EBS volumes attachable | β Ephemeral only (20-200GB) |
| EFS (shared file system) | β Supported | β Supported |
| Spot instances | β EC2 Spot (up to 90% discount) | β Fargate Spot (up to 70% discount) |
| Daemon scheduling | β One task per instance | β Not supported |
| Task density | Pack multiple tasks per instance | One micro-VM per task |
| SSH access | β Direct SSH to instance | β ECS Exec only (SSM-based) |
| Scaling layers | 2 layers: tasks + instances (ASG) | 1 layer: tasks only |
| Cold start | ~minutes (if ASG needs new instance) | ~30-60s (Fargate provisions infra) |
| Best for | Large steady workloads, GPU, tight cost control | Variable/spiky workloads, simplicity, microservices |
A single ECS cluster can use both launch types simultaneously. This is the production-standard pattern for cost optimization: run steady-state workloads on EC2 Reserved Instances (cheapest baseline), and burst overflow to Fargate (no pre-provisioning needed). Capacity Provider strategies let you define the mix β for example, "80% on EC2, 20% overflow on Fargate" or "batch jobs on Fargate Spot, web tier on Fargate."
Web Tier β Fargate
Variable traffic, auto-scales, no servers to manage. Simplest operational model for customer-facing services.
Workers β EC2
Steady-state processing, Reserved Instances for cost. Pack multiple worker tasks per large instance for efficiency.
Batch β Fargate Spot
Interruptible batch jobs get up to 70% discount. Task retries handle interruptions naturally.
π Decision framework: Start with Fargate. It is simpler and scales naturally. Move to EC2 launch type only when you need: (1) GPU, (2) EBS persistent volumes, (3) daemon scheduling, (4) cost optimization on large steady-state fleets, or (5) specific instance types. Fargate is the default for most new workloads on ECS.
When using the EC2 launch type, ECS decides which instance gets each new task. Task placement strategies control this decision. They apply only to EC2 β Fargate handles placement internally (one micro-VM per task, AWS chooses the host).
| Strategy | How It Works | Best For |
|---|---|---|
| spread | Distribute tasks evenly across the specified field (e.g., attribute:ecs.availability-zone or instanceId) | High availability β ensures AZ failure impacts minimum tasks |
| binpack | Pack tasks onto the fewest instances possible (by CPU or memory) | Cost optimization β fewer instances running, lower EC2 bill |
| random | Place tasks on random instances | Simple workloads, testing β no preference |
Combining Strategies
You can chain strategies in order of priority. Example: spread(az) first, then binpack(memory). This spreads across AZs for HA, then packs tightly within each AZ for cost savings.
Placement Constraints
Constraints filter which instances are eligible: distinctInstance (no two tasks on same instance) or memberOf (custom expressions like attribute:ecs.instance-type == g4dn.xlarge).
π Default behavior: ECS uses spread across Availability Zones by default. This is the safest default β it maximizes availability. Switch to binpack when cost optimization is the priority and you can tolerate reduced AZ spread.
This architecture uses each launch type where it shines: Fargate for the web tier (simple, no servers, auto-scales), EC2 Spot for batch processing (cheapest compute, interruption-tolerant), and EC2 GPU for ML inference (needs hardware that Fargate can't provide). All managed through a single ECS cluster with capacity provider strategies.
- "No server management" + containers β Always Fargate. This is the exam's favorite phrase for Fargate.
- "Requires GPU" β Must use EC2 launch type. Fargate does not support GPU instances.
- "Run one monitoring agent per host" β Daemon scheduling on EC2 launch type. Fargate doesn't support daemons.
- "Need persistent block storage (EBS)" β EC2 launch type. Fargate only has ephemeral storage.
- "Need shared file storage across tasks" β EFS works with both EC2 and Fargate. Don't pick EC2 just for shared storage.
- Fargate Spot β up to 70% discount but tasks can be interrupted with 2-minute warning. Good for batch. Not for web servers.
- Valid vCPU values: 0.25, 0.5, 1, 2, 4, 8, 16. If a question says "3 vCPU" β that's invalid, must round up to 4.
- Fargate cold start ~30-60s. If the question requires "sub-second scaling" β EC2 with pre-warmed instances.
- Distractor: "Fargate is always cheaper than EC2" β wrong. For large steady-state workloads, EC2 with Reserved Instances is cheaper per unit.
- EC2 launch type: you manage instances. Full control, GPU, EBS, daemon scheduling. Cheaper for steady-state (RI/SP).
- Fargate: serverless containers. Zero server management. Per-task pricing. No GPU, no EBS, no SSH.
- Fargate isolation: each task runs in its own Firecracker micro-VM (task-level isolation vs instance-level).
- Fargate CPU/memory: fixed combinations. vCPU options: 0.25, 0.5, 1, 2, 4, 8, 16.
- Hybrid clusters: use both launch types. Web on Fargate, batch on EC2 Spot, ML on EC2 GPU.
- Default choice: start with Fargate. Move to EC2 only when you need GPU, EBS, daemons, or cost optimization at scale.
- Fargate Spot: up to 70% discount for interruptible batch workloads.
Fargate β Complete Understanding
Fargate is a serverless compute engine for containers. It removes the server layer entirely β you define what you want to run and how much CPU/memory it needs. AWS handles everything else: provisioning compute, patching the OS, managing the container runtime, and isolating your workload.
The Right Mental Model
Most people think Fargate = "Lambda for containers." That's not quite right.
π Better mental model:
"Fargate = EC2 without access to EC2"
Your container gets a VPC, an ENI, security groups, private IP β just like EC2. You just can't SSH in, can't pick the instance type, can't install host-level agents. The EC2 exists β you just don't see it.
What Happens Under the Hood
- AWS provisions a Firecracker micro-VM per task
- AWS manages the host OS, container runtime (containerd), and ECS agent
- You never see the underlying EC2 instance
- Isolation: each task is a separate micro-VM (not just a container on a shared host)
- You only define: CPU, memory, container image, networking
Internally: Fargate still runs on EC2 hardware (Nitro instances). It's EC2 that AWS manages for you β not a different compute technology.
π Key insight: Fargate is NOT a separate compute platform. It's an abstraction layer over EC2. AWS is running EC2 instances, launching Firecracker micro-VMs on them, and exposing only the container interface to you. This is why Fargate tasks behave like EC2 instances (own IP, security groups, VPC placement) β because under the hood, they ARE running on EC2.
Every Fargate task gets its own Elastic Network Interface (ENI) with a private IP address in your VPC. This has major implications:
How It Works
- Each task = own ENI = own private IP
- ENI lives in your subnet (public or private)
- You attach security groups directly to the task
- Tasks can communicate using standard VPC networking
- Always uses awsvpc network mode (no other option)
Implications
- No port conflicts β every task has its own IP, so all can use port 80
- Security group per task β fine-grained firewall rules
- Task behaves like an EC2 instance from a networking perspective
- ALB targets individual tasks by IP (not instance + port)
β οΈ Critical requirement: Fargate tasks in private subnets need a NAT Gateway for internet access (pulling images from Docker Hub, calling external APIs). Without NAT, the task hangs at "PROVISIONING" and eventually times out. For ECR image pulls, you can alternatively use VPC Endpoints (PrivateLink) to avoid NAT Gateway costs.
Fargate does NOT allow arbitrary resource values. You must choose from predefined CPU/memory combinations. If you specify an invalid pair, the task definition will fail to register.
| vCPU | Memory Options (GB) | Typical Use Case |
|---|---|---|
| 0.25 vCPU | 0.5, 1, 2 | Microservices, health checks, lightweight APIs |
| 0.5 vCPU | 1, 2, 3, 4 | Small web apps, background workers |
| 1 vCPU | 2, 3, 4, 5, 6, 7, 8 | Standard web apps, APIs |
| 2 vCPU | 4β16 (in 1GB increments) | Medium workloads, data processing |
| 4 vCPU | 8β30 (in 1GB increments) | Large apps, compute-heavy tasks |
| 8 vCPU | 16β60 (in 4GB increments) | Heavy processing, in-memory caching |
| 16 vCPU | 32β120 (in 8GB increments) | Max power (rare, expensive) |
π Exam trap: "The task requires 3 vCPU and 6GB memory." There is no 3 vCPU option β you must round up to 4 vCPU. Valid vCPU values: 0.25, 0.5, 1, 2, 4, 8, 16. Nothing in between. This is a frequently tested concept.
Fargate task startup is not instant β understanding the timeline helps you set realistic expectations for scaling and health check grace periods:
Startup Time: ~30β60 seconds
Typical cold start. Includes compute provisioning + image pull + container start. Larger images = longer startup.
What Happens During Startup
- AWS provisions Firecracker micro-VM
- Attaches ENI to your subnet
- Pulls container image from ECR/registry
- Starts your container process
- Health check grace period begins
Compared To
- Lambda cold start: 100msβ3s (faster)
- Fargate: 30β60s
- EC2 launch: 2β5 min (slower)
For latency-sensitive scaling, keep min tasks > 0 to avoid cold starts.
Storage in Fargate is fundamentally different from EC2 β there is no EBS available. Understanding what you get (and don't get) prevents painful surprises:
Ephemeral Storage
- Default: 20 GB per task
- Configurable: up to 200 GB
- Lifecycle: destroyed when task stops
- Fast local SSD β good for temp files, caching, scratch space
- Shared across all containers in the task
Persistent Storage: EFS
- Amazon EFS = only persistent storage option for Fargate
- Shared filesystem β multiple tasks read/write simultaneously
- Survives task restarts
- Mount as a volume in task definition
- Use for: shared config, uploaded files, ML models
β οΈ No EBS on Fargate. If your workload requires EBS volumes (high IOPS, block storage, databases), you must use EC2 launch type. Fargate only supports ephemeral storage + EFS. This is a common exam question and a real-world constraint.
Fargate is excellent β but it's NOT always the right choice. These limitations are critical for architecture decisions and exam answers:
What Fargate Cannot Do
- No GPU support β ML training, rendering β use EC2
- No EBS volumes β only ephemeral + EFS
- No daemon containers β can't run node-level agents (Datadog agent, Fluentd)
- No SSH access β cannot log into the host
- No custom AMI β can't customize the underlying OS
- No privileged mode β can't run containers with root-level host access
- No Windows containers (limited support, still maturing)
- Fixed CPU/memory combos β can't choose arbitrary values
Cost Considerations
- Fargate is ~20-40% more expensive per vCPU-hour than EC2 On-Demand
- EC2 with Reserved Instances / Savings Plans = much cheaper for steady workloads
- Fargate Spot helps (up to 70% off) but can be interrupted
- Break-even point: if task utilization is >70% consistently for 24/7 workloads, EC2 is cheaper
- Fargate wins: for variable workloads, burst traffic, short-lived tasks
Choose Fargate When
- You want zero infrastructure management
- Workloads are variable or bursty
- Fast setup and iteration speed matter
- Small team with no dedicated DevOps
- Security isolation per task is important
- You're running microservices (many small containers)
- Development and staging environments
Choose EC2 When
- You need GPU instances (ML training, rendering)
- You want cost optimization at scale (RI/SP at 50-60% off)
- You need EBS volumes (databases, high IOPS)
- You run daemon containers (log agents, monitoring sidecars)
- You need privileged mode or custom OS configs
- Workloads are steady-state 24/7 at high utilization
- You need instance types Fargate doesn't match (compute-optimized, memory-optimized)
π Golden rule: Start with Fargate. Move to EC2 only when you hit a specific limitation (GPU, EBS, cost, daemons). Don't pre-optimize β Fargate's operational simplicity saves engineering time that often exceeds the compute cost difference.
Assuming GPU Support
Fargate does NOT support GPU workloads. For ML training, inference with GPU, or rendering β you MUST use EC2 launch type with P/G instance families.
Forgetting NAT Gateway
Fargate tasks in private subnets cannot reach the internet (or ECR) without a NAT Gateway or VPC Endpoints. Task gets stuck in PROVISIONING forever.
Overprovisioning Resources
Choosing 4 vCPU / 8GB when the app uses 0.5 vCPU / 512MB. Fargate bills per-second β oversized tasks = wasted money every second they run.
Expecting EBS
Fargate only has ephemeral storage + EFS. If you need high-IOPS block storage (databases, caches with persistence) β use EC2 launch type.
Large Image + Cold Start
Using 2GB+ images on Fargate β 60+ second startup times. Keep images lean (<500MB). Use multi-stage Docker builds to minimize image size.
Daemon Scheduling
Trying to run "one per host" containers (log agents, monitoring) on Fargate β there's no concept of "host." Use ECS daemon service with EC2 launch type instead.
Fargate = EC2 without access to EC2. Serverless containers with full VPC networking.
- Behind the scenes: Firecracker micro-VMs on AWS-managed EC2. Each task = isolated VM, own ENI, own IP.
- Networking: awsvpc mode only. Each task gets an ENI in your subnet with security groups. NAT Gateway required for private subnets.
- Resources: Fixed CPU/memory combos. Valid vCPU: 0.25, 0.5, 1, 2, 4, 8, 16. No arbitrary values.
- Startup: ~30-60 seconds (provision + ENI + image pull + start). Not instant like Lambda.
- Storage: Ephemeral 20-200GB (destroyed on stop) + EFS (persistent, shared). NO EBS.
- Limitations: No GPU, no EBS, no daemons, no SSH, no privileged mode, no custom AMI.
- Cost: ~20-40% more than EC2 On-Demand. Wins for variable/burst workloads. Loses for 24/7 steady-state at scale.
- Golden rule: Start with Fargate. Move to EC2 only when you hit a limitation.
Networking & Storage
How your ECS tasks connect to the network determines their security posture, IP behavior, and load balancer integration. ECS supports three networking modes β but for all practical purposes, awsvpc is the only one you should use (and the only one that works with Fargate).
| Network Mode | How It Works | Launch Type | Use Case |
|---|---|---|---|
| awsvpc | Each task gets its own ENI (Elastic Network Interface) with a private IP in your VPC | EC2 + Fargate | Production standard β all new workloads |
| bridge | Tasks share the host's network via Docker bridge. Dynamic port mapping. | EC2 only | Legacy. Only if migrating from Docker Compose |
| host | Task uses the host EC2 instance's network directly. No isolation. | EC2 only | Maximum performance (no NAT overhead). Rare. |
In awsvpc mode, each ECS task gets its own Elastic Network Interface (ENI) β a real VPC network interface with a private IP address. This means each task has its own security group, appears as a distinct network entity in your VPC, and can be targeted directly by load balancers. There is no port conflict β every task listens on the same container port (e.g., 8080) because each has its own IP.
Benefits
- Task-level security groups β different rules per service
- No port conflicts β every task uses port 8080, own IP
- VPC Flow Logs per task β full network visibility
- Direct ALB targeting by IP β no dynamic port mapping needed
- Required for Fargate β the only option that works
ENI Limits (EC2 only)
- Each EC2 instance type has a max ENI count
- Each task in awsvpc mode consumes one ENI
- t3.micro: 2 ENIs β only 1 task (1 ENI for the instance itself)
- m5.xlarge: 4 ENIs β 3 tasks max
- ENI trunking (opt-in) increases the limit significantly
- Not a concern with Fargate β AWS manages this
π ENI trunking is an opt-in feature that lets you run more tasks per EC2 instance in awsvpc mode. It creates a "trunk" ENI with multiple "branch" ENIs sharing it. Enable it via account settings: aws ecs put-account-setting --name awsvpcTrunking --value enabled. With trunking, an m5.xlarge can support ~18 tasks instead of 3.
π Mental model: In awsvpc mode, each task behaves exactly like a standalone EC2 instance from a networking perspective β it has its own private IP, its own security group, its own entry in VPC Flow Logs, and can be directly addressed by other services. The only difference: it is a container, not a VM. This is why awsvpc is required for Fargate β it provides the clean network isolation that serverless containers need.
In awsvpc mode, security groups are assigned at the task level (via the service's network configuration), not at the instance level. This gives you fine-grained control:
Web Tier SG
- Inbound: port 8080 from ALB SG
- Outbound: port 5432 to DB SG
- Outbound: port 443 to internet (HTTPS)
Worker Tier SG
- Inbound: none (pulls from SQS)
- Outbound: port 443 to SQS/S3
- Outbound: port 5432 to DB SG
Database SG
- Inbound: port 5432 from Web SG + Worker SG
- Outbound: none
- Reference SGs by ID (not IP ranges)
The key insight: reference security groups by their group ID, not by IP ranges. Since task IPs change on every restart, IP-based rules would constantly break. SG-to-SG references are stable regardless of task IP churn.
ECS integrates with ALB (Application Load Balancer) and NLB (Network Load Balancer) through target groups. When you create a service with a load balancer, ECS automatically registers each task's IP:port as a target. When a task is replaced, the old target is deregistered and the new one registered β seamlessly.
ALB (Layer 7)
- Path-based routing: /api/* β API service, /web/* β frontend
- Host-based routing: api.example.com β one service
- Health checks: HTTP GET /health β 200 OK
- Sticky sessions: route same user to same task
- Target type: ip (required for awsvpc + Fargate)
- Best for: HTTP/HTTPS workloads, microservices
NLB (Layer 4)
- TCP/UDP pass-through: no HTTP awareness
- Ultra-low latency: millions of requests/sec
- Static IP: one IP per AZ (great for whitelisting)
- TLS termination or pass-through
- Target type: ip (for awsvpc mode)
- Best for: gRPC, WebSocket, non-HTTP protocols
π Target type must be ip for Fargate. The ALB target group must use target type: ip (not instance). With awsvpc mode, ECS registers the task's ENI IP directly. If you create the target group with type instance, the deployment fails silently β tasks start but are never registered.
For service-to-service communication without a load balancer, ECS integrates with AWS Cloud Map to provide DNS-based service discovery. When a task starts, it registers a DNS record (e.g., web-api.production.local). Other services resolve this name to get the task's current IP address. When the task stops, the record is removed.
Cloud Map (Service Discovery)
- DNS A records pointing to task IPs
- Private DNS namespace (e.g.,
production.local) - Auto-register on task start, deregister on stop
- Health checks to remove unhealthy instances
- Works with both EC2 and Fargate
Service Connect (newer, simpler)
- Built on Cloud Map + Envoy proxy sidecar
- Service-to-service via logical names (not IPs)
- Automatic retries, timeouts, circuit breaking
- Traffic metrics out of the box
- Recommended over raw Cloud Map for new services
Service Connect is the newer recommended approach. It injects an Envoy sidecar proxy into your tasks automatically. Your code calls http://web-api:8080, and the proxy handles discovery, load balancing, retries, and telemetry. Think of it as a lightweight service mesh managed by ECS β no Kubernetes or App Mesh complexity.
π Service-to-service communication: Cloud Map eliminates hardcoded IPs entirely. Example: your orders service calls http://payments.production.local:8080/charge β DNS resolves to the current task IP. No load balancer needed for internal calls. For exam: if the question says "internal service communication without ALB" β Service Discovery via Cloud Map.
ECS containers need storage for application data, temp files, shared state, and logs. The options depend heavily on your launch type:
| Storage Type | Persistence | EC2 | Fargate | Shared Across Tasks | Best For |
|---|---|---|---|---|---|
| Ephemeral (container layer) | Deleted on task stop | β | β (20-200GB) | β | Temp files, caches, scratch space |
| EBS Volume | Persists beyond task lifecycle | β | β | β (one AZ only) | Database data, stateful single-task workloads |
| EFS (Elastic File System) | Persistent, durable | β | β | β (multi-AZ, multi-task) | Shared config, ML models, CMS uploads |
| Instance Store (NVMe) | Lost on instance stop/terminate | β | β | β | High-IOPS scratch (ML training, video encode) |
| Docker Volumes | Depends on driver | β | β | Between containers in same task | Sidecar data sharing within a task |
Amazon EFS is the most important storage integration for ECS because it works with both EC2 and Fargate and supports concurrent access from multiple tasks across multiple AZs. Mount an EFS file system in your task definition, and every task gets read/write access to the same files β no matter which AZ it runs in.
When to Use EFS
- Shared configuration files across multiple tasks
- ML model files (load once, serve from many tasks)
- CMS file uploads (WordPress media, user uploads)
- Log aggregation (multiple writers, one reader)
- Any workload needing shared persistent storage on Fargate
Gotchas
- Latency: EFS is network-attached β higher latency than local SSD
- Throughput: scales with data stored (or use provisioned throughput)
- Cost: $0.30/GB-month (standard). Use Infrequent Access for cold data
- Security: must configure SG to allow NFS (port 2049) from task SG
- IAM auth: use EFS access points for per-task directory isolation
π Fargate ephemeral storage β each Fargate task gets 20GB of ephemeral storage by default (stored on the micro-VM's local disk). You can configure up to 200GB in the task definition. This data is fast (local NVMe) but deleted when the task stops. Use it for temp files, build artifacts, or caching β not for anything you need to persist.
This pattern is common for file processing: API tasks accept uploads and write to EFS, processor tasks read from EFS and generate thumbnails or transcodes. EFS is shared across all tasks and all AZs β no need to copy files between tasks or use S3 as an intermediary for simple file sharing.
- awsvpc = required for Fargate. If using Fargate, awsvpc is the only networking mode. If the exam says "bridge mode" + Fargate β that's impossible.
- ALB target type must be
ipfor Fargate. Notinstance. This is a common configuration error tested in exams. - "Need shared storage across Fargate tasks" β EFS. It's the only persistent shared storage that works with Fargate.
- "Need persistent block storage" β EBS, which means EC2 launch type only. Fargate ephemeral storage is deleted on stop.
- "Tasks can't communicate with each other" β Check security groups. In awsvpc mode, each task has its own SG. The SG must allow the needed ports.
- ENI limits on EC2: each awsvpc task uses one ENI. Small instances (t3.micro) may only support 1 task. Enable ENI trunking for more.
- Service Discovery vs ALB: Use ALB for external-facing traffic. Use Cloud Map/Service Connect for internal service-to-service calls.
- Fargate ephemeral storage: 20GB default, configurable up to 200GB. Fast (local NVMe) but non-persistent.
- EFS security: task SG must allow outbound to port 2049 (NFS). EFS SG must allow inbound port 2049 from task SG.
- awsvpc: production standard. Each task gets own ENI + private IP + security group. Required for Fargate.
- Security groups: applied per task (not per instance). Reference by SG ID, not IP ranges.
- ALB: target type must be
ipfor Fargate/awsvpc. Auto-registers task IPs in target group. - Service Discovery: Cloud Map provides DNS records per task. Service Connect adds Envoy proxy for retries/metrics.
- EFS: shared persistent storage across tasks and AZs. Works with both EC2 and Fargate.
- EBS: persistent block storage, EC2 only. Single-AZ. For stateful single-instance workloads.
- Fargate ephemeral: 20-200GB, fast NVMe, deleted on task stop. Great for temp/scratch data.
- ENI trunking: opt-in to run more awsvpc tasks per EC2 instance by sharing trunk ENI.
Capacity Providers
A capacity provider is the bridge between your ECS tasks and the infrastructure they run on. It answers a simple question: "When ECS needs to launch a new task, where does the compute come from?" Without capacity providers you must manually ensure enough EC2 instances exist. With them, ECS automatically provisions capacity β either by scaling an Auto Scaling Group (EC2) or by simply requesting Fargate resources from AWS.
FARGATE
Built-in. AWS provisions compute per task. No configuration needed β always available by default.
FARGATE_SPOT
Built-in. Same as Fargate but uses spare capacity at up to 70% discount. Tasks can be interrupted with 2-minute warning.
ASG Capacity Provider
Links an Auto Scaling Group to ECS. When tasks need capacity, ECS tells the ASG to scale out. You manage the instance fleet.
The FARGATE and FARGATE_SPOT capacity providers are built into ECS β you don't create them. They are available on every cluster. Fargate is the default: every task you launch on Fargate uses this provider unless you configure otherwise.
Fargate (On-Demand)
- Always available β AWS guarantees capacity
- No interruptions β task runs until it exits or you stop it
- Full per-second billing for vCPU + memory
- Use for: production web services, customer-facing APIs
Fargate Spot
- Up to 70% cheaper than on-demand Fargate
- Uses spare AWS capacity β can be reclaimed anytime
- 2-minute SIGTERM before task is terminated
- ECS service auto-replaces interrupted tasks on on-demand
- Use for: batch jobs, queue workers, data processing
π Fargate Spot interruption handling: When AWS reclaims your Spot task, ECS sends SIGTERM β waits 2 minutes β then SIGKILL. Your app should handle SIGTERM gracefully (finish current work, checkpoint state). The ECS service will automatically launch a replacement task on on-demand Fargate β you don't lose desired count.
Python β SIGTERM Handler
import signal, sys
def graceful_shutdown(signum, frame):
print("SIGTERM received β finishing work...")
# flush queues, save checkpoint, close DB
sys.exit(0)
signal.signal(signal.SIGTERM, graceful_shutdown) Node.js β SIGTERM Handler
process.on('SIGTERM', async () => {
console.log('SIGTERM received β draining...');
server.close(); // stop accepting new requests
await flushQueues();
await db.close();
process.exit(0);
}); For the EC2 launch type, capacity providers link your ECS cluster to an Auto Scaling Group. This enables Cluster Auto Scaling (CAS) β when ECS needs to place tasks but no instance has enough room, CAS triggers the ASG to launch new instances. When instances are underutilized, CAS scales them in. This eliminates manual capacity planning.
How CAS Works
- 1. ECS receives a task placement request
- 2. No instance has enough CPU/memory available
- 3. CAS calculates how many instances are needed
- 4. CAS sets the ASG's desired count β ASG launches instances
- 5. New instances register with ECS β tasks placed
- Scale-in: ECS drains tasks first β then terminates instance
Configuration
- Target capacity %: how full instances should be
- 100% = pack instances fully (maximize density)
- 80% = leave 20% headroom for burst (faster placement)
- Managed scaling: on/off toggle for CAS
- Managed termination protection: prevents ASG from terminating instances that still have running tasks
The target capacity % is the most important CAS parameter. Set it to 100% for maximum cost efficiency (every instance fully packed, but new tasks wait for scale-out). Set it to 70-80% for responsiveness (headroom means tasks place instantly, but you pay for idle capacity).
A capacity provider strategy defines how tasks are distributed across multiple capacity providers. You assign weights and an optional base count per provider. This is how you build hybrid workloads β for example, "run 2 tasks on on-demand Fargate as baseline, then spread additional tasks 80% to Fargate Spot and 20% to on-demand."
| Strategy Example | Provider | Base | Weight | Behavior |
|---|---|---|---|---|
| Cost-optimized batch | FARGATE_SPOT | 0 | 4 | 80% of tasks β Spot (cheap) |
| FARGATE | 0 | 1 | 20% on-demand (fallback) | |
| HA web service | FARGATE | 2 | 1 | Always 2 on-demand tasks (base) |
| FARGATE_SPOT | 0 | 3 | Extra tasks 75% Spot (save $) | |
| Hybrid EC2 + overflow | ASG (EC2 RI) | 0 | 3 | 75% on Reserved EC2 instances |
| FARGATE | 0 | 1 | 25% overflow to Fargate (burst) |
The base count guarantees a minimum number of tasks on that provider (placed first, before weights apply). After base is filled, additional tasks distribute according to the weight ratio. This gives you predictable baseline capacity with elastic overflow.
This pattern maximizes cost savings for batch workloads: EC2 Spot gives the deepest discount (up to 90%), Fargate Spot adds overflow without managing instances, and 2 on-demand Fargate tasks guarantee a minimum processing rate even during Spot capacity shortages.
- FARGATE and FARGATE_SPOT are built-in β you don't create them. They exist on every cluster.
- "Reduce cost for batch processing on ECS" β Fargate Spot (up to 70% off) or EC2 Spot ASG capacity provider (up to 90% off).
- "Ensure minimum availability while minimizing cost" β Capacity provider strategy with
baseon FARGATE (guaranteed) andweighton FARGATE_SPOT (cheap excess). - Cluster Auto Scaling (CAS) β only works with EC2 launch type via ASG capacity provider. Fargate doesn't need CAS because AWS handles capacity.
- Target capacity % = 100% means "pack instances fully before scaling out." 80% means "keep headroom for faster placement."
- Managed termination protection prevents ASG from terminating instances that still have running ECS tasks. Always enable this.
- Fargate Spot interruption: SIGTERM β 2 min β SIGKILL. Service auto-replaces on on-demand. Design for graceful shutdown.
- Distractor: "Fargate Spot is the same as EC2 Spot" β no. Fargate Spot discount is ~70%, EC2 Spot can reach ~90%. EC2 Spot also has diversified instance fleets for better availability.
- Capacity providers: bridge between ECS and compute. Fargate (on-demand) Β· Fargate Spot (70% off) Β· ASG (EC2, managed by CAS).
- Capacity provider strategy: base (guaranteed) + weight (ratio). Distribute tasks across providers for cost/HA balance.
- Cluster Auto Scaling (CAS): auto-scales EC2 instances based on task demand. Target capacity % controls utilization.
- Fargate Spot: up to 70% cheaper. 2-minute SIGTERM before termination. Service auto-replaces interrupted tasks.
- EC2 Spot via ASG: up to 90% cheaper. CAS manages the ASG. Managed termination protection drains tasks before instance stop.
- Hybrid pattern: baseline on-demand (guaranteed) + Spot overflow (cheap). Best cost-to-availability ratio for batch.
Scaling & Deployment
ECS Service Auto Scaling adjusts the desired task count automatically based on CloudWatch metrics. It uses Application Auto Scaling β the same system that scales DynamoDB tables and Aurora replicas. You define a target value (e.g., "keep average CPU at 70%"), and the system adds or removes tasks to maintain it.
Target Tracking
- Set target: "Average CPU = 70%"
- System auto-creates CloudWatch alarms
- Scales out when above, in when below
- Simplest, most common approach
- Supported metrics: CPU, Memory, ALB request count
Step Scaling
- Define steps: "CPU 70-80% β add 1, 80-90% β add 3, 90%+ β add 5"
- More control over scaling aggressiveness
- Requires manual CloudWatch alarm setup
- Good for: bursty workloads needing fast scale-out
Scheduled Scaling
- "Scale to 20 tasks at 9am, back to 5 at 6pm"
- Cron-based, predictable patterns
- Use with target tracking (scheduled sets min, TT adjusts within range)
- Good for: known traffic patterns (business hours, events)
| Scaling Metric | Target Suggestion | When to Use |
|---|---|---|
| ECSServiceAverageCPUUtilization | 60-75% | CPU-bound workloads (computation, encoding) |
| ECSServiceAverageMemoryUtilization | 70-80% | Memory-bound (caching, JVM, data processing) |
| ALBRequestCountPerTarget | 1000 req/target | Request-driven APIs (scale per request volume) |
| Custom CloudWatch metric | Varies | SQS queue depth, business metric, latency P99 |
π Scale on ALBRequestCountPerTarget for web APIs, not CPU. Web APIs often have low CPU but high request count. If you scale on CPU alone, you'll be under-provisioned β requests queue up and latency spikes before CPU triggers. ALBRequestCountPerTarget scales based on actual request volume, which directly correlates with user experience.
When you update a service (new image version, config change), ECS must replace old tasks with new ones. How it does this determines whether your users experience downtime, mixed versions, or seamless updates.
Rolling Update (default)
- ECS launches new tasks β waits for health check β drains old tasks
- Controlled by
minimumHealthyPercentandmaximumPercent - minimumHealthyPercent: 100 = never go below desired count (add new before removing old)
- maximumPercent: 200 = can double task count temporarily during deploy
- No additional cost (uses ECS built-in controller)
- Rollback: manual (deploy previous revision)
Blue/Green (CodeDeploy)
- Two target groups: blue (current) and green (new)
- CodeDeploy shifts traffic: 100% blue β 100% green
- Options: all-at-once, linear (10% every 5min), canary (10% β 100%)
- Instant rollback: shift traffic back to blue
- Both old and new tasks run simultaneously
- Requires ALB with two target groups + CodeDeploy setup
π Blue/Green = true zero-downtime: Unlike rolling updates where old+new tasks coexist briefly, Blue/Green keeps the full blue fleet running until green is 100% validated. Rollback is instant β just flip the ALB listener back. For exam: if a question requires "zero-downtime deployment with instant rollback" β Blue/Green with CodeDeploy. If it says "simplest deployment" β Rolling Update (default, no extra setup).
| minimumHealthyPercent | maximumPercent | Behavior | Best For |
|---|---|---|---|
| 100% | 200% | Launch new tasks first, then drain old. Never below desired count. Temporarily doubles cost. | Production services needing zero-downtime |
| 50% | 100% | Stop half the old tasks, then start new ones. Brief capacity reduction. | Cost-sensitive, can tolerate brief capacity dip |
| 0% | 100% | Stop ALL old tasks, then start new. Full downtime during deploy. | Dev/staging only. Never production. |
| 100% | 150% | Launch 50% new tasks, drain some old, repeat. Moderate overhead. | Balance between speed and cost |
π Fargate constraint: Fargate enforces a minimum minimumHealthyPercent of 50%. You cannot use 0% (full-stop deployment) with Fargate β only EC2 launch type supports it. For Fargate zero-downtime deploys, use 100%/200% or Blue/Green with CodeDeploy.
The deployment circuit breaker automatically detects when a deployment is failing (new tasks keep crashing) and rolls back to the previous stable version. Without it, a bad deployment loops endlessly: ECS launches new task β task crashes β ECS launches another β crashes β repeat forever, burning compute.
How It Works
- ECS monitors new tasks during deployment
- If tasks repeatedly fail to reach RUNNING state...
- Circuit breaker triggers: stops launching new tasks
- If
rollback: trueβ automatically reverts to last stable - Based on failure threshold (number of consecutive task failures)
Configuration
- Enable:
deploymentCircuitBreaker: {enable: true, rollback: true} - Works with both ECS rolling update and CodeDeploy
- Always enable for production. Default is disabled.
- Failure reasons detected: OOM, crash loop, health check failure
π Always enable deployment circuit breaker with rollback in production. Without it, a bad image tag or misconfigured environment variable causes infinite task restarts. Your service degrades while ECS keeps trying to deploy the broken version. Circuit breaker + rollback catches this in seconds and reverts automatically.
- "Zero downtime deployment" β Rolling update with minimumHealthyPercent=100%, maximumPercent=200%. Or Blue/Green with CodeDeploy.
- "Automatically rollback failed deployments" β Deployment circuit breaker with rollback=true. Or CodeDeploy blue/green with automatic rollback alarm.
- "Scale based on SQS queue depth" β Custom CloudWatch metric for ApproximateNumberOfMessagesVisible, step scaling policy.
- Service Auto Scaling β Cluster Auto Scaling. Service AS changes task count. Cluster AS (CAS) changes EC2 instance count. They work together but are separate.
- Target tracking is "set and forget." You specify the target value; AWS creates and manages the CloudWatch alarms. Step scaling requires you to create alarms manually.
- Blue/Green requires ALB with two target groups. NLB is supported but less common for blue/green. Cannot do blue/green without a load balancer.
- "Gradual traffic shift" β CodeDeploy canary or linear deployment. Not possible with ECS rolling update (which is binary per task).
- Cooldown period: default 300s between scaling actions. Too short = oscillation (scale up/down/up/down). Too long = slow reaction.
- Distractor: "ECS rolling update supports canary deployment" β false. Canary requires CodeDeploy blue/green with traffic shifting.
- Service Auto Scaling: target tracking (CPU, memory, ALB requests, custom) adjusts task count automatically.
- Scale on ALBRequestCountPerTarget for APIs, not CPU. Request volume correlates better with user experience.
- Rolling update: minHealthy=100%, max=200% β zero downtime. New tasks must pass health check before old ones drain.
- Blue/Green (CodeDeploy): two target groups. Traffic shift: all-at-once, linear, or canary. Instant rollback to blue.
- Circuit breaker: detects failed deployments, auto-reverts. Always enable in production.
- Scheduled scaling: predictable patterns (business hours). Combine with target tracking for best results.
Integrations
Amazon ECR (Elastic Container Registry) is a fully managed Docker container image registry. It stores, manages, and deploys your container images. ECS pulls images from ECR during task launch β this is the standard production pattern. ECR integrates with IAM for access control, encrypts images at rest, and scans for known vulnerabilities.
Key Features
- Private repositories: IAM-based access, no public exposure
- Image scanning: Basic scanning (free, on push, Clair-based CVE detection). Enhanced scanning (uses Amazon Inspector, continuous, per-image cost)
- Lifecycle policies: auto-delete untagged/old images (save cost)
- Cross-region replication: replicate images to multiple regions
- Image immutability: prevent tag overwrites (tag=v1 always same image)
ECS + ECR Flow
- 1. Build:
docker build -t my-api:v2 . - 2. Tag:
docker tag my-api:v2 123456.dkr.ecr.us-east-1.amazonaws.com/my-api:v2 - 3. Auth:
aws ecr get-login-password | docker login... - 4. Push:
docker push 123456.dkr.ecr.../my-api:v2 - 5. ECS task definition references the ECR image URI
- 6. ECS Execution Role must have
ecr:GetAuthorizationToken+ecr:BatchGetImage
When ECS launches a task, the image pull follows a precise sequence. Understanding this flow helps debug "CannotPullContainerError" β the most common task startup failure:
π Most common fix: If tasks fail with CannotPullContainerError β (1) verify the Execution Role has ecr:GetAuthorizationToken + ecr:BatchGetImage + ecr:GetDownloadUrlForLayer, (2) ensure the task's subnet has a NAT Gateway or VPC endpoint for ECR (com.amazonaws.region.ecr.dkr + com.amazonaws.region.ecr.api).
ECS services integrate with ALB and NLB via target groups. When a task starts, ECS automatically registers it with the target group. When a task stops, ECS deregisters it after the ALB drains active connections. For Fargate (awsvpc mode), the target type must be ip (not instance).
| Feature | ALB (Application) | NLB (Network) |
|---|---|---|
| Layer | Layer 7 (HTTP/HTTPS) | Layer 4 (TCP/UDP/TLS) |
| Routing | Path-based, host-based, header-based | Port-based only |
| Health checks | HTTP GET /health (path + status code) | TCP connect or HTTP |
| WebSocket | β Native support | β TCP passthrough |
| Static IP | β DNS only (changes) | β Elastic IP per AZ |
| Sticky sessions | β Cookie-based | β Not supported |
| Best for ECS | REST APIs, web apps, microservices | gRPC, real-time, extreme throughput |
π ALB path-based routing is the standard pattern for ECS microservices. One ALB, multiple listener rules: /api/users/* β user-service target group, /api/orders/* β order-service target group. Each service registers its own target group. This avoids one-LB-per-service cost while keeping services independently deployable.
ECS tasks use two different IAM roles. Confusing them is one of the most common ECS mistakes and a frequent exam question.
Task Execution Role
- Who uses it: ECS agent (not your application)
- Purpose: pull images, push logs, read secrets
- Permissions needed:
ecr:GetAuthorizationTokenecr:BatchGetImagelogs:CreateLogStreamlogs:PutLogEventsssm:GetParameters(if injecting from Parameter Store)secretsmanager:GetSecretValue(if injecting secrets)
- AWS provides managed policy:
AmazonECSTaskExecutionRolePolicy
Task Role
- Who uses it: your application code (inside the container)
- Purpose: access AWS services from your app
- Examples:
s3:PutObject(upload files)dynamodb:PutItem(write data)sqs:SendMessage(queue messages)sns:Publish(send notifications)
- Follow least privilege β only what your app actually needs
- Accessible via instance metadata endpoint (SDK auto-discovers)
Never hardcode secrets (database passwords, API keys) in your Docker image or task definition environment variables. Instead, reference them from AWS Secrets Manager or SSM Parameter Store. ECS injects the secret value at task launch time β your container sees the value as a regular environment variable, but the actual secret never appears in the task definition.
Secrets Manager
- Designed specifically for secrets (credentials, tokens, keys)
- Automatic rotation (Lambda-based, $0.40/secret/month)
- Reference in task def:
"valueFrom": "arn:aws:secretsmanager:..." - Execution Role needs
secretsmanager:GetSecretValue
SSM Parameter Store
- Config values + secrets (Standard tier free, up to 10K params)
- SecureString type encrypts with KMS
- Reference:
"valueFrom": "arn:aws:ssm:...:parameter/db_host" - Execution Role needs
ssm:GetParameters - Free for standard params (cheaper than Secrets Manager)
ECS containers send logs to CloudWatch via the awslogs log driver. Each container gets its own log stream within a log group. Container Insights provides CPU, memory, network, and disk metrics at the task and container level β critical for troubleshooting and capacity planning.
awslogs Driver
- Configured in task definition per container
- Options:
awslogs-group,awslogs-region,awslogs-stream-prefix - Log stream name:
prefix/container-name/task-id - Execution Role needs
logs:CreateLogStream,logs:PutLogEvents - Set log group retention (default: never expires β cost grows forever)
Container Insights
- Enable per cluster:
containerInsights: enabled - Metrics: CPU/memory utilization per task, per service, per cluster
- Network: bytes in/out, packet errors
- Storage: ephemeral storage utilization (Fargate)
- Costs ~$0.30/task/month (CloudWatch custom metrics pricing)
X-Ray traces requests across your microservices β showing where time is spent, which service is slow, and where errors occur. For ECS, you run the X-Ray daemon as a sidecar container in the same task. Your application sends trace data to the daemon (localhost:2000/udp), and the daemon forwards it to the X-Ray service.
Setup Steps
- 1. Add X-Ray daemon container to task definition (sidecar)
- 2. Image:
amazon/aws-xray-daemon - 3. Port: 2000/UDP
- 4. Task Role needs:
xray:PutTraceSegments,xray:PutTelemetryRecords - 5. Your app uses X-Ray SDK (or OpenTelemetry) to instrument requests
What You Get
- Service map: visual graph of all services and their connections
- Latency breakdown: where each millisecond was spent
- Error rates per service
- Trace filtering by URL, status code, duration
- Integration with CloudWatch ServiceLens for unified view
π Complete observability stack: CloudWatch Logs (what happened β container stdout/stderr), Container Insights (how it's performing β CPU/memory metrics), X-Ray (where time is spent β distributed traces). For exam: "how to view container logs" β awslogs driver + CloudWatch. "How to find slow microservice" β X-Ray. "How to set up CPU-based auto scaling" β Container Insights metrics.
- Task Execution Role β Task Role. Execution Role = ECS agent (pull images, push logs, read secrets). Task Role = your application code (DynamoDB, S3, SQS).
- "Container cannot pull image from ECR" β check Execution Role has
ecr:GetAuthorizationToken+ecr:BatchGetImage. - "Application needs to write to S3" β add S3 permissions to the Task Role, not the Execution Role.
- "Inject database password securely" β Secrets Manager or SSM SecureString referenced in task definition. Execution Role needs read permission.
- ALB target type must be
ipfor Fargate (awsvpc mode).instancetype only works with EC2 launch type bridge/host networking. - X-Ray for ECS: run daemon as sidecar, not standalone service. Using port 2000/UDP. Task Role needs xray:PutTraceSegments.
- ECR lifecycle policies auto-delete old/untagged images β prevents storage cost creep. Set to keep last 10 tagged images.
- "Logs not appearing in CloudWatch" β check Execution Role has
logs:CreateLogStream+logs:PutLogEvents, and check log group exists. - Distractor: "Task Role is needed to pull images from ECR" β false. Image pull uses the Execution Role.
- ECR: managed Docker registry. Build β tag β push β ECS pulls. Enable scanning + lifecycle policies.
- ALB: path-based routing to multiple ECS services via target groups (ip type for Fargate).
- Execution Role vs Task Role: Execution = infrastructure (ECR, logs, secrets). Task = application (S3, DynamoDB, SQS).
- Secrets: inject from Secrets Manager or SSM Parameter Store at task launch. Never hardcode in images.
- CloudWatch: awslogs driver for logs. Container Insights for metrics. Set log retention to avoid cost creep.
- X-Ray: sidecar daemon for distributed tracing. Task Role needs xray permissions.
Architecture Patterns
ECS is flexible enough to support many application styles β from long-running web services to one-shot batch jobs. The key is matching the right ECS features (service vs standalone task, Fargate vs EC2, Spot vs On-Demand) to each workload's requirements.
| Pattern | ECS Feature | Launch Type | Scaling Trigger | Example |
|---|---|---|---|---|
| Microservices | Service + ALB + Service Discovery | Fargate | ALBRequestCountPerTarget | E-commerce (user, order, product services) |
| API Backend | Service + ALB + Auto Scaling | Fargate | ALBRequestCountPerTarget or CPU | Mobile app backend |
| Batch Processing | Standalone task (RunTask API) | Fargate Spot | EventBridge schedule or SQS | Nightly reports, video transcoding |
| Event-Driven | Service + SQS polling | Fargate Spot | SQS queue depth (custom metric) | Order processing, image resizing |
| Scheduled Tasks | RunTask triggered by EventBridge | Fargate Spot | Cron schedule | DB cleanup, daily sync, report generation |
| Web App + API | Service + CloudFront + ALB | Fargate | ALBRequestCountPerTarget | SPA frontend + REST API |
The most common ECS architecture: multiple independent services, each in its own ECS service with its own task definition, scaling policy, and deployment lifecycle. An ALB routes requests by path to the correct target group. Services discover each other via AWS Cloud Map (Service Discovery) for internal communication.
When to Use
- Multiple teams owning different services
- Services scale independently (orders spike on sales, users steady)
- Independent deployment β deploy user-service without touching order-service
- Different tech stacks per service (Node.js, Java, Python in same cluster)
ECS Features Used
- ALB with path-based routing (one LB, many target groups)
- Service Discovery (Cloud Map) for service-to-service calls
- Fargate per-service with independent scaling
- ECR separate repository per service
- Secrets Manager per-service credentials
π Use Service Discovery (Cloud Map) for internal calls, ALB for external. Service A calls Service B via DNS: order-service.local:8080 β Cloud Map maintains the DNS records. This avoids routing internal traffic through the ALB (extra hop, extra cost). External traffic still goes ALB β target group β service.
A service polls SQS for messages and processes them. When the queue grows, Auto Scaling adds tasks. When the queue empties, it scales back down. This pattern decouples producers from consumers and handles traffic spikes gracefully β the queue absorbs the burst while consumers process at their own pace.
When to Use
- Async processing: order placed β process payment, send email
- Unpredictable bursts: 10K images uploaded at once β resize queue
- Decoupled: producer doesn't wait for consumer to finish
- Retry built-in: failed messages go to DLQ for investigation
ECS Features Used
- ECS Service with desired count = min workers
- Step Scaling on SQS ApproximateNumberOfMessagesVisible
- Fargate Spot for cost savings (interruptible processing is OK)
- SQS DLQ for failed messages
- Task Role with sqs:ReceiveMessage, sqs:DeleteMessage
One-shot tasks triggered by a schedule (EventBridge cron) or an event. Unlike services, batch tasks run to completion and exit β they are not restarted. Perfect for nightly reports, database migrations, ETL jobs, and data exports.
When to Use
- Scheduled jobs: "run nightly at 2am UTC"
- Finite workloads: process file, generate report, exit
- Cost-sensitive: Fargate Spot for up to 70% savings
- No load balancer needed β tasks run independently
ECS Features Used
- EventBridge rule or Scheduler β
ecs:RunTask - Standalone task (not a service β exits when done)
- Fargate Spot for cost optimization
- EFS for shared data across batch tasks
- CloudWatch Logs for output capture
A modern web application with a static frontend (React/Vue SPA) served from S3 + CloudFront, and an API backend running on ECS behind ALB. CloudFront routes /api/* to ALB origin and everything else to S3. This separates the static delivery (CDN-optimized) from the dynamic API (container-optimized).
Frontend (Static)
- React/Vue/Angular SPA built β uploaded to S3
- CloudFront CDN for global low-latency delivery
- Origin Access Control: S3 bucket not publicly accessible
- Cache-Control headers: immutable assets cached at edge
Backend (ECS)
- REST API on ECS Fargate behind ALB
- CloudFront origin:
/api/*β ALB - Auto Scaling on request count per target
- Private subnets β not directly internet accessible
- "Decouple order processing from API" β SQS queue between order-service and order-processor. ECS service polls SQS. Scale on queue depth.
- "Run a task on a schedule" β EventBridge Scheduler rule with ecs:RunTask target. NOT an ECS service (services are long-running). Use Fargate Spot for cost.
- "Service-to-service communication inside ECS" β AWS Cloud Map (Service Discovery). DNS-based:
order-svc.local:8080. No ALB needed for internal traffic. - Service vs Standalone Task: Service = long-running, auto-restarts, load balanced. Task = one-shot, exits when done, no restart.
- "Cheapest way to run batch container jobs" β Fargate Spot + EventBridge trigger. If interruptible, Spot saves up to 70%.
- SQS + ECS scaling: use ApproximateNumberOfMessagesVisible as a custom metric for step scaling. NOT target tracking (it doesn't support SQS natively).
- "Static website + API on same domain" β CloudFront + S3 (static) + ALB origin (/api/*). Not served from ECS containers.
- Distractor: "Lambda is always cheaper than Fargate for event processing" β false. For sustained high-throughput queues, Fargate Spot costs less than millions of Lambda invocations.
- Microservices: ALB path-routing + Service Discovery (Cloud Map). Each service independently deployed and scaled.
- Event-driven: SQS β ECS service polling. Scale on queue depth. Fargate Spot for cost. DLQ for failures.
- Batch/scheduled: EventBridge β RunTask (standalone, not a service). Fargate Spot. Exits when complete.
- Web app: CloudFront β S3 (static), CloudFront β ALB (/api/*) β ECS. Separate static and dynamic delivery.
- Internal comms: Cloud Map DNS for service-to-service. Service Connect (built on Cloud Map + Envoy) is the modern alternative β adds retries, timeouts, and circuit breaking automatically. ALB only for external traffic.
- Cost pattern: long-running APIs on Fargate On-Demand. Queue workers and batch on Fargate Spot.
Troubleshooting & Observability
When an ECS task stops unexpectedly, ECS records a stopped reason that tells you what went wrong. This is the first place to look when debugging β run aws ecs describe-tasks and check the stoppedReason and containers[].reason fields.
| Stopped Reason | What Happened | Fix |
|---|---|---|
| EssentialΒContainerΒExited | A container marked essential: true exited (crashed, exited with non-zero code) | Check container exit code + CloudWatch Logs for stack trace. Fix the application bug. |
| OutOfMemoryError | Container exceeded its memory limit. Killed by OOM killer. | Increase memory in task definition. Check for memory leaks. JVM: set -Xmx to 75% of container memory. |
| CannotPullΒContainerError | ECS cannot pull image from ECR or Docker Hub. | Check: (1) Image exists in ECR (2) Execution Role has ecr:BatchGetImage (3) VPC has NAT gateway or ECR VPC endpoint (4) Image tag is correct. |
| ResourceInitializationΒError | Task could not attach ENI (awsvpc) or mount volume. | Check: (1) Subnet has available IPs (2) Security group allows traffic (3) EFS mount target exists in task's AZ. |
| TaskFailedΒToStart | Task launch failed before any container started. | Usually infrastructure issue: no capacity, ENI limit reached, or secret injection failure. Check Execution Role permissions. |
| AGENT | ECS agent on EC2 instance is unreachable or unhealthy. | EC2 launch type only. Check instance health, ECS agent logs (/var/log/ecs/ecs-agent.log). Restart agent or replace instance. |
| SERVICE_SCHEDULER_INITIATED | Service deliberately stopped the task (deployment, scale-in, health check failure). | Normal during deployments. If unexpected: check ALB health check config, ensure /health endpoint returns 200. |
π The debugging command you'll use most: aws ecs describe-tasks --cluster my-cluster --tasks <task-id> β look at stoppedReason + each container's reason and exitCode. Exit code 137 = OOM killed. Exit code 1 = application error. Exit code 0 = normal shutdown.
ECS Exec lets you exec into a running container β like docker exec -it but for containers running on Fargate or EC2. It uses AWS Systems Manager Session Manager under the hood. This is essential for debugging running containers that aren't behaving as expected.
Setup Requirements
- 1. Enable execute command on service:
--enable-execute-command - 2. Task Role needs:
ssmmessages:CreateControlChannel,ssmmessages:CreateDataChannel,ssmmessages:OpenControlChannel,ssmmessages:OpenDataChannel - 3. SSM agent is bundled with Fargate platform version 1.4.0+
- 4. VPC needs NAT gateway or SSM VPC endpoints
Usage
- Open shell:
aws ecs execute-command --cluster my-cluster --task <task-id> --container my-app --interactive --command "/bin/sh" - Check env vars, filesystem, network connectivity
- Test DNS resolution:
nslookup order-svc.local - Check if secrets injected:
echo $DB_PASSWORD - Audit: all exec sessions logged in CloudTrail
Logs (CloudWatch)
awslogsdriver captures stdout/stderr- Log group:
/ecs/my-service - Log stream:
prefix/container/task-id - Filter patterns for error detection
- Metric filters: count errors β alarm
Metrics (Container Insights)
- CpuUtilized / CpuReserved per task
- MemoryUtilized / MemoryReserved per task
- NetworkRxBytes / NetworkTxBytes
- RunningTaskCount per service
- StorageUtilized (Fargate ephemeral)
Traces (X-Ray)
- End-to-end request traces across services
- Latency breakdown per service hop
- Error rate visualization
- Service map: which service calls which
- Sidecar daemon + SDK instrumentation
π Health check failures are the #1 cause of "task keeps restarting." The ALB health check calls your /health endpoint. If it returns non-200 three times in a row, the ALB marks the target unhealthy, ECS stops the task and starts a new one, which hasn't warmed up yet, fails health check again β restart loop. Fix: (1) ensure /health is fast (<5s response), (2) set health check grace period (give app time to start before first check), (3) check that security group allows ALB β task traffic.
aws ecs describe-tasks β stoppedReason + exitCode
aws ecs execute-command β shell into running task- "Task keeps failing to start + CannotPullContainerError" β check: (1) ECR image exists (2) Execution Role permissions (3) NAT gateway in private subnet or ECR VPC endpoint.
- "Container killed with exit code 137" β OOM. Container exceeded memory limit. Increase memory in task definition.
- "Container exited with exit code 143" β SIGTERM received (graceful shutdown). Normal during service scaling, deployments, or Fargate Spot interruptions. Not an error β means your app received a shutdown signal.
- "How to debug a running ECS container" β ECS Exec (
aws ecs execute-command). Requires SSM permissions on Task Role + enable-execute-command on service. - "Logs not appearing" β Execution Role missing
logs:CreateLogStreamorlogs:PutLogEvents. Also check log group exists and awslogs driver is configured. - ECS Exec requires SSM permissions on the Task Role (not the Execution Role). This is a common exam distractor.
- Container Insights costs extra. It's not free β it generates CloudWatch custom metrics. Budget ~$0.30/task/month.
- "Service never reaches steady state" β
aws ecs describe-servicesβ checkeventsfield for recent messages. Usually: health check failures, insufficient capacity, or image pull errors. - Health check grace period: seconds to wait before first health check after task registration. Set to app startup time (e.g., 60s for Java Spring Boot). Default: 0 (immediate check).
- Stopped reasons: describe-tasks β stoppedReason + exitCode. 137 = OOM. 1 = app error. CannotPullContainer = ECR permissions/networking.
- ECS Exec: shell into running containers via SSM. Requires enable-execute-command + Task Role ssmmessages permissions.
- Observability: CloudWatch Logs (awslogs), Container Insights (metrics), X-Ray (traces), Alarms (auto-scaling + alerts).
- Health check loop: most common "task keeps restarting" cause. Fix: grace period, check /health endpoint, verify security group rules.
- describe-services events: first place to check when service won't stabilize. Shows recent scheduling failures and reasons.
Key CLI Commands
aws ecs create-cluster --cluster-name my-clusteraws ecs register-task-definition --cli-input-json file://task-def.jsonaws ecs create-service --cluster my-cluster --service-name my-svc ...aws ecs update-service --cluster my-cluster --service my-svc --desired-count 5aws ecs run-task --cluster my-cluster --task-definition my-task:3aws ecs describe-tasks --cluster my-cluster --tasks <id>aws ecs describe-services --cluster my-cluster --services my-svcaws ecs execute-command --cluster my-cluster --task <id> --command "/bin/sh" --interactiveaws ecs list-tasks --cluster my-cluster --service-name my-svcaws ecs stop-task --cluster my-cluster --task <id>
ARN Formats
- Cluster:
arn:aws:ecs:region:account:cluster/name - Task Definition:
arn:aws:ecs:region:account:task-definition/family:revision - Service:
arn:aws:ecs:region:account:service/cluster/service-name - Task:
arn:aws:ecs:region:account:task/cluster/task-id - Container Instance:
arn:aws:ecs:region:account:container-instance/cluster/id
| Stopped Reason | Exit Code | Quick Fix |
|---|---|---|
| EssentialContainerExited | 1 | Check CloudWatch Logs for stack trace |
| OutOfMemoryError | 137 | Increase task memory |
| CannotPullContainerError | β | ECR perms + NAT/VPC endpoint |
| ResourceInitializationError | β | Subnet IPs + SG rules + EFS mounts |
| TaskFailedToStart | β | Execution Role + capacity |