Amazon SQS
LearningTree Β· AWS Β· Integration

Amazon SQS β€”
Simple Queue Service

A fully managed message queue that decouples distributed systems β€” producers drop work in, consumers process it on their own terms. No direct connections, no cascading failures, no traffic spikes crashing your services.

πŸ—‚οΈ SQS in 30 Seconds

  • Managed message queue β€” producers enqueue, consumers dequeue and process independently
  • Pull-based (polling) β€” consumers ask for messages at their own pace, unlike SNS push
  • Messages stored durably for up to 14 days β€” survive consumer downtime
  • Standard: near-unlimited throughput, best-effort order Β· FIFO: strict order, exactly-once, 3,000 msg/s
  • Dead-Letter Queue (DLQ) catches messages that fail repeatedly β€” essential for production
01
Chapter One

What is Amazon SQS

The Problem: Systems That Talk Directly Introductory

In a naive microservices architecture, every service calls other services directly. The Order API calls the Email Service, the Inventory Service, and the Shipping Service β€” all synchronously, all in real time. This causes three serious problems:

πŸ’₯

Traffic Spikes Crash Services

A flash sale sends 100Γ— normal orders. The Order API overwhelms the Inventory Service β€” which can't scale fast enough. Requests fail. Orders are lost.

🌊

Failures Cascade

Email Service goes down for 2 minutes. Every order fails β€” even though the warehouse is running fine. One broken service breaks everything downstream.

πŸ”—

Tight Coupling

Adding a new fraud detection service means changing the Order API and redeploying. Every new consumer makes the producer more complex.

The Mental Model: A Warehouse Receiving Area Introductory

πŸ‘‰ SQS is like a warehouse receiving dock. Delivery trucks (producers) drop off packages at any time. The dock holds them safely. Warehouse workers (consumers) pick up packages when they're ready β€” at their own pace. If the workers are busy, packages wait. Nothing is lost. Nobody blocks waiting for each other.

More everyday analogies:

🎫

Ticket Queue

Customers take a number and wait. Service agents handle one at a time. A rush of customers doesn't overwhelm agents β€” it just lengthens the queue temporarily.

πŸ“¬

Mailbox

The postman drops letters in your mailbox regardless of whether you're home. You read them when you're ready. The postman doesn't wait for you.

🏭

Assembly Line Buffer

Parts accumulate on a conveyor between two stations. Station B processes at its own speed. Station A never blocks waiting for B to be free.

What SQS Actually Is Introductory

Amazon SQS (Simple Queue Service) is a fully managed message queue that enables asynchronous communication between distributed components. The core model:

  • Producers send messages into a queue
  • Messages are stored durably until a consumer processes them
  • Consumers poll the queue and process messages at their own rate
  • Once processed successfully, the consumer deletes the message from the queue
Core Concept Diagram Introductory
SQS β€” Producer β†’ Queue β†’ Consumer
Producer β‘  Sends Enqueue SQS Queue msg msg msg Β·Β·Β· β‘‘ Waits Poll Consumer β‘’ Processes β‘£ Delete on success
β‘  PRODUCER
Publishes a message to the queue and returns immediately β€” no waiting
β‘‘ QUEUE
Stores message durably (up to 14 days). Survives consumer downtime
β‘’ CONSUMER
Polls queue at own pace, processes one message, then deletes it
β‘£ DELETE
Message deleted only after successful processing β€” if it fails, the message returns
SQS vs Direct Service Calls Core
ConcernDirect Synchronous CallSQS Queue
Traffic spikeService overwhelmed, requests droppedMessages buffer in queue β€” consumer processes steadily
Consumer downtimeCalls fail, data is lostMessages wait safely for up to 14 days
Consumer speedProducer must wait for responseProducer returns instantly β€” no waiting
Adding new consumerChange producer code, redeployPoint new consumer at the queue
Retry on failureManual retry logic neededBuilt-in visibility timeout + DLQ
ScalingProducer and consumer must scale togetherConsumer scales independently based on queue depth
Queue Processing Flow Core
Queue-Based Processing β€” Smoothing a Traffic Spike
βœ— WITHOUT SQS πŸ’₯ Service Crashes βœ“ WITH SQS SQS Queue Worker
βœ— WITHOUT SQS
Traffic spike hits the service directly β€” it has no buffer, gets overwhelmed, requests are dropped
βœ“ WITH SQS
Spike enters the queue. Worker processes at a steady, sustainable rate. Queue drains over time β€” no data lost
Why This Matters β€” The Three Superpowers of Queues Core
🧱

Buffering

The queue holds messages during traffic spikes, consumer slowdowns, and deployments. Work is never lost β€” it waits safely until capacity is available.

πŸ”„

Retry Capability

If processing fails, the message returns to the queue automatically. Built-in retry logic means transient failures are handled without custom code.

πŸ“

Independent Scaling

Consumers scale based on queue depth rather than producer rate. Add more workers when the queue grows β€” completely independent of the producer.

πŸŽ“ Exam Insight

When an exam question mentions "decouple services", "handle traffic spikes", "buffer requests", or "asynchronous processing" β€” the answer is an SQS queue. SQS is the AWS answer to workload isolation and async processing, while SNS is the answer to broadcasting events to many consumers.

πŸ‘‰ Key Takeaway

SQS breaks the direct dependency between producers and consumers β€” work is stored durably in a queue and processed reliably, regardless of traffic spikes or consumer failures

02
Chapter Two

Why Distributed Systems Need Queues

The Fundamental Problem: Synchronous Systems Don't Scale Introductory

Every distributed system eventually faces the same question: what happens when two services need to communicate but operate at different speeds, different scales, or different availability levels? Synchronous direct calls work fine at small scale. They break catastrophically at large scale.

⚑

Speed Mismatch

Service A can produce 10,000 events/sec. Service B can process 500/sec. Without a queue, 9,500 events per second are either dropped or Service A must slow down β€” both unacceptable in production.

πŸ•

Availability Mismatch

Service B deploys every Tuesday. Service A cannot stop accepting user requests for 3 minutes while B restarts. With direct calls, A's availability is limited by B's availability.

πŸ“ˆ

Scale Mismatch

Service A auto-scales to 50 instances during peak. Service B can only handle 5x load. Direct calls flood B during spikes β€” B crashes, which cascades back to A and the entire system fails.

πŸ”

Retry Complexity

When Service B is temporarily down, Service A must implement exponential backoff, retry logic, circuit breakers β€” all custom code. Every service pair adds more complexity.

πŸ‘‰ Queues solve all four problems simultaneously. They act as a shock absorber between services β€” absorbing speed mismatches, surviving availability gaps, smoothing scale spikes, and eliminating the need for custom retry logic.

Real-World Workloads That Require Queues Core
πŸ›’

E-Commerce Order Processing

  • Orders arrive in bursts (flash sales, promotions)
  • Inventory, email, shipping, fraud all need to react
  • Queue absorbs the burst β€” all downstream systems stay stable
  • Pattern: Order Service β†’ SQS β†’ multiple workers
🎬

Video Processing Pipeline

  • User uploads video β€” transcoding takes 30 seconds
  • Can't make the user wait synchronously
  • Upload β†’ SQS β†’ transcoding worker β†’ CDN publish
  • Pattern: Fan-out to multiple resolution workers
πŸ’³

Payment Processing

  • Payment accepted instantly, settlement is async
  • Fraud check, bank transfer, receipt β€” all non-blocking
  • FIFO queue ensures transaction ordering
  • DLQ captures failed transactions for manual review
πŸ“§

Email / Notification Systems

  • Sending 1M emails takes minutes β€” never synchronous
  • SQS buffers all send requests
  • Email workers scale based on queue depth
  • Failed sends retry automatically via visibility timeout
How Queues Help Systems "Slow Down Safely" Core

One of the most underrated queue properties: a queue lets a system slow down without losing work. This is not possible with synchronous calls. Without a queue, when a service is overwhelmed it drops requests. With a queue, work accumulates and drains as capacity becomes available.

Workload Buffering β€” Queue Absorbs Spikes and Drains Smoothly
Time β†’ Load Incoming work (burst) Queue depth builds β†’ drains Consumer (steady)
INCOMING WORK
Arrives in bursts β€” spiky, unpredictable, can peak at 50–100Γ— baseline during sales
QUEUE DEPTH
Grows during the spike, then gradually drains as the consumer catches up. Nothing is lost.
CONSUMER
Processes at a steady, predictable rate. Can scale out if queue depth grows too large.
Five Benefits Every Architect Knows Core
BenefitWhat It Means in Practice
Workload bufferingMessage queue absorbs traffic spikes. Consumer processes at its own pace. No dropped requests.
DecouplingProducer doesn't know about consumer. Add, replace, or scale consumers without touching the producer.
Retry handlingFailed messages return to queue automatically. No custom retry code in your services.
Fault toleranceConsumer downtime doesn't cause data loss. Messages wait. System resumes where it left off.
Independent scalingScale consumers based on queue depth metric. Auto Scaling reacts to backlog, not to producer rate.
Common Architectural Mistakes Core
❌

Processing in the producer

Doing heavy work (DB writes, API calls) in the producer before queuing β€” defeats the purpose. The producer should enqueue and return immediately. Heavy lifting belongs in the consumer.

❌

Not handling duplicates

Standard queues deliver at-least-once, meaning occasionally a message arrives twice. Consumers that don't handle this idempotently can process orders twice, send two emails, charge twice.

❌

Visibility timeout too short

If the consumer takes 10 seconds to process but timeout is 5 seconds, the message becomes visible again while the first consumer is still working β€” causing duplicate processing.

❌

No Dead-Letter Queue

Without a DLQ, a "poison" message that always fails processing will loop forever, blocking the queue and consuming all your compute in failed retries.

πŸŽ“ Exam Insight

Common exam pattern: "An application receives variable traffic β€” 10 requests/sec normally, 10,000/sec during promotions. The processing backend can only handle 50 req/sec max. How do you architect this?" Answer: SQS queue between the frontend and backend. The queue absorbs the spike; the backend processes at 50 req/sec; requests are never dropped. Scale the backend using the SQS ApproximateNumberOfMessages metric in Auto Scaling.

πŸ‘‰ Key Takeaway

Queues decouple the rate of work arrival from the rate of work processing β€” enabling services to operate independently, survive each other's failures, and scale without coordination

03
Chapter Three

How SQS Works

Pull vs Push β€” Why SQS is Pull-Based Introductory

SQS uses a polling (pull) model β€” the consumer actively asks "do you have messages for me?" at regular intervals. This is the opposite of SNS which pushes messages to subscribers. The pull model gives the consumer full control over its processing rate.

πŸ“€

Push (SNS) β€” Producer controls pace

  • SNS delivers immediately to all subscribers
  • Consumer must handle any rate it receives
  • Good for broadcasting events to many subscribers
  • Consumer can be overwhelmed during spikes
πŸ“₯

Pull (SQS) β€” Consumer controls pace

  • Consumer decides when to ask for messages
  • Consumer processes at its own maximum rate
  • Good for workload processing at controlled speed
  • Queue absorbs backlog when consumer is slow
Message Lifecycle β€” 5 Stages Core
SQS Message Lifecycle
β‘  Send Producer calls SendMessage API β‘‘ Store Message sits in queue (visible) β‘’ Poll Consumer calls ReceiveMessage β‘£ Process Message hidden (in-flight) β‘€ Delete Consumer calls DeleteMessage βœ“
β‘  SEND
Producer puts message in queue. Returns immediately. Max 256 KB.
β‘‘ STORE
Message is visible and waiting. Any consumer can pick it up.
β‘’ POLL
Consumer requests messages. SQS returns up to 10 at a time.
β‘£ PROCESS
Message becomes invisible to others during processing (visibility timeout).
β‘€ DELETE
Consumer must explicitly delete. If not deleted β†’ message reappears for retry.
Visibility Timeout β€” The Key Safety Mechanism Core

When a consumer receives a message, SQS doesn't delete it immediately. Instead it makes the message invisible to all other consumers for a configurable period β€” the visibility timeout. This is SQS's built-in retry mechanism.

Visibility Timeout β€” Success Path vs Failure Path
βœ“ SUCCESS PATH Queue msg visible Poll Message In-Flight INVISIBLE Β· timeout: 30s Process Consumer Success βœ“ Delete πŸ—‘οΈ Gone forever βœ— FAILURE PATH (timeout expires / crash) Queue msg visible Message In-Flight TIMEOUT EXPIRES ⏱ Consumer Crashed / Slow Message becomes VISIBLE again β†’ retry After N retries β†’ Dead-Letter Queue (DLQ) for inspection
VISIBILITY TIMEOUT
  • Default: 30 seconds. Range: 0 – 12 hours
  • Set it longer than your worst-case processing time
  • Too short β†’ duplicates; Too long β†’ slow retries on crash
  • Consumer can extend it via ChangeMessageVisibility API while still working
AT-LEAST-ONCE DELIVERY
  • Standard queue: messages may be delivered more than once
  • Consumers must be idempotent β€” same message twice = same result
  • Use a unique message ID or database upsert to handle duplicates
  • FIFO queue provides exactly-once processing within 5-minute window
Long Polling vs Short Polling Core

By default, SQS uses short polling β€” it samples a subset of servers and returns immediately, even if the queue is empty. Long polling waits up to 20 seconds for a message before returning. Always use long polling in production.

FeatureShort PollingLong Polling (recommended)
Wait timeReturns immediately (0s)Waits up to 20s for a message
Empty responsesMany β€” wastes API callsMinimal β€” only returns when message arrives
CostHigher β€” many empty polls billedLower β€” fewer API requests
LatencyNear-zero when queue is activeNear-zero when message available; waits only when empty
How to enableDefaultSet WaitTimeSeconds=20
Message Batching β€” Cost & Throughput Optimization In-Depth

Batching sends multiple messages in a single API request instead of one at a time. This is one of the most important cost optimization techniques β€” it reduces API calls by up to 90%.

AspectSingle SendBatch Send
API calls per 10 messages10 calls1 call (90% reduction)
Cost per 1M messages$0.40$0.04 (90% reduction)
Max messages per batch110
Max batch size256 KB per message256 KB total across batch
πŸ“€

SendMessageBatch

Send up to 10 messages in one request. Each can have different body, attributes, and delay.

πŸ—‘οΈ

DeleteMessageBatch

Delete up to 10 messages in one request. Pass receipt handles from processing.

πŸ“₯

ReceiveMessage

Already returns up to 10 messages per poll. Set MaxNumberOfMessages=10.

πŸ‘‰ Cost example: With 10M messages/day β€” without batching: $4/day. With batching: $0.40/day. Savings: $109/month. Always batch when possible.

ChangeMessageVisibility β€” Extending Processing Time In-Depth

What if processing time varies? Some messages take 2 seconds, one takes 30 seconds. Use ChangeMessageVisibility API to extend the timeout while processing.

⏱️

The Problem

  • Timeout too short β†’ message reappears mid-processing β†’ duplicate work
  • Timeout too long β†’ if consumer dies, message waits unnecessarily
  • Variable processing time β†’ no single timeout fits all
βœ…

The Solution

  • Start with timeout = expected time Γ— 1.5
  • During processing, periodically check remaining time
  • If remainingTime < 30%, call ChangeMessageVisibility
  • Maximum visibility timeout: 12 hours
Message Attributes β€” Metadata Without Touching the Body In-Depth

Message attributes are key-value pairs attached to a message, separate from the body. Use them for routing, filtering, and tagging without parsing JSON.

πŸ“‹

Use Cases

  • Routing: Which service should process this?
  • Priority: Process high-priority first
  • Source tracking: Which system sent this?
  • Versioning: Schema version for consumers
πŸ“

Supported Types

  • String β€” text values
  • Number β€” integers, floats
  • Binary β€” base64-encoded data
  • Custom type IDs (e.g., "image/jpeg")
πŸŽ“ Exam Insight
  • Visibility timeout too short β†’ same message processed by two different consumers simultaneously β†’ data corruption risk. Set it to max expected processing time Γ— 1.5.
  • At-least-once delivery β†’ consumers must be idempotent. Exam question: "how to prevent duplicate processing?" β†’ use a DynamoDB conditional write to track processed message IDs.
  • Long polling β†’ reduces cost and eliminates empty receive calls. Exam scenario: "reduce SQS API costs" β†’ enable long polling (ReceiveMessage WaitTimeSeconds=20).
πŸ‘‰ Key Takeaway

SQS's visibility timeout makes retry automatic and safe β€” if a consumer crashes mid-processing, the message reappears and another consumer picks it up. No message is ever silently lost.

04
Chapter Four

Standard Queue vs FIFO Queue

Two Queue Types β€” Choose Based on Your Needs Introductory

SQS offers two fundamentally different queue types. Standard is the default and covers ~90% of use cases. FIFO adds strict ordering and exactly-once delivery but with throughput limits. Most production systems use Standard queues.

πŸš€

Standard Queue

  • Near-unlimited throughput β€” millions of messages/sec
  • Best-effort ordering β€” messages may arrive out of order
  • At-least-once delivery β€” occasional duplicates possible
  • Lower latency, higher availability
  • Use when: order doesn't matter, duplicates are handled
πŸ“‹

FIFO Queue

  • 3,000 msg/sec (300 without batching)
  • Strict ordering β€” first-in-first-out guaranteed
  • Exactly-once processing β€” no duplicates in 5-min window
  • Slightly higher latency
  • Use when: order matters, duplicates are unacceptable
Detailed Comparison Core
FeatureStandard QueueFIFO Queue
ThroughputNearly unlimited3,000 msg/sec (with batching)
Message orderingBest-effort (not guaranteed)Strict FIFO within message group
Delivery guaranteeAt-least-once (can duplicate)Exactly-once (within 5-min window)
DeduplicationNone β€” consumer must handleContent-based or ID-based
Queue nameAny nameMust end with .fifo
Message groupsNot applicableRequired β€” orders messages within group
Cost$0.40 per million requests$0.50 per million requests
Use casesLog processing, fan-out, async jobsFinancial transactions, inventory updates
When to Use Standard vs FIFO Core
βœ…

Use Standard When

  • Order doesn't matter β€” email sending, image thumbnails
  • You need massive throughput β€” millions of messages
  • Your consumer is idempotent β€” same message twice = same result
  • Cost matters β€” Standard is 20% cheaper
  • You're doing fan-out to multiple independent workers
βœ…

Use FIFO When

  • Order is critical β€” transaction ledgers, command sequences
  • Duplicates are unacceptable β€” payment processing
  • You need exactly-once for compliance reasons
  • Throughput is under 3,000 msg/sec
  • You have distinct message groups (e.g., per-customer)
FIFO Message Groups β€” Parallelism Within Order In-Depth

FIFO doesn't mean all messages are processed one at a time globally. You can have multiple message groups, and each group is ordered independently. Messages from different groups can be processed in parallel.

FIFO Message Groups β€” Parallel Processing with Per-Group Order
FIFO Queue Group A: user-123 A1 A2 A3 A1β†’A2β†’A3 (strict) Group B: user-456 B1 B2 B3 B1β†’B2β†’B3 (strict) Worker 1 Processes Group A Worker 2 Processes Group B ⟷ Parallel between groups
MESSAGE GROUP ID
Required on every FIFO message. Use customer ID, order ID, or any partition key. Messages with the same group ID are strictly ordered.
DEDUPLICATION ID
Either provide explicitly or enable content-based deduplication. Within 5-minute window, same ID = same message β†’ discarded.
FIFO Deduplication β€” Exactly-Once Mechanics In-Depth

FIFO queues automatically discard duplicate messages within a 5-minute deduplication window. Two methods available:

MethodHow It WorksWhen to Use
Explicit deduplication IDYou provide a unique ID with each messageYou control ID generation (idempotency keys, request IDs)
Content-based deduplicationSHA-256 hash of message body (not attributes)Simpler setup, body uniquely identifies message
⏱️

5-Minute Window

  • Same deduplication ID within 5 min β†’ duplicate discarded
  • After 5 minutes, same ID is accepted (new window)
  • SQS returns success (silent deduplication β€” no error)
⚠️

Important Gotchas

  • Content-based dedup ignores message attributes β€” only body
  • Different attributes + same body = still duplicate
  • Retry after 5 min β†’ message re-delivered (plan for this)

πŸ‘‰ Best practice: For order confirmations, use deduplication ID = "order-12345-confirmation". This prevents duplicate emails within 5 minutes. If you need longer deduplication, track processed IDs in DynamoDB.

πŸŽ“ Exam Insight
  • "Strict ordering required" β†’ FIFO queue with message group ID
  • "Exactly-once processing" β†’ FIFO queue with deduplication ID
  • "High throughput + async" β†’ Standard queue + idempotent consumer
  • FIFO limitation: max 3,000 msg/sec with batching (300 without). If you need more, use Standard.
  • FIFO name: must end with .fifo suffix β€” e.g., orders.fifo
πŸ‘‰ Key Takeaway

Standard queue for 90% of workloads β€” high throughput, handle duplicates in your consumer. FIFO queue when ordering or exactly-once is a hard requirement β€” but accept the 3,000 msg/sec limit.

05
Chapter Five

SQS Architecture Patterns

Pattern 1: Queue-Based Load Leveling Core

The most fundamental pattern: put a queue between a variable-rate producer and a fixed-rate consumer. The queue absorbs traffic spikes so the backend processes at a steady pace. This is the solution to every "traffic spike crashes our service" problem.

Queue-Based Load Leveling β€” Variable In, Steady Out
API Gateway Variable rate 10–10,000 req/s SQS Buffers spike depth grows β†’ drains Backend ASG Steady rate 500 req/s max DB
Pattern 2: Worker Pool Pattern Core

Multiple consumers (workers) poll the same queue in parallel. Each message is processed by exactly one worker. Scale the worker pool based on queue depth β€” the ApproximateNumberOfMessages CloudWatch metric.

Worker Pool β€” Multiple Consumers, One Queue
Producer Jobs to process Job Queue Messages wait for any worker Worker 1 Processing msg Worker 2 Processing msg Worker 3 Processing msg ASG Scales on queue depth
Pattern 3: Dead-Letter Queue (DLQ) Core

A DLQ catches "poison" messages that fail repeatedly. After N failed processing attempts, SQS automatically moves the message to the DLQ. This prevents a single bad message from blocking your entire queue and consuming infinite retry compute.

Dead-Letter Queue β€” Isolating Failed Messages
Main Queue maxReceiveCount=3 (retry 3 times) Consumer Processes message Success πŸ—‘οΈ Deleted Fail β†’ retry (up to 3Γ—) Dead-Letter Queue Inspect + alert + manual fix After 3 failures β†’ DLQ
WHEN TO USE DLQ
  • Always in production β€” no exceptions
  • Alert on DLQ message count > 0
  • Inspect failed messages for debugging
  • Redrive to main queue after fix
CONFIGURATION
  • maxReceiveCount: failures before DLQ (e.g., 3)
  • DLQ must be same type (Standardβ†’Standard, FIFOβ†’FIFO)
  • Set DLQ retention longer (14 days) for analysis
Pattern 4: Microservice Decoupling Core

Replace direct service-to-service HTTP calls with queue-based async messaging. Services communicate through queues instead of knowing about each other. This is the foundation of event-driven microservice architecture.

❌

Tightly Coupled (HTTP)

  • Order Service calls Inventory via HTTP
  • If Inventory is slow β†’ Order is slow
  • If Inventory is down β†’ Order fails
  • Scaling Inventory requires rebalancing
  • Adding Shipping requires changing Order
βœ…

Decoupled (SQS)

  • Order publishes "order.placed" to queue
  • Inventory polls queue at own pace
  • If Inventory is down β†’ messages wait
  • Scale Inventory independently
  • Add Shipping by subscribing to queue
Pattern 5: Priority Queue In-Depth

SQS doesn't have native priority support. Implement it with multiple queues β€” high-priority, normal-priority, low-priority. Configure your consumer to poll high first, then normal, then low.

πŸ”΄

High Priority

Critical alerts, VIP customers, payment failures. Consumer checks this queue first on every poll cycle.

🟑

Normal Priority

Standard workload. Consumer checks after high queue is empty or quota reached.

πŸ”΅

Low Priority

Batch jobs, reports, cleanup tasks. Processed only when higher queues are empty.

Pattern 6: Request-Reply In-Depth

Need async processing but also need to return a response? Use two queues β€” request queue + response queue. The caller sends a request and waits on its own reply queue.

Request-Reply Pattern β€” Async with Response
Service A Caller + waits correlation ID β‘  Send Request Queue + reply-to URL β‘‘ Deliver Service B Processes request β‘’ Process Response Queue β‘£ Reply sent β‘€ Receive
πŸ”‘

Key Components

  • Correlation ID: UUID that ties request to response
  • Reply-to queue: Included in request message
  • Long polling: Caller waits on reply queue
  • Timeout: Caller can timeout and retry
πŸ“‹

When to Use

  • Async processing but caller needs result
  • Long-running work (>30 seconds)
  • Decouple request from response latency
  • Alternative: AWS Step Functions for orchestration
πŸŽ“ Exam Insight
  • "Scale backend based on queue" β†’ Use Auto Scaling with ApproximateNumberOfMessages metric
  • "Messages failing repeatedly" β†’ Configure Dead-Letter Queue with maxReceiveCount
  • "Decouple microservices" β†’ SQS between services instead of HTTP calls
  • "Process orders in priority" β†’ Multiple queues (high/normal/low) with weighted polling
πŸ‘‰ Key Takeaway

SQS has a pattern for every distributed system challenge: load leveling for spikes, worker pools for throughput, DLQs for resilience, and multiple queues for priority β€” master these and you can architect any async system

06
Chapter Six

SQS + SNS β€” The Fan-Out Pattern

Why Combine SNS and SQS? Introductory

SNS and SQS are not competitors β€” they're complementary. SNS broadcasts (one event β†’ many subscribers). SQS buffers (store and process at own pace). Combined, you get the best of both: reliable fan-out to multiple independent consumers, each with their own buffer and retry capability.

πŸ“’

SNS Alone

  • Broadcasts to multiple subscribers
  • Push-based β€” immediate delivery
  • If subscriber is down β†’ message lost
  • If subscriber is slow β†’ backed up
  • Good for: real-time alerts, Lambda triggers
πŸ—‚οΈ

SQS Alone

  • Single queue β†’ single consumer (or pool)
  • Pull-based β€” consumer controls pace
  • Messages survive consumer downtime
  • Consumer processes at own rate
  • Good for: workload processing, jobs

πŸ‘‰ SNS + SQS = Fan-out with durability. SNS broadcasts to multiple SQS queues. Each queue buffers independently. Each consumer processes at its own pace. One slow consumer doesn't affect others. One down consumer catches up when it recovers.

The Fan-Out Architecture Core
SNS + SQS Fan-Out β€” One Event, Multiple Independent Consumers
Order Service order.placed event Publish SNS Topic order-events Email Queue Buffers independently Email Service Send confirmation Inventory Queue Buffers independently Inventory Update stock Analytics Queue Buffers independently Analytics Track metrics Independent
SNS β†’ BROADCAST
One publish to SNS fans out instantly to all subscribed queues. Producer doesn't know how many consumers exist.
SQS β†’ BUFFER
Each queue buffers independently. If Analytics is down for an hour, its messages wait. Email and Inventory are unaffected.
CONSUMERS β†’ INDEPENDENT
Each consumer scales separately. Email might need 2 workers, Inventory needs 10, Analytics needs 1. No coordination.
Fan-Out Benefits Core
BenefitHow Fan-Out Delivers It
IsolationOne slow or failed consumer doesn't affect others. Email being slow doesn't delay Inventory updates.
Independent scalingScale each consumer based on its own queue depth. Email might have 2 workers, Analytics 10.
No message lossIf a consumer is down, its queue buffers messages. Catches up when recovered.
Add consumers easilySubscribe a new SQS queue to the SNS topic. No changes to the producer.
Different processing speedsEmail (fast, 100/sec) and Video Transcode (slow, 2/sec) work from the same event stream.
Real-World Example: E-Commerce Order Event Core
πŸ””

Event Published

Order Service publishes order.placed to SNS topic order-events. Returns immediately β€” doesn't know or care who subscribes.

πŸ“§

Email Queue

Receives event β†’ Lambda sends confirmation email. Fast β€” 500 emails/sec. Small queue, clears quickly.

πŸ“¦

Inventory Queue

Receives event β†’ EC2 worker updates stock DB. Medium speed, complex logic. 5 workers in ASG.

πŸ“Š

Analytics Queue

Receives event β†’ Lambda writes to data lake. Batch processing β€” runs hourly. Queue grows, drains in batches.

πŸŽ“ Exam Insight
  • "One event, multiple consumers, each at own pace" β†’ SNS + SQS fan-out
  • "Decouple event producer from consumers" β†’ SNS topic, consumers subscribe queues
  • "Consumer failures shouldn't affect others" β†’ Each consumer has its own SQS queue
  • This is the #1 integration pattern for AWS microservices β€” expect it on every exam
πŸ‘‰ Key Takeaway

SNS + SQS fan-out is the gold-standard architecture for event-driven systems β€” SNS broadcasts, SQS buffers, consumers stay isolated and independently scalable

07
Chapter Seven

SQS vs SNS vs EventBridge vs Kafka

Four Messaging Services β€” Different Jobs Introductory

AWS has multiple messaging services because they solve different problems. They're not competitors β€” they complement each other. Understanding when to use which is a core architecture skill.

πŸ—‚οΈ

SQS β€” Queue

  • Job: Buffer and decouple workloads
  • Model: Pull (consumer polls)
  • Consumers: One queue β†’ one consumer (or pool)
  • Use: Async processing, load leveling
πŸ“’

SNS β€” Broadcast

  • Job: Fan-out events to many subscribers
  • Model: Push (SNS delivers)
  • Consumers: One topic β†’ many subscribers
  • Use: Notifications, alerts, pub-sub
🎯

EventBridge β€” Router

  • Job: Route events with complex rules
  • Model: Push (EventBridge delivers)
  • Consumers: Rule-based routing to targets
  • Use: Event-driven architecture, SaaS integrations
πŸš€

Kafka (MSK) β€” Stream

  • Job: High-throughput event streaming
  • Model: Pull (consumer reads from log)
  • Consumers: Multiple consumer groups, replay
  • Use: Real-time analytics, log aggregation
Detailed Comparison Core
FeatureSQSSNSEventBridgeKafka (MSK)
Primary use Buffer workloads Broadcast events Route events Stream events
Delivery model Pull (poll) Push Push Pull (read log)
Message retention Up to 14 days None (immediate) None (immediate) Configurable (days–forever)
Message replay No No Archive β†’ replay Yes (offset-based)
Throughput Near-unlimited Near-unlimited 10K events/sec (soft limit) Millions/sec
Ordering FIFO queue option FIFO topic option No guarantee Per-partition ordering
Content filtering No Subscription filters Rich rule patterns Consumer logic
Management Fully managed Fully managed Fully managed Managed (MSK) or self-managed
Cost model Per request + data Per request + data Per event Per broker-hour + storage
When to Use What β€” Decision Guide Core
βœ…

Use SQS When

  • You need to buffer workloads (jobs, tasks)
  • Consumer needs to process at its own pace
  • You need retry + DLQ for failed messages
  • Single consumer (or competing consumer pool)
  • Messages can be deleted after processing
βœ…

Use SNS When

  • One event needs to reach multiple subscribers
  • You want push delivery (immediate)
  • Subscribers are Lambda, HTTP, Email, SMS
  • Simple pub-sub pattern
  • Combined with SQS for durable fan-out
βœ…

Use EventBridge When

  • You need content-based routing rules
  • You're integrating with SaaS (Zendesk, Datadog)
  • You want schema registry + discovery
  • You're building event-driven architecture
  • You need to archive and replay events
βœ…

Use Kafka (MSK) When

  • You need millions of events per second
  • Multiple consumers need to read same stream
  • You need message replay / reprocessing
  • You're doing real-time analytics / ML
  • You already have Kafka expertise
Common Combinations Core
PatternServices UsedWhy
Durable fan-outSNS + SQSSNS broadcasts, SQS buffers per-consumer
Event-driven microservicesEventBridge + SQSEventBridge routes, SQS buffers processing
Real-time + batchKafka + S3 + AthenaKafka streams, S3 stores, Athena queries
SaaS integrationEventBridge + LambdaEventBridge receives SaaS events, Lambda processes
Transactional + analyticsSQS + KinesisSQS for transactions, Kinesis for analytics stream
SQS vs Kinesis β€” Queue vs Stream Core

Both are pull-based, but they serve fundamentally different purposes. This is a common source of confusion:

FeatureSQSKinesis Data Streams
Primary useWork queue, task processingReal-time streaming analytics
Data modelMessages (deleted after processing)Persistent log (retention 1-365 days)
Replay capabilityNo β€” message gone after deleteYes β€” replay from any offset
Multiple consumersCompeting consumer (one gets message)Multiple consumer groups (all get all data)
Message size256 KB1 MB
OrderingFIFO queue optionPer-partition ordering
Throughput scalingAuto-scalesPartition/shard scaling (manual)
RetentionUp to 14 daysUp to 365 days
Cost modelPer requestPer shard-hour + data
πŸ—‚οΈ

Use SQS When

  • You have a queue of work/tasks to process
  • Message can be deleted after successful processing
  • One consumer (or competing pool) per message
  • You need retry + DLQ for failures
πŸ“Š

Use Kinesis When

  • Multiple consumers need to read the same stream
  • You need to replay / reprocess historical events
  • Real-time analytics, ML, dashboards
  • Audit logs requiring long retention

πŸ‘‰ They work together: Kinesis for ingestion + SQS for work distribution. Pattern: Kinesis β†’ Lambda β†’ SQS β†’ worker pool. Kinesis handles high-throughput ingestion, SQS provides reliable per-item processing.

Quick Decision Flowchart Core
  • Need to buffer work for later processing? β†’ SQS
  • One event, many consumers immediately? β†’ SNS (or SNS + SQS for durability)
  • Complex event routing rules? β†’ EventBridge
  • SaaS integrations? β†’ EventBridge (has native connectors)
  • Real-time streaming at massive scale? β†’ Kafka (MSK) or Kinesis
  • Need to replay events? β†’ Kafka (permanent log) or EventBridge Archive
πŸŽ“ Exam Insight
  • "Decouple services, buffer requests" β†’ SQS
  • "Fan-out to multiple consumers" β†’ SNS (or SNS + SQS)
  • "Route events based on content" β†’ EventBridge
  • "Real-time analytics, millions/sec" β†’ Kinesis Data Streams or MSK (Kafka)
  • "Integrate with third-party SaaS" β†’ EventBridge (has partner sources)
  • These services complement each other β€” combinations are common and expected
πŸ‘‰ Key Takeaway

SQS = buffer, SNS = broadcast, EventBridge = route, Kafka = stream. They solve different problems and often work together β€” choose based on your specific pattern, not as competitors.

08
Chapter Eight

Security, Reliability & Scaling

Access Control β€” IAM vs Queue Policy Core

SQS access is controlled by two mechanisms: IAM policies (attached to users/roles) and SQS queue policies (attached to queues). Both must allow an action for it to succeed.

πŸ‘€

IAM Policy

  • Attached to IAM user, role, or group
  • Controls what that identity can do
  • "Role X can send to any queue in account"
  • Use for: same-account access, EC2/Lambda roles
πŸ—‚οΈ

Queue Policy

  • Attached to the queue itself
  • Controls who can access this queue
  • "Allow Account B to send to this queue"
  • Use for: cross-account access, AWS service access
ScenarioUse IAM PolicyUse Queue Policy
Lambda in same account sends to queueYes β€” attach to Lambda roleNot required
Another AWS account sends to your queueNot sufficient aloneYes β€” must allow principal
SNS topic sends to queueNot requiredYes β€” allow SNS service
S3 event sends to queueNot requiredYes β€” allow S3 service
Restrict which queues a role can accessYes β€” specify queue ARNNot the right tool
Encryption β€” At Rest and In Transit Core
πŸ”

Encryption at Rest (SSE)

  • Enable Server-Side Encryption (SSE)
  • AWS managed key (SSE-SQS) β€” free, automatic
  • Customer managed key (SSE-KMS) β€” more control
  • Messages encrypted when stored in SQS
  • Decrypted transparently when received
πŸ”’

Encryption in Transit

  • SQS API uses HTTPS by default
  • TLS 1.2+ for all connections
  • No configuration required
  • For extra security: add IAM condition aws:SecureTransport
SQS Extended Client β€” Messages Larger Than 256 KB In-Depth

SQS message size limit is 256 KB. For larger payloads, use the SQS Extended Client (AWS SDK library). It automatically stores large payloads in S3 and puts only a reference in SQS.

πŸ“¦

How It Works

  • Large payload (>256 KB) β†’ stored in S3
  • SQS message contains S3 reference (s3://bucket/key)
  • Consumer client retrieves from S3 automatically
  • Messages up to 2 GB (S3 limit)
⚠️

Limitations

  • Not supported for FIFO queues
  • Additional S3 cost (storage + GET/PUT)
  • Java, Python, Node.js SDKs supported
  • Alternative: store in S3 manually, send URI in message
VPC Endpoint β€” Private Access In-Depth

By default, SQS API calls go over the public internet. For EC2/Lambda in private subnets with no NAT, create a VPC Interface Endpoint for SQS. Traffic stays within AWS network.

🌐

Without VPC Endpoint

  • Traffic goes via Internet Gateway or NAT
  • Private subnet workloads need NAT Gateway
  • NAT adds cost and is a throughput bottleneck
πŸ”

With VPC Endpoint

  • Traffic stays within AWS network
  • No Internet Gateway or NAT required
  • Lower latency, higher security
  • Cost: ~$0.01/hr per AZ + data fees
Message Retention & Durability Core
SettingDefaultRangeNotes
Message retention4 days1 minute – 14 daysMessages deleted after this period if not processed
Visibility timeout30 seconds0 – 12 hoursHow long message is hidden during processing
Message sizeβ€”1 byte – 256 KBFor larger payloads, store in S3 and send pointer
Delay queue0 seconds0 – 15 minutesMessages invisible for this period after send
Receive wait time0 seconds0 – 20 secondsLong polling wait time (set to 20 for efficiency)
Scaling Consumers β€” Auto Scaling on Queue Depth Core

The best way to scale SQS consumers is based on queue depth β€” the number of messages waiting. Use CloudWatch metric ApproximateNumberOfMessages to trigger Auto Scaling.

πŸ“Š

Key CloudWatch Metrics

  • ApproximateNumberOfMessages β€” messages waiting
  • ApproximateNumberOfMessagesNotVisible β€” in-flight
  • ApproximateAgeOfOldestMessage β€” queue lag
  • NumberOfMessagesReceived β€” throughput
  • NumberOfMessagesSent β€” producer rate
⚑

Auto Scaling Strategy

  • Target tracking: "keep backlog per instance at 1000"
  • Scale out when: ApproximateNumberOfMessages / DesiredCapacity > 1000
  • Scale in when: backlog cleared
  • Use ApproximateAgeOfOldestMessage for SLA alarms

πŸ‘‰ Best practice formula: Target = (Acceptable latency in seconds) Γ— (Messages processed per second per instance). If each instance processes 10 msg/sec and you want max 60s latency, target = 600 messages per instance.

Lambda as Consumer β€” Event Source Mapping Core

Lambda can poll SQS automatically via Event Source Mapping. No need for EC2 workers. Lambda scales automatically based on queue depth.

βœ…

Lambda + SQS Benefits

  • No infrastructure to manage
  • Auto-scales with queue depth
  • Pay only for invocations
  • Built-in retry + DLQ support
  • Processes up to 10 messages per batch
⚠️

Lambda + SQS Limits

  • Max 15 min execution time (per message)
  • 1000 concurrent executions default (can increase)
  • FIFO queue: max 10 concurrent batches per group
  • Cold starts add latency on scale-out
  • Not ideal for very long-running jobs
Monitoring & Alarms Core
AlarmMetricThreshold ExampleWhy
Queue backlogApproximateNumberOfMessages> 10,000 for 5 minConsumers falling behind
Processing lagApproximateAgeOfOldestMessage> 300 secSLA violation risk
DLQ messagesApproximateNumberOfMessages (DLQ)> 0Messages failing repeatedly
Empty receivesNumberOfEmptyReceives> 1000/minEnable long polling
Cost Calculation Examples In-Depth

SQS pricing is simple: $0.40 per 1M requests for Standard, $0.50 per 1M for FIFO. Batching and long polling are the main optimization levers.

WorkloadMessages/DayWithout BatchingWith Batching
Small e-commerce1,000 orders$0.0008/day$0.00008/day
Medium app10,000 msg$0.008/day$0.0008/day
Large scale10M msg$8/day ($240/mo)$0.80/day ($24/mo)
FIFO100K msg$0.10/day$0.01/day
πŸ’°

Cost Optimization Checklist

  • βœ… Use batch APIs β€” 10x cost reduction
  • βœ… Enable long polling (WaitTimeSeconds=20)
  • βœ… Same region for sender and consumer
  • βœ… Delete unused queues
  • βœ… Monitor with Cost Explorer (filter: "Requests")
πŸ’‘

Data Transfer Notes

  • SQS β†’ Lambda (same region): free
  • SQS β†’ EC2 (same region): free
  • Cross-region: ~$0.02/GB (avoid if possible)
  • SQS β†’ Internet: ~$0.09/GB
Security Best Practices Core
πŸ”

Enable SSE

Always enable encryption at rest. Use SSE-SQS (free) or SSE-KMS (for compliance). No excuse for unencrypted queues.

πŸ”’

Least Privilege

Grant only needed actions: sqs:SendMessage for producers, sqs:ReceiveMessage + sqs:DeleteMessage for consumers.

πŸ“

Enable Logging

Use CloudTrail to log all SQS API calls. Monitor for unexpected access patterns or unauthorized attempts.

πŸŽ“ Exam Insight
  • "Cross-account queue access" β†’ Requires queue policy (not just IAM)
  • "SNS publishes to SQS" β†’ Queue policy must allow sns.amazonaws.com
  • "Encrypt messages at rest" β†’ Enable SSE-SQS or SSE-KMS
  • "Private subnet access to SQS" β†’ VPC Interface Endpoint
  • "Scale consumers on queue size" β†’ Auto Scaling on ApproximateNumberOfMessages
  • "Reduce SQS costs" β†’ Enable long polling (WaitTimeSeconds=20)
πŸ‘‰ Key Takeaway

SQS is designed for production: enable SSE for encryption, use queue policies for cross-account/service access, scale consumers on queue depth, and always configure a DLQ. Monitor ApproximateAgeOfOldestMessage for SLA compliance.

Amazon SQS Β· Complete
  • SQS is a managed message queue β€” producers enqueue, consumers poll and process at their own pace
  • Core model: Send β†’ Store (up to 14 days) β†’ Poll β†’ Process β†’ Delete
  • Visibility timeout β€” message hidden during processing; returns to queue if not deleted
  • Standard queue β€” near-unlimited throughput, best-effort order, at-least-once delivery
  • FIFO queue β€” strict ordering, exactly-once, 3,000 msg/sec limit
  • Dead-Letter Queue β€” catches messages that fail repeatedly; essential in production
  • SNS + SQS fan-out β€” SNS broadcasts, each consumer has its own SQS buffer
  • Security β€” IAM + queue policies, SSE encryption, VPC endpoints for private access
  • Scaling β€” Auto Scale on ApproximateNumberOfMessages; Lambda event source mapping for serverless
  • vs other services: SQS = buffer, SNS = broadcast, EventBridge = route, Kafka = stream