LearningTree · AWS · Storage

Amazon S3 —
Simple Storage Service

Unlimited object storage in the cloud. The backbone of data lakes, backups, static websites, and CDN origins — infinitely scalable, 11 nines durable.

⚡ S3 in 30 Seconds

Object storage — store any file of any size, retrieved by a unique key (URL)
Unlimited capacity — no pre-provisioning, no disk management
99.999999999% (11 nines) durability — data replicated across ≥3 AZs automatically
Multiple storage classes — optimize cost from millisecond access to archival
Integrated with almost every AWS service — the default data layer for AWS

Chapter One

What is S3

Introduction Introductory

Amazon S3 (Simple Storage Service) is AWS's object storage service. It lets you store and retrieve any amount of data — files, images, videos, backups, logs, ML datasets — from anywhere on the internet. Unlike a hard drive with folders and files, S3 stores data as objects inside buckets, each identified by a unique key.

👉 Think of S3 as: An infinite hard drive in the cloud — pay only for what you store, access from anywhere

S3 was one of the first AWS services, launched in 2006. Today it stores trillions of objects and handles millions of requests per second across AWS customers. It is the most-used AWS storage service and the foundation of most data architectures on AWS.

Why S3 Exists Introductory

⚠️

Traditional File Storage Problems

Fixed disk capacity — buy hardware before you need it
Disks fail — complex RAID and backup setups required
Not globally accessible — VPN or network share required
Scaling is slow — days/weeks to add capacity
High upfront capital cost

✅

S3 Solves

Unlimited capacity — grows automatically with your data
AWS manages replication and durability — 11 nines
Accessible over HTTPS from anywhere, any device
Add storage instantly — no provisioning required
Pay per GB stored + requests — no upfront cost

Object Storage vs Block vs File Core

Understanding the storage type is critical for choosing the right service:

Type	How It Works	AWS Service	Best For
Object Storage	Flat namespace — key → object. No folders. Access via HTTP.	S3	Files, images, backups, data lakes, logs
Block Storage	Raw disk blocks. OS mounts it like a hard drive. Low latency.	EBS	Databases, boot volumes, OS-level read/write
File Storage	Shared filesystem with directories. NFS protocol.	EFS	Shared access across multiple EC2 instances

👉 S3 is not a filesystem. You cannot "mount" S3 like a drive or run a database on it. It is optimized for storing and retrieving whole objects via HTTP — not for random read/write of small byte ranges.

Where S3 Fits in AWS Introductory

S3 is referenced by almost every AWS service:

💾

Data & Analytics

Data lakes (Athena, Glue, Redshift Spectrum). S3 is the raw storage layer — query data in-place without loading into a database.

🌐

Web & Applications

Static website hosting (HTML/CSS/JS), user uploads, media assets, and application configuration files stored in S3.

🔧

DevOps & Infrastructure

CloudFormation templates, Lambda deployment packages, CodePipeline artifacts, EC2 AMI snapshots — all stored in S3.

🛡️

Backup & Compliance

AWS Backup destinations, CloudTrail audit logs, VPC flow logs, config history, and compliance archives all land in S3.

🤖

Machine Learning

SageMaker training datasets, model artifacts, and inference results. S3 is the default ML data store on AWS.

📡

CDN Origin

CloudFront uses S3 as an origin to cache and serve content globally with low latency — the standard pattern for static assets.

Mental Model Core

Think of S3 like a post office with infinite numbered mailboxes:

📫

The Post Office = Bucket

A named container for objects
Name must be globally unique across all AWS accounts
Lives in one AWS region — data does not leave unless you replicate
You own and control the bucket policies and access
Up to 100 buckets per account (soft limit, can be raised)

📦

The Package = Object

Any file — image, video, CSV, zip, binary, JSON
Up to 5 TB per object (use Multipart Upload above 100 MB)
Identified by a unique key (like a full file path)
Includes metadata: content-type, custom tags, system attributes
Immutable — to update, you replace the entire object

Durability vs Availability Core

Two different guarantees — both important, often confused on the exam:

🔒

Durability — 99.999999999%

Will your data survive? — yes, 11 nines
AWS stores multiple copies across ≥3 AZs automatically
Designed to tolerate concurrent loss of data in 2 facilities
Losing stored data in S3 Standard is essentially impossible
Same for all storage classes except S3 One Zone-IA (single AZ)

⚡

Availability — 99.99%

Can you access it right now? — 99.99% of the time
~52 minutes downtime per year on S3 Standard
Varies by storage class — S3-IA = 99.9%, One Zone-IA = 99.5%
Glacier availability is lower — retrieval takes minutes to hours

Concept Diagram Introductory

S3 — User uploads and retrieves objects from a bucket

Core Use Cases Introductory

Use Case	How S3 Is Used	Why It Works
Static Website Hosting	Serve HTML/CSS/JS from a bucket with public access	No server needed — scales to any traffic automatically
Database Backups	Dump files pushed to S3 on a schedule	Cheap, durable, cross-region replication available
User Uploads	Presigned URLs let users upload directly to S3	Bypass your app server for large files
Data Lake	Raw data (JSON, Parquet, CSV) stored in S3, queried with Athena	Decouple storage from compute — pay per query
Log Archive	CloudTrail, ALB access logs, VPC flow logs → S3	Long-term storage, lifecycle to Glacier after 90 days
CDN Origin	CloudFront serves from S3 origin globally	Edge caching + S3 durability = best of both worlds

Strong Read-After-Write Consistency In-Depth

Since December 2020, Amazon S3 provides strong read-after-write consistency for all operations — at no additional cost and with no performance impact. This was a major change from S3's original eventual consistency model.

✅

Current Behavior (Strong Consistency)

PUT a new object → immediately readable by all subsequent GETs
Overwrite an existing object → next GET returns the new version
DELETE an object → next GET returns 404
LIST operations reflect the latest state
Applies to all storage classes, all regions

⚠️

Old Behavior (Pre-2020 — No Longer Applies)

New objects: read-after-write consistent (same as now)
Overwrites and deletes: eventually consistent — you might read stale data
LIST after PUT: object might not appear immediately
This is in many older study guides — it is outdated

👉 Exam note: S3 is now strongly consistent for all operations. If a question references eventual consistency for S3, the correct answer is strong read-after-write consistency. Older materials mentioning eventual consistency for overwrites are outdated.

👉 Key Takeaway

S3 is unlimited, durable object storage — the default data layer for AWS. If you need to store a file in AWS, S3 is the answer 90% of the time.

📋 Chapter 1 — Summary

Object storage — files stored as objects with a unique key, retrieved via HTTP. Not a filesystem or database.
Unlimited capacity — no pre-provisioning. Pay per GB stored + per request made.
11 nines durability — data replicated across ≥3 AZs automatically. AWS manages it.
Object vs Block vs File: S3 = objects (HTTP). EBS = block (disk). EFS = file (NFS mount).
Strong consistency: all operations (PUT, DELETE, LIST) are strongly consistent since 2020. No eventual consistency.
Used everywhere: backups, data lakes, static websites, ML datasets, CDN origin, DevOps artifacts.
Durability ≠ Availability: 11 nines = data won't disappear. 99.99% = you can access it almost always.

Chapter Two

Core Concepts & Storage Model

Buckets Core

A bucket is the top-level container for objects in S3. Every object lives inside a bucket. Buckets are created in a specific AWS region and data does not leave that region unless you explicitly configure replication.

🌍

Globally Unique Name

Bucket names must be unique across all AWS accounts globally — not just your account. If my-company-data is taken by anyone in the world, you cannot use it.

📍

Regional Resource

A bucket is created in one region (e.g., us-east-1). Choose the region closest to your users or compute workload to minimize latency and data transfer costs.

📋

Naming Rules

3–63 characters long
Lowercase letters, numbers, hyphens only
Cannot start or end with a hyphen
Cannot be formatted as an IP address

Objects & Keys Core

An object is the fundamental unit of data in S3. It consists of the data itself plus metadata. Every object is identified by a key — a string that uniquely identifies the object within its bucket.

Component	What It Is	Example
Key	The full "path" of the object within the bucket	`images/2026/logo.png`
Value	The actual data — any bytes, any format	Binary PNG file data
Version ID	Unique ID per version (when versioning is enabled)	`ab3c4de5fg6h`
Metadata	Key-value pairs describing the object	`Content-Type: image/png`
Tags	User-defined labels for cost allocation or access control	`env=prod, team=frontend`
ETag	MD5 hash of the object — used to verify integrity	`d41d8cd98f00b204e9800998ecf8427e`

👉 S3 has no real folders — the key images/2026/logo.png is just a string. The AWS console displays the slash as a folder, but it is purely cosmetic. This matters for prefix-based performance optimization.

Object Size Limits Core

📁

Single PUT Upload

Max 5 GB per PUT request. For anything larger, use Multipart Upload. AWS recommends Multipart for objects above 100 MB.

🔀

Multipart Upload

Split large files into parts (min 5 MB, max 10,000 parts). Upload parts in parallel. Combine on S3. Required for objects above 5 GB.

📦

Maximum Object Size

A single object can be up to 5 TB. No limit on bucket total size — store petabytes in one bucket if needed.

Storage Classes In-Depth

S3 offers multiple storage classes, each optimized for different access frequency and cost profiles. You pay less per GB for classes you access less frequently — but you pay a retrieval fee when you do access them.

Storage Class	Access Pattern	Availability	Retrieval Fee	Best For
S3 Standard	Frequent access	99.99%	None	Active data, websites, apps
S3 Intelligent-Tiering	Unknown / changing	99.9%	None	Data with unpredictable patterns
S3 Standard-IA	Infrequent (monthly)	99.9%	Per GB retrieved	Backups, disaster recovery
S3 One Zone-IA	Infrequent, single AZ	99.5%	Per GB retrieved	Re-creatable data, secondary backups
S3 Glacier Instant	Rare (quarterly)	99.9%	Per GB retrieved	Archive with instant access
S3 Glacier Flexible	Rare — minutes to hours	99.99%	Per GB + request	Compliance archives, tape replacement
S3 Glacier Deep Archive	Very rare — 12h retrieval	99.99%	Per GB + request	7–10 year regulatory archives

Storage Class Cost vs Access Frequency — The Trade-off

Versioning In-Depth

Versioning keeps multiple versions of an object in the same bucket. Every time you overwrite or delete an object, S3 creates a new version instead of destroying the old one.

✅

Why Enable Versioning

Recover from accidental overwrites and deletes
Required prerequisite for S3 Replication
Required for S3 Object Lock (compliance)
Enables audit trail — who changed what, when
Deletes create a "delete marker" — data is still there

⚠️

Versioning Trade-offs

Storage cost grows — every version is billed separately
Once enabled, cannot be fully disabled — only suspended
Need lifecycle rules to expire old versions automatically
Deleting a versioned object requires deleting ALL versions

Versioning — How Overwrites and Deletes Work

Metadata & Tags Core

📋

System Metadata

Set by AWS — Content-Type, Content-Length, Last-Modified
Content-Type is critical — browsers use it to render objects correctly
Set at upload time, cannot always be changed retroactively

🏷️

User-Defined Tags

Up to 10 key-value pairs per object
Used for cost allocation reports (group by team, env, project)
Used in lifecycle rules — apply rules to tagged objects only
Used in IAM/bucket policies — grant access based on tags

S3 Request Types Core

Understanding request types matters for cost calculation — you pay per request:

Request Type	Operation	Relative Cost
PUT / COPY / POST / LIST	Write or list operations	Higher ($0.005 per 1,000)
GET / SELECT	Read object data	Lower ($0.0004 per 1,000)
DELETE	Delete object	Free
Lifecycle transitions	Move object between storage classes	Per-transition fee

👉 Key Takeaway

S3's storage model is simple: buckets hold objects, objects have keys and metadata. The storage class you choose determines cost and access speed — match it to how frequently you access the data.

📋 Chapter 2 — Summary

Buckets — globally unique named containers, tied to one region. Up to 100 per account (soft limit).
Objects — data + metadata + tags. Max 5 TB. Use Multipart Upload above 100 MB.
Keys — the full "path" string identifying an object. No real folders — slashes are cosmetic.
Storage classes — Standard (frequent) → IA (monthly) → Glacier (rare) → Deep Archive (years). Lower cost = retrieval fee.
Versioning — keeps all versions on overwrite/delete. Enables recovery. Required for replication and Object Lock.
Metadata & Tags — Content-Type is critical. Tags drive cost allocation, lifecycle rules, and access control.

Chapter Three

Security & Access Control

S3 Security Model Core

By default, all S3 buckets and objects are private. Nothing is publicly accessible unless you explicitly allow it. Access to S3 is controlled through multiple overlapping layers — understanding which layer applies when is the key to both security and the SAA-C03 exam.

👤

IAM Policies

Attached to users, groups, or roles. Define what AWS identities can do to S3. Evaluated by IAM before the request even reaches S3.

🪣

Bucket Policies

Attached to the bucket itself. Resource-based policy in JSON. Can grant access to other AWS accounts, services, and the public. Most powerful S3 access tool.

🔑

ACLs (Legacy)

Object or bucket-level access control lists. Predates IAM. AWS recommends disabling ACLs and using bucket policies instead. Still appears on exams.

IAM Policies for S3 Core

IAM policies grant S3 permissions to AWS identities. The identity must have permissions AND the bucket policy must allow (or at least not deny) the request.

IAM Action	What It Allows
`s3:GetObject`	Download / read an object
`s3:PutObject`	Upload / write an object
`s3:DeleteObject`	Delete an object
`s3:ListBucket`	List objects in a bucket
`s3:GetBucketPolicy`	Read the bucket policy
`s3:PutBucketPolicy`	Write / replace the bucket policy
`s3:*`	Full access to all S3 actions (admin)

Bucket Policies In-Depth

Bucket policies are JSON documents attached directly to a bucket. They can grant or deny access to specific AWS accounts, IAM users/roles, services, or the public. They are the primary mechanism for cross-account access and public access.

✅

Common Bucket Policy Use Cases

Grant another AWS account read access to a bucket
Force all uploads to use HTTPS (deny HTTP)
Allow CloudFront OAC to read from a private bucket
Restrict access to specific IP address ranges
Require server-side encryption on all PUT requests
Make a bucket publicly readable for static website hosting

📋

Policy Structure

Effect — Allow or Deny
Principal — who (IAM user, account, * for public)
Action — what S3 operations (s3:GetObject)
Resource — which bucket/object (arn:aws:s3:::my-bucket/*)
Condition — optional constraints (IP, MFA, HTTPS)

S3 Access Decision — IAM + Bucket Policy Evaluation

Block Public Access Core

Block Public Access is a safety switch that sits above bucket policies and ACLs. Even if your bucket policy grants public access, Block Public Access will override and deny it.

🛡️

What It Does

4 independent settings that can be toggled on/off
Can be set at account level (all buckets) or per bucket
Account-level setting overrides bucket-level
Enabled by default on all new buckets since 2023
Protects against misconfigured bucket policies accidentally exposing data

⚠️

When to Disable

Static website hosting that needs to be publicly readable
Public software distribution buckets
Any intentional public access scenario
Must be explicitly and deliberately turned off — never by accident

S3 Access Points In-Depth

S3 Access Points simplify managing access to shared datasets in S3. Instead of one complex bucket policy that handles every application, each application gets its own named endpoint with its own access policy — scoped to exactly what it needs.

🔌

How It Works

Each access point has a unique DNS name (endpoint)
Each has its own IAM-style policy for permissions
Multiple access points on one bucket — one per app/team
Access point ARN used in place of bucket ARN

🔒

VPC-Restricted Access Points

Access point can be restricted to a specific VPC
Requests from outside the VPC are automatically denied
No need for complex bucket policy VPC conditions
Combines with VPC Endpoints for fully private access

🏗️

When to Use

Data lake: different teams query different prefixes
Multi-tenant: each tenant's app gets scoped access
Compliance: audit access per application
At scale: 10,000 access points per bucket supported

VPC Endpoint for S3 (Gateway) Core

A VPC Gateway Endpoint allows EC2 instances and other resources in a private subnet to access S3 without going through the internet — no NAT Gateway, no Internet Gateway, no public IP required.

🔗

How It Works

Create a Gateway Endpoint for S3 in your VPC
Attach route table entries directing S3 traffic to the endpoint
Traffic to S3 stays on the AWS private network — never touches the internet
Free — no hourly charge, no data processing charge
Works with bucket policies: add aws:sourceVpce condition to restrict access to endpoint only

✅

Benefits

Security: data never traverses the public internet
Cost: no NAT Gateway data processing fees (saves $0.045/GB)
Performance: lower latency, higher throughput within AWS
Exam: "How to access S3 from a private subnet securely" → Gateway Endpoint

👉 Exam tip: S3 and DynamoDB use Gateway Endpoints (free, route table-based). Most other AWS services use Interface Endpoints (ENI-based, hourly charge). "Private S3 access from a private subnet" → VPC Gateway Endpoint — this appears on nearly every AWS exam.

Presigned URLs In-Depth

A presigned URL grants temporary access to a private S3 object without making the bucket public. Any identity with the right IAM permissions can generate one.

🔗

How It Works

S3 embeds the credentials and expiry time into the URL itself. The URL is signed with the creator's AWS credentials. Anyone with the URL can access the object until it expires.

✅

Use Cases

User downloads a private file from your app
User uploads directly to S3 without credentials
Sharing a large file temporarily
Email attachment links that expire

⏱️

Expiry

Default: 1 hour
Max: 7 days (with STS temp credentials)
URL becomes invalid after expiry — no revocation needed
Revoke early by invalidating the signing credentials

Encryption In-Depth

S3 supports encryption at rest and in transit. Since January 2023, all new objects are encrypted by default with SSE-S3.

Type	Key Management	Use Case	Exam Note
SSE-S3	AWS manages keys completely	Default — zero management overhead	Header: `x-amz-server-side-encryption: AES256`
SSE-KMS	AWS KMS — you control the key policy	Compliance, audit trail, cross-account control	CloudTrail logs every key usage. Adds KMS API call cost.
SSE-C	You provide the key on every request	You manage keys outside AWS completely	AWS never stores your key — must be sent with every PUT/GET.
Client-side	You encrypt before upload	Zero trust — AWS never sees plaintext	Application owns the full encryption lifecycle.

🚀

Encryption in Transit

All S3 endpoints support HTTPS (TLS 1.2+)
HTTP requests are also accepted by default — unless you deny them
Force HTTPS with a bucket policy condition: aws:SecureTransport: false → Deny
HTTPS is always recommended — required for compliance workloads

🔑

SSE-KMS Considerations

Every S3 GET/PUT = a KMS API call (GenerateDataKey / Decrypt)
KMS has request rate limits — heavy S3 workloads can hit KMS throttling
Use KMS key policies to restrict who can use the key
Audit all data access via CloudTrail — every decrypt is logged

ACLs (Legacy) Core

Access Control Lists (ACLs) are the original S3 access mechanism. AWS now recommends disabling ACLs and using bucket policies instead. However, ACLs still appear on certifications.

ACL Permission	What It Allows
READ	List objects (bucket) or download object
WRITE	Upload/delete objects in bucket
READ_ACP	Read the ACL itself
WRITE_ACP	Modify the ACL
FULL_CONTROL	All of the above

👉 New accounts have ACLs disabled by default. Use bucket policies for access control — they are more expressive, easier to audit, and don't require understanding legacy ACL semantics.

Security Best Practices Core

🔒

Bucket Hardening

Enable Block Public Access at account level
Enable versioning — protects against ransomware and accidental deletes
Enable Object Lock for compliance data (WORM)
Require SSE-KMS for sensitive data via bucket policy
Enable S3 access logging — record all requests to the bucket

🛡️

IAM & Network

Use IAM roles — never hardcode credentials in apps
Apply least-privilege policies — grant only the actions needed
Use VPC Endpoints (Gateway type) for private access from EC2
Force HTTPS with aws:SecureTransport deny condition
Enable AWS Macie for sensitive data discovery (PII detection)

👉 Key Takeaway

S3 security has three layers: IAM (who can act), Bucket Policy (what the bucket allows), and Block Public Access (safety override). All three must align for access to succeed. When in doubt, Block Public Access wins.

📋 Chapter 3 — Summary

Default private — all buckets and objects are private. Nothing is public unless you explicitly allow it.
IAM Policies — attached to identities. Control what users/roles can do in S3.
Bucket Policies — attached to the bucket. JSON resource policy. Best for cross-account and public access.
Block Public Access — account or bucket-level override. Enabled by default. Always overrides bucket policy.
Access Points — per-application named endpoints with individual policies. VPC-restricted for private access. Scale to 10,000 per bucket.
VPC Gateway Endpoint — private S3 access from VPC without internet. Free. Route table-based. Common exam topic.
Presigned URLs — temporary signed URLs for private object access. Max 7 days. No bucket policy change needed.
Encryption: SSE-S3 (default, AWS manages), SSE-KMS (audit trail, your key policy), SSE-C (you manage key).
Force HTTPS via bucket policy. Disable legacy ACLs. Use VPC endpoints for private access.

Chapter Four

Data Management & Lifecycle

Lifecycle Rules In-Depth

Lifecycle rules automate the movement and deletion of objects over time. They eliminate the need to manually manage aging data — define the rules once, and S3 handles the transitions and expirations automatically.

🔄

Transition Actions

Move objects to a cheaper storage class after N days
Example: Standard → Standard-IA after 30 days
Example: Standard-IA → Glacier after 90 days
Example: Glacier → Deep Archive after 365 days
Can be scoped to a prefix or object tags

🗑️

Expiration Actions

Delete objects after N days — automatic cleanup
Delete expired delete markers (versioned buckets)
Delete non-current versions after N days
Abort incomplete multipart uploads after N days
Prevents unbounded storage cost growth

Lifecycle Transitions — Object Aging Through Storage Classes

Replication In-Depth

S3 Replication automatically and asynchronously copies objects from one bucket to another. Versioning must be enabled on both source and destination buckets.

🌍

CRR — Cross-Region Replication

Source and destination in different AWS regions
Use case: disaster recovery across regions
Use case: low-latency access from another geography
Use case: compliance (data residency requirements)
Incurs inter-region data transfer cost

📍

SRR — Same-Region Replication

Source and destination in the same AWS region
Use case: copy data between accounts in the same region
Use case: log aggregation from multiple source buckets
Use case: test environment with live data copy
No inter-region transfer cost

Replication Behaviour	Detail
What replicates	New objects after replication is enabled. Existing objects need S3 Batch Replication.
Delete behaviour	Delete markers are NOT replicated by default (can be enabled). Permanent deletes never replicate.
Storage class	Destination uses same class by default. Can override to a cheaper class.
Ownership	Replicated objects are owned by source account by default. Use Object Ownership setting to change.
Chaining	Replication is not transitive — A→B→C does NOT automatically replicate A to C.
Encryption	SSE-S3 and SSE-KMS objects can be replicated. SSE-C objects cannot.

CRR vs SRR — Replication Topology

⏱️ Replication Time Control (RTC)

Standard replication is asynchronous with no SLA on timing — most objects replicate in seconds, but some may take hours. S3 Replication Time Control (RTC) guarantees that 99.99% of objects replicate within 15 minutes, with S3 metrics to track replication lag. Use RTC when you have compliance or disaster recovery requirements that demand a guaranteed replication SLA. RTC adds cost — enable it only for buckets where the timing guarantee matters.

Object Lock In-Depth

Object Lock prevents objects from being deleted or overwritten for a defined period. It implements WORM (Write Once Read Many) storage — required for SEC 17a-4, HIPAA, and financial compliance workloads.

🔒

Retention Modes

Compliance mode — nobody can delete or change the object, including the root user. Period cannot be shortened. Used for strict regulatory requirements.
Governance mode — only users with s3:BypassGovernanceRetention permission can override. Lighter enforcement for internal policies.

🏦

Legal Hold

Prevents deletion independent of any retention period
No expiry date — stays locked until explicitly removed
Requires s3:PutObjectLegalHold permission to apply/remove
Used during litigation — preserve evidence without a known end date

👉 Object Lock must be enabled when the bucket is created — it cannot be added to an existing bucket. Compliance mode retention periods cannot be shortened even by AWS Support.

S3 Event Notifications In-Depth

S3 can publish events when objects are created, deleted, restored, or replicated. This enables event-driven architectures where downstream systems react to data changes automatically.

📬

SNS

Fan out notification to multiple subscribers. Email alerts, SMS, or trigger multiple SQS queues from one S3 event.

📩

SQS

Decouple processing from uploads. Workers poll SQS and process each uploaded object independently. Handles volume spikes gracefully.

⚡

Lambda

Trigger serverless processing immediately on upload. Image resizing, virus scanning, data validation, format conversion — all without a server.

Event Type	Triggered When	Common Use
`s3:ObjectCreated:*`	Any object is uploaded (PUT, POST, COPY, multipart)	Trigger processing pipeline on upload
`s3:ObjectRemoved:*`	Object is deleted	Audit deletion, update downstream index
`s3:ObjectRestore:*`	Glacier object restore initiated/completed	Notify when archive is available
`s3:Replication:*`	Replication failure or missed threshold	Alert on replication health issues

S3 Batch Operations In-Depth

S3 Batch Operations runs large-scale jobs across billions of objects with a single API call. Instead of writing scripts to iterate through objects, you describe the operation and S3 runs it at scale.

⚙️

Supported Operations

Copy objects between buckets
Replace object tags or ACLs
Restore objects from Glacier
Invoke Lambda on every object
Replicate existing objects (Batch Replication)
Set Object Lock retention on existing objects

📊

How It Works

Provide an object manifest (S3 Inventory report or CSV)
Define the operation and parameters
S3 processes all listed objects — tracks progress and errors
Generates a completion report to S3
Full audit trail in CloudTrail

👉 Key Takeaway

Lifecycle rules + Replication + Object Lock form your data governance foundation. Automate transitions to save cost, replicate for resilience, and lock for compliance. Event notifications turn S3 into a trigger for your entire data pipeline.

📋 Chapter 4 — Summary

Lifecycle rules — automate transitions (Standard → IA → Glacier) and expirations. Scope by prefix or tag.
CRR — cross-region replication for DR, compliance, and latency. Adds inter-region transfer cost.
SRR — same-region replication for cross-account copy, log aggregation. No transfer cost.
Replication nuances: versioning required, new objects only, delete markers not replicated by default, not transitive.
Object Lock — WORM storage. Compliance mode = nobody can delete. Governance mode = privileged users can override. Must enable at bucket creation.
Event notifications — S3 → SNS / SQS / Lambda on create/delete/restore. Foundation of event-driven data pipelines.
Batch Operations — run jobs on billions of objects. Copy, tag, restore, invoke Lambda at scale.

Chapter Five

Performance & Scaling

S3 Scalability Model Core

S3 scales automatically — there are no capacity limits to configure, no partitions to manage, and no pre-warming required. AWS manages the infrastructure horizontally behind the scenes. However, understanding S3's performance characteristics helps you avoid hitting rate limits on high-throughput workloads.

📤

PUT / COPY / DELETE

3,500 requests/sec per prefix. Writing 100K objects/sec requires ~29 prefixes with evenly distributed keys.

📥

GET / HEAD

5,500 requests/sec per prefix. A single prefix can serve ~5,500 reads per second before S3 automatically scales further.

♾️

No Hard Limits

These are baseline per-prefix rates. S3 will scale beyond these automatically as traffic increases — no pre-warming needed.

Prefix Partitioning In-Depth

A prefix is the part of an object key before the final filename — essentially the "path". S3 uses prefixes to distribute requests across its internal infrastructure. More distinct prefixes = more parallelism = higher throughput.

❌

Bad Pattern — Single Prefix

All objects under uploads/2026/
All requests go to the same partition
Hits 3,500 PUT/sec limit quickly
No horizontal scaling benefit

✅

Good Pattern — Multiple Prefixes

Distribute across a/uploads/ b/uploads/ c/uploads/
Or use hash prefixes: a3f/ 7b2/ 9d1/
Each prefix gets its own 3,500/5,500 rate budget
10 prefixes = 35,000 PUT/sec, 55,000 GET/sec

Prefix Partitioning — Spreading Load for High Throughput

Multipart Upload Core

Multipart Upload splits large objects into parts, uploads them in parallel, and reassembles them on S3. It is the correct mechanism for any object above 100 MB.

Feature	Detail
Minimum part size	5 MB (except the last part)
Maximum parts	10,000 parts per object
Maximum object size	5 TB (requires multipart)
Parallel uploads	Upload all parts simultaneously — dramatically faster on high-bandwidth connections
Resume on failure	Only the failed part needs to be retried — not the entire object
Incomplete uploads	Parts are billed even if never completed — use lifecycle rule to abort after N days

S3 Transfer Acceleration In-Depth

Transfer Acceleration routes uploads through AWS CloudFront edge locations instead of going directly to the S3 regional endpoint. Data enters the AWS backbone at the nearest edge location, then travels on AWS's private network to S3 — which is faster and more reliable than routing over the public internet for long distances.

⚡

When Transfer Acceleration Helps

Users uploading from distant geographies (EU → us-east-1)
Large file uploads over high-latency internet connections
Consistent performance from multiple global locations to one bucket
Can provide 50–500% speed improvement over direct upload

⚠️

When It Does NOT Help

Uploads from within the same region as the bucket
Small files — overhead of edge routing is not worth it
Adds per-GB transfer cost on top of standard S3 pricing
Test with the S3 Transfer Acceleration Speed Comparison tool first

Transfer Acceleration vs Direct Upload — Global Routing

S3 Select & Glacier Select In-Depth

S3 Select allows you to retrieve only the subset of data you need from an object using SQL expressions — without downloading the entire file. Instead of downloading a 5 GB CSV and filtering locally, S3 filters on the server and returns only matching rows.

🔍

How It Works

Supported formats: CSV, JSON, Parquet
Optional compression: GZIP, BZIP2
Run SQL SELECT and WHERE against the object server-side
S3 returns only matching rows — not the full file
Reduces data transfer cost and client-side processing time

💰

Why It Matters

A 5 GB CSV with 10 matching rows → transfer 10 rows, not 5 GB
Faster for Lambda functions operating on large S3 files
Glacier Select brings the same capability to archived data
Not a replacement for Athena — no joins, no aggregations

Byte-Range Fetches In-Depth

You can retrieve specific byte ranges of an object using the HTTP Range header. This enables parallel downloads and efficient partial reads without fetching the entire object.

⬇️

Parallel Download

Split a 10 GB object into 10 × 1 GB ranges. Download all 10 in parallel. Combine client-side. Significantly faster than a single sequential download.

📋

Read Header Only

Fetch just the first few KB of a file to read its header metadata (e.g., Parquet footer, image EXIF). Avoid downloading 500 MB to read 4 KB of metadata.

🔄

Resume Downloads

If a download fails mid-way, resume from the last successful byte. No need to restart from zero for large objects.

👉 Key Takeaway

S3 scales to any throughput automatically — but you must spread load across prefixes to use it. Use Multipart Upload for anything above 100 MB. Use Transfer Acceleration for global users. Use S3 Select to minimize data transfer on large objects.

📋 Chapter 5 — Summary

Baseline rates: 3,500 PUT/sec and 5,500 GET/sec per prefix. Spread load across prefixes to scale linearly.
Prefix partitioning: hash-based prefixes distribute requests across S3 partitions. 10 prefixes = 10× throughput.
Multipart Upload: required above 5 GB, recommended above 100 MB. Parallel parts + resume on failure. Set lifecycle rule to abort incomplete uploads.
Transfer Acceleration: edge location → AWS backbone → S3. 50–500% faster for distant geographies. Adds per-GB cost.
S3 Select: server-side SQL filter on CSV/JSON/Parquet. Transfer only matching rows. Not a query engine — no joins.
Byte-range fetches: parallel downloads, header-only reads, and resume support via HTTP Range header.

Chapter Six

Cost Optimization

What You Pay For Core

S3 has no up-front cost and no minimum fee. You pay only for what you use across four dimensions:

💾

Storage Cost

Per GB stored per month
Varies by storage class — Standard is most expensive, Deep Archive cheapest
Billed by actual bytes — fractional GBs charged proportionally
Versioned objects: every version is billed separately

🔁

Request & Retrieval Cost

PUT/COPY/POST/LIST: ~$0.005 per 1,000 requests
GET/SELECT: ~$0.0004 per 1,000 requests
Retrieval fee for IA, Glacier classes (per GB retrieved)
Lifecycle transition requests: small per-object fee

🌐

Data Transfer Cost

Inbound (upload to S3): free
S3 → internet: ~$0.09/GB (first 10 TB/month)
S3 → same-region EC2: free
S3 → different region (CRR): ~$0.02/GB
S3 → CloudFront: free (use CF to avoid egress)

⚙️

Management & Features

S3 Inventory reports: per million objects listed
S3 Analytics (Storage Class Analysis): per million objects
Replication: per-GB data transfer + request fees
Transfer Acceleration: additional per-GB fee

Storage Class Cost Comparison In-Depth

These are approximate US East (N. Virginia) prices to illustrate relative costs. Always check the AWS pricing page for current rates in your region.

Storage Class	Storage ($/GB/month)	Retrieval ($/GB)	Min Storage Duration	Min Object Size
S3 Standard	~$0.023	Free	None	None
S3 Intelligent-Tiering	~$0.023 (frequent tier)	Free	None	128 KB (smaller = Standard)
S3 Standard-IA	~$0.0125	~$0.01/GB	30 days	128 KB billed minimum
S3 One Zone-IA	~$0.01	~$0.01/GB	30 days	128 KB billed minimum
S3 Glacier Instant	~$0.004	~$0.03/GB	90 days	128 KB billed minimum
S3 Glacier Flexible	~$0.0036	~$0.01–0.03/GB	90 days	40 KB billed minimum
S3 Glacier Deep Archive	~$0.00099	~$0.02/GB	180 days	40 KB billed minimum

👉 Minimum storage duration traps are real. If you store a 1 GB file in Standard-IA for only 10 days and delete it, you are still billed for 30 days. Do not use IA classes for short-lived or frequently changed objects.

S3 Intelligent-Tiering In-Depth

Intelligent-Tiering (INT) automatically moves objects between access tiers based on actual usage — no retrieval fees, no lifecycle rules to manage. It is the right choice when access patterns are unknown or unpredictable.

🤖

How It Works

Objects start in the Frequent Access tier (same cost as Standard)
Move to Infrequent Access tier after 30 days of no access
Move to Archive Instant tier after 90 days (optional)
Move to Archive tier after 90–180 days (optional, configure)
Object accessed → immediately moved back to Frequent Access tier

💰

Cost Considerations

Small monitoring fee per object per month (~$0.0025/1,000 objects)
Objects smaller than 128 KB are billed as Standard — not worth INT
No retrieval fees within Frequent and Infrequent tiers
Archive tiers have retrieval fees (like Glacier)
No minimum storage duration — no early deletion penalty

Cost Optimization Strategies Core

📉

Reduce Storage Cost

Set lifecycle rules to transition to cheaper classes automatically
Enable S3 Analytics to identify infrequently accessed data
Use Intelligent-Tiering for data with unknown access patterns
Expire old object versions automatically with lifecycle rules
Abort incomplete multipart uploads (lifecycle rule after 7 days)
Compress files before upload (GZIP, Snappy, ZSTD)

📡

Reduce Transfer Cost

Serve S3 content via CloudFront — S3→CF is free, CF→internet is cheaper
Keep compute (EC2/Lambda) in the same region as S3 — free transfer
Use VPC Gateway Endpoints — free S3 access from within VPC
Use S3 Select to transfer only needed rows, not full objects
Enable Requester Pays for public datasets — consumer pays retrieval

S3 Storage Lens In-Depth

S3 Storage Lens provides org-wide visibility into S3 usage, activity trends, and cost optimization recommendations across all buckets and accounts in your AWS Organization.

📊

Usage Metrics

Total storage bytes, object count, average object size, incomplete multipart uploads — aggregated across your entire organization.

📈

Activity Metrics

GET/PUT/DELETE request counts, bytes downloaded. Identify hot buckets and cold buckets that should be transitioned to cheaper storage classes.

💡

Recommendations

S3 Storage Lens surfaces cost optimization tips: objects that qualify for lifecycle transitions, buckets with no lifecycle rules, and incomplete multipart upload accumulation.

Cost Decision Framework Core

Scenario	Right Choice	Reason
Frequently accessed app data	S3 Standard	No retrieval fee, no min duration
Access pattern is unknown	S3 Intelligent-Tiering	Auto-optimizes without lifecycle rules
Backup accessed once/month	S3 Standard-IA	50% cheaper storage, low retrieval frequency
Replicated data (can re-create)	S3 One Zone-IA	20% cheaper than Standard-IA, acceptable single-AZ risk
Compliance archive, instant access	S3 Glacier Instant	~83% cheaper than Standard, ms retrieval
7+ year regulatory archive	S3 Glacier Deep Archive	~96% cheaper than Standard, 12h retrieval acceptable
Short-lived temp files (<30 days)	S3 Standard	IA min-duration billing makes IA more expensive

👉 Key Takeaway

S3 cost optimization is primarily about storage class selection and lifecycle automation. Serve via CloudFront to eliminate egress. Set lifecycle rules on day one — retroactively optimizing storage is expensive and slow. Use Storage Lens to find what you missed.

📋 Chapter 6 — Summary

Four cost dimensions: storage ($/GB/month), requests (per 1,000), retrieval ($/GB for IA/Glacier), data transfer (free inbound, ~$0.09/GB egress).
S3→CloudFront is free. Use CloudFront for public content — eliminates S3 egress cost entirely.
Minimum duration traps: Standard-IA = 30 days, Glacier = 90 days, Deep Archive = 180 days. Don't use IA for short-lived objects.
Intelligent-Tiering: auto-moves objects based on actual access. No retrieval fee. Best for unknown access patterns. Objects <128 KB billed as Standard.
Lifecycle rules: set on day one. Expire old versions. Abort incomplete multipart uploads. Transition logs to Glacier after 90 days.
VPC Gateway Endpoint: free S3 access from within a VPC. Eliminates NAT Gateway data processing costs for S3 traffic.
Storage Lens: org-wide dashboard for usage, activity, and automatic cost optimization recommendations.

Chapter Seven

Architecture Patterns

Pattern 1 — Static Website Hosting Introductory

S3 can serve HTML, CSS, JavaScript, and image files directly as a website — no web server, no EC2, no maintenance. For read-heavy static content, this is the simplest and cheapest architecture on AWS.

🏗️

Architecture

Enable static website hosting on the S3 bucket
Set index document (index.html) and error document (404.html)
Bucket policy grants s3:GetObject to * (public read)
Disable Block Public Access to allow the public policy
Use custom domain via Route 53 CNAME or alias

✅

With CloudFront (Recommended)

CloudFront distribution in front of S3 origin
S3 bucket stays private — CloudFront uses OAC to access it
HTTPS via ACM certificate on CloudFront (S3 website endpoint is HTTP only)
Global edge caching — serves from PoP nearest to user
Eliminates S3 egress cost — S3→CF transfer is free

Static Website — S3 + CloudFront Architecture

Pattern 2 — User Upload with Presigned URL In-Depth

Users upload files directly to S3, bypassing your application server entirely. Your backend generates a short-lived presigned URL and returns it to the client. The client uploads directly to S3 — your server never touches the bytes.

🏗️

Flow

Client requests upload permission from your API
Your API generates a presigned PUT URL (e.g., 15 minutes)
API returns the presigned URL to the client
Client uploads the file directly to S3 using the URL
S3 sends an event notification to Lambda on completion
Lambda processes the uploaded file (resize, scan, index)

✅

Benefits

Your servers handle zero upload bandwidth
Files go directly to S3 — faster for users on high-bandwidth connections
Bucket stays private — presigned URL grants temporary access only
Lambda event trigger enables automatic downstream processing
Scales to thousands of concurrent uploads without bottleneck

Presigned URL Upload Pattern — Client → API → S3 Direct

Pattern 3 — Data Lake Architecture In-Depth

S3 is the standard storage layer for data lakes on AWS. Raw data lands in S3, is catalogued with Glue, and queried in-place with Athena — no database to provision, no ETL until you need it.

🏗️

Architecture Layers

Landing zone: raw data as-is — JSON, CSV, logs, API dumps
Processed zone: cleaned, partitioned Parquet files (columnar format)
Curated zone: aggregated, business-ready datasets
Each zone is a separate S3 prefix or bucket
AWS Glue Crawlers auto-discover schema and update the Glue Catalog
Athena queries directly against Parquet files using SQL

💡

Why This Pattern Works

Storage is decoupled from compute — scale each independently
Pay per query with Athena — no always-on database cluster
Parquet columnar format reduces Athena scan cost by 10–100×
Partition by date/region — Athena skips irrelevant partitions entirely
Lake Formation adds fine-grained table/column access control

Pattern 4 — Backup & Disaster Recovery Core

💾

Database Backups

RDS automated backups export to S3
DynamoDB exports to S3 (point-in-time)
EC2 snapshots stored via EBS then exported to S3
Lifecycle: transition to Glacier after 30 days

🌍

Cross-Region DR

Enable CRR to a secondary region bucket
RPO: near-zero (async replication, seconds lag)
RTO: immediate — data already in secondary region
S3 Object Lock protects against ransomware

🔄

Versioning for Recovery

Versioning = built-in point-in-time recovery
Restore any object to any previous state
Lifecycle rules expire old versions to control cost
MFA Delete for extra protection on versioned buckets

Pattern 5 — Event-Driven Processing Pipeline In-Depth

S3 events drive serverless processing pipelines — no polling, no scheduler, no idle workers. Every object upload automatically triggers the next stage of processing.

Event-Driven Pipeline — S3 Upload Triggers Processing Chain

Pattern 6 — Common Mistakes Introductory

Mistake	Why It's Bad	Fix
Making bucket public for all content	Exposes all objects — including future uploads	Keep bucket private, use CloudFront OAC + presigned URLs
No lifecycle rules	Storage cost grows unbounded over months/years	Set lifecycle rules on day one for every bucket
Using S3 as a database	No query capability, no indexing — extremely slow lookups	Store metadata in DynamoDB/RDS, store files in S3
Ignoring incomplete multipart uploads	Parts accumulate silently and are billed indefinitely	Lifecycle rule: abort incomplete multipart after 7 days
Moving to IA too aggressively	Min-duration billing + retrieval fees make it more expensive for frequent access	Use S3 Analytics or Intelligent-Tiering to identify true access patterns
Not enabling versioning on important buckets	One accidental delete or overwrite = permanent data loss	Enable versioning + lifecycle expire old versions
Storing credentials in S3 objects	Exposed if bucket is ever misconfigured	Use Secrets Manager or SSM Parameter Store

👉 Key Takeaway

S3's patterns all follow one principle: S3 is storage, not compute. Let CloudFront serve it, let Lambda process it, let Athena query it, let your API control access to it. S3 itself just stores — everything else is glue.

📋 Chapter 7 — Summary

Static website: S3 + CloudFront + ACM + Route 53. Bucket stays private. OAC grants CloudFront access. Zero server cost.
User uploads: API generates presigned PUT URL → client uploads directly to S3 → Lambda processes on event. Your server handles zero bytes.
Data lake: Landing (raw) → Processed (Parquet) → Curated zones in S3. Glue Catalog autodiscovers schema. Athena queries in-place.
Backup & DR: CRR to secondary region. Versioning for point-in-time recovery. Object Lock for ransomware protection.
Event-driven pipeline: ObjectCreated → Lambda → processed S3 → SQS → downstream. No polling, scales to any volume.
Common mistakes: no lifecycle rules, ignoring incomplete multipart uploads, moving data to IA too aggressively, no versioning on critical buckets.