LearningTree · AWS · Analytics

Amazon Athena —
Serverless Analytics Directly on S3

Query petabytes of data stored in S3 using standard SQL — without provisioning a single server, loading data into a database, or managing infrastructure. Athena is the engine of the modern AWS data lake.

Chapter One · Analytics

What is Amazon Athena?

Amazon Athena is a serverless, interactive query service that lets you analyze data stored in Amazon S3 using standard SQL — without loading it into a database, provisioning servers, or managing any infrastructure. You point Athena at files in S3, write a SQL query, and results appear in seconds.

The Core Idea in One Sentence Introductory

Traditional analytics requires you to move data into a database before you can query it. Athena flips this model — the query goes to the data, not the data to the query engine.

🏛️ Mental Model — The Digital Library

Imagine a giant digital library with millions of books stored on shelves (S3). In a traditional database, you'd need to move all the books into a reading room, catalogue them, and maintain the room before anyone can search. Athena is a librarian who reads the books exactly where they are on the shelves — you just describe what you're looking for, and the answer comes back instantly. No moving. No room to maintain. No staff to hire.

Traditional vs Athena — The Paradigm Shift Introductory

❌ Traditional Analytics

Provision a database cluster (RDS, Redshift)
Design and create tables and schemas
ETL pipeline loads data into the database
Pay for the cluster 24/7, even when idle
Scale the cluster as data grows
Manage backups, patches, engine versions
Time-to-first-query: days or weeks

✅ With Amazon Athena

Data stays in S3 exactly where it already is
Define a schema in Glue Catalog (metadata only)
Write a SQL query — Athena executes it directly
Pay only for what you query (per GB scanned)
No infrastructure to scale — Athena scales itself
Nothing to patch, manage, or maintain
Time-to-first-query: minutes

What Kinds of Data Can Athena Query? Core

📋

Log Files

CloudTrail audit logs
ALB / CloudFront access logs
VPC Flow Logs
Application logs exported to S3

📊

Business Data

CSV exports from databases
JSON event streams
AWS Cost & Usage Reports (CUR)
Parquet data from ETL pipelines

🔬

Analytics Datasets

Clickstream and user event data
IoT sensor readings
Financial transaction records
Machine learning feature stores

Athena is Serverless — What Does That Actually Mean? Core

"Serverless" is overused — here is exactly what it means for Athena:

Concern	Traditional DB	Amazon Athena
Servers	You provision EC2 or DB instances	AWS owns and operates all compute
Scaling	You resize instances or add nodes	Athena scales automatically per query
Idle cost	You pay even when no queries run	Zero cost when not querying
Patching	You apply engine/OS updates	AWS maintains all infrastructure
Capacity planning	Required — guess future load	Not required — on-demand
Startup time	Minutes to hours to provision	Instant — always ready

Concept Diagram — How Athena Fits Introductory

Amazon Athena — Serverless Query Model

Athena reads schema from Glue Catalog, scans files in S3, returns SQL results — no database, no loading, no servers

Query Flow Diagram — Step by Step Core

Athena Query Execution Flow — What Happens When You Run a Query

Each query is fully isolated — Athena spins up distributed workers, scans only required S3 partitions, returns results

Athena's Place in the AWS Ecosystem Core

Athena is not a standalone service — it is the query layer of the modern AWS data lake stack:

💾

Storage Layer — Amazon S3

S3 holds the actual data files. It is the source of truth — cheap, durable, infinitely scalable. Athena never moves this data; it reads it in place.

🗂️

Metadata Layer — AWS Glue Catalog

Glue Catalog holds the schema — what tables exist, what columns they have, where the S3 partitions are. Athena reads this catalog before scanning any data.

🔍

Query Layer — Amazon Athena

Athena is the SQL engine. It reads schemas from Glue, figures out which S3 files to read, distributes the scan across workers, and returns results.

📈

Visualisation Layer — QuickSight / BI Tools

Tools like Amazon QuickSight, Tableau, or Power BI connect to Athena via JDBC/ODBC to build dashboards and reports on top of S3 data.

Why Athena is Important — The Big Picture Introductory

Before services like Athena, running analytics on large datasets required — at minimum — a running database cluster, an ETL pipeline to load data, and a team to maintain both. Small companies couldn't afford it. Even large companies struggled with idle infrastructure.

Athena changed this permanently. Today, a startup can store billions of events in S3 for pennies per GB per month, and query that data with SQL on demand, paying only for each query. There is no minimum spend, no committed capacity, no infrastructure team required.

Why This Matters for Architecture

Athena is a foundational component of the modern AWS data lake pattern: S3 (storage) + Glue (catalog) + Athena (query) + QuickSight (visualise). This stack replaces what used to require a full data warehouse team and six-figure infrastructure budgets. You'll see this pattern in almost every serious AWS analytics architecture.

🎯 Exam Insight

"Query data directly in S3 using SQL" → Athena
"Serverless analytics" or "no infrastructure to manage" → Athena
"Pay per query" or "pay per GB scanned" → Athena pricing model
Athena uses Glue Data Catalog for table schemas — this is the key integration to remember
Results are always written to an S3 output location — they are not stored in Athena itself

Chapter 01 — Key Takeaway

Athena is a serverless SQL query engine that reads data directly from S3. No servers to provision, no data to load, no infrastructure to manage. You pay per query, per GB scanned. It is the query layer of the modern AWS data lake — built on top of S3 for storage and Glue Catalog for schema management.

Chapter Two · Analytics

Why Serverless Analytics Matters

Before serverless analytics, every organisation that wanted to query large datasets had to become an infrastructure company by accident. They spent more time managing databases than getting value from data. Athena changes the economics entirely — the cost of asking a question is nearly zero, which changes how teams think about data.

The Traditional Analytics Problem Introductory

Classic analytics stacks require you to solve a set of hard operational problems before you can even run a single query:

🏗️

Infrastructure Overhead

Provision database clusters upfront
Estimate capacity for peak load (often wrong)
Over-provision because under-provisioning is worse
Pay for idle capacity every hour of every day
Maintain OS, engine patches, version upgrades

🐢

Operational Friction

Schema must be defined before data arrives
ETL pipelines needed to load data into DB
Pipeline failures = stale or missing data
Schema changes require expensive migrations
Scaling down is as hard as scaling up

⚠️ The Real Cost of Traditional Analytics

A small data team's analytics database running 24/7 on a 4-node Redshift cluster costs ~$7,000–$15,000 per month — whether anyone is querying it or not. Most queries happen during business hours (8 hours/day, 5 days/week). That means 75% of the cost is paying for idle infrastructure.

How Athena Changes the Economics Core

Serverless analytics decouples when data is stored from when it is queried. With Athena:

Traditional Approach	With Athena	Impact
Pay 24/7 for cluster	Pay per query ($5/TB scanned)	80–95% cost reduction for ad-hoc workloads
Provision before you know the load	No provisioning needed	Zero capacity risk
ETL before first query	Query files directly as-is	Hours → minutes to first insight
One cluster = one team's bottleneck	Unlimited parallel queries	No queue, no contention
Schema must match expectations	Schema-on-read; evolve freely	Flexible exploration

Real-World Use Cases Where Athena Shines Core

🔍

Log Analytics

Query ALB, CloudFront, or VPC Flow Logs
Find which IPs hit which endpoints
Analyse error rate trends over time
No log pipeline needed — files already in S3

🔐

CloudTrail Security Audit

Query who called which AWS API and when
Find all IAM key usage across accounts
Detect unusual cross-region activity
Native Athena table for CloudTrail exists

💰

Cost Reporting

Query AWS Cost & Usage Reports (CUR)
Break down spending by team, tag, service
Find idle resources burning budget
CUR is JSON/Parquet — perfect for Athena

📊

Ad-Hoc Reporting

Business teams query S3 data directly
One-off analysis without data engineering
Connect BI tools via JDBC/ODBC
No change to underlying data storage

🚨

Security Investigations

Athena + GuardDuty findings in S3
Query network flow logs during incidents
Correlate events across multiple log sources
Forensic queries on historical data

🏭

Data Lake Analytics

Explore raw data before building pipelines
Validate data quality with SQL
Query across multiple S3 datasets at once
Power dashboards via QuickSight

The Right Mental Model — Athena is Optimised For Core

✅ Athena is Great For

Exploration — discovering patterns in data you don't yet understand
Ad-hoc analytics — irregular, one-off queries by many users
Occasional reporting — dashboards that run a few times a day
Large scans on cold data — history, logs, archives
Security auditing — query CloudTrail and VPC Flow on-demand
Cost-effective BI — when query volume is low to medium

≠

❌ Athena is NOT For

Transactional workloads — INSERT/UPDATE/DELETE, OLTP → use RDS
Sub-second latency queries — real-time dashboards → use ElastiCache
High-concurrency BI (thousands of users) → use Redshift
Streaming data — continuous real-time feeds → use Kinesis
Application backends — Athena is not a transactional DB

Architecture Insight — Where Athena Lives in a Data Platform Core

Serverless Analytics Platform — Where Athena Fits

A full serverless analytics platform — no running databases, no ETL clusters, no idle infrastructure

Common Mistakes — What Teams Get Wrong Core

❌

Mistake 1: Using Athena Like a Database

Teams try to run real-time application queries through Athena. Athena executes in seconds, not milliseconds. It is an analytics tool, not a database backend. Use RDS/DynamoDB for application queries.

❌

Mistake 2: Querying Unpartitioned Data

Running queries on a flat S3 bucket with millions of small files is catastrophically expensive. Without partitioning, every query scans every file. This is the most common and costly Athena mistake.

❌

Mistake 3: Storing Data as CSV/JSON at Scale

CSV and JSON are row-based formats. A query for two columns out of 200 must read every column. Using Parquet or ORC reduces costs by 60–80% and query time dramatically. (More in Chapter 4.)

❌

Mistake 4: Expecting Consistent Sub-Second Results

Athena has a query startup overhead of 1–3 seconds. If your dashboard needs to refresh every second, Athena is not the right tool. For interactive sub-second dashboards, cache results or use ElastiCache.

🎯 Exam Insight

"Ad-hoc SQL queries on S3 data with no infrastructure" → Athena (not Redshift)
"Analyse CloudTrail logs with SQL" → Athena (built-in CloudTrail table)
"OLTP workload, INSERT/UPDATE, millisecond latency" → NOT Athena → RDS or DynamoDB
"Reduce Athena query cost" → add partitioning + use Parquet/ORC format
Athena is not suitable for transactional workloads — this distinction is frequently tested

Chapter 02 — Key Takeaway

Serverless analytics eliminates the infrastructure tax on data exploration. Athena's pay-per-query model makes asking questions nearly free — turning analytics from an expensive IT project into a standard engineering capability. Use it for exploration, log analysis, auditing, and ad-hoc reporting. Never use it for transactional workloads or sub-second latency requirements.

Chapter Three · Analytics

How Athena Works

Athena is not magic — it is a distributed SQL execution engine (Apache Trino under the hood) that reads files directly from S3 using metadata from the Glue Data Catalog. Understanding these three components — S3, Glue Catalog, and Athena's query engine — is the key to using Athena effectively in production.

The Three-Layer Architecture Core

Every Athena query interacts with three distinct layers. You must understand all three to architect a production data lake.

💾

Layer 1 — S3 (Data Store)

Physical files: CSV, JSON, Parquet, ORC, Avro
Organised in folder hierarchies (used for partitions)
Data never moves — Athena reads it in place
S3 is also where query results are saved

🗂️

Layer 2 — Glue Catalog (Metadata)

Stores table definitions: column names, types, formats
Tracks S3 partitions (folders mapped to table partitions)
Acts as a Hive Metastore — schema-on-read
Shared across Athena, Glue ETL, EMR, Redshift Spectrum

⚡

Layer 3 — Athena Engine (Query)

Distributed query engine (Apache Trino / Presto)
Parses SQL, plans execution, distributes workers
Workers read S3 files in parallel
Aggregates results and writes to S3 output

Deep Dive — The AWS Glue Data Catalog Core

The Glue Catalog is the most important concept to understand because it is the bridge between raw S3 files and queryable SQL tables. Without Glue, Athena would not know what schema a file has, what columns to expose, or where partitions live.

🧠 Mental Model — The Card Catalogue

Think of S3 as warehouse shelving and Glue Catalog as the index card system that says: "Shelf B-7 contains boxes of CSV files with columns [user_id, event_type, timestamp]. Box labels are dates." Athena uses the card catalogue to find the right shelf, then opens only the boxes it needs. Without the catalogue, it would have to open every box in the warehouse.

Glue Catalog Concept	What It Stores	Why It Matters
Database	A namespace grouping related tables	Organise tables by domain (e.g. `logs_db`, `billing_db`)
Table	Schema: column names, data types, S3 location, file format	Maps an S3 path to a queryable SQL table
Partition	Sub-folder path mapped to partition key values (e.g. year=2026)	Athena skips partitions not matching the WHERE clause
Crawler	A Glue job that scans S3 and auto-populates the catalog	Automate schema discovery; update partitions as new data arrives
SerDe	Serializer/Deserializer — how to parse a file format	Athena uses the SerDe to read CSV, JSON, Parquet, ORC correctly

Schema-on-Read — A Crucial Concept Core

Traditional databases use schema-on-write — the schema must be defined before data is loaded, and data must conform to it. Athena uses schema-on-read — the data exists in S3 in any shape, and you define a schema (via Glue) that describes how to interpret it at query time.

📝

Schema-on-Write (Traditional DB)

Define table structure first
ETL transforms data to match schema
Any schema change = migration
Data must be "clean" before loading
Rigid, but fast query execution

📖

Schema-on-Read (Athena / Data Lake)

Store data first in S3 as-is
Define schema in Glue Catalog at query time
Add columns to Glue without changing S3 files
Raw, messy data can still be queried
Flexible, but schema quality affects results

AWS Architecture Diagram — Athena + S3 + Glue Core

Amazon Athena — Full AWS Architecture with Glue Catalog

Athena, Glue Catalog, and S3 form a three-layer architecture — metadata, storage, and execution are fully separate and independently scalable

Data Lake Architecture — End-to-End Flow Core

Data Lake Architecture — Sources → S3 → Glue → Athena → BI

Ingest raw data → store in S3 → Glue ETL converts to Parquet with partitions → Athena queries via Glue Catalog → BI tools visualise

How Query Results Are Returned Core

Athena does not return results directly in memory like a traditional database client. Every query result is written to an S3 location (your chosen "query result bucket"), then served back to the client. This means:

📤

Results in S3

Output always goes to a configured S3 path
Stored as CSV (or JSON for some query types)
Results available for 45 days in Athena history
You control access via S3 bucket policies
Large results: download from S3 directly

💡

Workgroups — Cost & Access Control

Workgroups isolate teams: different result buckets
Set per-query data scan limits (cost control)
CloudWatch metrics per workgroup
Enforce encryption on query results
IAM policies per workgroup

Athena Query Engine Versions Deep

Engine	Based On	Key Capability	Use
Athena Engine v2	Presto 0.217	Standard SQL analytics	Default for older workgroups
Athena Engine v3	Apache Trino	Faster, more SQL functions, DML support	Current default — use this
Athena for Apache Spark	PySpark	Python notebook analytics, ML prep	Data engineering / ML workflows

💡 Engine v3 (Trino) — What's New

Athena Engine v3 adds DML operations (INSERT INTO, CTAS with partitioning), more SQL functions, better performance on complex joins, and support for Apache Iceberg and Hudi table formats — enabling true ACID transactions on S3 data files.

🎯 Exam Insight

Glue Data Catalog is the metadata store Athena uses — know this integration cold
Schema-on-read = define schema at query time, not before data lands — Athena's model
Query results are always written to S3 — not stored inside Athena
Workgroups = isolation + cost control for different teams
Glue Catalog is shared across Athena, EMR, Glue ETL, and Redshift Spectrum — one catalog, many tools
Athena reads data in place from S3 — data never moves or gets copied into Athena

Chapter 03 — Key Takeaway

Athena is a three-layer system: S3 holds the data, Glue Catalog holds the schema, and Athena's distributed engine executes SQL across them. The Glue Catalog is the critical integration — it tells Athena what tables exist, what columns they have, and where partitions are in S3. Understanding schema-on-read is the key insight: data lives in S3 in any format, and you define how to interpret it at query time via Glue.

Chapter Four · Analytics

Data Formats & Performance Optimization

This is the most operationally important Athena chapter. Athena charges $5 per TB scanned. How your data is stored — file format, compression, and partitioning — directly determines both query speed and cost. A poorly structured dataset costs 10–100× more to query than a well-structured one.

The Critical Insight — You Pay for Data Scanned Core

Athena pricing is not based on query complexity or time. It is based on how many bytes of S3 data your query reads. This means:

💸

Expensive Queries

SELECT * on unpartitioned CSV = scans EVERYTHING
Row-based formats (CSV/JSON) = reads ALL columns
Uncompressed files = maximum bytes scanned
Millions of tiny files = overhead per file

✅

Cheap Queries

Partitioned data = skip irrelevant folders
Columnar formats (Parquet/ORC) = read only needed columns
Compressed files = fewer bytes to scan
Right-sized files (128MB–512MB each)

⚠️ Real Cost Example

1 TB of CSV log data. You run SELECT user_id, event FROM logs WHERE date = '2026-05-07'.
Without optimization: Athena scans all 1 TB → costs $5.00, takes ~45 seconds.
With Parquet + partitioning: Athena reads ~2 GB → costs $0.01, takes ~3 seconds.
Same query. 500× cost reduction.

File Format Comparison Core

Format	Type	Compression	Column Pruning	When to Use	Athena Cost Impact
CSV	Row-based	Optional (GZIP)	❌ No — reads all columns	Quick ingestion, human-readable, small datasets	❌ Most expensive
JSON	Row-based	Optional (GZIP)	❌ No — reads all columns	API outputs, nested data, small files	❌ Most expensive
Parquet	Columnar	Built-in (Snappy)	✅ Yes — reads only selected columns	Analytics workloads, BI, large datasets	✅ 60–90% cheaper
ORC	Columnar	Built-in (ZLIB/Snappy)	✅ Yes — reads only selected columns	Hive ecosystem, high compression ratio	✅ 60–90% cheaper
Avro	Row-based	Built-in (Snappy)	❌ No	Schema evolution, Kafka/streaming pipelines	⚠️ Moderate

Why Columnar Formats Matter — Visual Explanation Core

Row-Based (CSV) vs Columnar (Parquet) — What Gets Read

Columnar formats store each column as a separate block — Athena reads only the columns in your SELECT

Partitioning — The Most Important Optimization Core

Partitioning organises S3 data into folder hierarchies that Athena can use to skip entire sections that don't match your query. Without partitioning, every query scans all files. With partitioning, Athena reads only the folders that match your WHERE clause.

🧠 Mental Model — Filing Cabinets

Imagine 100 filing cabinets labelled by month. If someone asks for "all invoices from March" — you open only the March cabinet (1 of 100). Without labels, you'd need to open every single drawer in every cabinet. Partitioning = labelling your S3 folders so Athena knows which to open and which to skip.

S3 Partitioning — Folder Structure Enables Partition Pruning

Query: WHERE year=2026 AND month=5 — Athena reads ONLY the green folder, skips everything else

Partitioning Best Practices Deep

Practice	Why	Example
Partition by time (date)	Most queries filter by date — massive scan reduction	`/year=2026/month=05/day=07/`
Partition by high-cardinality filter	If queries always filter by region or account, partition by it	`/region=us-east-1/`
Don't over-partition	Too many tiny files = overhead. Aim for 128MB–512MB per partition	Don't partition by hour if files are only 1KB
Use Hive-style naming	Athena auto-detects Hive partition format (`key=value/`)	`s3://bucket/data/year=2026/month=05/`
Register partitions in Glue	Athena won't see new partitions until Glue knows about them	Use Glue Crawler or `MSCK REPAIR TABLE`

File Size Matters — The Goldilocks Zone Deep

❌

Too Small (<10 MB)

Thousands of tiny files
Excessive S3 LIST + GET overhead
Athena spends more time opening files than reading data
Common with raw event streams

✅

Just Right (128–512 MB)

Optimal for Athena's parallel reading
Each file is meaningful work for one worker
Good balance of parallelism and efficiency
Use Glue ETL to consolidate small files

⚠️

Too Large (>2 GB)

Fewer files limits parallelism
One failed read = restart large file
Still works, but not optimal
Acceptable for cold archive data

Compression — Free Performance Boost Core

Compression	Splittable?	Ratio	Speed	Best For
Snappy	Yes (in Parquet/ORC)	Medium (2–4×)	Very fast decompress	Default for Parquet — best general choice
GZIP	No (raw), Yes (in Parquet)	High (5–8×)	Slower decompress	CSV/JSON when max compression needed
ZSTD	Yes (in Parquet)	High (5–7×)	Fast decompress	Best balance of ratio + speed (newer)
LZO	Yes (standalone)	Medium (2–3×)	Very fast	Hadoop/EMR ecosystems (legacy)
None	Yes	1× (uncompressed)	No CPU overhead	Never for analytics — wastes cost + time

The Optimization Pyramid — Summary Core

🏗️ Four Layers of Athena Cost Optimization (Most Impact First)

① Partitioning — eliminate 80–99% of data scanned by filtering folder paths
② Columnar format (Parquet/ORC) — read only needed columns, skip the rest
③ Compression (Snappy/ZSTD) — reduce bytes within each column block
④ Right-sized files (128–512 MB) — balance parallelism and per-file overhead

🎯 Exam Insight

"Reduce Athena cost" → partition data + convert to Parquet/ORC (always the answer)
"Athena scans too much data" → data is not partitioned or is stored in CSV/JSON
"Columnar format" = Parquet or ORC — enables column pruning (reads only needed columns)
Athena charges $5 per TB scanned with a 10 MB minimum per query
Parquet with Snappy is the default best-practice format for Athena
MSCK REPAIR TABLE or Glue Crawler adds new partitions to the catalog

Chapter 04 — Key Takeaway

Athena cost = data scanned. Reduce what Athena reads and you reduce both cost and latency. Partition by date (eliminates irrelevant data), use Parquet (reads only needed columns), compress with Snappy (fewer bytes per column), and consolidate files to 128–512 MB each. A well-optimized dataset can be 100× cheaper to query than a raw CSV dump.

Chapter Five · Analytics

Athena vs Redshift vs OpenSearch

AWS has many analytics and query services. Understanding when to use which — and what problem each actually solves — is the key to choosing the right tool. Athena, Redshift, and OpenSearch are the three most commonly compared and confused services.

The Core Distinction Introductory

🔍

Athena — Serverless Ad-Hoc

Query data in S3 with SQL on-demand
No infrastructure to manage
Pay per query ($5/TB scanned)
Best for: exploration, logs, ad-hoc
Latency: seconds to minutes

🏢

Redshift — Enterprise Warehouse

Managed columnar data warehouse
Petabyte-scale with sub-second queries
Pay per cluster (or Redshift Serverless per RPU-hr)
Best for: BI dashboards, complex joins, heavy analytics
Latency: sub-second to seconds

🔎

OpenSearch — Search & Logs

Full-text search + real-time log analytics
Managed Elasticsearch / OpenSearch cluster
Pay per instance + storage
Best for: text search, log dashboards, observability
Latency: milliseconds (real-time)

Full Comparison Table Core

Dimension	Athena	Redshift	OpenSearch
Model	Serverless — no infra	Provisioned cluster (or Serverless)	Managed cluster
Data location	S3 (query in place)	Local SSD + can extend to S3 (Spectrum)	Local storage (indexes)
Query language	Standard SQL (Trino)	PostgreSQL-compatible SQL	Query DSL + SQL plugin
Latency	2–30 seconds	Sub-second to seconds	Milliseconds
Concurrency	~20 concurrent queries	Hundreds (with WLM)	Thousands (search)
Pricing	$5/TB scanned	~$0.25/hr per node (or RPU-hr)	~$0.10/hr per instance
Best for	Ad-hoc, exploration, infrequent	Complex BI, heavy repeated queries	Full-text search, real-time logs
Idle cost	$0 (pay per query)	Running cluster = cost 24/7	Running cluster = cost 24/7
Schema	Schema-on-read (Glue)	Schema-on-write (tables)	Index mappings
Data format	Any (CSV, JSON, Parquet, ORC)	Proprietary columnar	JSON documents (inverted index)

Decision Guide — When to Use Each Core

If Your Scenario Is…	Use	Why
Occasional SQL queries on S3 data	Athena	Zero idle cost, no setup needed
Data exploration by analysts	Athena	Schema-on-read, flexible, low barrier
Heavy BI dashboard (Tableau/PowerBI) with sub-second response	Redshift	Optimised for repeated complex queries at scale
Thousands of concurrent dashboard users	Redshift	Better concurrency management (WLM)
Full-text search across logs or documents	OpenSearch	Inverted index = millisecond text search
Real-time observability (Kibana dashboards)	OpenSearch	Near-real-time indexing + visualisations
CloudTrail / VPC Flow Log analysis (occasional)	Athena	Files already in S3, pay only when querying
Petabyte star-schema analytics with complex joins	Redshift	Distribution keys + sort keys = optimal join performance
Query budget is near zero but query volume is low	Athena	Literal $0 idle cost; pennies per query

They Work Together — Not Competing Core

These services are complementary, not alternatives. A production analytics stack often uses multiple:

🔗

Athena + Redshift Spectrum

Hot data in Redshift local storage (fast, repeated queries)
Cold/historical data in S3 (queried via Redshift Spectrum or Athena)
Same Glue Catalog shared between both
Move data from hot → cold as it ages

🔗

Athena + OpenSearch

OpenSearch for real-time log dashboards (last 7 days)
Athena for historical deep-dives (query months of logs in S3)
Kinesis Firehose ships logs to both: OpenSearch (real-time) + S3 (archive)
Security teams use Athena for forensic investigations on historical data

Athena vs RDS/Aurora — A Common Confusion Core

Aspect	Athena	RDS / Aurora
Purpose	Analytics — read-heavy, scan-heavy	Transactional — read/write OLTP
Operations	SELECT only (read-only analytics)	INSERT, UPDATE, DELETE, SELECT
Latency	Seconds	Milliseconds
Data source	S3 files (external)	Local storage (managed disks)
Use case	Reports, auditing, exploration	Application backend database
Analogy	Library catalogue search	Cash register (record transactions)

⚠️ Common Exam Trap

Athena is not a replacement for RDS. If the question describes an application that needs INSERT/UPDATE/DELETE with millisecond latency — that is NOT Athena. Athena is for analytics. RDS/Aurora/DynamoDB are for transactional workloads. This is the most common confusion in certification exams.

🎯 Exam Insight

"Serverless SQL on S3" → Athena (not Redshift)
"Sub-second BI dashboard with complex joins" → Redshift
"Full-text search across log data in real-time" → OpenSearch
"Zero idle cost analytics" → Athena (Redshift has running cluster cost)
"OLTP workload with transactions" → NOT Athena → RDS/Aurora/DynamoDB
"Redshift Spectrum" = Redshift querying S3 data (uses same Glue Catalog as Athena)
"Athena or Redshift for occasional queries?" → Athena (lower cost when queries are infrequent)
"Athena or Redshift for thousands of concurrent BI users?" → Redshift (better concurrency)

Chapter 05 — Key Takeaway

Athena = serverless ad-hoc SQL on S3 (pay-per-query, zero idle cost). Redshift = enterprise data warehouse for heavy, repeated, complex analytics (sub-second, high concurrency). OpenSearch = real-time search and log observability (millisecond full-text search). They complement each other — use Athena for exploration and cold data, Redshift for hot BI, and OpenSearch for real-time logs.

Chapter Six · Analytics

Athena Architecture Patterns

Athena typically doesn't exist in isolation — it's part of a larger analytics architecture. This chapter covers the five most common production patterns where Athena provides the query layer: data lake analytics, CloudTrail auditing, serverless reporting, cost analysis, and security investigation workflows.

Pattern 1 — Data Lake Analytics Core

The canonical Athena use case. Raw data lands in S3, gets processed into Parquet with partitioning, then queried on-demand via Athena. This is the modern replacement for a traditional data warehouse at a fraction of the cost.

Production Data Lake — S3 + Glue + Athena + QuickSight

Sources → Raw S3 → Glue ETL (Parquet + partition) → Curated S3 → Athena queries → QuickSight dashboards

Pattern 2 — CloudTrail Log Analysis Core

CloudTrail logs every AWS API call in your account and delivers them to S3 automatically. Athena has a built-in CloudTrail table template — you can query who did what, when, from where, in seconds.

🔐

Common CloudTrail Queries

Who deleted that S3 bucket / EC2 instance?
All API calls from a specific IP address
All root account usage in the last 30 days
All IAM policy changes in a time range
Failed authentication attempts

⚡

How It Works

CloudTrail delivers JSON logs to S3 (automatic)
Create Athena table pointing at CloudTrail S3 path
Partition by region + year + month + day
Query with SQL: WHERE eventname = 'DeleteBucket'
Cost: pennies per query (small scan with partitions)

Pattern 3 — Serverless Reporting System Core

Build a reporting system with zero running infrastructure. An API Gateway triggers Lambda, which runs an Athena query, returns results to the client or stores them for a dashboard.

Serverless Reporting — API Gateway + Lambda + Athena

Pattern 4 — AWS Cost & Usage Report Analysis Core

💰

What Is CUR?

AWS Cost & Usage Report — most detailed billing data
Delivered as Parquet files to S3 (hourly/daily)
Line-item detail: service, usage, tags, accounts
Can be queried directly by Athena (no ETL needed!)

📊

Example Queries

Total spend by team/project (using cost allocation tags)
Top 10 most expensive resources this month
Daily spending trend for EC2 vs Lambda vs S3
Unused EBS volumes costing money
Reserved Instance utilization tracking

Pattern 5 — Security Investigation Workflow Deep

During a security incident, you need to query massive volumes of logs across accounts in minutes. Athena enables forensic investigation across CloudTrail, VPC Flow Logs, GuardDuty findings, and DNS logs — all stored in S3.

Security Investigation — Multi-Source Forensic Analysis

Multi-log forensics: join CloudTrail + VPC Flow + GuardDuty findings using Athena SQL — no infrastructure to set up during an incident

Pattern Summary Table Core

Pattern	Data Sources	Key Benefit	Cost Profile
Data Lake Analytics	Any — ingested via Kinesis/DMS/direct S3	Replaces traditional data warehouse	$5/TB per query, $0 idle
CloudTrail Audit	CloudTrail JSON logs in S3	Who did what, when — instant forensics	Pennies per query (small partitions)
Serverless Reporting	Any S3 data + API Gateway + Lambda	Zero-infra reporting API, $0 idle	Lambda + Athena per invocation
CUR Cost Analysis	AWS Cost & Usage Report (Parquet in S3)	Granular billing insight, tag-based allocations	Near-free (CUR is already Parquet)
Security Investigation	CloudTrail + VPC Flow + GuardDuty + DNS	Multi-source forensics, cross-log JOINs	Variable (depends on time range scanned)

🎯 Exam Insight

"Analyse CloudTrail logs with SQL" → Athena + CloudTrail table in Glue
"Query AWS Cost & Usage Report" → Athena (CUR is Parquet in S3 — perfect match)
"Serverless reporting API, zero idle cost" → API Gateway + Lambda + Athena
"Cross-account security forensics" → Athena with org-wide CloudTrail S3 bucket
"Data lake query layer" → Athena + Glue Catalog + S3

Chapter 06 — Key Takeaway

Athena's power comes from pairing it with the right architecture. Data Lake (S3 + Glue + Athena + QuickSight) replaces expensive data warehouses. CloudTrail + Athena gives instant security auditing. Lambda + Athena builds zero-cost serverless reporting APIs. CUR + Athena enables granular cost governance. In every case, the pattern is the same: data in S3, metadata in Glue, query in Athena.

Chapter Seven · Analytics

Security & Governance

Athena security is not one thing — it is the intersection of three permission systems: IAM (who can run queries), S3 (who can read/write data), and Glue Catalog (who can see table metadata). Understanding these three layers is critical for secure production use.

The Three Security Layers Core

🔑

Layer 1 — IAM Permissions

athena:StartQueryExecution
athena:GetQueryResults
athena:GetWorkGroup
Controls WHO can run queries and in which workgroup
Use IAM policies to restrict by workgroup

📁

Layer 2 — S3 Access

s3:GetObject on data bucket
s3:PutObject on results bucket
s3:GetBucketLocation
Controls WHAT data Athena can read
Use bucket policies + IAM to restrict

🗂️

Layer 3 — Glue Catalog

glue:GetDatabase, glue:GetTable
glue:GetPartitions
Controls WHICH tables/databases are visible
Use resource-based Glue policies
Lake Formation for fine-grained column access

Security Architecture Diagram Core

Athena Security — Triple-Lock Access Model

A query must pass IAM (can you run queries?), Glue (can you see this table?), and S3 (can you read these files?) — all three must allow access

Encryption Core

What	Encryption Options	Configuration
Source data in S3	SSE-S3, SSE-KMS, CSE-KMS	Set on S3 bucket default encryption or per-object
Query results in S3	SSE-S3, SSE-KMS	Configured per workgroup (enforce via workgroup settings)
Data in transit	TLS 1.2 (automatic)	JDBC/ODBC connections are encrypted by default

Workgroups — Team Isolation & Governance Core

👥

What Workgroups Control

Separate query result locations per team
Per-query data scan limits (cost cap)
Enforce encryption on results
CloudWatch metrics per workgroup
IAM policies scoped to workgroup

🏗️

Best Practice: Multi-Team Setup

One workgroup per team (marketing, engineering, security)
Set scan limit: e.g. 100 GB max per query
Each team's results → separate S3 prefix
Audit trail: CloudTrail logs query execution per workgroup
Tag workgroups for cost allocation tracking

Lake Formation — Fine-Grained Access Deep

For enterprise environments that need column-level or row-level security, AWS Lake Formation adds another governance layer on top of Glue Catalog:

Feature	Without Lake Formation	With Lake Formation
Table access	IAM + Glue policies (coarse)	Fine-grained: per-column, per-row, per-user
Column masking	Not possible natively	Hide sensitive columns from specific users
Row filtering	Not possible natively	Users see only rows matching their filter
Cross-account sharing	Complex IAM + S3 policies	One-click table sharing across accounts
Audit	CloudTrail only	Centralised access audit log in Lake Formation

💡 When to Use Lake Formation

Use Lake Formation when you need column-level security (e.g. hide SSN column from certain teams), row-level filtering (e.g. each team sees only their region's data), or cross-account data sharing without complex IAM. For simple setups with 1–2 teams, standard IAM + Glue + S3 policies are sufficient.

Query Auditing & Monitoring Core

📋

CloudTrail

Logs every Athena API call
StartQueryExecution events
Who ran what query, when
IP source, IAM identity

📊

CloudWatch Metrics

Data scanned per workgroup
Query execution time
Number of queries (throttled)
Set alarms on cost thresholds

📝

Query History

45-day history in Athena console
Query text, status, data scanned
Duration, result location
Useful for cost attribution

🎯 Exam Insight

Athena security = IAM + S3 + Glue — all three must allow access for a query to succeed
Workgroups = team isolation + enforce encryption + cost control (per-query scan limit)
Lake Formation = column-level and row-level access control on Glue tables
Query result encryption can be enforced at the workgroup level (SSE-S3 or SSE-KMS)
"Restrict analyst to seeing only their team's data" → Lake Formation row-level security
"Limit Athena spend per team" → workgroup with per-query scan limit

Chapter 07 — Key Takeaway

Athena security is the union of three systems: IAM (who can query), S3 (what data is readable), and Glue Catalog (which tables are visible). All three must permit access. Use workgroups to isolate teams, enforce encryption, and set cost limits. For enterprise column/row-level security, add AWS Lake Formation. Every query is auditable via CloudTrail + CloudWatch.

Chapter Eight · Analytics

Cost Optimization & Best Practices

Athena's pricing is deceptively simple: $5 per TB of data scanned. But in practice, the difference between a well-architected data lake and a naive one is 100× cost difference for the same query. This chapter consolidates every optimization into a production-ready checklist.

Pricing Model — What You Actually Pay Core

Component	Cost	Notes
Data scanned per query	$5.00 per TB	Rounded up to 10 MB minimum per query
Cancelled queries	Charged for data scanned before cancellation	Cancel early to limit cost
DDL statements (CREATE TABLE, etc.)	Free	No data scanned
Failed queries	Free	Syntax errors, permission errors = $0
Glue Data Catalog	Free first 1M objects/month	Then $1 per 100K objects
S3 storage	~$0.023/GB/month (Standard)	Separate from Athena — you'd store it anyway
S3 GET requests	$0.0004 per 1000 requests	Athena generates many GETs — usually negligible

💡 Real-World Cost Example

Scenario: 100 analysts, each running 5 queries/day, scanning 500 MB per query (well-optimized Parquet + partitioning).
Monthly cost: 100 × 5 × 30 × 0.5 GB = 7,500 GB = 7.5 TB × $5 = $37.50/month.
Compare to a Redshift cluster doing the same work: $2,000–$8,000/month.

The Complete Cost Optimization Checklist Core

#	Optimization	Impact	Effort	How
1	Partition by date	80–99% reduction	Low	Hive-style folders: `/year=YYYY/month=MM/day=DD/`
2	Use Parquet or ORC	60–90% reduction	Medium	Glue ETL job to convert raw → Parquet
3	Compress with Snappy/ZSTD	20–40% further	Free (Parquet default)	Parquet uses Snappy by default — just use Parquet
4	Consolidate small files	10–50% improvement	Medium	Glue ETL or S3 BatchOps to merge into 128–512 MB
5	SELECT only needed columns	Proportional	Free	Never `SELECT *` — list specific columns
6	Use workgroup scan limits	Cost safety net	Low	Set max per-query scan (e.g. 100 GB)
7	CTAS for repeated queries	Avoids repeated scans	Low	CREATE TABLE AS to materialize results
8	Use Athena query result reuse	$0 for cache hits	Free	Enable "Reuse query results" in workgroup (7-day cache)

CTAS — Materialize Expensive Queries Deep

CREATE TABLE AS SELECT (CTAS) is a powerful technique: run an expensive query once, store the results as a new Parquet table, then all subsequent queries read the smaller materialized result instead of rescanning raw data.

❌

Without CTAS

Dashboard refreshes 10×/day
Each refresh scans 200 GB of raw data
Daily cost: 10 × 200 GB × $5/TB = $10/day
Monthly: ~$300 just for one dashboard

✅

With CTAS

CTAS runs once/day → 200 GB scan = $1/day
Materialized result: 2 GB Parquet table
Dashboard reads 2 GB × 10 = 20 GB = $0.10/day
Monthly: ~$33 (90% savings)

Production Best Practices Summary Core

✅

Data Engineering Practices

Store analytics data as Parquet with Snappy
Partition by the most common filter (usually date)
File size target: 128–512 MB per file
Use Glue Crawlers to auto-update partitions
Separate raw zone (JSON/CSV) from curated zone (Parquet)
Use lifecycle rules to move old data to Glacier

🏗️

Operational Governance

One workgroup per team — separate cost tracking
Set per-query scan limits to prevent runaway queries
Encrypt results (enforce at workgroup level)
Enable query result reuse (7-day cache)
Monitor with CloudWatch: data scanned, query time
Tag workgroups for FinOps cost allocation

🔍

Query Writing Practices

Never SELECT * — list only needed columns
Always filter by partition keys in WHERE clause
Use LIMIT during exploration
Materialise expensive joins with CTAS
Use views for common access patterns
Check "Data scanned" after each query — optimise if high

⚠️

Anti-Patterns to Avoid

❌ SELECT * on large tables
❌ Querying unpartitioned data at scale
❌ Storing analytics data as CSV/JSON long-term
❌ Millions of tiny files (< 1 MB each)
❌ Missing partition keys in WHERE clause
❌ Using Athena as a transactional database

When NOT to Use Athena Core

Scenario	Why Not Athena	Better Choice
Sub-second interactive dashboards (thousands of users)	2–30s query latency, limited concurrency	Redshift + SPICE (QuickSight) or ElastiCache
Transactional workloads (INSERT/UPDATE/DELETE)	Athena is read-only analytics	RDS, Aurora, DynamoDB
Real-time streaming analytics	Athena is batch, not streaming	Kinesis Analytics, Flink, OpenSearch
Very high query volume (100+ concurrent queries)	Default concurrency limit ~20–25 DML queries	Redshift Serverless (higher concurrency)
Application backend (millisecond reads)	Not a database — seconds of latency	DynamoDB, ElastiCache, Aurora

🎯 Exam Insight

"Reduce Athena cost" → partition + Parquet + compression (ALWAYS the answer)
"Pay per query, $5/TB" → Athena pricing (vs Redshift per-hour cluster cost)
CTAS = CREATE TABLE AS — materializes results, avoids repeated scans
"Millions of small files slow down Athena" → consolidate to 128–512 MB files
"Limit Athena spend" → workgroup per-query byte scan limit
"Cache Athena results" → query result reuse (workgroup setting)
Athena minimum charge: 10 MB per query even if data scanned is less

Full Page Summary — Amazon Athena Core

📋 Amazon Athena — Complete Recap
What: Serverless SQL query engine that reads data directly from S3 — no servers, no databases, no loading.
How: Three-layer architecture — S3 (data) + Glue Catalog (metadata/schema) + Athena Engine (distributed SQL execution via Trino).
Cost: $5 per TB of data scanned. Optimize with partitioning (80–99% reduction), Parquet (60–90%), and compression.
Best for: Ad-hoc analytics, log analysis, CloudTrail auditing, CUR billing queries, data lake exploration, serverless reporting.
Not for: OLTP workloads, sub-second latency, high-concurrency dashboards, real-time streaming.
vs Redshift: Athena = serverless, $0 idle, ad-hoc. Redshift = managed warehouse, sub-second, high concurrency.
vs OpenSearch: Athena = SQL analytics on S3. OpenSearch = real-time search/log observability.
Security: IAM + S3 + Glue — triple-lock model. Workgroups for team isolation. Lake Formation for column/row-level access.
Key integration: Glue Data Catalog is the bridge between raw S3 files and queryable SQL tables.

👉 Final Takeaway — Athena in One Thought

Amazon Athena turns every S3 bucket into a queryable database — without provisioning, loading, or managing anything. Store data cheaply in S3, describe it in Glue, query it with SQL, pay only for what you scan. Optimize with Parquet and partitioning, and your analytics cost drops to near-zero. It is the query engine that makes the AWS data lake practical, accessible, and affordable.