Amazon Athena Amazon S3 AWS Glue
LearningTree ยท AWS ยท Analytics

Amazon Athena โ€”
Serverless Analytics Directly on S3

Query petabytes of data stored in S3 using standard SQL โ€” without provisioning a single server, loading data into a database, or managing infrastructure. Athena is the engine of the modern AWS data lake.

01
Chapter One ยท Analytics

What is Amazon Athena?

Amazon Athena is a serverless, interactive query service that lets you analyze data stored in Amazon S3 using standard SQL โ€” without loading it into a database, provisioning servers, or managing any infrastructure. You point Athena at files in S3, write a SQL query, and results appear in seconds.

The Core Idea in One Sentence Introductory

Traditional analytics requires you to move data into a database before you can query it. Athena flips this model โ€” the query goes to the data, not the data to the query engine.

๐Ÿ›๏ธ Mental Model โ€” The Digital Library

Imagine a giant digital library with millions of books stored on shelves (S3). In a traditional database, you'd need to move all the books into a reading room, catalogue them, and maintain the room before anyone can search. Athena is a librarian who reads the books exactly where they are on the shelves โ€” you just describe what you're looking for, and the answer comes back instantly. No moving. No room to maintain. No staff to hire.

Traditional vs Athena โ€” The Paradigm Shift Introductory
โŒ Traditional Analytics
  • Provision a database cluster (RDS, Redshift)
  • Design and create tables and schemas
  • ETL pipeline loads data into the database
  • Pay for the cluster 24/7, even when idle
  • Scale the cluster as data grows
  • Manage backups, patches, engine versions
  • Time-to-first-query: days or weeks
vs
โœ… With Amazon Athena
  • Data stays in S3 exactly where it already is
  • Define a schema in Glue Catalog (metadata only)
  • Write a SQL query โ€” Athena executes it directly
  • Pay only for what you query (per GB scanned)
  • No infrastructure to scale โ€” Athena scales itself
  • Nothing to patch, manage, or maintain
  • Time-to-first-query: minutes
What Kinds of Data Can Athena Query? Core
๐Ÿ“‹

Log Files

  • CloudTrail audit logs
  • ALB / CloudFront access logs
  • VPC Flow Logs
  • Application logs exported to S3
๐Ÿ“Š

Business Data

  • CSV exports from databases
  • JSON event streams
  • AWS Cost & Usage Reports (CUR)
  • Parquet data from ETL pipelines
๐Ÿ”ฌ

Analytics Datasets

  • Clickstream and user event data
  • IoT sensor readings
  • Financial transaction records
  • Machine learning feature stores
Athena is Serverless โ€” What Does That Actually Mean? Core

"Serverless" is overused โ€” here is exactly what it means for Athena:

ConcernTraditional DBAmazon Athena
ServersYou provision EC2 or DB instancesAWS owns and operates all compute
ScalingYou resize instances or add nodesAthena scales automatically per query
Idle costYou pay even when no queries runZero cost when not querying
PatchingYou apply engine/OS updatesAWS maintains all infrastructure
Capacity planningRequired โ€” guess future loadNot required โ€” on-demand
Startup timeMinutes to hours to provisionInstant โ€” always ready
Concept Diagram โ€” How Athena Fits Introductory
Amazon Athena โ€” Serverless Query Model
GLUE CATALOG Table metadata Schema & partitions ANALYST SQL Query Console / API SQL ATHENA Serverless Query Engine (Trino) $5 / TB scanned reads schema scans files S3 DATA LAKE CSV / JSON Parquet / ORC $0.023/GB/month results Analyst โ†’ Athena (SQL) โ†’ reads schema from Glue โ†’ scans S3 files โ†’ returns results No database ยท No loading ยท No servers
Athena reads schema from Glue Catalog, scans files in S3, returns SQL results โ€” no database, no loading, no servers
Query Flow Diagram โ€” Step by Step Core
Athena Query Execution Flow โ€” What Happens When You Run a Query
โ‘  SUBMIT SQL query via Console / API / JDBC โ‘ก PARSE Athena parses SQL, validates against Glue โ‘ข PLAN Query planner selects partitions to scan โ‘ฃ EXECUTE Distributed workers scan S3 in parallel โ‘ค RESULTS Written to S3 output location You trigger Glue metadata Partition pruning S3 file reads CSV or JSON
Each query is fully isolated โ€” Athena spins up distributed workers, scans only required S3 partitions, returns results
Athena's Place in the AWS Ecosystem Core

Athena is not a standalone service โ€” it is the query layer of the modern AWS data lake stack:

๐Ÿ’พ

Storage Layer โ€” Amazon S3

S3 holds the actual data files. It is the source of truth โ€” cheap, durable, infinitely scalable. Athena never moves this data; it reads it in place.

๐Ÿ—‚๏ธ

Metadata Layer โ€” AWS Glue Catalog

Glue Catalog holds the schema โ€” what tables exist, what columns they have, where the S3 partitions are. Athena reads this catalog before scanning any data.

๐Ÿ”

Query Layer โ€” Amazon Athena

Athena is the SQL engine. It reads schemas from Glue, figures out which S3 files to read, distributes the scan across workers, and returns results.

๐Ÿ“ˆ

Visualisation Layer โ€” QuickSight / BI Tools

Tools like Amazon QuickSight, Tableau, or Power BI connect to Athena via JDBC/ODBC to build dashboards and reports on top of S3 data.

Why Athena is Important โ€” The Big Picture Introductory

Before services like Athena, running analytics on large datasets required โ€” at minimum โ€” a running database cluster, an ETL pipeline to load data, and a team to maintain both. Small companies couldn't afford it. Even large companies struggled with idle infrastructure.

Athena changed this permanently. Today, a startup can store billions of events in S3 for pennies per GB per month, and query that data with SQL on demand, paying only for each query. There is no minimum spend, no committed capacity, no infrastructure team required.

Why This Matters for Architecture

Athena is a foundational component of the modern AWS data lake pattern: S3 (storage) + Glue (catalog) + Athena (query) + QuickSight (visualise). This stack replaces what used to require a full data warehouse team and six-figure infrastructure budgets. You'll see this pattern in almost every serious AWS analytics architecture.

๐ŸŽฏ Exam Insight
  • "Query data directly in S3 using SQL" โ†’ Athena
  • "Serverless analytics" or "no infrastructure to manage" โ†’ Athena
  • "Pay per query" or "pay per GB scanned" โ†’ Athena pricing model
  • Athena uses Glue Data Catalog for table schemas โ€” this is the key integration to remember
  • Results are always written to an S3 output location โ€” they are not stored in Athena itself
Chapter 01 โ€” Key Takeaway

Athena is a serverless SQL query engine that reads data directly from S3. No servers to provision, no data to load, no infrastructure to manage. You pay per query, per GB scanned. It is the query layer of the modern AWS data lake โ€” built on top of S3 for storage and Glue Catalog for schema management.

02
Chapter Two ยท Analytics

Why Serverless Analytics Matters

Before serverless analytics, every organisation that wanted to query large datasets had to become an infrastructure company by accident. They spent more time managing databases than getting value from data. Athena changes the economics entirely โ€” the cost of asking a question is nearly zero, which changes how teams think about data.

The Traditional Analytics Problem Introductory

Classic analytics stacks require you to solve a set of hard operational problems before you can even run a single query:

๐Ÿ—๏ธ

Infrastructure Overhead

  • Provision database clusters upfront
  • Estimate capacity for peak load (often wrong)
  • Over-provision because under-provisioning is worse
  • Pay for idle capacity every hour of every day
  • Maintain OS, engine patches, version upgrades
๐Ÿข

Operational Friction

  • Schema must be defined before data arrives
  • ETL pipelines needed to load data into DB
  • Pipeline failures = stale or missing data
  • Schema changes require expensive migrations
  • Scaling down is as hard as scaling up
โš ๏ธ The Real Cost of Traditional Analytics

A small data team's analytics database running 24/7 on a 4-node Redshift cluster costs ~$7,000โ€“$15,000 per month โ€” whether anyone is querying it or not. Most queries happen during business hours (8 hours/day, 5 days/week). That means 75% of the cost is paying for idle infrastructure.

How Athena Changes the Economics Core

Serverless analytics decouples when data is stored from when it is queried. With Athena:

Traditional ApproachWith AthenaImpact
Pay 24/7 for clusterPay per query ($5/TB scanned)80โ€“95% cost reduction for ad-hoc workloads
Provision before you know the loadNo provisioning neededZero capacity risk
ETL before first queryQuery files directly as-isHours โ†’ minutes to first insight
One cluster = one team's bottleneckUnlimited parallel queriesNo queue, no contention
Schema must match expectationsSchema-on-read; evolve freelyFlexible exploration
Real-World Use Cases Where Athena Shines Core
๐Ÿ”

Log Analytics

  • Query ALB, CloudFront, or VPC Flow Logs
  • Find which IPs hit which endpoints
  • Analyse error rate trends over time
  • No log pipeline needed โ€” files already in S3
๐Ÿ”

CloudTrail Security Audit

  • Query who called which AWS API and when
  • Find all IAM key usage across accounts
  • Detect unusual cross-region activity
  • Native Athena table for CloudTrail exists
๐Ÿ’ฐ

Cost Reporting

  • Query AWS Cost & Usage Reports (CUR)
  • Break down spending by team, tag, service
  • Find idle resources burning budget
  • CUR is JSON/Parquet โ€” perfect for Athena
๐Ÿ“Š

Ad-Hoc Reporting

  • Business teams query S3 data directly
  • One-off analysis without data engineering
  • Connect BI tools via JDBC/ODBC
  • No change to underlying data storage
๐Ÿšจ

Security Investigations

  • Athena + GuardDuty findings in S3
  • Query network flow logs during incidents
  • Correlate events across multiple log sources
  • Forensic queries on historical data
๐Ÿญ

Data Lake Analytics

  • Explore raw data before building pipelines
  • Validate data quality with SQL
  • Query across multiple S3 datasets at once
  • Power dashboards via QuickSight
The Right Mental Model โ€” Athena is Optimised For Core
โœ… Athena is Great For
  • Exploration โ€” discovering patterns in data you don't yet understand
  • Ad-hoc analytics โ€” irregular, one-off queries by many users
  • Occasional reporting โ€” dashboards that run a few times a day
  • Large scans on cold data โ€” history, logs, archives
  • Security auditing โ€” query CloudTrail and VPC Flow on-demand
  • Cost-effective BI โ€” when query volume is low to medium
โ‰ 
โŒ Athena is NOT For
  • Transactional workloads โ€” INSERT/UPDATE/DELETE, OLTP โ†’ use RDS
  • Sub-second latency queries โ€” real-time dashboards โ†’ use ElastiCache
  • High-concurrency BI (thousands of users) โ†’ use Redshift
  • Streaming data โ€” continuous real-time feeds โ†’ use Kinesis
  • Application backends โ€” Athena is not a transactional DB
Architecture Insight โ€” Where Athena Lives in a Data Platform Core
Serverless Analytics Platform โ€” Where Athena Fits
DATA SOURCES App logs CloudTrail VPC Flow Logs CUR Reports DB exports Kinesis streams S3 Data Lake Raw storage AWS GLUE Catalog + ETL (optional โ€” enrich or convert format) ATHENA Serverless SQL Query engine Pay per query BI & REPORTING QuickSight Tableau / PowerBI Custom dashboards JDBC / ODBC Athena: $5 per TB scanned ยท S3: $0.023/GB/month ยท Glue Catalog: free for first 1M objects
A full serverless analytics platform โ€” no running databases, no ETL clusters, no idle infrastructure
Common Mistakes โ€” What Teams Get Wrong Core
โŒ

Mistake 1: Using Athena Like a Database

Teams try to run real-time application queries through Athena. Athena executes in seconds, not milliseconds. It is an analytics tool, not a database backend. Use RDS/DynamoDB for application queries.

โŒ

Mistake 2: Querying Unpartitioned Data

Running queries on a flat S3 bucket with millions of small files is catastrophically expensive. Without partitioning, every query scans every file. This is the most common and costly Athena mistake.

โŒ

Mistake 3: Storing Data as CSV/JSON at Scale

CSV and JSON are row-based formats. A query for two columns out of 200 must read every column. Using Parquet or ORC reduces costs by 60โ€“80% and query time dramatically. (More in Chapter 4.)

โŒ

Mistake 4: Expecting Consistent Sub-Second Results

Athena has a query startup overhead of 1โ€“3 seconds. If your dashboard needs to refresh every second, Athena is not the right tool. For interactive sub-second dashboards, cache results or use ElastiCache.

๐ŸŽฏ Exam Insight
  • "Ad-hoc SQL queries on S3 data with no infrastructure" โ†’ Athena (not Redshift)
  • "Analyse CloudTrail logs with SQL" โ†’ Athena (built-in CloudTrail table)
  • "OLTP workload, INSERT/UPDATE, millisecond latency" โ†’ NOT Athena โ†’ RDS or DynamoDB
  • "Reduce Athena query cost" โ†’ add partitioning + use Parquet/ORC format
  • Athena is not suitable for transactional workloads โ€” this distinction is frequently tested
Chapter 02 โ€” Key Takeaway

Serverless analytics eliminates the infrastructure tax on data exploration. Athena's pay-per-query model makes asking questions nearly free โ€” turning analytics from an expensive IT project into a standard engineering capability. Use it for exploration, log analysis, auditing, and ad-hoc reporting. Never use it for transactional workloads or sub-second latency requirements.

03
Chapter Three ยท Analytics

How Athena Works

Athena is not magic โ€” it is a distributed SQL execution engine (Apache Trino under the hood) that reads files directly from S3 using metadata from the Glue Data Catalog. Understanding these three components โ€” S3, Glue Catalog, and Athena's query engine โ€” is the key to using Athena effectively in production.

The Three-Layer Architecture Core

Every Athena query interacts with three distinct layers. You must understand all three to architect a production data lake.

๐Ÿ’พ

Layer 1 โ€” S3 (Data Store)

  • Physical files: CSV, JSON, Parquet, ORC, Avro
  • Organised in folder hierarchies (used for partitions)
  • Data never moves โ€” Athena reads it in place
  • S3 is also where query results are saved
๐Ÿ—‚๏ธ

Layer 2 โ€” Glue Catalog (Metadata)

  • Stores table definitions: column names, types, formats
  • Tracks S3 partitions (folders mapped to table partitions)
  • Acts as a Hive Metastore โ€” schema-on-read
  • Shared across Athena, Glue ETL, EMR, Redshift Spectrum
โšก

Layer 3 โ€” Athena Engine (Query)

  • Distributed query engine (Apache Trino / Presto)
  • Parses SQL, plans execution, distributes workers
  • Workers read S3 files in parallel
  • Aggregates results and writes to S3 output
Deep Dive โ€” The AWS Glue Data Catalog Core

The Glue Catalog is the most important concept to understand because it is the bridge between raw S3 files and queryable SQL tables. Without Glue, Athena would not know what schema a file has, what columns to expose, or where partitions live.

๐Ÿง  Mental Model โ€” The Card Catalogue

Think of S3 as warehouse shelving and Glue Catalog as the index card system that says: "Shelf B-7 contains boxes of CSV files with columns [user_id, event_type, timestamp]. Box labels are dates." Athena uses the card catalogue to find the right shelf, then opens only the boxes it needs. Without the catalogue, it would have to open every box in the warehouse.

Glue Catalog ConceptWhat It StoresWhy It Matters
DatabaseA namespace grouping related tablesOrganise tables by domain (e.g. logs_db, billing_db)
TableSchema: column names, data types, S3 location, file formatMaps an S3 path to a queryable SQL table
PartitionSub-folder path mapped to partition key values (e.g. year=2026)Athena skips partitions not matching the WHERE clause
CrawlerA Glue job that scans S3 and auto-populates the catalogAutomate schema discovery; update partitions as new data arrives
SerDeSerializer/Deserializer โ€” how to parse a file formatAthena uses the SerDe to read CSV, JSON, Parquet, ORC correctly
Schema-on-Read โ€” A Crucial Concept Core

Traditional databases use schema-on-write โ€” the schema must be defined before data is loaded, and data must conform to it. Athena uses schema-on-read โ€” the data exists in S3 in any shape, and you define a schema (via Glue) that describes how to interpret it at query time.

๐Ÿ“

Schema-on-Write (Traditional DB)

  • Define table structure first
  • ETL transforms data to match schema
  • Any schema change = migration
  • Data must be "clean" before loading
  • Rigid, but fast query execution
๐Ÿ“–

Schema-on-Read (Athena / Data Lake)

  • Store data first in S3 as-is
  • Define schema in Glue Catalog at query time
  • Add columns to Glue without changing S3 files
  • Raw, messy data can still be queried
  • Flexible, but schema quality affects results
AWS Architecture Diagram โ€” Athena + S3 + Glue Core
Amazon Athena โ€” Full AWS Architecture with Glue Catalog
QUERY CONSUMERS Console AWS Console JDBC/ODBC BI Tools / SQL IDE QuickSight Dashboards API / SDK Programmatic Lambda Automated queries AMAZON ATHENA Serverless Query Engine (Trino) Workgroups ยท Query history ยท Cost controls reads schema GLUE CATALOG Databases ยท Tables Partitions ยท SerDe Shared metastore GLUE CRAWLER Auto-detects schemas scans files writes results AMAZON S3 ๐Ÿ“ /data/raw/ ๐Ÿ“ /data/processed/ ๐Ÿ“ /athena/results/ Parquet ยท ORC ยท CSV JSON ยท Avro ยท GZIP Output location Glue Catalog = metadata bridge ยท S3 = actual data ยท Athena = distributed SQL engine
Athena, Glue Catalog, and S3 form a three-layer architecture โ€” metadata, storage, and execution are fully separate and independently scalable
Data Lake Architecture โ€” End-to-End Flow Core
Data Lake Architecture โ€” Sources โ†’ S3 โ†’ Glue โ†’ Athena โ†’ BI
โ‘  INGEST Kinesis Firehose DMS (DB export) Direct S3 upload CloudTrail auto App log shipping Snowball (batch) RAW S3 CSV / JSON as-received GLUE ETL Jobs Convert to Parquet + partition PROCESSED Parquet / ORC partitioned ATHENA SQL queries via Glue Catalog $5 / TB scanned BI LAYER QuickSight Tableau PowerBI Custom apps
Ingest raw data โ†’ store in S3 โ†’ Glue ETL converts to Parquet with partitions โ†’ Athena queries via Glue Catalog โ†’ BI tools visualise
How Query Results Are Returned Core

Athena does not return results directly in memory like a traditional database client. Every query result is written to an S3 location (your chosen "query result bucket"), then served back to the client. This means:

๐Ÿ“ค

Results in S3

  • Output always goes to a configured S3 path
  • Stored as CSV (or JSON for some query types)
  • Results available for 45 days in Athena history
  • You control access via S3 bucket policies
  • Large results: download from S3 directly
๐Ÿ’ก

Workgroups โ€” Cost & Access Control

  • Workgroups isolate teams: different result buckets
  • Set per-query data scan limits (cost control)
  • CloudWatch metrics per workgroup
  • Enforce encryption on query results
  • IAM policies per workgroup
Athena Query Engine Versions Deep
EngineBased OnKey CapabilityUse
Athena Engine v2Presto 0.217Standard SQL analyticsDefault for older workgroups
Athena Engine v3Apache TrinoFaster, more SQL functions, DML supportCurrent default โ€” use this
Athena for Apache SparkPySparkPython notebook analytics, ML prepData engineering / ML workflows
๐Ÿ’ก Engine v3 (Trino) โ€” What's New

Athena Engine v3 adds DML operations (INSERT INTO, CTAS with partitioning), more SQL functions, better performance on complex joins, and support for Apache Iceberg and Hudi table formats โ€” enabling true ACID transactions on S3 data files.

๐ŸŽฏ Exam Insight
  • Glue Data Catalog is the metadata store Athena uses โ€” know this integration cold
  • Schema-on-read = define schema at query time, not before data lands โ€” Athena's model
  • Query results are always written to S3 โ€” not stored inside Athena
  • Workgroups = isolation + cost control for different teams
  • Glue Catalog is shared across Athena, EMR, Glue ETL, and Redshift Spectrum โ€” one catalog, many tools
  • Athena reads data in place from S3 โ€” data never moves or gets copied into Athena
Chapter 03 โ€” Key Takeaway

Athena is a three-layer system: S3 holds the data, Glue Catalog holds the schema, and Athena's distributed engine executes SQL across them. The Glue Catalog is the critical integration โ€” it tells Athena what tables exist, what columns they have, and where partitions are in S3. Understanding schema-on-read is the key insight: data lives in S3 in any format, and you define how to interpret it at query time via Glue.

04
Chapter Four ยท Analytics

Data Formats & Performance Optimization

This is the most operationally important Athena chapter. Athena charges $5 per TB scanned. How your data is stored โ€” file format, compression, and partitioning โ€” directly determines both query speed and cost. A poorly structured dataset costs 10โ€“100ร— more to query than a well-structured one.

The Critical Insight โ€” You Pay for Data Scanned Core

Athena pricing is not based on query complexity or time. It is based on how many bytes of S3 data your query reads. This means:

๐Ÿ’ธ

Expensive Queries

  • SELECT * on unpartitioned CSV = scans EVERYTHING
  • Row-based formats (CSV/JSON) = reads ALL columns
  • Uncompressed files = maximum bytes scanned
  • Millions of tiny files = overhead per file
โœ…

Cheap Queries

  • Partitioned data = skip irrelevant folders
  • Columnar formats (Parquet/ORC) = read only needed columns
  • Compressed files = fewer bytes to scan
  • Right-sized files (128MBโ€“512MB each)
โš ๏ธ Real Cost Example

1 TB of CSV log data. You run SELECT user_id, event FROM logs WHERE date = '2026-05-07'.
Without optimization: Athena scans all 1 TB โ†’ costs $5.00, takes ~45 seconds.
With Parquet + partitioning: Athena reads ~2 GB โ†’ costs $0.01, takes ~3 seconds.
Same query. 500ร— cost reduction.

File Format Comparison Core
FormatTypeCompressionColumn PruningWhen to UseAthena Cost Impact
CSVRow-basedOptional (GZIP)โŒ No โ€” reads all columnsQuick ingestion, human-readable, small datasetsโŒ Most expensive
JSONRow-basedOptional (GZIP)โŒ No โ€” reads all columnsAPI outputs, nested data, small filesโŒ Most expensive
ParquetColumnarBuilt-in (Snappy)โœ… Yes โ€” reads only selected columnsAnalytics workloads, BI, large datasetsโœ… 60โ€“90% cheaper
ORCColumnarBuilt-in (ZLIB/Snappy)โœ… Yes โ€” reads only selected columnsHive ecosystem, high compression ratioโœ… 60โ€“90% cheaper
AvroRow-basedBuilt-in (Snappy)โŒ NoSchema evolution, Kafka/streaming pipelinesโš ๏ธ Moderate
Why Columnar Formats Matter โ€” Visual Explanation Core
Row-Based (CSV) vs Columnar (Parquet) โ€” What Gets Read
CSV (Row-Based) SELECT user_id, event โ†’ must read ALL columns user_id event ip_addr browser country u001, click, 10.0.0.1, Chrome, US ... row 1 u002, view, 10.0.0.2, Safari, UK ... row 2 u003, buy, 10.0.0.3, Firefox, DE ... row 3 โฌ† Reads 100% of data (all 5 columns) You pay for ip_addr, browser, country even though unused Parquet (Columnar) SELECT user_id, event โ†’ reads ONLY 2 columns user_id u001 u002 u003 โœ… READ event click view buy โœ… READ ip_addr SKIPPED browser SKIPPED country SKIPPED โฌ† Reads 40% of data (2 of 5 columns) You pay ONLY for user_id + event columns RESULT: Same query, same answer CSV costs $5.00/TB ยท Parquet costs $0.50โ€“$1.00/TB for this query (60โ€“90% savings)
Columnar formats store each column as a separate block โ€” Athena reads only the columns in your SELECT
Partitioning โ€” The Most Important Optimization Core

Partitioning organises S3 data into folder hierarchies that Athena can use to skip entire sections that don't match your query. Without partitioning, every query scans all files. With partitioning, Athena reads only the folders that match your WHERE clause.

๐Ÿง  Mental Model โ€” Filing Cabinets

Imagine 100 filing cabinets labelled by month. If someone asks for "all invoices from March" โ€” you open only the March cabinet (1 of 100). Without labels, you'd need to open every single drawer in every cabinet. Partitioning = labelling your S3 folders so Athena knows which to open and which to skip.

S3 Partitioning โ€” Folder Structure Enables Partition Pruning
s3://my-data-lake/events/ year=2026/ month=04/ day=01/ โ†’ files day=02/ โ†’ files ... day=30/ โ†’ files โŒ SKIPPED (not in WHERE) month=05/ day=01/ โ†’ files day=02/ โ†’ files ... day=07/ โ†’ files โœ… SCANNED (matches WHERE) year=2025/ โŒ ENTIRE YEAR SKIPPED (not in WHERE clause) 12 months ร— 30 days = 360 folders 0 bytes scanned, $0 cost
Query: WHERE year=2026 AND month=5 โ€” Athena reads ONLY the green folder, skips everything else
Partitioning Best Practices Deep
PracticeWhyExample
Partition by time (date)Most queries filter by date โ€” massive scan reduction/year=2026/month=05/day=07/
Partition by high-cardinality filterIf queries always filter by region or account, partition by it/region=us-east-1/
Don't over-partitionToo many tiny files = overhead. Aim for 128MBโ€“512MB per partitionDon't partition by hour if files are only 1KB
Use Hive-style namingAthena auto-detects Hive partition format (key=value/)s3://bucket/data/year=2026/month=05/
Register partitions in GlueAthena won't see new partitions until Glue knows about themUse Glue Crawler or MSCK REPAIR TABLE
File Size Matters โ€” The Goldilocks Zone Deep
โŒ

Too Small (<10 MB)

  • Thousands of tiny files
  • Excessive S3 LIST + GET overhead
  • Athena spends more time opening files than reading data
  • Common with raw event streams
โœ…

Just Right (128โ€“512 MB)

  • Optimal for Athena's parallel reading
  • Each file is meaningful work for one worker
  • Good balance of parallelism and efficiency
  • Use Glue ETL to consolidate small files
โš ๏ธ

Too Large (>2 GB)

  • Fewer files limits parallelism
  • One failed read = restart large file
  • Still works, but not optimal
  • Acceptable for cold archive data
Compression โ€” Free Performance Boost Core
CompressionSplittable?RatioSpeedBest For
SnappyYes (in Parquet/ORC)Medium (2โ€“4ร—)Very fast decompressDefault for Parquet โ€” best general choice
GZIPNo (raw), Yes (in Parquet)High (5โ€“8ร—)Slower decompressCSV/JSON when max compression needed
ZSTDYes (in Parquet)High (5โ€“7ร—)Fast decompressBest balance of ratio + speed (newer)
LZOYes (standalone)Medium (2โ€“3ร—)Very fastHadoop/EMR ecosystems (legacy)
NoneYes1ร— (uncompressed)No CPU overheadNever for analytics โ€” wastes cost + time
The Optimization Pyramid โ€” Summary Core
๐Ÿ—๏ธ Four Layers of Athena Cost Optimization (Most Impact First)
  • โ‘  Partitioning โ€” eliminate 80โ€“99% of data scanned by filtering folder paths
  • โ‘ก Columnar format (Parquet/ORC) โ€” read only needed columns, skip the rest
  • โ‘ข Compression (Snappy/ZSTD) โ€” reduce bytes within each column block
  • โ‘ฃ Right-sized files (128โ€“512 MB) โ€” balance parallelism and per-file overhead
๐ŸŽฏ Exam Insight
  • "Reduce Athena cost" โ†’ partition data + convert to Parquet/ORC (always the answer)
  • "Athena scans too much data" โ†’ data is not partitioned or is stored in CSV/JSON
  • "Columnar format" = Parquet or ORC โ€” enables column pruning (reads only needed columns)
  • Athena charges $5 per TB scanned with a 10 MB minimum per query
  • Parquet with Snappy is the default best-practice format for Athena
  • MSCK REPAIR TABLE or Glue Crawler adds new partitions to the catalog
Chapter 04 โ€” Key Takeaway

Athena cost = data scanned. Reduce what Athena reads and you reduce both cost and latency. Partition by date (eliminates irrelevant data), use Parquet (reads only needed columns), compress with Snappy (fewer bytes per column), and consolidate files to 128โ€“512 MB each. A well-optimized dataset can be 100ร— cheaper to query than a raw CSV dump.

05
Chapter Five ยท Analytics

Athena vs Redshift vs OpenSearch

AWS has many analytics and query services. Understanding when to use which โ€” and what problem each actually solves โ€” is the key to choosing the right tool. Athena, Redshift, and OpenSearch are the three most commonly compared and confused services.

The Core Distinction Introductory
๐Ÿ”

Athena โ€” Serverless Ad-Hoc

  • Query data in S3 with SQL on-demand
  • No infrastructure to manage
  • Pay per query ($5/TB scanned)
  • Best for: exploration, logs, ad-hoc
  • Latency: seconds to minutes
๐Ÿข

Redshift โ€” Enterprise Warehouse

  • Managed columnar data warehouse
  • Petabyte-scale with sub-second queries
  • Pay per cluster (or Redshift Serverless per RPU-hr)
  • Best for: BI dashboards, complex joins, heavy analytics
  • Latency: sub-second to seconds
๐Ÿ”Ž

OpenSearch โ€” Search & Logs

  • Full-text search + real-time log analytics
  • Managed Elasticsearch / OpenSearch cluster
  • Pay per instance + storage
  • Best for: text search, log dashboards, observability
  • Latency: milliseconds (real-time)
Full Comparison Table Core
DimensionAthenaRedshiftOpenSearch
ModelServerless โ€” no infraProvisioned cluster (or Serverless)Managed cluster
Data locationS3 (query in place)Local SSD + can extend to S3 (Spectrum)Local storage (indexes)
Query languageStandard SQL (Trino)PostgreSQL-compatible SQLQuery DSL + SQL plugin
Latency2โ€“30 secondsSub-second to secondsMilliseconds
Concurrency~20 concurrent queriesHundreds (with WLM)Thousands (search)
Pricing$5/TB scanned~$0.25/hr per node (or RPU-hr)~$0.10/hr per instance
Best forAd-hoc, exploration, infrequentComplex BI, heavy repeated queriesFull-text search, real-time logs
Idle cost$0 (pay per query)Running cluster = cost 24/7Running cluster = cost 24/7
SchemaSchema-on-read (Glue)Schema-on-write (tables)Index mappings
Data formatAny (CSV, JSON, Parquet, ORC)Proprietary columnarJSON documents (inverted index)
Decision Guide โ€” When to Use Each Core
If Your Scenario Isโ€ฆUseWhy
Occasional SQL queries on S3 dataAthenaZero idle cost, no setup needed
Data exploration by analystsAthenaSchema-on-read, flexible, low barrier
Heavy BI dashboard (Tableau/PowerBI) with sub-second responseRedshiftOptimised for repeated complex queries at scale
Thousands of concurrent dashboard usersRedshiftBetter concurrency management (WLM)
Full-text search across logs or documentsOpenSearchInverted index = millisecond text search
Real-time observability (Kibana dashboards)OpenSearchNear-real-time indexing + visualisations
CloudTrail / VPC Flow Log analysis (occasional)AthenaFiles already in S3, pay only when querying
Petabyte star-schema analytics with complex joinsRedshiftDistribution keys + sort keys = optimal join performance
Query budget is near zero but query volume is lowAthenaLiteral $0 idle cost; pennies per query
They Work Together โ€” Not Competing Core

These services are complementary, not alternatives. A production analytics stack often uses multiple:

๐Ÿ”—

Athena + Redshift Spectrum

  • Hot data in Redshift local storage (fast, repeated queries)
  • Cold/historical data in S3 (queried via Redshift Spectrum or Athena)
  • Same Glue Catalog shared between both
  • Move data from hot โ†’ cold as it ages
๐Ÿ”—

Athena + OpenSearch

  • OpenSearch for real-time log dashboards (last 7 days)
  • Athena for historical deep-dives (query months of logs in S3)
  • Kinesis Firehose ships logs to both: OpenSearch (real-time) + S3 (archive)
  • Security teams use Athena for forensic investigations on historical data
Athena vs RDS/Aurora โ€” A Common Confusion Core
AspectAthenaRDS / Aurora
PurposeAnalytics โ€” read-heavy, scan-heavyTransactional โ€” read/write OLTP
OperationsSELECT only (read-only analytics)INSERT, UPDATE, DELETE, SELECT
LatencySecondsMilliseconds
Data sourceS3 files (external)Local storage (managed disks)
Use caseReports, auditing, explorationApplication backend database
AnalogyLibrary catalogue searchCash register (record transactions)
โš ๏ธ Common Exam Trap

Athena is not a replacement for RDS. If the question describes an application that needs INSERT/UPDATE/DELETE with millisecond latency โ€” that is NOT Athena. Athena is for analytics. RDS/Aurora/DynamoDB are for transactional workloads. This is the most common confusion in certification exams.

๐ŸŽฏ Exam Insight
  • "Serverless SQL on S3" โ†’ Athena (not Redshift)
  • "Sub-second BI dashboard with complex joins" โ†’ Redshift
  • "Full-text search across log data in real-time" โ†’ OpenSearch
  • "Zero idle cost analytics" โ†’ Athena (Redshift has running cluster cost)
  • "OLTP workload with transactions" โ†’ NOT Athena โ†’ RDS/Aurora/DynamoDB
  • "Redshift Spectrum" = Redshift querying S3 data (uses same Glue Catalog as Athena)
  • "Athena or Redshift for occasional queries?" โ†’ Athena (lower cost when queries are infrequent)
  • "Athena or Redshift for thousands of concurrent BI users?" โ†’ Redshift (better concurrency)
Chapter 05 โ€” Key Takeaway

Athena = serverless ad-hoc SQL on S3 (pay-per-query, zero idle cost). Redshift = enterprise data warehouse for heavy, repeated, complex analytics (sub-second, high concurrency). OpenSearch = real-time search and log observability (millisecond full-text search). They complement each other โ€” use Athena for exploration and cold data, Redshift for hot BI, and OpenSearch for real-time logs.

06
Chapter Six ยท Analytics

Athena Architecture Patterns

Athena typically doesn't exist in isolation โ€” it's part of a larger analytics architecture. This chapter covers the five most common production patterns where Athena provides the query layer: data lake analytics, CloudTrail auditing, serverless reporting, cost analysis, and security investigation workflows.

Pattern 1 โ€” Data Lake Analytics Core

The canonical Athena use case. Raw data lands in S3, gets processed into Parquet with partitioning, then queried on-demand via Athena. This is the modern replacement for a traditional data warehouse at a fraction of the cost.

Production Data Lake โ€” S3 + Glue + Athena + QuickSight
DATA SOURCES Kinesis Firehose DMS (databases) API exports IoT / events App logs RAW ZONE CSV / JSON as-received GLUE ETL โ†’ Parquet + partition CURATED Parquet partitioned ATHENA SQL queries $5/TB QUICKSIGHT Dashboards Reports SPICE caching AWS Glue Data Catalog โ€” shared metadata layer
Sources โ†’ Raw S3 โ†’ Glue ETL (Parquet + partition) โ†’ Curated S3 โ†’ Athena queries โ†’ QuickSight dashboards
Pattern 2 โ€” CloudTrail Log Analysis Core

CloudTrail logs every AWS API call in your account and delivers them to S3 automatically. Athena has a built-in CloudTrail table template โ€” you can query who did what, when, from where, in seconds.

๐Ÿ”

Common CloudTrail Queries

  • Who deleted that S3 bucket / EC2 instance?
  • All API calls from a specific IP address
  • All root account usage in the last 30 days
  • All IAM policy changes in a time range
  • Failed authentication attempts
โšก

How It Works

  • CloudTrail delivers JSON logs to S3 (automatic)
  • Create Athena table pointing at CloudTrail S3 path
  • Partition by region + year + month + day
  • Query with SQL: WHERE eventname = 'DeleteBucket'
  • Cost: pennies per query (small scan with partitions)
Pattern 3 โ€” Serverless Reporting System Core

Build a reporting system with zero running infrastructure. An API Gateway triggers Lambda, which runs an Athena query, returns results to the client or stores them for a dashboard.

Serverless Reporting โ€” API Gateway + Lambda + Athena
CLIENT GET /report API GATEWAY REST / HTTP LAMBDA Start query ATHENA SQL query S3 Data + Results Fully serverless โ€” $0 idle cost ยท Lambda polls Athena for completion, returns results to client
Pattern 4 โ€” AWS Cost & Usage Report Analysis Core
๐Ÿ’ฐ

What Is CUR?

  • AWS Cost & Usage Report โ€” most detailed billing data
  • Delivered as Parquet files to S3 (hourly/daily)
  • Line-item detail: service, usage, tags, accounts
  • Can be queried directly by Athena (no ETL needed!)
๐Ÿ“Š

Example Queries

  • Total spend by team/project (using cost allocation tags)
  • Top 10 most expensive resources this month
  • Daily spending trend for EC2 vs Lambda vs S3
  • Unused EBS volumes costing money
  • Reserved Instance utilization tracking
Pattern 5 โ€” Security Investigation Workflow Deep

During a security incident, you need to query massive volumes of logs across accounts in minutes. Athena enables forensic investigation across CloudTrail, VPC Flow Logs, GuardDuty findings, and DNS logs โ€” all stored in S3.

Security Investigation โ€” Multi-Source Forensic Analysis
LOG SOURCES (S3) ๐Ÿ“ CloudTrail ๐Ÿ“ VPC Flow Logs ๐Ÿ“ GuardDuty findings ๐Ÿ“ DNS query logs ๐Ÿ“ WAF logs ๐Ÿ“ Application logs GLUE CATALOG Tables for each log source ATHENA Cross-source SQL JOIN CloudTrail + VPC Flow by IP/time Forensic investigation OUTPUTS ๐Ÿ” Compromised IPs ๐Ÿ‘ค Affected users ๐Ÿ“… Timeline of events ๐Ÿ“‹ Evidence for IR โ†’ Automated SNS alerts โ†’ Security Hub findings
Multi-log forensics: join CloudTrail + VPC Flow + GuardDuty findings using Athena SQL โ€” no infrastructure to set up during an incident
Pattern Summary Table Core
PatternData SourcesKey BenefitCost Profile
Data Lake AnalyticsAny โ€” ingested via Kinesis/DMS/direct S3Replaces traditional data warehouse$5/TB per query, $0 idle
CloudTrail AuditCloudTrail JSON logs in S3Who did what, when โ€” instant forensicsPennies per query (small partitions)
Serverless ReportingAny S3 data + API Gateway + LambdaZero-infra reporting API, $0 idleLambda + Athena per invocation
CUR Cost AnalysisAWS Cost & Usage Report (Parquet in S3)Granular billing insight, tag-based allocationsNear-free (CUR is already Parquet)
Security InvestigationCloudTrail + VPC Flow + GuardDuty + DNSMulti-source forensics, cross-log JOINsVariable (depends on time range scanned)
๐ŸŽฏ Exam Insight
  • "Analyse CloudTrail logs with SQL" โ†’ Athena + CloudTrail table in Glue
  • "Query AWS Cost & Usage Report" โ†’ Athena (CUR is Parquet in S3 โ€” perfect match)
  • "Serverless reporting API, zero idle cost" โ†’ API Gateway + Lambda + Athena
  • "Cross-account security forensics" โ†’ Athena with org-wide CloudTrail S3 bucket
  • "Data lake query layer" โ†’ Athena + Glue Catalog + S3
Chapter 06 โ€” Key Takeaway

Athena's power comes from pairing it with the right architecture. Data Lake (S3 + Glue + Athena + QuickSight) replaces expensive data warehouses. CloudTrail + Athena gives instant security auditing. Lambda + Athena builds zero-cost serverless reporting APIs. CUR + Athena enables granular cost governance. In every case, the pattern is the same: data in S3, metadata in Glue, query in Athena.

07
Chapter Seven ยท Analytics

Security & Governance

Athena security is not one thing โ€” it is the intersection of three permission systems: IAM (who can run queries), S3 (who can read/write data), and Glue Catalog (who can see table metadata). Understanding these three layers is critical for secure production use.

The Three Security Layers Core
๐Ÿ”‘

Layer 1 โ€” IAM Permissions

  • athena:StartQueryExecution
  • athena:GetQueryResults
  • athena:GetWorkGroup
  • Controls WHO can run queries and in which workgroup
  • Use IAM policies to restrict by workgroup
๐Ÿ“

Layer 2 โ€” S3 Access

  • s3:GetObject on data bucket
  • s3:PutObject on results bucket
  • s3:GetBucketLocation
  • Controls WHAT data Athena can read
  • Use bucket policies + IAM to restrict
๐Ÿ—‚๏ธ

Layer 3 โ€” Glue Catalog

  • glue:GetDatabase, glue:GetTable
  • glue:GetPartitions
  • Controls WHICH tables/databases are visible
  • Use resource-based Glue policies
  • Lake Formation for fine-grained column access
Security Architecture Diagram Core
Athena Security โ€” Triple-Lock Access Model
USER IAM Role or IAM User ๐Ÿ”’ IAM CHECK athena:* permissions Can you query at all? ๐Ÿ”’ GLUE CHECK glue:GetTable, etc. Can you see this table? ๐Ÿ”’ S3 CHECK s3:GetObject on path Can you read the files? โœ… QUERY EXECUTES ALL THREE locks must pass โ€” any deny = query fails + Lake Formation for column/row-level access control
A query must pass IAM (can you run queries?), Glue (can you see this table?), and S3 (can you read these files?) โ€” all three must allow access
Encryption Core
WhatEncryption OptionsConfiguration
Source data in S3SSE-S3, SSE-KMS, CSE-KMSSet on S3 bucket default encryption or per-object
Query results in S3SSE-S3, SSE-KMSConfigured per workgroup (enforce via workgroup settings)
Data in transitTLS 1.2 (automatic)JDBC/ODBC connections are encrypted by default
Workgroups โ€” Team Isolation & Governance Core
๐Ÿ‘ฅ

What Workgroups Control

  • Separate query result locations per team
  • Per-query data scan limits (cost cap)
  • Enforce encryption on results
  • CloudWatch metrics per workgroup
  • IAM policies scoped to workgroup
๐Ÿ—๏ธ

Best Practice: Multi-Team Setup

  • One workgroup per team (marketing, engineering, security)
  • Set scan limit: e.g. 100 GB max per query
  • Each team's results โ†’ separate S3 prefix
  • Audit trail: CloudTrail logs query execution per workgroup
  • Tag workgroups for cost allocation tracking
Lake Formation โ€” Fine-Grained Access Deep

For enterprise environments that need column-level or row-level security, AWS Lake Formation adds another governance layer on top of Glue Catalog:

FeatureWithout Lake FormationWith Lake Formation
Table accessIAM + Glue policies (coarse)Fine-grained: per-column, per-row, per-user
Column maskingNot possible nativelyHide sensitive columns from specific users
Row filteringNot possible nativelyUsers see only rows matching their filter
Cross-account sharingComplex IAM + S3 policiesOne-click table sharing across accounts
AuditCloudTrail onlyCentralised access audit log in Lake Formation
๐Ÿ’ก When to Use Lake Formation

Use Lake Formation when you need column-level security (e.g. hide SSN column from certain teams), row-level filtering (e.g. each team sees only their region's data), or cross-account data sharing without complex IAM. For simple setups with 1โ€“2 teams, standard IAM + Glue + S3 policies are sufficient.

Query Auditing & Monitoring Core
๐Ÿ“‹

CloudTrail

  • Logs every Athena API call
  • StartQueryExecution events
  • Who ran what query, when
  • IP source, IAM identity
๐Ÿ“Š

CloudWatch Metrics

  • Data scanned per workgroup
  • Query execution time
  • Number of queries (throttled)
  • Set alarms on cost thresholds
๐Ÿ“

Query History

  • 45-day history in Athena console
  • Query text, status, data scanned
  • Duration, result location
  • Useful for cost attribution
๐ŸŽฏ Exam Insight
  • Athena security = IAM + S3 + Glue โ€” all three must allow access for a query to succeed
  • Workgroups = team isolation + enforce encryption + cost control (per-query scan limit)
  • Lake Formation = column-level and row-level access control on Glue tables
  • Query result encryption can be enforced at the workgroup level (SSE-S3 or SSE-KMS)
  • "Restrict analyst to seeing only their team's data" โ†’ Lake Formation row-level security
  • "Limit Athena spend per team" โ†’ workgroup with per-query scan limit
Chapter 07 โ€” Key Takeaway

Athena security is the union of three systems: IAM (who can query), S3 (what data is readable), and Glue Catalog (which tables are visible). All three must permit access. Use workgroups to isolate teams, enforce encryption, and set cost limits. For enterprise column/row-level security, add AWS Lake Formation. Every query is auditable via CloudTrail + CloudWatch.

08
Chapter Eight ยท Analytics

Cost Optimization & Best Practices

Athena's pricing is deceptively simple: $5 per TB of data scanned. But in practice, the difference between a well-architected data lake and a naive one is 100ร— cost difference for the same query. This chapter consolidates every optimization into a production-ready checklist.

Pricing Model โ€” What You Actually Pay Core
ComponentCostNotes
Data scanned per query$5.00 per TBRounded up to 10 MB minimum per query
Cancelled queriesCharged for data scanned before cancellationCancel early to limit cost
DDL statements (CREATE TABLE, etc.)FreeNo data scanned
Failed queriesFreeSyntax errors, permission errors = $0
Glue Data CatalogFree first 1M objects/monthThen $1 per 100K objects
S3 storage~$0.023/GB/month (Standard)Separate from Athena โ€” you'd store it anyway
S3 GET requests$0.0004 per 1000 requestsAthena generates many GETs โ€” usually negligible
๐Ÿ’ก Real-World Cost Example

Scenario: 100 analysts, each running 5 queries/day, scanning 500 MB per query (well-optimized Parquet + partitioning).
Monthly cost: 100 ร— 5 ร— 30 ร— 0.5 GB = 7,500 GB = 7.5 TB ร— $5 = $37.50/month.
Compare to a Redshift cluster doing the same work: $2,000โ€“$8,000/month.

The Complete Cost Optimization Checklist Core
#OptimizationImpactEffortHow
1Partition by date80โ€“99% reductionLowHive-style folders: /year=YYYY/month=MM/day=DD/
2Use Parquet or ORC60โ€“90% reductionMediumGlue ETL job to convert raw โ†’ Parquet
3Compress with Snappy/ZSTD20โ€“40% furtherFree (Parquet default)Parquet uses Snappy by default โ€” just use Parquet
4Consolidate small files10โ€“50% improvementMediumGlue ETL or S3 BatchOps to merge into 128โ€“512 MB
5SELECT only needed columnsProportionalFreeNever SELECT * โ€” list specific columns
6Use workgroup scan limitsCost safety netLowSet max per-query scan (e.g. 100 GB)
7CTAS for repeated queriesAvoids repeated scansLowCREATE TABLE AS to materialize results
8Use Athena query result reuse$0 for cache hitsFreeEnable "Reuse query results" in workgroup (7-day cache)
CTAS โ€” Materialize Expensive Queries Deep

CREATE TABLE AS SELECT (CTAS) is a powerful technique: run an expensive query once, store the results as a new Parquet table, then all subsequent queries read the smaller materialized result instead of rescanning raw data.

โŒ

Without CTAS

  • Dashboard refreshes 10ร—/day
  • Each refresh scans 200 GB of raw data
  • Daily cost: 10 ร— 200 GB ร— $5/TB = $10/day
  • Monthly: ~$300 just for one dashboard
โœ…

With CTAS

  • CTAS runs once/day โ†’ 200 GB scan = $1/day
  • Materialized result: 2 GB Parquet table
  • Dashboard reads 2 GB ร— 10 = 20 GB = $0.10/day
  • Monthly: ~$33 (90% savings)
Production Best Practices Summary Core
โœ…

Data Engineering Practices

  • Store analytics data as Parquet with Snappy
  • Partition by the most common filter (usually date)
  • File size target: 128โ€“512 MB per file
  • Use Glue Crawlers to auto-update partitions
  • Separate raw zone (JSON/CSV) from curated zone (Parquet)
  • Use lifecycle rules to move old data to Glacier
๐Ÿ—๏ธ

Operational Governance

  • One workgroup per team โ€” separate cost tracking
  • Set per-query scan limits to prevent runaway queries
  • Encrypt results (enforce at workgroup level)
  • Enable query result reuse (7-day cache)
  • Monitor with CloudWatch: data scanned, query time
  • Tag workgroups for FinOps cost allocation
๐Ÿ”

Query Writing Practices

  • Never SELECT * โ€” list only needed columns
  • Always filter by partition keys in WHERE clause
  • Use LIMIT during exploration
  • Materialise expensive joins with CTAS
  • Use views for common access patterns
  • Check "Data scanned" after each query โ€” optimise if high
โš ๏ธ

Anti-Patterns to Avoid

  • โŒ SELECT * on large tables
  • โŒ Querying unpartitioned data at scale
  • โŒ Storing analytics data as CSV/JSON long-term
  • โŒ Millions of tiny files (< 1 MB each)
  • โŒ Missing partition keys in WHERE clause
  • โŒ Using Athena as a transactional database
When NOT to Use Athena Core
ScenarioWhy Not AthenaBetter Choice
Sub-second interactive dashboards (thousands of users)2โ€“30s query latency, limited concurrencyRedshift + SPICE (QuickSight) or ElastiCache
Transactional workloads (INSERT/UPDATE/DELETE)Athena is read-only analyticsRDS, Aurora, DynamoDB
Real-time streaming analyticsAthena is batch, not streamingKinesis Analytics, Flink, OpenSearch
Very high query volume (100+ concurrent queries)Default concurrency limit ~20โ€“25 DML queriesRedshift Serverless (higher concurrency)
Application backend (millisecond reads)Not a database โ€” seconds of latencyDynamoDB, ElastiCache, Aurora
๐ŸŽฏ Exam Insight
  • "Reduce Athena cost" โ†’ partition + Parquet + compression (ALWAYS the answer)
  • "Pay per query, $5/TB" โ†’ Athena pricing (vs Redshift per-hour cluster cost)
  • CTAS = CREATE TABLE AS โ€” materializes results, avoids repeated scans
  • "Millions of small files slow down Athena" โ†’ consolidate to 128โ€“512 MB files
  • "Limit Athena spend" โ†’ workgroup per-query byte scan limit
  • "Cache Athena results" โ†’ query result reuse (workgroup setting)
  • Athena minimum charge: 10 MB per query even if data scanned is less
Full Page Summary โ€” Amazon Athena Core
๐Ÿ“‹ Amazon Athena โ€” Complete Recap
  • What: Serverless SQL query engine that reads data directly from S3 โ€” no servers, no databases, no loading.
  • How: Three-layer architecture โ€” S3 (data) + Glue Catalog (metadata/schema) + Athena Engine (distributed SQL execution via Trino).
  • Cost: $5 per TB of data scanned. Optimize with partitioning (80โ€“99% reduction), Parquet (60โ€“90%), and compression.
  • Best for: Ad-hoc analytics, log analysis, CloudTrail auditing, CUR billing queries, data lake exploration, serverless reporting.
  • Not for: OLTP workloads, sub-second latency, high-concurrency dashboards, real-time streaming.
  • vs Redshift: Athena = serverless, $0 idle, ad-hoc. Redshift = managed warehouse, sub-second, high concurrency.
  • vs OpenSearch: Athena = SQL analytics on S3. OpenSearch = real-time search/log observability.
  • Security: IAM + S3 + Glue โ€” triple-lock model. Workgroups for team isolation. Lake Formation for column/row-level access.
  • Key integration: Glue Data Catalog is the bridge between raw S3 files and queryable SQL tables.
๐Ÿ‘‰ Final Takeaway โ€” Athena in One Thought

Amazon Athena turns every S3 bucket into a queryable database โ€” without provisioning, loading, or managing anything. Store data cheaply in S3, describe it in Glue, query it with SQL, pay only for what you scan. Optimize with Parquet and partitioning, and your analytics cost drops to near-zero. It is the query engine that makes the AWS data lake practical, accessible, and affordable.