Amazon Athena โ
Serverless Analytics Directly on S3
Query petabytes of data stored in S3 using standard SQL โ without provisioning a single server, loading data into a database, or managing infrastructure. Athena is the engine of the modern AWS data lake.
What is Amazon Athena?
Amazon Athena is a serverless, interactive query service that lets you analyze data stored in Amazon S3 using standard SQL โ without loading it into a database, provisioning servers, or managing any infrastructure. You point Athena at files in S3, write a SQL query, and results appear in seconds.
Traditional analytics requires you to move data into a database before you can query it. Athena flips this model โ the query goes to the data, not the data to the query engine.
Imagine a giant digital library with millions of books stored on shelves (S3). In a traditional database, you'd need to move all the books into a reading room, catalogue them, and maintain the room before anyone can search. Athena is a librarian who reads the books exactly where they are on the shelves โ you just describe what you're looking for, and the answer comes back instantly. No moving. No room to maintain. No staff to hire.
- Provision a database cluster (RDS, Redshift)
- Design and create tables and schemas
- ETL pipeline loads data into the database
- Pay for the cluster 24/7, even when idle
- Scale the cluster as data grows
- Manage backups, patches, engine versions
- Time-to-first-query: days or weeks
- Data stays in S3 exactly where it already is
- Define a schema in Glue Catalog (metadata only)
- Write a SQL query โ Athena executes it directly
- Pay only for what you query (per GB scanned)
- No infrastructure to scale โ Athena scales itself
- Nothing to patch, manage, or maintain
- Time-to-first-query: minutes
Log Files
- CloudTrail audit logs
- ALB / CloudFront access logs
- VPC Flow Logs
- Application logs exported to S3
Business Data
- CSV exports from databases
- JSON event streams
- AWS Cost & Usage Reports (CUR)
- Parquet data from ETL pipelines
Analytics Datasets
- Clickstream and user event data
- IoT sensor readings
- Financial transaction records
- Machine learning feature stores
"Serverless" is overused โ here is exactly what it means for Athena:
| Concern | Traditional DB | Amazon Athena |
|---|---|---|
| Servers | You provision EC2 or DB instances | AWS owns and operates all compute |
| Scaling | You resize instances or add nodes | Athena scales automatically per query |
| Idle cost | You pay even when no queries run | Zero cost when not querying |
| Patching | You apply engine/OS updates | AWS maintains all infrastructure |
| Capacity planning | Required โ guess future load | Not required โ on-demand |
| Startup time | Minutes to hours to provision | Instant โ always ready |
Athena is not a standalone service โ it is the query layer of the modern AWS data lake stack:
Storage Layer โ Amazon S3
S3 holds the actual data files. It is the source of truth โ cheap, durable, infinitely scalable. Athena never moves this data; it reads it in place.
Metadata Layer โ AWS Glue Catalog
Glue Catalog holds the schema โ what tables exist, what columns they have, where the S3 partitions are. Athena reads this catalog before scanning any data.
Query Layer โ Amazon Athena
Athena is the SQL engine. It reads schemas from Glue, figures out which S3 files to read, distributes the scan across workers, and returns results.
Visualisation Layer โ QuickSight / BI Tools
Tools like Amazon QuickSight, Tableau, or Power BI connect to Athena via JDBC/ODBC to build dashboards and reports on top of S3 data.
Before services like Athena, running analytics on large datasets required โ at minimum โ a running database cluster, an ETL pipeline to load data, and a team to maintain both. Small companies couldn't afford it. Even large companies struggled with idle infrastructure.
Athena changed this permanently. Today, a startup can store billions of events in S3 for pennies per GB per month, and query that data with SQL on demand, paying only for each query. There is no minimum spend, no committed capacity, no infrastructure team required.
Athena is a foundational component of the modern AWS data lake pattern: S3 (storage) + Glue (catalog) + Athena (query) + QuickSight (visualise). This stack replaces what used to require a full data warehouse team and six-figure infrastructure budgets. You'll see this pattern in almost every serious AWS analytics architecture.
- "Query data directly in S3 using SQL" โ Athena
- "Serverless analytics" or "no infrastructure to manage" โ Athena
- "Pay per query" or "pay per GB scanned" โ Athena pricing model
- Athena uses Glue Data Catalog for table schemas โ this is the key integration to remember
- Results are always written to an S3 output location โ they are not stored in Athena itself
Athena is a serverless SQL query engine that reads data directly from S3. No servers to provision, no data to load, no infrastructure to manage. You pay per query, per GB scanned. It is the query layer of the modern AWS data lake โ built on top of S3 for storage and Glue Catalog for schema management.
Why Serverless Analytics Matters
Before serverless analytics, every organisation that wanted to query large datasets had to become an infrastructure company by accident. They spent more time managing databases than getting value from data. Athena changes the economics entirely โ the cost of asking a question is nearly zero, which changes how teams think about data.
Classic analytics stacks require you to solve a set of hard operational problems before you can even run a single query:
Infrastructure Overhead
- Provision database clusters upfront
- Estimate capacity for peak load (often wrong)
- Over-provision because under-provisioning is worse
- Pay for idle capacity every hour of every day
- Maintain OS, engine patches, version upgrades
Operational Friction
- Schema must be defined before data arrives
- ETL pipelines needed to load data into DB
- Pipeline failures = stale or missing data
- Schema changes require expensive migrations
- Scaling down is as hard as scaling up
A small data team's analytics database running 24/7 on a 4-node Redshift cluster costs ~$7,000โ$15,000 per month โ whether anyone is querying it or not. Most queries happen during business hours (8 hours/day, 5 days/week). That means 75% of the cost is paying for idle infrastructure.
Serverless analytics decouples when data is stored from when it is queried. With Athena:
| Traditional Approach | With Athena | Impact |
|---|---|---|
| Pay 24/7 for cluster | Pay per query ($5/TB scanned) | 80โ95% cost reduction for ad-hoc workloads |
| Provision before you know the load | No provisioning needed | Zero capacity risk |
| ETL before first query | Query files directly as-is | Hours โ minutes to first insight |
| One cluster = one team's bottleneck | Unlimited parallel queries | No queue, no contention |
| Schema must match expectations | Schema-on-read; evolve freely | Flexible exploration |
Log Analytics
- Query ALB, CloudFront, or VPC Flow Logs
- Find which IPs hit which endpoints
- Analyse error rate trends over time
- No log pipeline needed โ files already in S3
CloudTrail Security Audit
- Query who called which AWS API and when
- Find all IAM key usage across accounts
- Detect unusual cross-region activity
- Native Athena table for CloudTrail exists
Cost Reporting
- Query AWS Cost & Usage Reports (CUR)
- Break down spending by team, tag, service
- Find idle resources burning budget
- CUR is JSON/Parquet โ perfect for Athena
Ad-Hoc Reporting
- Business teams query S3 data directly
- One-off analysis without data engineering
- Connect BI tools via JDBC/ODBC
- No change to underlying data storage
Security Investigations
- Athena + GuardDuty findings in S3
- Query network flow logs during incidents
- Correlate events across multiple log sources
- Forensic queries on historical data
Data Lake Analytics
- Explore raw data before building pipelines
- Validate data quality with SQL
- Query across multiple S3 datasets at once
- Power dashboards via QuickSight
- Exploration โ discovering patterns in data you don't yet understand
- Ad-hoc analytics โ irregular, one-off queries by many users
- Occasional reporting โ dashboards that run a few times a day
- Large scans on cold data โ history, logs, archives
- Security auditing โ query CloudTrail and VPC Flow on-demand
- Cost-effective BI โ when query volume is low to medium
- Transactional workloads โ INSERT/UPDATE/DELETE, OLTP โ use RDS
- Sub-second latency queries โ real-time dashboards โ use ElastiCache
- High-concurrency BI (thousands of users) โ use Redshift
- Streaming data โ continuous real-time feeds โ use Kinesis
- Application backends โ Athena is not a transactional DB
Mistake 1: Using Athena Like a Database
Teams try to run real-time application queries through Athena. Athena executes in seconds, not milliseconds. It is an analytics tool, not a database backend. Use RDS/DynamoDB for application queries.
Mistake 2: Querying Unpartitioned Data
Running queries on a flat S3 bucket with millions of small files is catastrophically expensive. Without partitioning, every query scans every file. This is the most common and costly Athena mistake.
Mistake 3: Storing Data as CSV/JSON at Scale
CSV and JSON are row-based formats. A query for two columns out of 200 must read every column. Using Parquet or ORC reduces costs by 60โ80% and query time dramatically. (More in Chapter 4.)
Mistake 4: Expecting Consistent Sub-Second Results
Athena has a query startup overhead of 1โ3 seconds. If your dashboard needs to refresh every second, Athena is not the right tool. For interactive sub-second dashboards, cache results or use ElastiCache.
- "Ad-hoc SQL queries on S3 data with no infrastructure" โ Athena (not Redshift)
- "Analyse CloudTrail logs with SQL" โ Athena (built-in CloudTrail table)
- "OLTP workload, INSERT/UPDATE, millisecond latency" โ NOT Athena โ RDS or DynamoDB
- "Reduce Athena query cost" โ add partitioning + use Parquet/ORC format
- Athena is not suitable for transactional workloads โ this distinction is frequently tested
Serverless analytics eliminates the infrastructure tax on data exploration. Athena's pay-per-query model makes asking questions nearly free โ turning analytics from an expensive IT project into a standard engineering capability. Use it for exploration, log analysis, auditing, and ad-hoc reporting. Never use it for transactional workloads or sub-second latency requirements.
How Athena Works
Athena is not magic โ it is a distributed SQL execution engine (Apache Trino under the hood) that reads files directly from S3 using metadata from the Glue Data Catalog. Understanding these three components โ S3, Glue Catalog, and Athena's query engine โ is the key to using Athena effectively in production.
Every Athena query interacts with three distinct layers. You must understand all three to architect a production data lake.
Layer 1 โ S3 (Data Store)
- Physical files: CSV, JSON, Parquet, ORC, Avro
- Organised in folder hierarchies (used for partitions)
- Data never moves โ Athena reads it in place
- S3 is also where query results are saved
Layer 2 โ Glue Catalog (Metadata)
- Stores table definitions: column names, types, formats
- Tracks S3 partitions (folders mapped to table partitions)
- Acts as a Hive Metastore โ schema-on-read
- Shared across Athena, Glue ETL, EMR, Redshift Spectrum
Layer 3 โ Athena Engine (Query)
- Distributed query engine (Apache Trino / Presto)
- Parses SQL, plans execution, distributes workers
- Workers read S3 files in parallel
- Aggregates results and writes to S3 output
The Glue Catalog is the most important concept to understand because it is the bridge between raw S3 files and queryable SQL tables. Without Glue, Athena would not know what schema a file has, what columns to expose, or where partitions live.
Think of S3 as warehouse shelving and Glue Catalog as the index card system that says: "Shelf B-7 contains boxes of CSV files with columns [user_id, event_type, timestamp]. Box labels are dates." Athena uses the card catalogue to find the right shelf, then opens only the boxes it needs. Without the catalogue, it would have to open every box in the warehouse.
| Glue Catalog Concept | What It Stores | Why It Matters |
|---|---|---|
| Database | A namespace grouping related tables | Organise tables by domain (e.g. logs_db, billing_db) |
| Table | Schema: column names, data types, S3 location, file format | Maps an S3 path to a queryable SQL table |
| Partition | Sub-folder path mapped to partition key values (e.g. year=2026) | Athena skips partitions not matching the WHERE clause |
| Crawler | A Glue job that scans S3 and auto-populates the catalog | Automate schema discovery; update partitions as new data arrives |
| SerDe | Serializer/Deserializer โ how to parse a file format | Athena uses the SerDe to read CSV, JSON, Parquet, ORC correctly |
Traditional databases use schema-on-write โ the schema must be defined before data is loaded, and data must conform to it. Athena uses schema-on-read โ the data exists in S3 in any shape, and you define a schema (via Glue) that describes how to interpret it at query time.
Schema-on-Write (Traditional DB)
- Define table structure first
- ETL transforms data to match schema
- Any schema change = migration
- Data must be "clean" before loading
- Rigid, but fast query execution
Schema-on-Read (Athena / Data Lake)
- Store data first in S3 as-is
- Define schema in Glue Catalog at query time
- Add columns to Glue without changing S3 files
- Raw, messy data can still be queried
- Flexible, but schema quality affects results
Athena does not return results directly in memory like a traditional database client. Every query result is written to an S3 location (your chosen "query result bucket"), then served back to the client. This means:
Results in S3
- Output always goes to a configured S3 path
- Stored as CSV (or JSON for some query types)
- Results available for 45 days in Athena history
- You control access via S3 bucket policies
- Large results: download from S3 directly
Workgroups โ Cost & Access Control
- Workgroups isolate teams: different result buckets
- Set per-query data scan limits (cost control)
- CloudWatch metrics per workgroup
- Enforce encryption on query results
- IAM policies per workgroup
| Engine | Based On | Key Capability | Use |
|---|---|---|---|
| Athena Engine v2 | Presto 0.217 | Standard SQL analytics | Default for older workgroups |
| Athena Engine v3 | Apache Trino | Faster, more SQL functions, DML support | Current default โ use this |
| Athena for Apache Spark | PySpark | Python notebook analytics, ML prep | Data engineering / ML workflows |
Athena Engine v3 adds DML operations (INSERT INTO, CTAS with partitioning), more SQL functions, better performance on complex joins, and support for Apache Iceberg and Hudi table formats โ enabling true ACID transactions on S3 data files.
- Glue Data Catalog is the metadata store Athena uses โ know this integration cold
- Schema-on-read = define schema at query time, not before data lands โ Athena's model
- Query results are always written to S3 โ not stored inside Athena
- Workgroups = isolation + cost control for different teams
- Glue Catalog is shared across Athena, EMR, Glue ETL, and Redshift Spectrum โ one catalog, many tools
- Athena reads data in place from S3 โ data never moves or gets copied into Athena
Athena is a three-layer system: S3 holds the data, Glue Catalog holds the schema, and Athena's distributed engine executes SQL across them. The Glue Catalog is the critical integration โ it tells Athena what tables exist, what columns they have, and where partitions are in S3. Understanding schema-on-read is the key insight: data lives in S3 in any format, and you define how to interpret it at query time via Glue.
Data Formats & Performance Optimization
This is the most operationally important Athena chapter. Athena charges $5 per TB scanned. How your data is stored โ file format, compression, and partitioning โ directly determines both query speed and cost. A poorly structured dataset costs 10โ100ร more to query than a well-structured one.
Athena pricing is not based on query complexity or time. It is based on how many bytes of S3 data your query reads. This means:
Expensive Queries
- SELECT * on unpartitioned CSV = scans EVERYTHING
- Row-based formats (CSV/JSON) = reads ALL columns
- Uncompressed files = maximum bytes scanned
- Millions of tiny files = overhead per file
Cheap Queries
- Partitioned data = skip irrelevant folders
- Columnar formats (Parquet/ORC) = read only needed columns
- Compressed files = fewer bytes to scan
- Right-sized files (128MBโ512MB each)
1 TB of CSV log data. You run SELECT user_id, event FROM logs WHERE date = '2026-05-07'.
Without optimization: Athena scans all 1 TB โ costs $5.00, takes ~45 seconds.
With Parquet + partitioning: Athena reads ~2 GB โ costs $0.01, takes ~3 seconds.
Same query. 500ร cost reduction.
| Format | Type | Compression | Column Pruning | When to Use | Athena Cost Impact |
|---|---|---|---|---|---|
| CSV | Row-based | Optional (GZIP) | โ No โ reads all columns | Quick ingestion, human-readable, small datasets | โ Most expensive |
| JSON | Row-based | Optional (GZIP) | โ No โ reads all columns | API outputs, nested data, small files | โ Most expensive |
| Parquet | Columnar | Built-in (Snappy) | โ Yes โ reads only selected columns | Analytics workloads, BI, large datasets | โ 60โ90% cheaper |
| ORC | Columnar | Built-in (ZLIB/Snappy) | โ Yes โ reads only selected columns | Hive ecosystem, high compression ratio | โ 60โ90% cheaper |
| Avro | Row-based | Built-in (Snappy) | โ No | Schema evolution, Kafka/streaming pipelines | โ ๏ธ Moderate |
Partitioning organises S3 data into folder hierarchies that Athena can use to skip entire sections that don't match your query. Without partitioning, every query scans all files. With partitioning, Athena reads only the folders that match your WHERE clause.
Imagine 100 filing cabinets labelled by month. If someone asks for "all invoices from March" โ you open only the March cabinet (1 of 100). Without labels, you'd need to open every single drawer in every cabinet. Partitioning = labelling your S3 folders so Athena knows which to open and which to skip.
| Practice | Why | Example |
|---|---|---|
| Partition by time (date) | Most queries filter by date โ massive scan reduction | /year=2026/month=05/day=07/ |
| Partition by high-cardinality filter | If queries always filter by region or account, partition by it | /region=us-east-1/ |
| Don't over-partition | Too many tiny files = overhead. Aim for 128MBโ512MB per partition | Don't partition by hour if files are only 1KB |
| Use Hive-style naming | Athena auto-detects Hive partition format (key=value/) | s3://bucket/data/year=2026/month=05/ |
| Register partitions in Glue | Athena won't see new partitions until Glue knows about them | Use Glue Crawler or MSCK REPAIR TABLE |
Too Small (<10 MB)
- Thousands of tiny files
- Excessive S3 LIST + GET overhead
- Athena spends more time opening files than reading data
- Common with raw event streams
Just Right (128โ512 MB)
- Optimal for Athena's parallel reading
- Each file is meaningful work for one worker
- Good balance of parallelism and efficiency
- Use Glue ETL to consolidate small files
Too Large (>2 GB)
- Fewer files limits parallelism
- One failed read = restart large file
- Still works, but not optimal
- Acceptable for cold archive data
| Compression | Splittable? | Ratio | Speed | Best For |
|---|---|---|---|---|
| Snappy | Yes (in Parquet/ORC) | Medium (2โ4ร) | Very fast decompress | Default for Parquet โ best general choice |
| GZIP | No (raw), Yes (in Parquet) | High (5โ8ร) | Slower decompress | CSV/JSON when max compression needed |
| ZSTD | Yes (in Parquet) | High (5โ7ร) | Fast decompress | Best balance of ratio + speed (newer) |
| LZO | Yes (standalone) | Medium (2โ3ร) | Very fast | Hadoop/EMR ecosystems (legacy) |
| None | Yes | 1ร (uncompressed) | No CPU overhead | Never for analytics โ wastes cost + time |
- โ Partitioning โ eliminate 80โ99% of data scanned by filtering folder paths
- โก Columnar format (Parquet/ORC) โ read only needed columns, skip the rest
- โข Compression (Snappy/ZSTD) โ reduce bytes within each column block
- โฃ Right-sized files (128โ512 MB) โ balance parallelism and per-file overhead
- "Reduce Athena cost" โ partition data + convert to Parquet/ORC (always the answer)
- "Athena scans too much data" โ data is not partitioned or is stored in CSV/JSON
- "Columnar format" = Parquet or ORC โ enables column pruning (reads only needed columns)
- Athena charges $5 per TB scanned with a 10 MB minimum per query
- Parquet with Snappy is the default best-practice format for Athena
- MSCK REPAIR TABLE or Glue Crawler adds new partitions to the catalog
Athena cost = data scanned. Reduce what Athena reads and you reduce both cost and latency. Partition by date (eliminates irrelevant data), use Parquet (reads only needed columns), compress with Snappy (fewer bytes per column), and consolidate files to 128โ512 MB each. A well-optimized dataset can be 100ร cheaper to query than a raw CSV dump.
Athena vs Redshift vs OpenSearch
AWS has many analytics and query services. Understanding when to use which โ and what problem each actually solves โ is the key to choosing the right tool. Athena, Redshift, and OpenSearch are the three most commonly compared and confused services.
Athena โ Serverless Ad-Hoc
- Query data in S3 with SQL on-demand
- No infrastructure to manage
- Pay per query ($5/TB scanned)
- Best for: exploration, logs, ad-hoc
- Latency: seconds to minutes
Redshift โ Enterprise Warehouse
- Managed columnar data warehouse
- Petabyte-scale with sub-second queries
- Pay per cluster (or Redshift Serverless per RPU-hr)
- Best for: BI dashboards, complex joins, heavy analytics
- Latency: sub-second to seconds
OpenSearch โ Search & Logs
- Full-text search + real-time log analytics
- Managed Elasticsearch / OpenSearch cluster
- Pay per instance + storage
- Best for: text search, log dashboards, observability
- Latency: milliseconds (real-time)
| Dimension | Athena | Redshift | OpenSearch |
|---|---|---|---|
| Model | Serverless โ no infra | Provisioned cluster (or Serverless) | Managed cluster |
| Data location | S3 (query in place) | Local SSD + can extend to S3 (Spectrum) | Local storage (indexes) |
| Query language | Standard SQL (Trino) | PostgreSQL-compatible SQL | Query DSL + SQL plugin |
| Latency | 2โ30 seconds | Sub-second to seconds | Milliseconds |
| Concurrency | ~20 concurrent queries | Hundreds (with WLM) | Thousands (search) |
| Pricing | $5/TB scanned | ~$0.25/hr per node (or RPU-hr) | ~$0.10/hr per instance |
| Best for | Ad-hoc, exploration, infrequent | Complex BI, heavy repeated queries | Full-text search, real-time logs |
| Idle cost | $0 (pay per query) | Running cluster = cost 24/7 | Running cluster = cost 24/7 |
| Schema | Schema-on-read (Glue) | Schema-on-write (tables) | Index mappings |
| Data format | Any (CSV, JSON, Parquet, ORC) | Proprietary columnar | JSON documents (inverted index) |
| If Your Scenario Isโฆ | Use | Why |
|---|---|---|
| Occasional SQL queries on S3 data | Athena | Zero idle cost, no setup needed |
| Data exploration by analysts | Athena | Schema-on-read, flexible, low barrier |
| Heavy BI dashboard (Tableau/PowerBI) with sub-second response | Redshift | Optimised for repeated complex queries at scale |
| Thousands of concurrent dashboard users | Redshift | Better concurrency management (WLM) |
| Full-text search across logs or documents | OpenSearch | Inverted index = millisecond text search |
| Real-time observability (Kibana dashboards) | OpenSearch | Near-real-time indexing + visualisations |
| CloudTrail / VPC Flow Log analysis (occasional) | Athena | Files already in S3, pay only when querying |
| Petabyte star-schema analytics with complex joins | Redshift | Distribution keys + sort keys = optimal join performance |
| Query budget is near zero but query volume is low | Athena | Literal $0 idle cost; pennies per query |
These services are complementary, not alternatives. A production analytics stack often uses multiple:
Athena + Redshift Spectrum
- Hot data in Redshift local storage (fast, repeated queries)
- Cold/historical data in S3 (queried via Redshift Spectrum or Athena)
- Same Glue Catalog shared between both
- Move data from hot โ cold as it ages
Athena + OpenSearch
- OpenSearch for real-time log dashboards (last 7 days)
- Athena for historical deep-dives (query months of logs in S3)
- Kinesis Firehose ships logs to both: OpenSearch (real-time) + S3 (archive)
- Security teams use Athena for forensic investigations on historical data
| Aspect | Athena | RDS / Aurora |
|---|---|---|
| Purpose | Analytics โ read-heavy, scan-heavy | Transactional โ read/write OLTP |
| Operations | SELECT only (read-only analytics) | INSERT, UPDATE, DELETE, SELECT |
| Latency | Seconds | Milliseconds |
| Data source | S3 files (external) | Local storage (managed disks) |
| Use case | Reports, auditing, exploration | Application backend database |
| Analogy | Library catalogue search | Cash register (record transactions) |
Athena is not a replacement for RDS. If the question describes an application that needs INSERT/UPDATE/DELETE with millisecond latency โ that is NOT Athena. Athena is for analytics. RDS/Aurora/DynamoDB are for transactional workloads. This is the most common confusion in certification exams.
- "Serverless SQL on S3" โ Athena (not Redshift)
- "Sub-second BI dashboard with complex joins" โ Redshift
- "Full-text search across log data in real-time" โ OpenSearch
- "Zero idle cost analytics" โ Athena (Redshift has running cluster cost)
- "OLTP workload with transactions" โ NOT Athena โ RDS/Aurora/DynamoDB
- "Redshift Spectrum" = Redshift querying S3 data (uses same Glue Catalog as Athena)
- "Athena or Redshift for occasional queries?" โ Athena (lower cost when queries are infrequent)
- "Athena or Redshift for thousands of concurrent BI users?" โ Redshift (better concurrency)
Athena = serverless ad-hoc SQL on S3 (pay-per-query, zero idle cost). Redshift = enterprise data warehouse for heavy, repeated, complex analytics (sub-second, high concurrency). OpenSearch = real-time search and log observability (millisecond full-text search). They complement each other โ use Athena for exploration and cold data, Redshift for hot BI, and OpenSearch for real-time logs.
Athena Architecture Patterns
Athena typically doesn't exist in isolation โ it's part of a larger analytics architecture. This chapter covers the five most common production patterns where Athena provides the query layer: data lake analytics, CloudTrail auditing, serverless reporting, cost analysis, and security investigation workflows.
The canonical Athena use case. Raw data lands in S3, gets processed into Parquet with partitioning, then queried on-demand via Athena. This is the modern replacement for a traditional data warehouse at a fraction of the cost.
CloudTrail logs every AWS API call in your account and delivers them to S3 automatically. Athena has a built-in CloudTrail table template โ you can query who did what, when, from where, in seconds.
Common CloudTrail Queries
- Who deleted that S3 bucket / EC2 instance?
- All API calls from a specific IP address
- All root account usage in the last 30 days
- All IAM policy changes in a time range
- Failed authentication attempts
How It Works
- CloudTrail delivers JSON logs to S3 (automatic)
- Create Athena table pointing at CloudTrail S3 path
- Partition by region + year + month + day
- Query with SQL:
WHERE eventname = 'DeleteBucket' - Cost: pennies per query (small scan with partitions)
Build a reporting system with zero running infrastructure. An API Gateway triggers Lambda, which runs an Athena query, returns results to the client or stores them for a dashboard.
What Is CUR?
- AWS Cost & Usage Report โ most detailed billing data
- Delivered as Parquet files to S3 (hourly/daily)
- Line-item detail: service, usage, tags, accounts
- Can be queried directly by Athena (no ETL needed!)
Example Queries
- Total spend by team/project (using cost allocation tags)
- Top 10 most expensive resources this month
- Daily spending trend for EC2 vs Lambda vs S3
- Unused EBS volumes costing money
- Reserved Instance utilization tracking
During a security incident, you need to query massive volumes of logs across accounts in minutes. Athena enables forensic investigation across CloudTrail, VPC Flow Logs, GuardDuty findings, and DNS logs โ all stored in S3.
| Pattern | Data Sources | Key Benefit | Cost Profile |
|---|---|---|---|
| Data Lake Analytics | Any โ ingested via Kinesis/DMS/direct S3 | Replaces traditional data warehouse | $5/TB per query, $0 idle |
| CloudTrail Audit | CloudTrail JSON logs in S3 | Who did what, when โ instant forensics | Pennies per query (small partitions) |
| Serverless Reporting | Any S3 data + API Gateway + Lambda | Zero-infra reporting API, $0 idle | Lambda + Athena per invocation |
| CUR Cost Analysis | AWS Cost & Usage Report (Parquet in S3) | Granular billing insight, tag-based allocations | Near-free (CUR is already Parquet) |
| Security Investigation | CloudTrail + VPC Flow + GuardDuty + DNS | Multi-source forensics, cross-log JOINs | Variable (depends on time range scanned) |
- "Analyse CloudTrail logs with SQL" โ Athena + CloudTrail table in Glue
- "Query AWS Cost & Usage Report" โ Athena (CUR is Parquet in S3 โ perfect match)
- "Serverless reporting API, zero idle cost" โ API Gateway + Lambda + Athena
- "Cross-account security forensics" โ Athena with org-wide CloudTrail S3 bucket
- "Data lake query layer" โ Athena + Glue Catalog + S3
Athena's power comes from pairing it with the right architecture. Data Lake (S3 + Glue + Athena + QuickSight) replaces expensive data warehouses. CloudTrail + Athena gives instant security auditing. Lambda + Athena builds zero-cost serverless reporting APIs. CUR + Athena enables granular cost governance. In every case, the pattern is the same: data in S3, metadata in Glue, query in Athena.
Security & Governance
Athena security is not one thing โ it is the intersection of three permission systems: IAM (who can run queries), S3 (who can read/write data), and Glue Catalog (who can see table metadata). Understanding these three layers is critical for secure production use.
Layer 1 โ IAM Permissions
athena:StartQueryExecutionathena:GetQueryResultsathena:GetWorkGroup- Controls WHO can run queries and in which workgroup
- Use IAM policies to restrict by workgroup
Layer 2 โ S3 Access
s3:GetObjecton data buckets3:PutObjecton results buckets3:GetBucketLocation- Controls WHAT data Athena can read
- Use bucket policies + IAM to restrict
Layer 3 โ Glue Catalog
glue:GetDatabase,glue:GetTableglue:GetPartitions- Controls WHICH tables/databases are visible
- Use resource-based Glue policies
- Lake Formation for fine-grained column access
| What | Encryption Options | Configuration |
|---|---|---|
| Source data in S3 | SSE-S3, SSE-KMS, CSE-KMS | Set on S3 bucket default encryption or per-object |
| Query results in S3 | SSE-S3, SSE-KMS | Configured per workgroup (enforce via workgroup settings) |
| Data in transit | TLS 1.2 (automatic) | JDBC/ODBC connections are encrypted by default |
What Workgroups Control
- Separate query result locations per team
- Per-query data scan limits (cost cap)
- Enforce encryption on results
- CloudWatch metrics per workgroup
- IAM policies scoped to workgroup
Best Practice: Multi-Team Setup
- One workgroup per team (marketing, engineering, security)
- Set scan limit: e.g. 100 GB max per query
- Each team's results โ separate S3 prefix
- Audit trail: CloudTrail logs query execution per workgroup
- Tag workgroups for cost allocation tracking
For enterprise environments that need column-level or row-level security, AWS Lake Formation adds another governance layer on top of Glue Catalog:
| Feature | Without Lake Formation | With Lake Formation |
|---|---|---|
| Table access | IAM + Glue policies (coarse) | Fine-grained: per-column, per-row, per-user |
| Column masking | Not possible natively | Hide sensitive columns from specific users |
| Row filtering | Not possible natively | Users see only rows matching their filter |
| Cross-account sharing | Complex IAM + S3 policies | One-click table sharing across accounts |
| Audit | CloudTrail only | Centralised access audit log in Lake Formation |
Use Lake Formation when you need column-level security (e.g. hide SSN column from certain teams), row-level filtering (e.g. each team sees only their region's data), or cross-account data sharing without complex IAM. For simple setups with 1โ2 teams, standard IAM + Glue + S3 policies are sufficient.
CloudTrail
- Logs every Athena API call
- StartQueryExecution events
- Who ran what query, when
- IP source, IAM identity
CloudWatch Metrics
- Data scanned per workgroup
- Query execution time
- Number of queries (throttled)
- Set alarms on cost thresholds
Query History
- 45-day history in Athena console
- Query text, status, data scanned
- Duration, result location
- Useful for cost attribution
- Athena security = IAM + S3 + Glue โ all three must allow access for a query to succeed
- Workgroups = team isolation + enforce encryption + cost control (per-query scan limit)
- Lake Formation = column-level and row-level access control on Glue tables
- Query result encryption can be enforced at the workgroup level (SSE-S3 or SSE-KMS)
- "Restrict analyst to seeing only their team's data" โ Lake Formation row-level security
- "Limit Athena spend per team" โ workgroup with per-query scan limit
Athena security is the union of three systems: IAM (who can query), S3 (what data is readable), and Glue Catalog (which tables are visible). All three must permit access. Use workgroups to isolate teams, enforce encryption, and set cost limits. For enterprise column/row-level security, add AWS Lake Formation. Every query is auditable via CloudTrail + CloudWatch.
Cost Optimization & Best Practices
Athena's pricing is deceptively simple: $5 per TB of data scanned. But in practice, the difference between a well-architected data lake and a naive one is 100ร cost difference for the same query. This chapter consolidates every optimization into a production-ready checklist.
| Component | Cost | Notes |
|---|---|---|
| Data scanned per query | $5.00 per TB | Rounded up to 10 MB minimum per query |
| Cancelled queries | Charged for data scanned before cancellation | Cancel early to limit cost |
| DDL statements (CREATE TABLE, etc.) | Free | No data scanned |
| Failed queries | Free | Syntax errors, permission errors = $0 |
| Glue Data Catalog | Free first 1M objects/month | Then $1 per 100K objects |
| S3 storage | ~$0.023/GB/month (Standard) | Separate from Athena โ you'd store it anyway |
| S3 GET requests | $0.0004 per 1000 requests | Athena generates many GETs โ usually negligible |
Scenario: 100 analysts, each running 5 queries/day, scanning 500 MB per query (well-optimized Parquet + partitioning).
Monthly cost: 100 ร 5 ร 30 ร 0.5 GB = 7,500 GB = 7.5 TB ร $5 = $37.50/month.
Compare to a Redshift cluster doing the same work: $2,000โ$8,000/month.
| # | Optimization | Impact | Effort | How |
|---|---|---|---|---|
| 1 | Partition by date | 80โ99% reduction | Low | Hive-style folders: /year=YYYY/month=MM/day=DD/ |
| 2 | Use Parquet or ORC | 60โ90% reduction | Medium | Glue ETL job to convert raw โ Parquet |
| 3 | Compress with Snappy/ZSTD | 20โ40% further | Free (Parquet default) | Parquet uses Snappy by default โ just use Parquet |
| 4 | Consolidate small files | 10โ50% improvement | Medium | Glue ETL or S3 BatchOps to merge into 128โ512 MB |
| 5 | SELECT only needed columns | Proportional | Free | Never SELECT * โ list specific columns |
| 6 | Use workgroup scan limits | Cost safety net | Low | Set max per-query scan (e.g. 100 GB) |
| 7 | CTAS for repeated queries | Avoids repeated scans | Low | CREATE TABLE AS to materialize results |
| 8 | Use Athena query result reuse | $0 for cache hits | Free | Enable "Reuse query results" in workgroup (7-day cache) |
CREATE TABLE AS SELECT (CTAS) is a powerful technique: run an expensive query once, store the results as a new Parquet table, then all subsequent queries read the smaller materialized result instead of rescanning raw data.
Without CTAS
- Dashboard refreshes 10ร/day
- Each refresh scans 200 GB of raw data
- Daily cost: 10 ร 200 GB ร $5/TB = $10/day
- Monthly: ~$300 just for one dashboard
With CTAS
- CTAS runs once/day โ 200 GB scan = $1/day
- Materialized result: 2 GB Parquet table
- Dashboard reads 2 GB ร 10 = 20 GB = $0.10/day
- Monthly: ~$33 (90% savings)
Data Engineering Practices
- Store analytics data as Parquet with Snappy
- Partition by the most common filter (usually date)
- File size target: 128โ512 MB per file
- Use Glue Crawlers to auto-update partitions
- Separate raw zone (JSON/CSV) from curated zone (Parquet)
- Use lifecycle rules to move old data to Glacier
Operational Governance
- One workgroup per team โ separate cost tracking
- Set per-query scan limits to prevent runaway queries
- Encrypt results (enforce at workgroup level)
- Enable query result reuse (7-day cache)
- Monitor with CloudWatch: data scanned, query time
- Tag workgroups for FinOps cost allocation
Query Writing Practices
- Never
SELECT *โ list only needed columns - Always filter by partition keys in WHERE clause
- Use
LIMITduring exploration - Materialise expensive joins with CTAS
- Use views for common access patterns
- Check "Data scanned" after each query โ optimise if high
Anti-Patterns to Avoid
- โ
SELECT *on large tables - โ Querying unpartitioned data at scale
- โ Storing analytics data as CSV/JSON long-term
- โ Millions of tiny files (< 1 MB each)
- โ Missing partition keys in WHERE clause
- โ Using Athena as a transactional database
| Scenario | Why Not Athena | Better Choice |
|---|---|---|
| Sub-second interactive dashboards (thousands of users) | 2โ30s query latency, limited concurrency | Redshift + SPICE (QuickSight) or ElastiCache |
| Transactional workloads (INSERT/UPDATE/DELETE) | Athena is read-only analytics | RDS, Aurora, DynamoDB |
| Real-time streaming analytics | Athena is batch, not streaming | Kinesis Analytics, Flink, OpenSearch |
| Very high query volume (100+ concurrent queries) | Default concurrency limit ~20โ25 DML queries | Redshift Serverless (higher concurrency) |
| Application backend (millisecond reads) | Not a database โ seconds of latency | DynamoDB, ElastiCache, Aurora |
- "Reduce Athena cost" โ partition + Parquet + compression (ALWAYS the answer)
- "Pay per query, $5/TB" โ Athena pricing (vs Redshift per-hour cluster cost)
- CTAS = CREATE TABLE AS โ materializes results, avoids repeated scans
- "Millions of small files slow down Athena" โ consolidate to 128โ512 MB files
- "Limit Athena spend" โ workgroup per-query byte scan limit
- "Cache Athena results" โ query result reuse (workgroup setting)
- Athena minimum charge: 10 MB per query even if data scanned is less
- What: Serverless SQL query engine that reads data directly from S3 โ no servers, no databases, no loading.
- How: Three-layer architecture โ S3 (data) + Glue Catalog (metadata/schema) + Athena Engine (distributed SQL execution via Trino).
- Cost: $5 per TB of data scanned. Optimize with partitioning (80โ99% reduction), Parquet (60โ90%), and compression.
- Best for: Ad-hoc analytics, log analysis, CloudTrail auditing, CUR billing queries, data lake exploration, serverless reporting.
- Not for: OLTP workloads, sub-second latency, high-concurrency dashboards, real-time streaming.
- vs Redshift: Athena = serverless, $0 idle, ad-hoc. Redshift = managed warehouse, sub-second, high concurrency.
- vs OpenSearch: Athena = SQL analytics on S3. OpenSearch = real-time search/log observability.
- Security: IAM + S3 + Glue โ triple-lock model. Workgroups for team isolation. Lake Formation for column/row-level access.
- Key integration: Glue Data Catalog is the bridge between raw S3 files and queryable SQL tables.
Amazon Athena turns every S3 bucket into a queryable database โ without provisioning, loading, or managing anything. Store data cheaply in S3, describe it in Glue, query it with SQL, pay only for what you scan. Optimize with Parquet and partitioning, and your analytics cost drops to near-zero. It is the query engine that makes the AWS data lake practical, accessible, and affordable.