AWS Glue Amazon S3
LearningTree Β· AWS Β· Analytics

AWS Glue β€”
Serverless ETL & Data Catalog

The metadata backbone of every AWS data lake. Glue provides a shared Data Catalog that Athena, EMR, and Redshift all read from β€” plus serverless ETL jobs that transform raw data into analytics-ready formats without managing a single server.

01
Chapter One Β· Analytics

What is AWS Glue?

AWS Glue is a serverless data integration service with two distinct capabilities: a shared Data Catalog (metadata store for all your datasets) and serverless ETL jobs (Apache Spark/Python Shell that transform data at scale). Together, they make raw data in S3 queryable and analytics-ready β€” without managing infrastructure.

The Two Halves of AWS Glue Introductory

Glue is often misunderstood because it is two services marketed as one. Understanding the split is key:

πŸ—‚οΈ

Half 1 β€” Glue Data Catalog

  • Central metadata repository (like Hive Metastore)
  • Stores table definitions: column names, types, S3 locations
  • Tracks file formats and partition structures
  • Shared across Athena, EMR, Redshift Spectrum, Glue ETL
  • Updated by Crawlers or manually
  • This is what Athena reads to know what tables exist
βš™οΈ

Half 2 β€” Glue ETL Jobs

  • Serverless Apache Spark or Python Shell jobs
  • Extract data from S3/RDS/JDBC, Transform it, Load back to S3
  • Convert CSV/JSON β†’ Parquet with partitioning
  • Deduplicate, clean, flatten, join datasets
  • Runs on AWS-managed compute (DPUs)
  • This is what makes raw data analytics-ready
🧠 Mental Model β€” The Library Catalogue + Transformation Workshop

Think of Glue Data Catalog as a library catalogue system β€” it tells you what books (tables) exist, where they are (S3 paths), and what's in them (columns/types). Glue ETL is the book restoration workshop β€” it takes damaged old books (raw CSV/JSON), cleans and reformats them into pristine editions (Parquet), and puts them on the right shelves (partitioned folders). Athena then uses the catalogue to find and read the books.

Why Glue Exists β€” The Problem It Solves Introductory

Without Glue, the data lake stack has two major pain points:

ProblemWithout GlueWith Glue
How does Athena know what tables exist?Manually create DDL statements for every tableCrawlers auto-discover schemas β†’ Catalog updated
How do you track schema changes?Manual documentation, breaks silentlyCatalog versions schemas, detects drift
How do you convert CSV β†’ Parquet?Self-manage EMR/Spark cluster ($$$)Serverless ETL β€” pay per DPU-second
How do multiple tools share metadata?Each tool has its own metastore (fragmented)One Catalog shared by Athena, EMR, Redshift
How do you partition new data?Custom scripts, easy to misconfigureGlue jobs handle partitioning automatically
Glue Architecture Overview Core
AWS Glue β€” Components & How They Connect
S3 RAW CSV / JSON crawl GLUE DATA CATALOG Databases Β· Tables Β· Partitions Shared metadata layer CONSUMERS Athena Β· EMR Β· Redshift Spectrum Β· Lake Formation read schema GLUE ETL JOBS Spark / Python Shell Transform Β· Partition Β· Convert read write S3 CURATED Parquet Β· partitioned updates catalog GLUE CRAWLERS Auto-detect schemas Crawlers discover β†’ Catalog stores β†’ ETL transforms β†’ Consumers query
Glue Catalog is the shared brain; Crawlers populate it; ETL Jobs transform data; Athena/EMR/Redshift consume the metadata
Glue's Role in the Data Lake Stack Core
πŸ’Ύ

S3 = Storage

S3 holds the actual data files. Cheap, durable, infinitely scalable. Glue never replaces S3 β€” it reads from and writes to S3.

πŸ—‚οΈ

Glue = Metadata + Transform

Glue tells every tool what data exists, what shape it has, and transforms it into optimal formats. The glue that holds the lake together.

πŸ”

Athena = Query

Athena reads schema from Glue Catalog, then scans S3 directly. Without Glue, Athena wouldn't know what tables or partitions exist.

🎯 Exam Insight
  • "Central metadata repository for data lake" β†’ Glue Data Catalog
  • "Serverless ETL" or "convert CSV to Parquet automatically" β†’ Glue ETL Jobs
  • "Auto-discover schemas in S3" β†’ Glue Crawlers
  • "Shared metastore across Athena, EMR, Redshift" β†’ Glue Data Catalog
  • Glue Data Catalog = Hive-compatible metastore (this is key)
  • Glue ETL runs on DPUs (Data Processing Units) β€” billed per DPU-hour
Chapter 01 β€” Key Takeaway

AWS Glue is two things: (1) a shared Data Catalog that tells every analytics service what tables exist and where they live in S3, and (2) serverless ETL jobs that transform raw data into optimised formats. The Catalog is the metadata backbone of every AWS data lake β€” without it, Athena wouldn't know what to query.

02
Chapter Two Β· Analytics

The Glue Data Catalog

The Glue Data Catalog is the single source of truth for metadata across your entire data lake. It stores table definitions (column names, data types, file formats, S3 locations, partition keys) β€” and every AWS analytics service reads from it. Think of it as a Hive Metastore as a managed service.

Catalog Hierarchy β€” How Metadata is Organised Core
ConceptWhat It IsExampleAnalogy
CatalogTop-level container (one per AWS account per region)Your entire metadata storeThe library building
DatabaseA namespace grouping related tablesanalytics_db, logs_dbA floor/section of the library
TableSchema + S3 location + format + partition keyslogs_db.cloudtrail_eventsA bookshelf with the index card
PartitionA sub-folder in S3 mapped to key=valueyear=2026/month=05/day=07/A labelled drawer in the bookshelf
ColumnA field in the table with name + data typeuser_id STRING, amount DOUBLEA column header in a spreadsheet
What a Table Definition Contains Core
πŸ“‹

Table Metadata Fields

  • Table name and database
  • S3 location β€” where data files live
  • Columns β€” names and types (string, int, etc.)
  • Partition keys β€” which columns are partition keys
  • SerDe β€” how to parse the file format
  • Input/Output format β€” CSV, JSON, Parquet, ORC
  • Table properties β€” compression, skip headers, etc.
πŸ”—

Who Reads the Catalog

  • Athena β€” reads table schemas before querying S3
  • EMR β€” uses catalog as Hive Metastore
  • Redshift Spectrum β€” queries S3 via catalog tables
  • Glue ETL Jobs β€” reads source/target schemas
  • Lake Formation β€” column/row security on catalog tables
  • AWS Step Functions β€” orchestrate based on catalog info
Schema-on-Read β€” The Key Concept Core

The Catalog doesn't enforce schema when data arrives (unlike a traditional database). It defines how to interpret data at query time. This means:

βœ…

Advantages of Schema-on-Read

  • Store raw data first, define schema later
  • Multiple tables can point to the same S3 data with different schemas
  • Schema changes don't require data migration
  • New columns are added without rewriting files
⚠️

Gotchas

  • Schema mismatch = NULL values or query errors
  • Catalog doesn't validate data correctness
  • New partitions must be registered (crawler or MSCK REPAIR)
  • Deleted S3 files still appear in catalog until updated
Schema Versioning Deep

Glue Data Catalog automatically versions table schemas. Every time a crawler or manual update changes a table definition, a new version is created. This lets you:

  • See the history of schema changes over time
  • Roll back to a previous schema version if a crawler misconfigures something
  • Detect schema drift β€” when data shape changes unexpectedly
Catalog Pricing Core
ItemCostNotes
First 1 million objects storedFreeObjects = tables + partitions + databases
Above 1M objects$1 per 100,000 objects/monthMost accounts never hit this
First 1 million requests/monthFreeRequests = Athena queries, crawler reads, etc.
Above 1M requests$1 per 1 million requestsVery cheap β€” effectively free for most
πŸ’‘ Practical Implication

The Glue Data Catalog is effectively free for the vast majority of AWS accounts. You'll never worry about catalog cost β€” even with thousands of tables and partitions. The real cost of Glue comes from ETL jobs (DPU-hours), not the catalog.

🎯 Exam Insight
  • "Hive-compatible metastore" = Glue Data Catalog
  • "Where does Athena get table definitions?" = Glue Data Catalog
  • "One catalog shared across multiple services" β†’ Athena, EMR, Redshift Spectrum all use the same catalog
  • "New partitions not visible in Athena" β†’ run MSCK REPAIR TABLE or re-run crawler
  • Catalog does NOT store the actual data β€” only metadata (S3 paths, columns, types)
Chapter 02 β€” Key Takeaway

The Glue Data Catalog is a managed Hive Metastore β€” it stores table definitions (columns, types, S3 paths, partitions) and is shared across Athena, EMR, Redshift Spectrum, and more. It is schema-on-read, versions schemas automatically, and is effectively free for most accounts. Without the catalog, analytics tools wouldn't know what data exists or how to parse it.

03
Chapter Three Β· Analytics

Glue Crawlers β€” Auto-Discover Schemas

Glue Crawlers are automated jobs that scan your S3 data, infer schemas, detect partitions, and register or update tables in the Glue Catalog β€” without you writing any DDL. Point a crawler at an S3 path, run it, and your data becomes queryable in Athena immediately.

How a Crawler Works β€” Step by Step Core
Glue Crawler β€” Execution Flow
β‘  POINT Configure S3 path or JDBC source β‘‘ SCAN Read file samples Detect format + cols β‘’ CLASSIFY Infer schema: cols, types, partitions β‘£ REGISTER Create/update table in Glue Catalog β‘€ QUERYABLE Athena / EMR can now query the data Point β†’ Scan β†’ Classify β†’ Register β†’ Queryable in ~1–5 minutes
What Crawlers Detect Automatically Core
πŸ“„

File Format

  • CSV, TSV, JSON
  • Parquet, ORC, Avro
  • GZIP, BZIP2, Snappy compressed
  • Sets the correct SerDe
πŸ›οΈ

Schema (Columns)

  • Column names from headers or structure
  • Data types (string, int, double, etc.)
  • Nested structures in JSON
  • Array and map types
πŸ“

Partitions

  • Hive-style: key=value/ folders
  • Registers each partition in catalog
  • Partition keys become queryable columns
  • New partitions discovered on re-crawl
Crawler Configuration Options Core
SettingWhat It DoesBest Practice
Data sourceS3 path, JDBC connection, DynamoDB, or Catalog tableUse specific prefixes, not entire bucket roots
IAM RolePermissions to read S3 and write to CatalogLeast-privilege: only the paths crawler needs
ScheduleOn-demand, hourly, daily, or cron expressionDaily for growing datasets; on-demand for one-offs
Database targetWhich Glue database to create tables inOne database per data domain
Schema change policyUpdate table / add columns / ignore changes"Update in place" unless strict governance needed
Table groupingCreate one table per S3 prefix, or group similar filesGroup by logical dataset
ClassifiersCustom parsers for non-standard formatsOnly needed for unusual file structures
Crawlers vs Manual Table Creation Core
ApproachWhen to UseProsCons
CrawlerExploratory; data shape is unknown or evolvingAuto-detects everything; finds new partitionsMay misclassify; takes 1–5 min; costs DPU time
Manual DDL (CREATE TABLE)Schema is well-known and stableInstant; zero cost; precise controlMust maintain manually; won't detect new partitions
MSCK REPAIR TABLESchema known but new partitions arriveFast; adds new Hive-style partitionsOnly works for Hive-style naming
Glue API / SDKCI/CD pipeline automationProgrammatic control; integrates with IaCMore code to maintain
Common Crawler Issues & Fixes Deep
❌

Problem: Crawler Creates Too Many Tables

Crawler treats each S3 prefix as a separate table. Fix: Use table grouping behavior or point crawler at a more specific prefix. Set exclusion patterns for temp/staging folders.

❌

Problem: Wrong Data Types Inferred

Crawler sees "12345" and infers INT, but it's a ZIP code (should be STRING). Fix: Use a custom classifier, or manually correct the table after crawling, or create the table with DDL.

❌

Problem: New Partitions Not Appearing

Data arrives in new date partitions but Athena doesn't see them. Fix: Re-run crawler on schedule, or use MSCK REPAIR TABLE, or use ALTER TABLE ADD PARTITION in Athena.

❌

Problem: Crawler Costs Adding Up

Crawling large datasets with many files takes DPU time. Fix: Use incremental crawling (only scan new partitions), schedule crawls during off-peak, or switch to manual partition management for stable schemas.

Crawler Pricing Core
ComponentCostNotes
Crawler runtime$0.44 per DPU-hourMinimum 10 minutes per run; typically 1–5 min for small datasets
Minimum charge~$0.07 per run (10 min minimum)Running a crawler even briefly costs at least this
S3 request costsS3 LIST + GET chargesUsually negligible
πŸ’‘ Cost Tip

A daily crawler on a moderately-sized dataset costs ~$2–4/month. For datasets with stable schemas but new partitions, consider using MSCK REPAIR TABLE (free) or the Glue BatchCreatePartition API instead of re-crawling to save cost.

🎯 Exam Insight
  • "Automatically discover schemas in S3" β†’ Glue Crawler
  • "New partitions not visible" β†’ re-run crawler OR MSCK REPAIR TABLE
  • "Detect new data and update catalog" β†’ schedule crawler (hourly/daily)
  • Crawlers infer format, schema, and partitions β€” all three at once
  • Crawlers can scan S3, JDBC (RDS/Redshift), and DynamoDB
  • Crawler output goes into the Glue Data Catalog β€” not into S3 or another store
  • "Crawler creating wrong tables" β†’ adjust grouping behavior or use exclusion patterns
Chapter 03 β€” Key Takeaway

Glue Crawlers automatically scan S3, detect file formats, infer column schemas, discover partitions, and register everything in the Glue Data Catalog. They eliminate the need to manually write DDL β€” just point, run, and your data is queryable. For stable schemas, manual DDL or MSCK REPAIR TABLE is cheaper. For evolving or unknown data shapes, crawlers are the fastest path from raw files to queryable tables.

04
Chapter Four Β· Analytics

Glue ETL Jobs β€” Serverless Data Transformation

Glue ETL Jobs are serverless Apache Spark or Python jobs that extract data from sources (S3, RDS, JDBC), transform it (convert formats, clean, partition, deduplicate), and load it back to S3 or other targets. You write the logic β€” AWS manages the compute infrastructure, scaling, and cluster lifecycle.

Job Types β€” Choosing the Right Engine Core
Job TypeEngineBest ForDPU DefaultLanguages
Spark ETLApache Spark (distributed)Large-scale transforms, TBs of data, partitioning10 DPUsPython (PySpark), Scala
Spark StreamingSpark Structured StreamingNear-real-time micro-batch from Kinesis/Kafka10 DPUsPython, Scala
Python ShellSingle-node PythonSmall transforms, API calls, light orchestration0.0625 or 1 DPUPython only
RayRay distributed frameworkML data prep, distributed Python workloadsVariablePython
🧠 When to Choose What
  • Spark ETL β€” default choice for any serious data transformation (CSVβ†’Parquet, joins, aggregations)
  • Python Shell β€” simple tasks: small file moves, API calls, <1 GB data, notification scripts
  • Spark Streaming β€” continuous processing from Kinesis with micro-batch windows
  • Ray β€” ML preprocessing, distributed pandas, when Spark isn't needed
DynamicFrames β€” Glue's Core Abstraction Core

Glue ETL extends Spark with DynamicFrames β€” a more flexible alternative to Spark DataFrames that handles messy, schema-inconsistent data gracefully:

πŸ“Š

Spark DataFrame

  • Strict schema β€” all rows must match
  • Fails on schema inconsistencies
  • Standard PySpark / Spark SQL
  • You already know if you know Spark
πŸ”„

Glue DynamicFrame

  • Self-describing β€” each row carries its own schema
  • Handles mixed types (same column: int + string)
  • Built-in ResolveChoice to fix type conflicts
  • Convert to/from DataFrame freely
  • Reads directly from Glue Catalog
Common ETL Transform Operations Core
OperationWhat It DoesUse Case
ApplyMappingRename/retype columns, drop unwanted fieldsStandardize column names across sources
ResolveChoiceFix columns with multiple data typesMixed int/string fields from JSON crawl
FilterKeep only rows matching a conditionRemove test/null records
JoinMerge two DynamicFrames on a keyEnrich events with user dimension table
DropNullFieldsRemove columns that are entirely nullClean sparse datasets
RelationalizeFlatten nested JSON into relational tablesConvert API responses for analytics
write_dynamic_frameWrite output to S3 in target formatWrite Parquet with partitioning
ETL Job Flow Diagram Core
Glue ETL Job β€” Typical Execution Flow
EXTRACT Read from S3 / RDS β†’ DynamicFrame TRANSFORM ApplyMapping Β· Filter Β· Join ResolveChoice Β· Relationalize LOAD Write to S3 as Parquet + partitions CATALOG Update table + add new partitions Extract β†’ Transform β†’ Load β†’ Update Catalog (all serverless, pay per DPU-second)
Job Bookmarks β€” Incremental Processing Deep

Job Bookmarks let a Glue job remember where it left off, so the next run only processes new data:

πŸ“‘

Without Bookmarks

  • Every run processes ALL source data from scratch
  • Wastes DPU time re-processing old files
  • May create duplicate output records
  • Cost grows linearly with historical data size
βœ…

With Bookmarks Enabled

  • Glue tracks which files/partitions were already processed
  • Next run reads ONLY new files since last bookmark
  • Dramatic cost reduction for daily/hourly ETL
  • Enable via job parameter: --job-bookmark-option = job-bookmark-enable
Triggers & Workflows β€” Orchestrating Jobs Core
MechanismHow It WorksUse Case
Schedule triggerCron expression (e.g. daily at 2am)Nightly ETL batch
On-demand triggerManual start or API callOne-off backfill or testing
Event triggerRun when another job/crawler completesChain: Crawler β†’ ETL β†’ Second ETL
Glue WorkflowDAG of jobs + crawlers with dependenciesMulti-step pipeline: crawl β†’ transform β†’ validate
Step FunctionsExternal orchestrator calling Glue via APIComplex logic, human approval, cross-service
EventBridgeTrigger job on S3 event or scheduleProcess file on arrival in S3
Glue Studio β€” Visual ETL Introductory

Glue Studio is a visual drag-and-drop interface that generates Glue ETL code automatically. You connect source β†’ transforms β†’ target visually, and Glue generates the PySpark script behind the scenes. Useful for teams that prefer visual tools or rapid prototyping.

🎯 Exam Insight
  • "Serverless ETL to convert CSV to Parquet" β†’ Glue ETL Job (Spark)
  • "Process only new files since last run" β†’ enable Job Bookmarks
  • "DPU" = Data Processing Unit β€” unit of Glue compute billing
  • "DynamicFrame" = Glue's extension of Spark DataFrame (handles schema inconsistencies)
  • "Chain jobs: crawler finishes β†’ run ETL" β†’ Glue Workflow or event trigger
  • "Small script, <1GB data, API calls" β†’ Python Shell job (0.0625 DPU, cheapest)
  • "Large-scale data transformation" β†’ Spark ETL job (10+ DPUs)
Chapter 04 β€” Key Takeaway

Glue ETL Jobs are serverless Spark (or Python Shell) jobs that Extract, Transform, and Load data. Use Spark ETL for large-scale transforms (CSV→Parquet, joins, partitioning), Python Shell for tiny tasks. DynamicFrames handle messy data gracefully. Job Bookmarks enable incremental processing. Orchestrate with triggers, workflows, or Step Functions. You pay per DPU-second — no idle clusters.

05
Chapter Five Β· Analytics

Architecture Patterns

Glue rarely works alone. It is a middleware component that connects data sources to analytics tools. This chapter covers the four most common production patterns where Glue plays a central role.

Pattern 1 β€” Data Lake ETL Pipeline Core

The most common Glue pattern. Raw data arrives in S3, Glue converts it to an optimised format, and analytics tools query the curated layer.

Data Lake ETL β€” Raw β†’ Curated β†’ Analytics
SOURCES Kinesis / DMS Direct upload S3 RAW CSV / JSON as-received GLUE ETL β†’ Parquet + partition + clean S3 CURATED Parquet partitioned ATHENA SQL queries via Catalog BI QuickSight Sources β†’ S3 Raw β†’ Glue ETL β†’ S3 Curated (Parquet) β†’ Athena β†’ BI Dashboards
Pattern 2 β€” Database Migration to Data Lake Core
πŸ—οΈ

The Flow

  • DMS replicates RDS/on-prem DB β†’ S3 (CSV/Parquet)
  • Glue Crawler discovers the new tables in S3
  • Glue ETL joins/transforms for analytics use
  • Athena + QuickSight provide the query layer
πŸ’‘

Why This Pattern

  • Offload heavy analytics from production database
  • Query historical data without impacting OLTP performance
  • Combine data from multiple RDS instances
  • Retain full history (RDS may only keep recent data)
Pattern 3 β€” Schema Discovery & Governance Core

Use Glue as a governance tool β€” crawlers scan all S3 data, the Catalog becomes the single inventory of what data exists, and Lake Formation applies security on top:

πŸ”

Discover

Crawlers scan all S3 prefixes and register every dataset in the Catalog β€” automatic data inventory.

πŸ“‹

Catalogue

Catalog becomes the authoritative list of all data assets β€” searchable, versioned, with schema history.

πŸ”’

Govern

Lake Formation applies column/row-level access on catalog tables β€” centralised security for the data lake.

Pattern 4 β€” Event-Driven ETL Deep

Process data the moment it arrives in S3 β€” no polling, no schedules. An S3 event triggers the Glue job immediately:

ComponentRole
S3 Event NotificationFires when a new file is uploaded to a prefix
EventBridgeRoutes the S3 event to the correct target
Glue Workflow / Step FunctionsStarts the ETL job with the new file path as input
Glue ETL JobProcesses only the new file (with bookmarks or explicit path)
Glue CatalogUpdated with new partitions from the processed output
Glue vs Alternatives β€” When to Use What Core
ServiceBest ForWhen NOT to Use
Glue ETLS3-centric batch ETL, format conversion, catalog integrationReal-time (<1s latency), very complex Spark tuning needed
EMRCustom Spark/Hadoop/Presto with full cluster controlSimple transforms (Glue is easier); short-lived jobs
LambdaSmall files (<100MB), <15 min runtime, event-driven micro-transformsTBs of data, complex joins, long-running
Step Functions + LambdaOrchestration with conditional logic, human approvalHeavy data processing (orchestrate Glue instead)
Kinesis AnalyticsReal-time stream SQL (sub-second)Batch processing, historical data
🎯 Exam Insight
  • "Convert CSV to Parquet in S3 serverlessly" β†’ Glue ETL (not EMR, not Lambda for large data)
  • "Process new S3 files on arrival" β†’ S3 Event β†’ EventBridge β†’ Glue ETL (event-driven ETL)
  • "Offload analytics from production RDS" β†’ DMS β†’ S3 β†’ Glue β†’ Athena
  • "Full Spark cluster control needed" β†’ EMR (not Glue β€” Glue is managed)
  • "Small file <100MB transform" β†’ Lambda (cheaper than spinning up Glue DPUs)
  • "Data lake governance + discovery" β†’ Glue Crawlers + Catalog + Lake Formation
Chapter 05 β€” Key Takeaway

Glue fits into four main architecture patterns: (1) Data Lake ETL — the canonical raw→curated pipeline; (2) Database migration — replicate RDS to S3 for analytics; (3) Schema discovery + governance — Crawlers + Catalog + Lake Formation; (4) Event-driven ETL — process files on arrival. Choose Glue over EMR when you want serverless simplicity, over Lambda when data exceeds 100MB, and over Kinesis when batch is acceptable.

06
Chapter Six Β· Analytics

Cost Optimization & Best Practices

Glue costs come from two sources: ETL Jobs (DPU-hours β€” the expensive part) and Crawlers (DPU-hours at lower scale). The Data Catalog itself is effectively free. Understanding DPU billing and optimization techniques is critical to controlling Glue spend.

Complete Pricing Model Core
ComponentCostMinimumNotes
Spark ETL Job$0.44 per DPU-hour10 min, 2 DPUsDefault: 10 DPUs. Scale down to 2 for small jobs
Python Shell Job$0.44 per DPU-hour1 min, 0.0625 DPUTiny jobs: ~$0.0005 per run. Extreme savings
Crawler$0.44 per DPU-hour10 min~$0.07 minimum per crawl run
Data Catalog storageFree (first 1M objects)β€”$1 per 100K above 1M
Data Catalog requestsFree (first 1M/month)β€”$1 per 1M requests above free tier
Glue DataBrew$0.48 per node-hourβ€”Interactive sessions billed separately
Glue Interactive Sessions$0.44 per DPU-hourβ€”Notebook-style development environment
⚠️ Common Cost Trap

Leaving DPUs at the default (10) for small jobs that only need 2 DPUs wastes 80% of your ETL budget. A 10-DPU job running 10 minutes costs $0.73. The same job on 2 DPUs costs $0.15. Always right-size DPU allocation based on data volume.

Cost Optimization Checklist Core
#OptimizationImpactHow
1Right-size DPUs50–80% savingsMonitor "Max needed DPUs" in CloudWatch; reduce to observed peak
2Enable Job BookmarksDramatic for incremental jobsProcess only new data each run instead of full reprocess
3Use Python Shell for small tasks100Γ— cheaper than Spark0.0625 DPU vs 10 DPU for sub-1GB, simple Python transforms
4Replace crawlers with MSCK REPAIR$0 for partition updatesIf schema is stable, don't re-crawl β€” just register new partitions
5Set job timeoutCost safety netPrevent runaway jobs from consuming DPUs for hours
6Auto-scaling (Glue 3.0+)Avoid over-provisioningGlue dynamically adds/removes workers based on workload
7Flex execution~34% cheaperNon-urgent jobs use preemptible capacity at a discount
Production Best Practices Core
βœ…

ETL Job Best Practices

  • Right-size DPUs β€” start at 2, increase only if job is slow
  • Enable bookmarks for all incremental jobs
  • Write output as Parquet with Snappy compression
  • Partition output by date (most common query filter)
  • Set job timeout to prevent runaway execution
  • Monitor with CloudWatch: DPU utilisation, run duration
  • Use Glue 4.0 engine for best performance
πŸ—οΈ

Catalog & Crawler Best Practices

  • One Glue database per data domain (logs, billing, events)
  • Use Hive-style partition naming for auto-discovery
  • Schedule crawlers only if schema actually evolves
  • For stable schemas, use manual DDL + MSCK REPAIR
  • Enable schema versioning for audit trail
  • Use Lake Formation for column-level security
  • Tag tables for governance and cost allocation
Common Mistakes Core
❌

Mistake: 10 DPUs for Small Jobs

Default allocation (10 DPUs) processes TBs. For GBs of data, 2–3 DPUs is enough. Over-provisioning means paying 5Γ— more than needed for every single run.

❌

Mistake: No Job Bookmarks

Without bookmarks, a daily job re-processes ALL historical data every run. Cost grows linearly with time. Enable bookmarks and process only delta.

❌

Mistake: Crawling Stable Schemas Daily

If your data format hasn't changed in months, running a crawler daily wastes $2–4/month per dataset and takes time. Use MSCK REPAIR or API to add partitions instead.

❌

Mistake: Writing Output as CSV

Glue ETL's purpose is to produce analytics-ready data. Writing output as CSV defeats the purpose β€” always output Parquet or ORC with partitioning for downstream analytics.

🎯 Exam Insight
  • "Reduce Glue ETL cost" β†’ reduce DPUs, enable bookmarks, use Flex execution
  • "$0.44 per DPU-hour" β€” this is the Glue ETL pricing model
  • "Process only new data each run" β†’ Job Bookmarks
  • "Cheapest Glue job for small tasks" β†’ Python Shell (0.0625 DPU)
  • "Glue Data Catalog cost" β†’ effectively free (first 1M objects + requests)
  • "Glue auto-scaling" β†’ available in Glue 3.0+ (dynamically adjusts workers)
Chapter 06 β€” Key Takeaway

Glue costs come from DPU-hours β€” ETL jobs and crawlers. The Catalog is free. Optimize by right-sizing DPUs (start at 2, not 10), enabling job bookmarks (incremental), choosing Python Shell for small tasks, and replacing daily crawlers with MSCK REPAIR for stable schemas. Flex execution saves 34% for non-urgent jobs. Always output Parquet with partitioning.

07
Chapter Seven Β· Analytics

Glue DataBrew β€” Visual Data Preparation

AWS Glue DataBrew is a visual data preparation tool for analysts and data scientists who need to clean and normalise data without writing code. It provides 250+ built-in transforms (trim, filter, pivot, fill nulls, deduplicate) in a spreadsheet-like interface β€” then runs the transforms at scale as Glue jobs.

DataBrew vs Glue ETL β€” When to Use Each Core
AspectGlue DataBrewGlue ETL (Spark)
InterfaceVisual / no-code β€” spreadsheet-likeCode (PySpark / Scala / visual Glue Studio)
UsersData analysts, data scientists, non-engineersData engineers, developers
ComplexitySimple transforms: clean, filter, formatComplex: joins, aggregations, custom logic
ScaleModerate datasetsPetabyte-scale distributed processing
Pricing$0.48 per node-hour (interactive sessions)$0.44 per DPU-hour
OutputCleaned S3 data (CSV, JSON, Parquet)Transformed S3 data + catalog updates
ProfilingBuilt-in data profiling (statistics, distributions)Not built-in (must code yourself)
Key DataBrew Features Core
πŸ“Š

Data Profiling

  • Auto-generate statistics for every column
  • Detect missing values, outliers, distributions
  • Data quality scores per column
  • Preview results before running
πŸ”§

250+ Transforms

  • String: trim, pad, case, regex extract
  • Numeric: round, normalize, bin
  • Date: parse, format, extract parts
  • Structure: pivot, unpivot, flatten, split
  • Quality: deduplicate, fill nulls, validate
πŸ“‹

Recipes

  • Save transform steps as reusable "recipes"
  • Apply same recipe to new datasets
  • Version recipes for audit trail
  • Schedule recipe runs as jobs
DataBrew Architecture Flow Core
πŸ”„

How DataBrew Works

  • 1. Create a Project β€” connect to S3/Glue Catalog source
  • 2. Interactive Session β€” explore data, apply transforms visually
  • 3. Build a Recipe β€” ordered list of transform steps
  • 4. Run as Job β€” execute the recipe at scale on full dataset
  • 5. Output β€” cleaned data written to S3 in target format
🎯

Common Use Cases

  • Clean messy CSV exports before loading to data lake
  • Standardise date/phone/address formats
  • Remove PII columns before sharing datasets
  • Profile new data sources for quality assessment
  • Non-technical teams prepare data for ML
🎯 Exam Insight
  • "Visual data preparation without code" β†’ Glue DataBrew
  • "Data profiling and quality assessment" β†’ Glue DataBrew (built-in profiling)
  • "Non-technical users clean and prepare data" β†’ DataBrew (not Glue ETL)
  • "250+ built-in transforms" β†’ DataBrew
  • DataBrew is NOT for complex joins, aggregations, or petabyte-scale β€” use Glue ETL for that
  • DataBrew recipes = reusable, versioned sets of transform steps
Chapter 07 β€” Key Takeaway

Glue DataBrew is a visual, no-code data preparation tool for cleaning and normalising datasets. Use it for non-technical users, data profiling, and simple transforms (trim, filter, deduplicate, format). Use Glue ETL (Spark) for complex joins, aggregations, and petabyte-scale processing. DataBrew recipes are reusable, versioned, and can be scheduled as production jobs.

Full Page Summary β€” AWS Glue Core
πŸ“‹ AWS Glue β€” Complete Recap
  • What: Serverless data integration β€” shared Data Catalog (metadata) + ETL Jobs (transform) + Crawlers (auto-discover) + DataBrew (visual prep).
  • Data Catalog: The Hive-compatible metastore shared by Athena, EMR, Redshift Spectrum. Stores database β†’ table β†’ column β†’ partition hierarchy. Effectively free.
  • Crawlers: Scan S3/JDBC, infer format + schema + partitions, register in Catalog. Point and click to make raw data queryable.
  • ETL Jobs: Serverless Spark (or Python Shell) β€” extract from S3/RDS, transform, load as Parquet + partitions. Uses DynamicFrames for schema flexibility. Job Bookmarks for incremental.
  • Cost: $0.44/DPU-hour. Right-size DPUs (2 not 10), enable bookmarks, use Python Shell for small tasks, replace crawlers with MSCK REPAIR for stable schemas.
  • DataBrew: Visual/no-code data prep with 250+ transforms and built-in profiling. For analysts, not engineers.
  • Key Integration: Glue is the metadata backbone β€” Athena reads Glue Catalog to know what tables exist. Without Glue, there's no data lake.
πŸ‘‰ Final Takeaway β€” Glue in One Thought

AWS Glue is the invisible backbone that turns S3 from a storage bucket into a queryable data lake. The Data Catalog tells every analytics service what data exists and how to read it. Crawlers auto-populate the catalog. ETL Jobs transform raw data into optimised formats. Together, they eliminate the infrastructure tax of building data pipelines β€” no servers to manage, no clusters to provision, no metadata to maintain manually.