LearningTree · AWS · Analytics

AWS Glue —
Serverless ETL & Data Catalog

The metadata backbone of every AWS data lake. Glue provides a shared Data Catalog that Athena, EMR, and Redshift all read from — plus serverless ETL jobs that transform raw data into analytics-ready formats without managing a single server.

Chapter One · Analytics

What is AWS Glue?

AWS Glue is a serverless data integration service with two distinct capabilities: a shared Data Catalog (metadata store for all your datasets) and serverless ETL jobs (Apache Spark/Python Shell that transform data at scale). Together, they make raw data in S3 queryable and analytics-ready — without managing infrastructure.

The Two Halves of AWS Glue Introductory

Glue is often misunderstood because it is two services marketed as one. Understanding the split is key:

🗂️

Half 1 — Glue Data Catalog

Central metadata repository (like Hive Metastore)
Stores table definitions: column names, types, S3 locations
Tracks file formats and partition structures
Shared across Athena, EMR, Redshift Spectrum, Glue ETL
Updated by Crawlers or manually
This is what Athena reads to know what tables exist

⚙️

Half 2 — Glue ETL Jobs

Serverless Apache Spark or Python Shell jobs
Extract data from S3/RDS/JDBC, Transform it, Load back to S3
Convert CSV/JSON → Parquet with partitioning
Deduplicate, clean, flatten, join datasets
Runs on AWS-managed compute (DPUs)
This is what makes raw data analytics-ready

🧠 Mental Model — The Library Catalogue + Transformation Workshop

Think of Glue Data Catalog as a library catalogue system — it tells you what books (tables) exist, where they are (S3 paths), and what's in them (columns/types). Glue ETL is the book restoration workshop — it takes damaged old books (raw CSV/JSON), cleans and reformats them into pristine editions (Parquet), and puts them on the right shelves (partitioned folders). Athena then uses the catalogue to find and read the books.

Why Glue Exists — The Problem It Solves Introductory

Without Glue, the data lake stack has two major pain points:

Problem	Without Glue	With Glue
How does Athena know what tables exist?	Manually create DDL statements for every table	Crawlers auto-discover schemas → Catalog updated
How do you track schema changes?	Manual documentation, breaks silently	Catalog versions schemas, detects drift
How do you convert CSV → Parquet?	Self-manage EMR/Spark cluster ($$$)	Serverless ETL — pay per DPU-second
How do multiple tools share metadata?	Each tool has its own metastore (fragmented)	One Catalog shared by Athena, EMR, Redshift
How do you partition new data?	Custom scripts, easy to misconfigure	Glue jobs handle partitioning automatically

Glue Architecture Overview Core

AWS Glue — Components & How They Connect

Glue Catalog is the shared brain; Crawlers populate it; ETL Jobs transform data; Athena/EMR/Redshift consume the metadata

Glue's Role in the Data Lake Stack Core

💾

S3 = Storage

S3 holds the actual data files. Cheap, durable, infinitely scalable. Glue never replaces S3 — it reads from and writes to S3.

🗂️

Glue = Metadata + Transform

Glue tells every tool what data exists, what shape it has, and transforms it into optimal formats. The glue that holds the lake together.

🔍

Athena = Query

Athena reads schema from Glue Catalog, then scans S3 directly. Without Glue, Athena wouldn't know what tables or partitions exist.

🎯 Exam Insight

"Central metadata repository for data lake" → Glue Data Catalog
"Serverless ETL" or "convert CSV to Parquet automatically" → Glue ETL Jobs
"Auto-discover schemas in S3" → Glue Crawlers
"Shared metastore across Athena, EMR, Redshift" → Glue Data Catalog
Glue Data Catalog = Hive-compatible metastore (this is key)
Glue ETL runs on DPUs (Data Processing Units) — billed per DPU-hour

Chapter 01 — Key Takeaway

AWS Glue is two things: (1) a shared Data Catalog that tells every analytics service what tables exist and where they live in S3, and (2) serverless ETL jobs that transform raw data into optimised formats. The Catalog is the metadata backbone of every AWS data lake — without it, Athena wouldn't know what to query.

Chapter Two · Analytics

The Glue Data Catalog

The Glue Data Catalog is the single source of truth for metadata across your entire data lake. It stores table definitions (column names, data types, file formats, S3 locations, partition keys) — and every AWS analytics service reads from it. Think of it as a Hive Metastore as a managed service.

Catalog Hierarchy — How Metadata is Organised Core

Concept	What It Is	Example	Analogy
Catalog	Top-level container (one per AWS account per region)	Your entire metadata store	The library building
Database	A namespace grouping related tables	`analytics_db`, `logs_db`	A floor/section of the library
Table	Schema + S3 location + format + partition keys	`logs_db.cloudtrail_events`	A bookshelf with the index card
Partition	A sub-folder in S3 mapped to key=value	`year=2026/month=05/day=07/`	A labelled drawer in the bookshelf
Column	A field in the table with name + data type	`user_id STRING`, `amount DOUBLE`	A column header in a spreadsheet

What a Table Definition Contains Core

📋

Table Metadata Fields

Table name and database
S3 location — where data files live
Columns — names and types (string, int, etc.)
Partition keys — which columns are partition keys
SerDe — how to parse the file format
Input/Output format — CSV, JSON, Parquet, ORC
Table properties — compression, skip headers, etc.

🔗

Who Reads the Catalog

Athena — reads table schemas before querying S3
EMR — uses catalog as Hive Metastore
Redshift Spectrum — queries S3 via catalog tables
Glue ETL Jobs — reads source/target schemas
Lake Formation — column/row security on catalog tables
AWS Step Functions — orchestrate based on catalog info

Schema-on-Read — The Key Concept Core

The Catalog doesn't enforce schema when data arrives (unlike a traditional database). It defines how to interpret data at query time. This means:

✅

Advantages of Schema-on-Read

Store raw data first, define schema later
Multiple tables can point to the same S3 data with different schemas
Schema changes don't require data migration
New columns are added without rewriting files

⚠️

Gotchas

Schema mismatch = NULL values or query errors
Catalog doesn't validate data correctness
New partitions must be registered (crawler or MSCK REPAIR)
Deleted S3 files still appear in catalog until updated

Schema Versioning Deep

Glue Data Catalog automatically versions table schemas. Every time a crawler or manual update changes a table definition, a new version is created. This lets you:

See the history of schema changes over time
Roll back to a previous schema version if a crawler misconfigures something
Detect schema drift — when data shape changes unexpectedly

Catalog Pricing Core

Item	Cost	Notes
First 1 million objects stored	Free	Objects = tables + partitions + databases
Above 1M objects	$1 per 100,000 objects/month	Most accounts never hit this
First 1 million requests/month	Free	Requests = Athena queries, crawler reads, etc.
Above 1M requests	$1 per 1 million requests	Very cheap — effectively free for most

💡 Practical Implication

The Glue Data Catalog is effectively free for the vast majority of AWS accounts. You'll never worry about catalog cost — even with thousands of tables and partitions. The real cost of Glue comes from ETL jobs (DPU-hours), not the catalog.

🎯 Exam Insight

"Hive-compatible metastore" = Glue Data Catalog
"Where does Athena get table definitions?" = Glue Data Catalog
"One catalog shared across multiple services" → Athena, EMR, Redshift Spectrum all use the same catalog
"New partitions not visible in Athena" → run MSCK REPAIR TABLE or re-run crawler
Catalog does NOT store the actual data — only metadata (S3 paths, columns, types)

Chapter 02 — Key Takeaway

The Glue Data Catalog is a managed Hive Metastore — it stores table definitions (columns, types, S3 paths, partitions) and is shared across Athena, EMR, Redshift Spectrum, and more. It is schema-on-read, versions schemas automatically, and is effectively free for most accounts. Without the catalog, analytics tools wouldn't know what data exists or how to parse it.

Chapter Three · Analytics

Glue Crawlers — Auto-Discover Schemas

Glue Crawlers are automated jobs that scan your S3 data, infer schemas, detect partitions, and register or update tables in the Glue Catalog — without you writing any DDL. Point a crawler at an S3 path, run it, and your data becomes queryable in Athena immediately.

How a Crawler Works — Step by Step Core

Glue Crawler — Execution Flow

What Crawlers Detect Automatically Core

📄

File Format

CSV, TSV, JSON
Parquet, ORC, Avro
GZIP, BZIP2, Snappy compressed
Sets the correct SerDe

🏛️

Schema (Columns)

Column names from headers or structure
Data types (string, int, double, etc.)
Nested structures in JSON
Array and map types

📁

Partitions

Hive-style: key=value/ folders
Registers each partition in catalog
Partition keys become queryable columns
New partitions discovered on re-crawl

Crawler Configuration Options Core

Setting	What It Does	Best Practice
Data source	S3 path, JDBC connection, DynamoDB, or Catalog table	Use specific prefixes, not entire bucket roots
IAM Role	Permissions to read S3 and write to Catalog	Least-privilege: only the paths crawler needs
Schedule	On-demand, hourly, daily, or cron expression	Daily for growing datasets; on-demand for one-offs
Database target	Which Glue database to create tables in	One database per data domain
Schema change policy	Update table / add columns / ignore changes	"Update in place" unless strict governance needed
Table grouping	Create one table per S3 prefix, or group similar files	Group by logical dataset
Classifiers	Custom parsers for non-standard formats	Only needed for unusual file structures

Crawlers vs Manual Table Creation Core

Approach	When to Use	Pros	Cons
Crawler	Exploratory; data shape is unknown or evolving	Auto-detects everything; finds new partitions	May misclassify; takes 1–5 min; costs DPU time
Manual DDL (CREATE TABLE)	Schema is well-known and stable	Instant; zero cost; precise control	Must maintain manually; won't detect new partitions
MSCK REPAIR TABLE	Schema known but new partitions arrive	Fast; adds new Hive-style partitions	Only works for Hive-style naming
Glue API / SDK	CI/CD pipeline automation	Programmatic control; integrates with IaC	More code to maintain

Common Crawler Issues & Fixes Deep

❌

Problem: Crawler Creates Too Many Tables

Crawler treats each S3 prefix as a separate table. Fix: Use table grouping behavior or point crawler at a more specific prefix. Set exclusion patterns for temp/staging folders.

❌

Problem: Wrong Data Types Inferred

Crawler sees "12345" and infers INT, but it's a ZIP code (should be STRING). Fix: Use a custom classifier, or manually correct the table after crawling, or create the table with DDL.

❌

Problem: New Partitions Not Appearing

Data arrives in new date partitions but Athena doesn't see them. Fix: Re-run crawler on schedule, or use MSCK REPAIR TABLE, or use ALTER TABLE ADD PARTITION in Athena.

❌

Problem: Crawler Costs Adding Up

Crawling large datasets with many files takes DPU time. Fix: Use incremental crawling (only scan new partitions), schedule crawls during off-peak, or switch to manual partition management for stable schemas.

Crawler Pricing Core

Component	Cost	Notes
Crawler runtime	$0.44 per DPU-hour	Minimum 10 minutes per run; typically 1–5 min for small datasets
Minimum charge	~$0.07 per run (10 min minimum)	Running a crawler even briefly costs at least this
S3 request costs	S3 LIST + GET charges	Usually negligible

💡 Cost Tip

A daily crawler on a moderately-sized dataset costs ~$2–4/month. For datasets with stable schemas but new partitions, consider using MSCK REPAIR TABLE (free) or the Glue BatchCreatePartition API instead of re-crawling to save cost.

🎯 Exam Insight

"Automatically discover schemas in S3" → Glue Crawler
"New partitions not visible" → re-run crawler OR MSCK REPAIR TABLE
"Detect new data and update catalog" → schedule crawler (hourly/daily)
Crawlers infer format, schema, and partitions — all three at once
Crawlers can scan S3, JDBC (RDS/Redshift), and DynamoDB
Crawler output goes into the Glue Data Catalog — not into S3 or another store
"Crawler creating wrong tables" → adjust grouping behavior or use exclusion patterns

Chapter 03 — Key Takeaway

Glue Crawlers automatically scan S3, detect file formats, infer column schemas, discover partitions, and register everything in the Glue Data Catalog. They eliminate the need to manually write DDL — just point, run, and your data is queryable. For stable schemas, manual DDL or MSCK REPAIR TABLE is cheaper. For evolving or unknown data shapes, crawlers are the fastest path from raw files to queryable tables.

Chapter Four · Analytics

Glue ETL Jobs — Serverless Data Transformation

Glue ETL Jobs are serverless Apache Spark or Python jobs that extract data from sources (S3, RDS, JDBC), transform it (convert formats, clean, partition, deduplicate), and load it back to S3 or other targets. You write the logic — AWS manages the compute infrastructure, scaling, and cluster lifecycle.

Job Types — Choosing the Right Engine Core

Job Type	Engine	Best For	DPU Default	Languages
Spark ETL	Apache Spark (distributed)	Large-scale transforms, TBs of data, partitioning	10 DPUs	Python (PySpark), Scala
Spark Streaming	Spark Structured Streaming	Near-real-time micro-batch from Kinesis/Kafka	10 DPUs	Python, Scala
Python Shell	Single-node Python	Small transforms, API calls, light orchestration	0.0625 or 1 DPU	Python only
Ray	Ray distributed framework	ML data prep, distributed Python workloads	Variable	Python

🧠 When to Choose What

Spark ETL — default choice for any serious data transformation (CSV→Parquet, joins, aggregations)
Python Shell — simple tasks: small file moves, API calls, <1 GB data, notification scripts
Spark Streaming — continuous processing from Kinesis with micro-batch windows
Ray — ML preprocessing, distributed pandas, when Spark isn't needed

DynamicFrames — Glue's Core Abstraction Core

Glue ETL extends Spark with DynamicFrames — a more flexible alternative to Spark DataFrames that handles messy, schema-inconsistent data gracefully:

📊

Spark DataFrame

Strict schema — all rows must match
Fails on schema inconsistencies
Standard PySpark / Spark SQL
You already know if you know Spark

🔄

Glue DynamicFrame

Self-describing — each row carries its own schema
Handles mixed types (same column: int + string)
Built-in ResolveChoice to fix type conflicts
Convert to/from DataFrame freely
Reads directly from Glue Catalog

Common ETL Transform Operations Core

Operation	What It Does	Use Case
`ApplyMapping`	Rename/retype columns, drop unwanted fields	Standardize column names across sources
`ResolveChoice`	Fix columns with multiple data types	Mixed int/string fields from JSON crawl
`Filter`	Keep only rows matching a condition	Remove test/null records
`Join`	Merge two DynamicFrames on a key	Enrich events with user dimension table
`DropNullFields`	Remove columns that are entirely null	Clean sparse datasets
`Relationalize`	Flatten nested JSON into relational tables	Convert API responses for analytics
`write_dynamic_frame`	Write output to S3 in target format	Write Parquet with partitioning

ETL Job Flow Diagram Core

Glue ETL Job — Typical Execution Flow

Job Bookmarks — Incremental Processing Deep

Job Bookmarks let a Glue job remember where it left off, so the next run only processes new data:

📑

Without Bookmarks

Every run processes ALL source data from scratch
Wastes DPU time re-processing old files
May create duplicate output records
Cost grows linearly with historical data size

✅

With Bookmarks Enabled

Glue tracks which files/partitions were already processed
Next run reads ONLY new files since last bookmark
Dramatic cost reduction for daily/hourly ETL
Enable via job parameter: --job-bookmark-option = job-bookmark-enable

Triggers & Workflows — Orchestrating Jobs Core

Mechanism	How It Works	Use Case
Schedule trigger	Cron expression (e.g. daily at 2am)	Nightly ETL batch
On-demand trigger	Manual start or API call	One-off backfill or testing
Event trigger	Run when another job/crawler completes	Chain: Crawler → ETL → Second ETL
Glue Workflow	DAG of jobs + crawlers with dependencies	Multi-step pipeline: crawl → transform → validate
Step Functions	External orchestrator calling Glue via API	Complex logic, human approval, cross-service
EventBridge	Trigger job on S3 event or schedule	Process file on arrival in S3

Glue Studio — Visual ETL Introductory

Glue Studio is a visual drag-and-drop interface that generates Glue ETL code automatically. You connect source → transforms → target visually, and Glue generates the PySpark script behind the scenes. Useful for teams that prefer visual tools or rapid prototyping.

🎯 Exam Insight

"Serverless ETL to convert CSV to Parquet" → Glue ETL Job (Spark)
"Process only new files since last run" → enable Job Bookmarks
"DPU" = Data Processing Unit — unit of Glue compute billing
"DynamicFrame" = Glue's extension of Spark DataFrame (handles schema inconsistencies)
"Chain jobs: crawler finishes → run ETL" → Glue Workflow or event trigger
"Small script, <1GB data, API calls" → Python Shell job (0.0625 DPU, cheapest)
"Large-scale data transformation" → Spark ETL job (10+ DPUs)

Chapter 04 — Key Takeaway

Glue ETL Jobs are serverless Spark (or Python Shell) jobs that Extract, Transform, and Load data. Use Spark ETL for large-scale transforms (CSV→Parquet, joins, partitioning), Python Shell for tiny tasks. DynamicFrames handle messy data gracefully. Job Bookmarks enable incremental processing. Orchestrate with triggers, workflows, or Step Functions. You pay per DPU-second — no idle clusters.

Chapter Five · Analytics

Architecture Patterns

Glue rarely works alone. It is a middleware component that connects data sources to analytics tools. This chapter covers the four most common production patterns where Glue plays a central role.

Pattern 1 — Data Lake ETL Pipeline Core

The most common Glue pattern. Raw data arrives in S3, Glue converts it to an optimised format, and analytics tools query the curated layer.

Data Lake ETL — Raw → Curated → Analytics

Pattern 2 — Database Migration to Data Lake Core

🏗️

The Flow

DMS replicates RDS/on-prem DB → S3 (CSV/Parquet)
Glue Crawler discovers the new tables in S3
Glue ETL joins/transforms for analytics use
Athena + QuickSight provide the query layer

💡

Why This Pattern

Offload heavy analytics from production database
Query historical data without impacting OLTP performance
Combine data from multiple RDS instances
Retain full history (RDS may only keep recent data)

Pattern 3 — Schema Discovery & Governance Core

Use Glue as a governance tool — crawlers scan all S3 data, the Catalog becomes the single inventory of what data exists, and Lake Formation applies security on top:

🔍

Discover

Crawlers scan all S3 prefixes and register every dataset in the Catalog — automatic data inventory.

📋

Catalogue

Catalog becomes the authoritative list of all data assets — searchable, versioned, with schema history.

🔒

Govern

Lake Formation applies column/row-level access on catalog tables — centralised security for the data lake.

Pattern 4 — Event-Driven ETL Deep

Process data the moment it arrives in S3 — no polling, no schedules. An S3 event triggers the Glue job immediately:

Component	Role
S3 Event Notification	Fires when a new file is uploaded to a prefix
EventBridge	Routes the S3 event to the correct target
Glue Workflow / Step Functions	Starts the ETL job with the new file path as input
Glue ETL Job	Processes only the new file (with bookmarks or explicit path)
Glue Catalog	Updated with new partitions from the processed output

Glue vs Alternatives — When to Use What Core

Service	Best For	When NOT to Use
Glue ETL	S3-centric batch ETL, format conversion, catalog integration	Real-time (<1s latency), very complex Spark tuning needed
EMR	Custom Spark/Hadoop/Presto with full cluster control	Simple transforms (Glue is easier); short-lived jobs
Lambda	Small files (<100MB), <15 min runtime, event-driven micro-transforms	TBs of data, complex joins, long-running
Step Functions + Lambda	Orchestration with conditional logic, human approval	Heavy data processing (orchestrate Glue instead)
Kinesis Analytics	Real-time stream SQL (sub-second)	Batch processing, historical data

🎯 Exam Insight

"Convert CSV to Parquet in S3 serverlessly" → Glue ETL (not EMR, not Lambda for large data)
"Process new S3 files on arrival" → S3 Event → EventBridge → Glue ETL (event-driven ETL)
"Offload analytics from production RDS" → DMS → S3 → Glue → Athena
"Full Spark cluster control needed" → EMR (not Glue — Glue is managed)
"Small file <100MB transform" → Lambda (cheaper than spinning up Glue DPUs)
"Data lake governance + discovery" → Glue Crawlers + Catalog + Lake Formation

Chapter 05 — Key Takeaway

Glue fits into four main architecture patterns: (1) Data Lake ETL — the canonical raw→curated pipeline; (2) Database migration — replicate RDS to S3 for analytics; (3) Schema discovery + governance — Crawlers + Catalog + Lake Formation; (4) Event-driven ETL — process files on arrival. Choose Glue over EMR when you want serverless simplicity, over Lambda when data exceeds 100MB, and over Kinesis when batch is acceptable.

Chapter Six · Analytics

Cost Optimization & Best Practices

Glue costs come from two sources: ETL Jobs (DPU-hours — the expensive part) and Crawlers (DPU-hours at lower scale). The Data Catalog itself is effectively free. Understanding DPU billing and optimization techniques is critical to controlling Glue spend.

Complete Pricing Model Core

Component	Cost	Minimum	Notes
Spark ETL Job	$0.44 per DPU-hour	10 min, 2 DPUs	Default: 10 DPUs. Scale down to 2 for small jobs
Python Shell Job	$0.44 per DPU-hour	1 min, 0.0625 DPU	Tiny jobs: ~$0.0005 per run. Extreme savings
Crawler	$0.44 per DPU-hour	10 min	~$0.07 minimum per crawl run
Data Catalog storage	Free (first 1M objects)	—	$1 per 100K above 1M
Data Catalog requests	Free (first 1M/month)	—	$1 per 1M requests above free tier
Glue DataBrew	$0.48 per node-hour	—	Interactive sessions billed separately
Glue Interactive Sessions	$0.44 per DPU-hour	—	Notebook-style development environment

⚠️ Common Cost Trap

Leaving DPUs at the default (10) for small jobs that only need 2 DPUs wastes 80% of your ETL budget. A 10-DPU job running 10 minutes costs $0.73. The same job on 2 DPUs costs $0.15. Always right-size DPU allocation based on data volume.

Cost Optimization Checklist Core

#	Optimization	Impact	How
1	Right-size DPUs	50–80% savings	Monitor "Max needed DPUs" in CloudWatch; reduce to observed peak
2	Enable Job Bookmarks	Dramatic for incremental jobs	Process only new data each run instead of full reprocess
3	Use Python Shell for small tasks	100× cheaper than Spark	0.0625 DPU vs 10 DPU for sub-1GB, simple Python transforms
4	Replace crawlers with MSCK REPAIR	$0 for partition updates	If schema is stable, don't re-crawl — just register new partitions
5	Set job timeout	Cost safety net	Prevent runaway jobs from consuming DPUs for hours
6	Auto-scaling (Glue 3.0+)	Avoid over-provisioning	Glue dynamically adds/removes workers based on workload
7	Flex execution	~34% cheaper	Non-urgent jobs use preemptible capacity at a discount

Production Best Practices Core

✅

ETL Job Best Practices

Right-size DPUs — start at 2, increase only if job is slow
Enable bookmarks for all incremental jobs
Write output as Parquet with Snappy compression
Partition output by date (most common query filter)
Set job timeout to prevent runaway execution
Monitor with CloudWatch: DPU utilisation, run duration
Use Glue 4.0 engine for best performance

🏗️

Catalog & Crawler Best Practices

One Glue database per data domain (logs, billing, events)
Use Hive-style partition naming for auto-discovery
Schedule crawlers only if schema actually evolves
For stable schemas, use manual DDL + MSCK REPAIR
Enable schema versioning for audit trail
Use Lake Formation for column-level security
Tag tables for governance and cost allocation

Common Mistakes Core

❌

Mistake: 10 DPUs for Small Jobs

Default allocation (10 DPUs) processes TBs. For GBs of data, 2–3 DPUs is enough. Over-provisioning means paying 5× more than needed for every single run.

❌

Mistake: No Job Bookmarks

Without bookmarks, a daily job re-processes ALL historical data every run. Cost grows linearly with time. Enable bookmarks and process only delta.

❌

Mistake: Crawling Stable Schemas Daily

If your data format hasn't changed in months, running a crawler daily wastes $2–4/month per dataset and takes time. Use MSCK REPAIR or API to add partitions instead.

❌

Mistake: Writing Output as CSV

Glue ETL's purpose is to produce analytics-ready data. Writing output as CSV defeats the purpose — always output Parquet or ORC with partitioning for downstream analytics.

🎯 Exam Insight

"Reduce Glue ETL cost" → reduce DPUs, enable bookmarks, use Flex execution
"$0.44 per DPU-hour" — this is the Glue ETL pricing model
"Process only new data each run" → Job Bookmarks
"Cheapest Glue job for small tasks" → Python Shell (0.0625 DPU)
"Glue Data Catalog cost" → effectively free (first 1M objects + requests)
"Glue auto-scaling" → available in Glue 3.0+ (dynamically adjusts workers)

Chapter 06 — Key Takeaway

Glue costs come from DPU-hours — ETL jobs and crawlers. The Catalog is free. Optimize by right-sizing DPUs (start at 2, not 10), enabling job bookmarks (incremental), choosing Python Shell for small tasks, and replacing daily crawlers with MSCK REPAIR for stable schemas. Flex execution saves 34% for non-urgent jobs. Always output Parquet with partitioning.

Chapter Seven · Analytics

Glue DataBrew — Visual Data Preparation

AWS Glue DataBrew is a visual data preparation tool for analysts and data scientists who need to clean and normalise data without writing code. It provides 250+ built-in transforms (trim, filter, pivot, fill nulls, deduplicate) in a spreadsheet-like interface — then runs the transforms at scale as Glue jobs.

DataBrew vs Glue ETL — When to Use Each Core

Aspect	Glue DataBrew	Glue ETL (Spark)
Interface	Visual / no-code — spreadsheet-like	Code (PySpark / Scala / visual Glue Studio)
Users	Data analysts, data scientists, non-engineers	Data engineers, developers
Complexity	Simple transforms: clean, filter, format	Complex: joins, aggregations, custom logic
Scale	Moderate datasets	Petabyte-scale distributed processing
Pricing	$0.48 per node-hour (interactive sessions)	$0.44 per DPU-hour
Output	Cleaned S3 data (CSV, JSON, Parquet)	Transformed S3 data + catalog updates
Profiling	Built-in data profiling (statistics, distributions)	Not built-in (must code yourself)

Key DataBrew Features Core

📊

Data Profiling

Auto-generate statistics for every column
Detect missing values, outliers, distributions
Data quality scores per column
Preview results before running

🔧

250+ Transforms

String: trim, pad, case, regex extract
Numeric: round, normalize, bin
Date: parse, format, extract parts
Structure: pivot, unpivot, flatten, split
Quality: deduplicate, fill nulls, validate

📋

Recipes

Save transform steps as reusable "recipes"
Apply same recipe to new datasets
Version recipes for audit trail
Schedule recipe runs as jobs

DataBrew Architecture Flow Core

🔄

How DataBrew Works

1. Create a Project — connect to S3/Glue Catalog source
2. Interactive Session — explore data, apply transforms visually
3. Build a Recipe — ordered list of transform steps
4. Run as Job — execute the recipe at scale on full dataset
5. Output — cleaned data written to S3 in target format

🎯

Common Use Cases

Clean messy CSV exports before loading to data lake
Standardise date/phone/address formats
Remove PII columns before sharing datasets
Profile new data sources for quality assessment
Non-technical teams prepare data for ML

🎯 Exam Insight

"Visual data preparation without code" → Glue DataBrew
"Data profiling and quality assessment" → Glue DataBrew (built-in profiling)
"Non-technical users clean and prepare data" → DataBrew (not Glue ETL)
"250+ built-in transforms" → DataBrew
DataBrew is NOT for complex joins, aggregations, or petabyte-scale — use Glue ETL for that
DataBrew recipes = reusable, versioned sets of transform steps

Chapter 07 — Key Takeaway

Glue DataBrew is a visual, no-code data preparation tool for cleaning and normalising datasets. Use it for non-technical users, data profiling, and simple transforms (trim, filter, deduplicate, format). Use Glue ETL (Spark) for complex joins, aggregations, and petabyte-scale processing. DataBrew recipes are reusable, versioned, and can be scheduled as production jobs.

Full Page Summary — AWS Glue Core

📋 AWS Glue — Complete Recap
What: Serverless data integration — shared Data Catalog (metadata) + ETL Jobs (transform) + Crawlers (auto-discover) + DataBrew (visual prep).
Data Catalog: The Hive-compatible metastore shared by Athena, EMR, Redshift Spectrum. Stores database → table → column → partition hierarchy. Effectively free.
Crawlers: Scan S3/JDBC, infer format + schema + partitions, register in Catalog. Point and click to make raw data queryable.
ETL Jobs: Serverless Spark (or Python Shell) — extract from S3/RDS, transform, load as Parquet + partitions. Uses DynamicFrames for schema flexibility. Job Bookmarks for incremental.
Cost: $0.44/DPU-hour. Right-size DPUs (2 not 10), enable bookmarks, use Python Shell for small tasks, replace crawlers with MSCK REPAIR for stable schemas.
DataBrew: Visual/no-code data prep with 250+ transforms and built-in profiling. For analysts, not engineers.
Key Integration: Glue is the metadata backbone — Athena reads Glue Catalog to know what tables exist. Without Glue, there's no data lake.

👉 Final Takeaway — Glue in One Thought

AWS Glue is the invisible backbone that turns S3 from a storage bucket into a queryable data lake. The Data Catalog tells every analytics service what data exists and how to read it. Crawlers auto-populate the catalog. ETL Jobs transform raw data into optimised formats. Together, they eliminate the infrastructure tax of building data pipelines — no servers to manage, no clusters to provision, no metadata to maintain manually.