AWS Glue β
Serverless ETL & Data Catalog
The metadata backbone of every AWS data lake. Glue provides a shared Data Catalog that Athena, EMR, and Redshift all read from β plus serverless ETL jobs that transform raw data into analytics-ready formats without managing a single server.
What is AWS Glue?
AWS Glue is a serverless data integration service with two distinct capabilities: a shared Data Catalog (metadata store for all your datasets) and serverless ETL jobs (Apache Spark/Python Shell that transform data at scale). Together, they make raw data in S3 queryable and analytics-ready β without managing infrastructure.
Glue is often misunderstood because it is two services marketed as one. Understanding the split is key:
Half 1 β Glue Data Catalog
- Central metadata repository (like Hive Metastore)
- Stores table definitions: column names, types, S3 locations
- Tracks file formats and partition structures
- Shared across Athena, EMR, Redshift Spectrum, Glue ETL
- Updated by Crawlers or manually
- This is what Athena reads to know what tables exist
Half 2 β Glue ETL Jobs
- Serverless Apache Spark or Python Shell jobs
- Extract data from S3/RDS/JDBC, Transform it, Load back to S3
- Convert CSV/JSON β Parquet with partitioning
- Deduplicate, clean, flatten, join datasets
- Runs on AWS-managed compute (DPUs)
- This is what makes raw data analytics-ready
Think of Glue Data Catalog as a library catalogue system β it tells you what books (tables) exist, where they are (S3 paths), and what's in them (columns/types). Glue ETL is the book restoration workshop β it takes damaged old books (raw CSV/JSON), cleans and reformats them into pristine editions (Parquet), and puts them on the right shelves (partitioned folders). Athena then uses the catalogue to find and read the books.
Without Glue, the data lake stack has two major pain points:
| Problem | Without Glue | With Glue |
|---|---|---|
| How does Athena know what tables exist? | Manually create DDL statements for every table | Crawlers auto-discover schemas β Catalog updated |
| How do you track schema changes? | Manual documentation, breaks silently | Catalog versions schemas, detects drift |
| How do you convert CSV β Parquet? | Self-manage EMR/Spark cluster ($$$) | Serverless ETL β pay per DPU-second |
| How do multiple tools share metadata? | Each tool has its own metastore (fragmented) | One Catalog shared by Athena, EMR, Redshift |
| How do you partition new data? | Custom scripts, easy to misconfigure | Glue jobs handle partitioning automatically |
S3 = Storage
S3 holds the actual data files. Cheap, durable, infinitely scalable. Glue never replaces S3 β it reads from and writes to S3.
Glue = Metadata + Transform
Glue tells every tool what data exists, what shape it has, and transforms it into optimal formats. The glue that holds the lake together.
Athena = Query
Athena reads schema from Glue Catalog, then scans S3 directly. Without Glue, Athena wouldn't know what tables or partitions exist.
- "Central metadata repository for data lake" β Glue Data Catalog
- "Serverless ETL" or "convert CSV to Parquet automatically" β Glue ETL Jobs
- "Auto-discover schemas in S3" β Glue Crawlers
- "Shared metastore across Athena, EMR, Redshift" β Glue Data Catalog
- Glue Data Catalog = Hive-compatible metastore (this is key)
- Glue ETL runs on DPUs (Data Processing Units) β billed per DPU-hour
AWS Glue is two things: (1) a shared Data Catalog that tells every analytics service what tables exist and where they live in S3, and (2) serverless ETL jobs that transform raw data into optimised formats. The Catalog is the metadata backbone of every AWS data lake β without it, Athena wouldn't know what to query.
The Glue Data Catalog
The Glue Data Catalog is the single source of truth for metadata across your entire data lake. It stores table definitions (column names, data types, file formats, S3 locations, partition keys) β and every AWS analytics service reads from it. Think of it as a Hive Metastore as a managed service.
| Concept | What It Is | Example | Analogy |
|---|---|---|---|
| Catalog | Top-level container (one per AWS account per region) | Your entire metadata store | The library building |
| Database | A namespace grouping related tables | analytics_db, logs_db | A floor/section of the library |
| Table | Schema + S3 location + format + partition keys | logs_db.cloudtrail_events | A bookshelf with the index card |
| Partition | A sub-folder in S3 mapped to key=value | year=2026/month=05/day=07/ | A labelled drawer in the bookshelf |
| Column | A field in the table with name + data type | user_id STRING, amount DOUBLE | A column header in a spreadsheet |
Table Metadata Fields
- Table name and database
- S3 location β where data files live
- Columns β names and types (string, int, etc.)
- Partition keys β which columns are partition keys
- SerDe β how to parse the file format
- Input/Output format β CSV, JSON, Parquet, ORC
- Table properties β compression, skip headers, etc.
Who Reads the Catalog
- Athena β reads table schemas before querying S3
- EMR β uses catalog as Hive Metastore
- Redshift Spectrum β queries S3 via catalog tables
- Glue ETL Jobs β reads source/target schemas
- Lake Formation β column/row security on catalog tables
- AWS Step Functions β orchestrate based on catalog info
The Catalog doesn't enforce schema when data arrives (unlike a traditional database). It defines how to interpret data at query time. This means:
Advantages of Schema-on-Read
- Store raw data first, define schema later
- Multiple tables can point to the same S3 data with different schemas
- Schema changes don't require data migration
- New columns are added without rewriting files
Gotchas
- Schema mismatch = NULL values or query errors
- Catalog doesn't validate data correctness
- New partitions must be registered (crawler or
MSCK REPAIR) - Deleted S3 files still appear in catalog until updated
Glue Data Catalog automatically versions table schemas. Every time a crawler or manual update changes a table definition, a new version is created. This lets you:
- See the history of schema changes over time
- Roll back to a previous schema version if a crawler misconfigures something
- Detect schema drift β when data shape changes unexpectedly
| Item | Cost | Notes |
|---|---|---|
| First 1 million objects stored | Free | Objects = tables + partitions + databases |
| Above 1M objects | $1 per 100,000 objects/month | Most accounts never hit this |
| First 1 million requests/month | Free | Requests = Athena queries, crawler reads, etc. |
| Above 1M requests | $1 per 1 million requests | Very cheap β effectively free for most |
The Glue Data Catalog is effectively free for the vast majority of AWS accounts. You'll never worry about catalog cost β even with thousands of tables and partitions. The real cost of Glue comes from ETL jobs (DPU-hours), not the catalog.
- "Hive-compatible metastore" = Glue Data Catalog
- "Where does Athena get table definitions?" = Glue Data Catalog
- "One catalog shared across multiple services" β Athena, EMR, Redshift Spectrum all use the same catalog
- "New partitions not visible in Athena" β run
MSCK REPAIR TABLEor re-run crawler - Catalog does NOT store the actual data β only metadata (S3 paths, columns, types)
The Glue Data Catalog is a managed Hive Metastore β it stores table definitions (columns, types, S3 paths, partitions) and is shared across Athena, EMR, Redshift Spectrum, and more. It is schema-on-read, versions schemas automatically, and is effectively free for most accounts. Without the catalog, analytics tools wouldn't know what data exists or how to parse it.
Glue Crawlers β Auto-Discover Schemas
Glue Crawlers are automated jobs that scan your S3 data, infer schemas, detect partitions, and register or update tables in the Glue Catalog β without you writing any DDL. Point a crawler at an S3 path, run it, and your data becomes queryable in Athena immediately.
File Format
- CSV, TSV, JSON
- Parquet, ORC, Avro
- GZIP, BZIP2, Snappy compressed
- Sets the correct SerDe
Schema (Columns)
- Column names from headers or structure
- Data types (string, int, double, etc.)
- Nested structures in JSON
- Array and map types
Partitions
- Hive-style:
key=value/folders - Registers each partition in catalog
- Partition keys become queryable columns
- New partitions discovered on re-crawl
| Setting | What It Does | Best Practice |
|---|---|---|
| Data source | S3 path, JDBC connection, DynamoDB, or Catalog table | Use specific prefixes, not entire bucket roots |
| IAM Role | Permissions to read S3 and write to Catalog | Least-privilege: only the paths crawler needs |
| Schedule | On-demand, hourly, daily, or cron expression | Daily for growing datasets; on-demand for one-offs |
| Database target | Which Glue database to create tables in | One database per data domain |
| Schema change policy | Update table / add columns / ignore changes | "Update in place" unless strict governance needed |
| Table grouping | Create one table per S3 prefix, or group similar files | Group by logical dataset |
| Classifiers | Custom parsers for non-standard formats | Only needed for unusual file structures |
| Approach | When to Use | Pros | Cons |
|---|---|---|---|
| Crawler | Exploratory; data shape is unknown or evolving | Auto-detects everything; finds new partitions | May misclassify; takes 1β5 min; costs DPU time |
| Manual DDL (CREATE TABLE) | Schema is well-known and stable | Instant; zero cost; precise control | Must maintain manually; won't detect new partitions |
| MSCK REPAIR TABLE | Schema known but new partitions arrive | Fast; adds new Hive-style partitions | Only works for Hive-style naming |
| Glue API / SDK | CI/CD pipeline automation | Programmatic control; integrates with IaC | More code to maintain |
Problem: Crawler Creates Too Many Tables
Crawler treats each S3 prefix as a separate table. Fix: Use table grouping behavior or point crawler at a more specific prefix. Set exclusion patterns for temp/staging folders.
Problem: Wrong Data Types Inferred
Crawler sees "12345" and infers INT, but it's a ZIP code (should be STRING). Fix: Use a custom classifier, or manually correct the table after crawling, or create the table with DDL.
Problem: New Partitions Not Appearing
Data arrives in new date partitions but Athena doesn't see them. Fix: Re-run crawler on schedule, or use MSCK REPAIR TABLE, or use ALTER TABLE ADD PARTITION in Athena.
Problem: Crawler Costs Adding Up
Crawling large datasets with many files takes DPU time. Fix: Use incremental crawling (only scan new partitions), schedule crawls during off-peak, or switch to manual partition management for stable schemas.
| Component | Cost | Notes |
|---|---|---|
| Crawler runtime | $0.44 per DPU-hour | Minimum 10 minutes per run; typically 1β5 min for small datasets |
| Minimum charge | ~$0.07 per run (10 min minimum) | Running a crawler even briefly costs at least this |
| S3 request costs | S3 LIST + GET charges | Usually negligible |
A daily crawler on a moderately-sized dataset costs ~$2β4/month. For datasets with stable schemas but new partitions, consider using MSCK REPAIR TABLE (free) or the Glue BatchCreatePartition API instead of re-crawling to save cost.
- "Automatically discover schemas in S3" β Glue Crawler
- "New partitions not visible" β re-run crawler OR
MSCK REPAIR TABLE - "Detect new data and update catalog" β schedule crawler (hourly/daily)
- Crawlers infer format, schema, and partitions β all three at once
- Crawlers can scan S3, JDBC (RDS/Redshift), and DynamoDB
- Crawler output goes into the Glue Data Catalog β not into S3 or another store
- "Crawler creating wrong tables" β adjust grouping behavior or use exclusion patterns
Glue Crawlers automatically scan S3, detect file formats, infer column schemas, discover partitions, and register everything in the Glue Data Catalog. They eliminate the need to manually write DDL β just point, run, and your data is queryable. For stable schemas, manual DDL or MSCK REPAIR TABLE is cheaper. For evolving or unknown data shapes, crawlers are the fastest path from raw files to queryable tables.
Glue ETL Jobs β Serverless Data Transformation
Glue ETL Jobs are serverless Apache Spark or Python jobs that extract data from sources (S3, RDS, JDBC), transform it (convert formats, clean, partition, deduplicate), and load it back to S3 or other targets. You write the logic β AWS manages the compute infrastructure, scaling, and cluster lifecycle.
| Job Type | Engine | Best For | DPU Default | Languages |
|---|---|---|---|---|
| Spark ETL | Apache Spark (distributed) | Large-scale transforms, TBs of data, partitioning | 10 DPUs | Python (PySpark), Scala |
| Spark Streaming | Spark Structured Streaming | Near-real-time micro-batch from Kinesis/Kafka | 10 DPUs | Python, Scala |
| Python Shell | Single-node Python | Small transforms, API calls, light orchestration | 0.0625 or 1 DPU | Python only |
| Ray | Ray distributed framework | ML data prep, distributed Python workloads | Variable | Python |
- Spark ETL β default choice for any serious data transformation (CSVβParquet, joins, aggregations)
- Python Shell β simple tasks: small file moves, API calls, <1 GB data, notification scripts
- Spark Streaming β continuous processing from Kinesis with micro-batch windows
- Ray β ML preprocessing, distributed pandas, when Spark isn't needed
Glue ETL extends Spark with DynamicFrames β a more flexible alternative to Spark DataFrames that handles messy, schema-inconsistent data gracefully:
Spark DataFrame
- Strict schema β all rows must match
- Fails on schema inconsistencies
- Standard PySpark / Spark SQL
- You already know if you know Spark
Glue DynamicFrame
- Self-describing β each row carries its own schema
- Handles mixed types (same column: int + string)
- Built-in
ResolveChoiceto fix type conflicts - Convert to/from DataFrame freely
- Reads directly from Glue Catalog
| Operation | What It Does | Use Case |
|---|---|---|
ApplyMapping | Rename/retype columns, drop unwanted fields | Standardize column names across sources |
ResolveChoice | Fix columns with multiple data types | Mixed int/string fields from JSON crawl |
Filter | Keep only rows matching a condition | Remove test/null records |
Join | Merge two DynamicFrames on a key | Enrich events with user dimension table |
DropNullFields | Remove columns that are entirely null | Clean sparse datasets |
Relationalize | Flatten nested JSON into relational tables | Convert API responses for analytics |
write_dynamic_frame | Write output to S3 in target format | Write Parquet with partitioning |
Job Bookmarks let a Glue job remember where it left off, so the next run only processes new data:
Without Bookmarks
- Every run processes ALL source data from scratch
- Wastes DPU time re-processing old files
- May create duplicate output records
- Cost grows linearly with historical data size
With Bookmarks Enabled
- Glue tracks which files/partitions were already processed
- Next run reads ONLY new files since last bookmark
- Dramatic cost reduction for daily/hourly ETL
- Enable via job parameter:
--job-bookmark-option = job-bookmark-enable
| Mechanism | How It Works | Use Case |
|---|---|---|
| Schedule trigger | Cron expression (e.g. daily at 2am) | Nightly ETL batch |
| On-demand trigger | Manual start or API call | One-off backfill or testing |
| Event trigger | Run when another job/crawler completes | Chain: Crawler β ETL β Second ETL |
| Glue Workflow | DAG of jobs + crawlers with dependencies | Multi-step pipeline: crawl β transform β validate |
| Step Functions | External orchestrator calling Glue via API | Complex logic, human approval, cross-service |
| EventBridge | Trigger job on S3 event or schedule | Process file on arrival in S3 |
Glue Studio is a visual drag-and-drop interface that generates Glue ETL code automatically. You connect source β transforms β target visually, and Glue generates the PySpark script behind the scenes. Useful for teams that prefer visual tools or rapid prototyping.
- "Serverless ETL to convert CSV to Parquet" β Glue ETL Job (Spark)
- "Process only new files since last run" β enable Job Bookmarks
- "DPU" = Data Processing Unit β unit of Glue compute billing
- "DynamicFrame" = Glue's extension of Spark DataFrame (handles schema inconsistencies)
- "Chain jobs: crawler finishes β run ETL" β Glue Workflow or event trigger
- "Small script, <1GB data, API calls" β Python Shell job (0.0625 DPU, cheapest)
- "Large-scale data transformation" β Spark ETL job (10+ DPUs)
Glue ETL Jobs are serverless Spark (or Python Shell) jobs that Extract, Transform, and Load data. Use Spark ETL for large-scale transforms (CSVβParquet, joins, partitioning), Python Shell for tiny tasks. DynamicFrames handle messy data gracefully. Job Bookmarks enable incremental processing. Orchestrate with triggers, workflows, or Step Functions. You pay per DPU-second β no idle clusters.
Architecture Patterns
Glue rarely works alone. It is a middleware component that connects data sources to analytics tools. This chapter covers the four most common production patterns where Glue plays a central role.
The most common Glue pattern. Raw data arrives in S3, Glue converts it to an optimised format, and analytics tools query the curated layer.
The Flow
- DMS replicates RDS/on-prem DB β S3 (CSV/Parquet)
- Glue Crawler discovers the new tables in S3
- Glue ETL joins/transforms for analytics use
- Athena + QuickSight provide the query layer
Why This Pattern
- Offload heavy analytics from production database
- Query historical data without impacting OLTP performance
- Combine data from multiple RDS instances
- Retain full history (RDS may only keep recent data)
Use Glue as a governance tool β crawlers scan all S3 data, the Catalog becomes the single inventory of what data exists, and Lake Formation applies security on top:
Discover
Crawlers scan all S3 prefixes and register every dataset in the Catalog β automatic data inventory.
Catalogue
Catalog becomes the authoritative list of all data assets β searchable, versioned, with schema history.
Govern
Lake Formation applies column/row-level access on catalog tables β centralised security for the data lake.
Process data the moment it arrives in S3 β no polling, no schedules. An S3 event triggers the Glue job immediately:
| Component | Role |
|---|---|
| S3 Event Notification | Fires when a new file is uploaded to a prefix |
| EventBridge | Routes the S3 event to the correct target |
| Glue Workflow / Step Functions | Starts the ETL job with the new file path as input |
| Glue ETL Job | Processes only the new file (with bookmarks or explicit path) |
| Glue Catalog | Updated with new partitions from the processed output |
| Service | Best For | When NOT to Use |
|---|---|---|
| Glue ETL | S3-centric batch ETL, format conversion, catalog integration | Real-time (<1s latency), very complex Spark tuning needed |
| EMR | Custom Spark/Hadoop/Presto with full cluster control | Simple transforms (Glue is easier); short-lived jobs |
| Lambda | Small files (<100MB), <15 min runtime, event-driven micro-transforms | TBs of data, complex joins, long-running |
| Step Functions + Lambda | Orchestration with conditional logic, human approval | Heavy data processing (orchestrate Glue instead) |
| Kinesis Analytics | Real-time stream SQL (sub-second) | Batch processing, historical data |
- "Convert CSV to Parquet in S3 serverlessly" β Glue ETL (not EMR, not Lambda for large data)
- "Process new S3 files on arrival" β S3 Event β EventBridge β Glue ETL (event-driven ETL)
- "Offload analytics from production RDS" β DMS β S3 β Glue β Athena
- "Full Spark cluster control needed" β EMR (not Glue β Glue is managed)
- "Small file <100MB transform" β Lambda (cheaper than spinning up Glue DPUs)
- "Data lake governance + discovery" β Glue Crawlers + Catalog + Lake Formation
Glue fits into four main architecture patterns: (1) Data Lake ETL β the canonical rawβcurated pipeline; (2) Database migration β replicate RDS to S3 for analytics; (3) Schema discovery + governance β Crawlers + Catalog + Lake Formation; (4) Event-driven ETL β process files on arrival. Choose Glue over EMR when you want serverless simplicity, over Lambda when data exceeds 100MB, and over Kinesis when batch is acceptable.
Cost Optimization & Best Practices
Glue costs come from two sources: ETL Jobs (DPU-hours β the expensive part) and Crawlers (DPU-hours at lower scale). The Data Catalog itself is effectively free. Understanding DPU billing and optimization techniques is critical to controlling Glue spend.
| Component | Cost | Minimum | Notes |
|---|---|---|---|
| Spark ETL Job | $0.44 per DPU-hour | 10 min, 2 DPUs | Default: 10 DPUs. Scale down to 2 for small jobs |
| Python Shell Job | $0.44 per DPU-hour | 1 min, 0.0625 DPU | Tiny jobs: ~$0.0005 per run. Extreme savings |
| Crawler | $0.44 per DPU-hour | 10 min | ~$0.07 minimum per crawl run |
| Data Catalog storage | Free (first 1M objects) | β | $1 per 100K above 1M |
| Data Catalog requests | Free (first 1M/month) | β | $1 per 1M requests above free tier |
| Glue DataBrew | $0.48 per node-hour | β | Interactive sessions billed separately |
| Glue Interactive Sessions | $0.44 per DPU-hour | β | Notebook-style development environment |
Leaving DPUs at the default (10) for small jobs that only need 2 DPUs wastes 80% of your ETL budget. A 10-DPU job running 10 minutes costs $0.73. The same job on 2 DPUs costs $0.15. Always right-size DPU allocation based on data volume.
| # | Optimization | Impact | How |
|---|---|---|---|
| 1 | Right-size DPUs | 50β80% savings | Monitor "Max needed DPUs" in CloudWatch; reduce to observed peak |
| 2 | Enable Job Bookmarks | Dramatic for incremental jobs | Process only new data each run instead of full reprocess |
| 3 | Use Python Shell for small tasks | 100Γ cheaper than Spark | 0.0625 DPU vs 10 DPU for sub-1GB, simple Python transforms |
| 4 | Replace crawlers with MSCK REPAIR | $0 for partition updates | If schema is stable, don't re-crawl β just register new partitions |
| 5 | Set job timeout | Cost safety net | Prevent runaway jobs from consuming DPUs for hours |
| 6 | Auto-scaling (Glue 3.0+) | Avoid over-provisioning | Glue dynamically adds/removes workers based on workload |
| 7 | Flex execution | ~34% cheaper | Non-urgent jobs use preemptible capacity at a discount |
ETL Job Best Practices
- Right-size DPUs β start at 2, increase only if job is slow
- Enable bookmarks for all incremental jobs
- Write output as Parquet with Snappy compression
- Partition output by date (most common query filter)
- Set job timeout to prevent runaway execution
- Monitor with CloudWatch: DPU utilisation, run duration
- Use Glue 4.0 engine for best performance
Catalog & Crawler Best Practices
- One Glue database per data domain (logs, billing, events)
- Use Hive-style partition naming for auto-discovery
- Schedule crawlers only if schema actually evolves
- For stable schemas, use manual DDL + MSCK REPAIR
- Enable schema versioning for audit trail
- Use Lake Formation for column-level security
- Tag tables for governance and cost allocation
Mistake: 10 DPUs for Small Jobs
Default allocation (10 DPUs) processes TBs. For GBs of data, 2β3 DPUs is enough. Over-provisioning means paying 5Γ more than needed for every single run.
Mistake: No Job Bookmarks
Without bookmarks, a daily job re-processes ALL historical data every run. Cost grows linearly with time. Enable bookmarks and process only delta.
Mistake: Crawling Stable Schemas Daily
If your data format hasn't changed in months, running a crawler daily wastes $2β4/month per dataset and takes time. Use MSCK REPAIR or API to add partitions instead.
Mistake: Writing Output as CSV
Glue ETL's purpose is to produce analytics-ready data. Writing output as CSV defeats the purpose β always output Parquet or ORC with partitioning for downstream analytics.
- "Reduce Glue ETL cost" β reduce DPUs, enable bookmarks, use Flex execution
- "$0.44 per DPU-hour" β this is the Glue ETL pricing model
- "Process only new data each run" β Job Bookmarks
- "Cheapest Glue job for small tasks" β Python Shell (0.0625 DPU)
- "Glue Data Catalog cost" β effectively free (first 1M objects + requests)
- "Glue auto-scaling" β available in Glue 3.0+ (dynamically adjusts workers)
Glue costs come from DPU-hours β ETL jobs and crawlers. The Catalog is free. Optimize by right-sizing DPUs (start at 2, not 10), enabling job bookmarks (incremental), choosing Python Shell for small tasks, and replacing daily crawlers with MSCK REPAIR for stable schemas. Flex execution saves 34% for non-urgent jobs. Always output Parquet with partitioning.
Glue DataBrew β Visual Data Preparation
AWS Glue DataBrew is a visual data preparation tool for analysts and data scientists who need to clean and normalise data without writing code. It provides 250+ built-in transforms (trim, filter, pivot, fill nulls, deduplicate) in a spreadsheet-like interface β then runs the transforms at scale as Glue jobs.
| Aspect | Glue DataBrew | Glue ETL (Spark) |
|---|---|---|
| Interface | Visual / no-code β spreadsheet-like | Code (PySpark / Scala / visual Glue Studio) |
| Users | Data analysts, data scientists, non-engineers | Data engineers, developers |
| Complexity | Simple transforms: clean, filter, format | Complex: joins, aggregations, custom logic |
| Scale | Moderate datasets | Petabyte-scale distributed processing |
| Pricing | $0.48 per node-hour (interactive sessions) | $0.44 per DPU-hour |
| Output | Cleaned S3 data (CSV, JSON, Parquet) | Transformed S3 data + catalog updates |
| Profiling | Built-in data profiling (statistics, distributions) | Not built-in (must code yourself) |
Data Profiling
- Auto-generate statistics for every column
- Detect missing values, outliers, distributions
- Data quality scores per column
- Preview results before running
250+ Transforms
- String: trim, pad, case, regex extract
- Numeric: round, normalize, bin
- Date: parse, format, extract parts
- Structure: pivot, unpivot, flatten, split
- Quality: deduplicate, fill nulls, validate
Recipes
- Save transform steps as reusable "recipes"
- Apply same recipe to new datasets
- Version recipes for audit trail
- Schedule recipe runs as jobs
How DataBrew Works
- 1. Create a Project β connect to S3/Glue Catalog source
- 2. Interactive Session β explore data, apply transforms visually
- 3. Build a Recipe β ordered list of transform steps
- 4. Run as Job β execute the recipe at scale on full dataset
- 5. Output β cleaned data written to S3 in target format
Common Use Cases
- Clean messy CSV exports before loading to data lake
- Standardise date/phone/address formats
- Remove PII columns before sharing datasets
- Profile new data sources for quality assessment
- Non-technical teams prepare data for ML
- "Visual data preparation without code" β Glue DataBrew
- "Data profiling and quality assessment" β Glue DataBrew (built-in profiling)
- "Non-technical users clean and prepare data" β DataBrew (not Glue ETL)
- "250+ built-in transforms" β DataBrew
- DataBrew is NOT for complex joins, aggregations, or petabyte-scale β use Glue ETL for that
- DataBrew recipes = reusable, versioned sets of transform steps
Glue DataBrew is a visual, no-code data preparation tool for cleaning and normalising datasets. Use it for non-technical users, data profiling, and simple transforms (trim, filter, deduplicate, format). Use Glue ETL (Spark) for complex joins, aggregations, and petabyte-scale processing. DataBrew recipes are reusable, versioned, and can be scheduled as production jobs.
- What: Serverless data integration β shared Data Catalog (metadata) + ETL Jobs (transform) + Crawlers (auto-discover) + DataBrew (visual prep).
- Data Catalog: The Hive-compatible metastore shared by Athena, EMR, Redshift Spectrum. Stores database β table β column β partition hierarchy. Effectively free.
- Crawlers: Scan S3/JDBC, infer format + schema + partitions, register in Catalog. Point and click to make raw data queryable.
- ETL Jobs: Serverless Spark (or Python Shell) β extract from S3/RDS, transform, load as Parquet + partitions. Uses DynamicFrames for schema flexibility. Job Bookmarks for incremental.
- Cost: $0.44/DPU-hour. Right-size DPUs (2 not 10), enable bookmarks, use Python Shell for small tasks, replace crawlers with MSCK REPAIR for stable schemas.
- DataBrew: Visual/no-code data prep with 250+ transforms and built-in profiling. For analysts, not engineers.
- Key Integration: Glue is the metadata backbone β Athena reads Glue Catalog to know what tables exist. Without Glue, there's no data lake.
AWS Glue is the invisible backbone that turns S3 from a storage bucket into a queryable data lake. The Data Catalog tells every analytics service what data exists and how to read it. Crawlers auto-populate the catalog. ETL Jobs transform raw data into optimised formats. Together, they eliminate the infrastructure tax of building data pipelines β no servers to manage, no clusters to provision, no metadata to maintain manually.