Amazon Redshift β
Cloud Data Warehouse
A fully managed, petabyte-scale columnar data warehouse optimised for complex analytical queries. Redshift delivers sub-second performance on billions of rows β powering enterprise BI dashboards, data science workloads, and business-critical reporting at scale.
What is Amazon Redshift?
Amazon Redshift is a fully managed, columnar data warehouse designed for Online Analytical Processing (OLAP). It stores structured data in a compressed columnar format, distributes it across multiple nodes, and executes massively parallel queries β returning results in seconds even over billions of rows and petabytes of data.
| Aspect | Data Warehouse (Redshift) | Data Lake (S3 + Athena) |
|---|---|---|
| Data model | Structured, schema-on-write (predefined tables) | Any format, schema-on-read |
| Query speed | Sub-second to seconds (optimised) | Seconds to minutes (scan-based) |
| Concurrency | High (hundreds of simultaneous dashboard users) | Lower (20-25 concurrent queries default) |
| Data loading | ETL required β data loaded into warehouse | Data stays in S3, no loading |
| Best for | BI dashboards, repeatable reports, complex joins | Ad-hoc exploration, infrequent queries, raw data |
| Cost model | Per-node-hour (always running) or serverless RPU | Per-query / per-TB scanned |
| Infrastructure | Managed cluster of compute nodes | Serverless (no infrastructure) |
If S3 + Athena is a giant library where you search through shelves on demand, Redshift is a research institute that has already indexed, compressed, and cross-referenced every book. When an analyst asks a question, the answer comes back in milliseconds β because the data is pre-organised for fast retrieval. The cost is maintaining the institute (running nodes), but the speed is unmatched.
Problems Redshift Solves
- Athena too slow for complex repeated BI queries
- Need sub-second response for 100+ dashboard users
- Complex multi-table joins across TBs of data
- Predictable, consistent query performance required
- Traditional on-prem warehouse too expensive to scale
What Redshift Provides
- Columnar storage β only reads needed columns
- Massively Parallel Processing (MPP) across nodes
- Automatic compression (up to 10Γ storage savings)
- Result caching β repeated queries return instantly
- Scales from 160GB to petabytes with node additions
Athena β Ad-Hoc
Serverless. Pay per query. Best for exploration, infrequent queries, raw data in S3. No infrastructure.
Redshift β Enterprise BI
Managed warehouse. Sub-second queries. Best for dashboards, complex joins, high concurrency, repeatable reports.
OpenSearch β Search
Full-text search + log analytics. Best for observability, log exploration, Kibana dashboards.
| Concept | What It Is | Why It Matters |
|---|---|---|
| Columnar Storage | Data stored by column, not by row | Reads only queried columns β 10Γ less I/O than row-based |
| MPP (Massively Parallel) | Query distributed across all compute nodes simultaneously | Linear performance scaling with nodes added |
| Leader Node | Receives SQL, plans execution, returns results | No data storage; coordinates the cluster |
| Compute Nodes | Store data + execute query fragments (slices) | Each node has CPU, RAM, SSD; data is partitioned across them |
| Slices | Subdivisions of a compute node (virtual processors) | Each slice processes a portion of data in parallel |
| Distribution Style | How rows are distributed across nodes (KEY, ALL, EVEN) | Bad distribution = data shuffling = slow joins |
| Sort Key | Column(s) that determine physical row order on disk | Zone maps allow skipping irrelevant blocks (huge speedup) |
| Result Cache | Repeated identical queries return cached results | Dashboard refresh with zero compute cost |
| Aspect | Redshift Provisioned | Redshift Serverless |
|---|---|---|
| Infrastructure | You choose node type + count | AWS manages capacity automatically |
| Scaling | Manual resize or elastic resize | Auto-scales based on workload |
| Cost | Per-node-hour (always running) | Per RPU-hour (only when queries run) |
| Idle cost | Full cost even when idle | $0 when no queries running |
| Best for | Predictable workloads, maximum control, cost-optimised steady state | Variable workloads, getting started, dev/test |
| Administration | Manage cluster, vacuuming, nodes | Zero admin β fully managed |
- "Fast analytical queries on structured data" or "data warehouse" β Redshift
- "Sub-second dashboard queries" or "high concurrency BI" β Redshift (not Athena)
- "Columnar storage" β Redshift (or Parquet in S3 for Athena)
- "Massively Parallel Processing" β Redshift architecture
- "Redshift vs Athena" β Redshift = managed warehouse (fast, always-on). Athena = serverless, pay-per-query, ad-hoc
- "Query S3 data from Redshift without loading" β Redshift Spectrum
- "Serverless data warehouse" β Redshift Serverless (not Athena β Athena = query engine, not DW)
- "Distribution key" and "sort key" = critical Redshift performance tuning concepts
Amazon Redshift is a managed columnar data warehouse for fast analytical queries at scale. It uses MPP across compute nodes for sub-second performance on billions of rows. Choose Redshift over Athena when you need consistent sub-second latency, high concurrency for dashboards, or complex multi-table joins. Choose Redshift Serverless for variable workloads with zero idle cost, or Provisioned for predictable workloads at optimal cost.
Architecture & Performance Tuning
Redshift performance depends on three design choices: distribution style (how data is spread across nodes), sort keys (how data is ordered on disk), and compression (how columns are encoded). Master these and your queries run 10β100Γ faster.
| Node Type | Storage | Best For | Example |
|---|---|---|---|
| RA3 (Recommended) | Managed storage (S3-backed, unlimited) | Most workloads β separates compute from storage | ra3.xlplus, ra3.4xlarge, ra3.16xlarge |
| DC2 (Dense Compute) | Local SSD (fixed per node) | Small datasets (<1TB) needing fastest I/O | dc2.large, dc2.8xlarge |
| DS2 (Dense Storage) | Local HDD (large capacity) | Legacy β use RA3 instead for new clusters | ds2.xlarge, ds2.8xlarge |
RA3 nodes decouple compute from storage β data lives in S3 (managed storage) with hot data cached locally. You can resize compute independently of data size. Always choose RA3 for new clusters unless you have <1TB and need lowest latency (DC2).
| Style | How It Works | Best For | Gotcha |
|---|---|---|---|
| KEY | Rows with same key value go to same node | Large fact tables joined on that key (e.g. customer_id) | Skewed key = hot node |
| ALL | Full copy on every node | Small dimension tables (<few million rows) joined by all nodes | Write amplification; only for small tables |
| EVEN | Round-robin distribution (default) | Tables not frequently joined; staging tables | Joins require data shuffling between nodes |
| AUTO | Redshift chooses (ALL for small, EVEN for large) | When unsure β let Redshift optimise | May not pick KEY when you need it |
Set the distribution key on the column you most frequently JOIN on. If orders and customers join on customer_id, distribute both tables on customer_id. This co-locates matching rows on the same node β eliminating expensive network shuffles.
| Sort Key Type | How It Works | Best For |
|---|---|---|
| Compound | Multi-column sort (col1, then col2, then col3) | Queries that filter on the prefix columns (like a composite index) |
| Interleaved | Equal weight to each sort column | Ad-hoc queries filtering on any subset of columns |
| AUTO | Redshift maintains sort automatically | Default β good enough for most workloads |
Sort keys enable zone maps β metadata that tracks min/max values per 1MB block. When filtering, Redshift skips entire blocks that can't match, dramatically reducing I/O.
Redshift automatically applies column-level compression encodings when data is loaded via COPY. This typically reduces storage by 3β10Γ and improves query speed (less I/O).
Common Encodings
- AZ64 β AWS proprietary, best for numeric/date
- LZO β general-purpose for large VARCHAR
- ZSTD β excellent compression ratio, newer
- RunLength β sorted columns with repeated values
- RAW β no compression (sort key columns)
Best Practices
- Use
COPYcommand β auto-selects optimal encoding - Run
ANALYZE COMPRESSIONon existing tables - Sort key column = RAW encoding (Redshift enforces this)
- Don't manually specify unless you have a reason
- AZ64 is default for numeric types since 2020
When query load exceeds cluster capacity, Redshift can automatically spin up additional clusters to handle the overflow β and spin them back down when demand decreases.
- Enabled per WLM (Workload Management) queue
- Additional clusters handle read queries only
- You get 1 hour/day of free concurrency scaling per cluster
- Beyond free tier: billed per-second at on-demand node rate
- Ideal for: morning dashboard rush, month-end reporting spikes
AQUA pushes filtering and aggregation down to the storage layer (RA3 managed storage). Instead of moving all data to compute nodes for processing, AQUA applies predicates at the storage level β reducing data movement by up to 10Γ. Available automatically on RA3 nodes.
- "Optimise join performance" β set distribution key on the join column (co-locate data)
- "Speed up range queries on date" β set sort key on the date column (zone map skipping)
- "Reduce storage" β compression encodings (automatic with COPY). 3β10Γ savings.
- "Handle query spikes without resizing" β Concurrency Scaling
- "RA3 vs DC2" β RA3 = compute+storage separation (recommended). DC2 = fastest local SSD (<1TB)
- "Data shuffling" β bad distribution style forces network transfer between nodes during joins
- "Separate compute from storage" β RA3 nodes (data in managed S3-backed storage)
Redshift performance depends on distribution keys (co-locate joined data on same node), sort keys (enable zone map block skipping), and compression (reduce I/O). Choose RA3 nodes to separate compute from storage. Use concurrency scaling for query spikes. The single biggest performance win: distribute your large fact table on the column you JOIN on most frequently.
Loading Data into Redshift
Data must be loaded into Redshift before it can be queried at full speed (unlike Athena which reads S3 directly). The COPY command is the primary mechanism β parallel, compressed, and optimised for bulk loading from S3.
| Method | How | Speed | Best For |
|---|---|---|---|
| COPY from S3 | Parallel load from S3 files into Redshift | Fastest (parallel across slices) | Bulk loads, ETL pipelines, primary method |
| COPY from DynamoDB | Direct parallel copy from DynamoDB table | Fast | Analytics on DynamoDB data |
| Kinesis Firehose | Stream delivery (Firehose β S3 β COPY into Redshift) | Near-real-time | Streaming data into warehouse |
| INSERT | Single-row SQL inserts | Very slow | Avoid for bulk loads β only for one-off corrections |
| AWS DMS | Continuous replication from RDS/on-prem | Ongoing CDC | Keep Redshift in sync with OLTP databases |
| Glue ETL | Transform + load via Spark jobs | Variable | Complex transforms before loading |
Do This
- Split input files into multiple chunks (β number of slices)
- Use compressed files (GZIP, LZO, ZSTD)
- Use columnar formats (Parquet, ORC) when possible
- Use the same region as your Redshift cluster (no cross-region)
- Use a manifest file for precise file list control
- COPY auto-applies optimal compression encoding
Avoid This
- Single large file (can't parallelise across slices)
- INSERT statements for bulk data (1000Γ slower)
- Cross-region COPY (adds latency + data transfer cost)
- Uncompressed files (wastes I/O bandwidth)
- Millions of tiny files (overhead per file)
- COPY without specifying format/delimiter correctly
For maximum COPY parallelism, split your data into N files where N = number of slices in your cluster (or a multiple). Each slice loads one file simultaneously. A cluster with 4 nodes Γ 8 slices = 32 slices, so split into 32 or 64 files for optimal loading speed.
The UNLOAD command does the reverse of COPY β exports query results from Redshift back to S3 as Parquet, CSV, or JSON files. Used for:
- Sharing processed data with other systems
- Feeding Athena or EMR from Redshift-processed results
- Archiving old data to S3 before deleting from cluster
- Creating snapshots of specific query results
- "Fastest way to load data into Redshift" β COPY from S3 (parallel, compressed, split files)
- "INSERT vs COPY" β COPY is orders of magnitude faster for bulk. INSERT = single-row, very slow.
- "Split input files" β match number of slices for parallel loading
- "Stream data into Redshift" β Kinesis Firehose (lands in S3, then COPY)
- "Export Redshift data to S3" β UNLOAD command
- "Continuous sync from RDS" β AWS DMS with CDC to Redshift
- "COPY supports Parquet" β yes, and it's the best format for loading (columnar + compressed)
Always load data with COPY from S3 β never INSERT for bulk. Split files into N chunks matching your slice count for maximum parallelism. Use compressed columnar formats (Parquet preferred). Stage data in the same region as your cluster. For streaming ingestion, use Firehose β S3 β COPY. For exports, use UNLOAD to write results back to S3 for Athena or sharing.
Redshift Spectrum β Query S3 Without Loading
Redshift Spectrum lets you run Redshift SQL queries directly against data stored in S3 β without loading it into the cluster. It extends your Redshift warehouse to include the entire data lake, combining fast local queries on loaded hot data with on-demand access to cold data in S3.
| Approach | When to Use | Pros | Cons |
|---|---|---|---|
| Load into Redshift (COPY) | Hot data queried frequently, sub-second needed | Fastest queries; full indexing, sort keys, caching | Storage cost; must load/maintain |
| Spectrum (external tables) | Cold/historical data; join S3 data with local tables | No loading; query PBs of S3 data; extend warehouse | Slower than local; billed per TB scanned |
| Athena (standalone) | Ad-hoc queries without Redshift cluster; serverless | No cluster needed; pay per query only | No joins with Redshift local data; lower concurrency |
Spectrum uses external tables defined in the Glue Data Catalog (or a Hive metastore). The key steps:
- Create external schema β points Redshift at a Glue Catalog database
- External tables β already exist in Glue (from crawlers or manual DDL)
- Query β
SELECT * FROM external_schema.table WHERE ... - Join β combine local Redshift tables with external S3 tables in one query
Hot/Cold Split
- Recent data (last 90 days) loaded into Redshift local
- Historical data (years) stays in S3 as Parquet
- Queries join both seamlessly via UNION ALL or JOIN
- Saves massive cluster storage cost
Data Lake Extension
- Redshift has curated dimension tables locally
- Spectrum queries raw fact tables in S3 data lake
- Join enriched local dims with massive S3 facts
- Best of both: warehouse speed + lake scale
- Use Parquet or ORC β columnar formats let Spectrum skip irrelevant columns
- Partition data β date-based partitions let Spectrum skip entire folders
- Push predicates β filter in WHERE clause (Spectrum filters at the S3 layer)
- Avoid SELECT * β only select needed columns to reduce data scanned
- Large files (128MBβ512MB) β avoid millions of tiny files
- "Query S3 data from Redshift without loading" β Redshift Spectrum
- "Join local Redshift tables with S3 data" β Spectrum external tables
- "Extend data warehouse to petabytes without adding nodes" β Spectrum
- "Hot/cold data architecture" β recent in Redshift, historical in S3 via Spectrum
- "Spectrum uses Glue Catalog" β same catalog as Athena and EMR (shared metadata)
- "Spectrum cost" β $5 per TB of S3 data scanned (same as Athena pricing)
- "Spectrum vs Athena" β Spectrum = query S3 from Redshift (can join local). Athena = standalone serverless.
Redshift Spectrum extends your warehouse to the entire S3 data lake without loading data. Use it for hot/cold splits (recent data local, historical in S3), joining dimension tables with massive S3 fact tables, or querying petabytes without adding cluster nodes. It shares the Glue Catalog with Athena and costs $5/TB scanned. Use Parquet + partitioning for best Spectrum performance.
Redshift Serverless
Redshift Serverless removes all cluster management. You get the same Redshift engine, same SQL, same performance β but AWS manages capacity, scaling, and infrastructure. You pay only when queries run (measured in RPU-hours). Zero cost when idle.
| Concern | Provisioned Cluster | Redshift Serverless |
|---|---|---|
| Capacity | You choose node type + count | You set base RPU (32β512); auto-scales |
| Idle cost | Full price 24/7 even if no queries | $0 when idle (after cooldown) |
| Scaling | Manual resize (minutes of downtime) | Scales up/down automatically per workload |
| Maintenance | Vacuum, analyze, WLM tuning, patches | All automatic β zero ops |
| Spectrum | Yes | Yes (same external tables) |
| Data sharing | Yes (producer/consumer) | Yes |
| Pricing unit | Per-node-hour | Per RPU-hour ($0.375 per RPU-hour) |
| Best for | Steady, predictable workloads at lowest cost | Variable, bursty, or getting-started workloads |
RPU is the compute unit for Serverless. You set a base capacity (minimum 32 RPU) and Serverless auto-scales from there:
Base Capacity
- Minimum: 32 RPU (~equivalent to 4 RA3.xlplus nodes)
- Maximum: 512 RPU
- Higher base = faster queries but more cost
- Auto-scales above base during peak demand
Pricing
- $0.375 per RPU-hour (compute only)
- Billed per-second with 60-second minimum per query
- Storage: $0.024/GB-month (same as RA3 managed)
- Idle = $0 compute (storage still billed)
- Cooldown: ~5 minutes before fully idle
Choose Serverless When
- Workload is unpredictable or bursty
- Getting started / proof of concept
- Dev/test environments (idle most of the day)
- You want zero administration
- Occasional heavy queries but mostly idle
- Don't want to manage WLM, vacuum, resize
Choose Provisioned When
- Queries run 18+ hours/day (cheaper at steady state)
- Need reserved instances (1yr/3yr) for cost savings
- Want full control over node types and cluster config
- Need deterministic pricing (budget predictability)
- Complex WLM queue tuning required
- Very high concurrency requiring concurrency scaling config
| Feature | Available in Serverless? | Notes |
|---|---|---|
| Spectrum (external S3 queries) | β Yes | Same as provisioned |
| Data Sharing | β Yes | Cross-account, cross-region |
| Federated Query | β Yes | Query RDS/Aurora from Redshift |
| ML (CREATE MODEL) | β Yes | Built-in ML with SageMaker |
| Materialized Views | β Yes | Auto-refresh supported |
| Snapshots | β Yes | Recovery points |
| VPC | β Yes | Runs in your VPC |
| Reserved Instances | β No | Only provisioned supports RI pricing |
| Node type selection | β No | AWS manages infrastructure |
- "Serverless data warehouse, zero idle cost" β Redshift Serverless
- "RPU" = Redshift Processing Unit β compute billing unit for Serverless
- "Variable analytics workload, pay only when querying" β Redshift Serverless
- "Same Redshift SQL and features, no cluster management" β Serverless
- "Redshift Serverless vs Athena" β Both are pay-per-use, but Redshift Serverless = full warehouse (joins, caching, concurrency). Athena = simpler query engine on S3.
- "Need Reserved Instances for cost" β must use Provisioned (Serverless doesn't support RI)
- "Auto-scales compute for data warehouse" β Redshift Serverless
Redshift Serverless gives you the full Redshift engine with zero ops β auto-scaling, no idle cost, no cluster management. Pay $0.375/RPU-hour only when queries run. Choose it for variable workloads, dev/test, or getting started. Choose Provisioned for steady 18+hr/day workloads where Reserved Instances make it cheaper. Both support Spectrum, Data Sharing, and the full Redshift feature set.
Cost & Best Practices
Redshift costs depend on whether you use Provisioned (node-hours) or Serverless (RPU-hours), plus storage and optional features. Understanding the cost model is critical β Redshift can be very cost-efficient or very expensive depending on configuration choices.
| Component | Provisioned | Serverless |
|---|---|---|
| Compute | $0.25β$13.04/node-hour (depends on node type) | $0.375/RPU-hour (billed per-second) |
| Storage (RA3) | $0.024/GB-month (managed storage) | $0.024/GB-month (same) |
| Spectrum | $5/TB scanned from S3 | $5/TB scanned from S3 |
| Concurrency Scaling | 1 hr/day free; then on-demand node rate | Included (auto-scales) |
| Backup storage | Free up to cluster size; $0.024/GB above | Free up to storage used; $0.024/GB above |
| Reserved Instances | Up to 75% savings with 1yr/3yr commitment | Not available |
| Data Transfer | Standard AWS data transfer rates | Standard AWS data transfer rates |
| # | Optimization | Impact | How |
|---|---|---|---|
| 1 | Reserved Instances | Up to 75% savings | Commit 1yr or 3yr for steady workloads (provisioned only) |
| 2 | Right-size cluster | 30β50% savings | Monitor CPU/disk utilisation; resize down if under-utilised |
| 3 | Pause idle clusters | $0 compute when paused | Schedule pause/resume for dev/test (provisioned) |
| 4 | Hot/cold with Spectrum | Massive storage savings | Keep recent data local; move old data to S3, query via Spectrum |
| 5 | Compression | 3β10Γ storage reduction | Use COPY (auto-compresses); run ANALYZE COMPRESSION |
| 6 | Serverless for variable loads | Zero idle cost | Switch dev/test or bursty workloads to Serverless |
| 7 | Use result cache | Free repeated queries | Identical queries hit cache (enabled by default) |
Performance Practices
- Set distribution key on most common join column
- Set sort key on most common filter column (usually date)
- Use COPY for all bulk loads (never INSERT)
- Split load files = number of slices
- Run VACUUM regularly (or use auto-vacuum)
- Run ANALYZE after large loads (statistics update)
- Use Parquet for Spectrum queries
- Leverage result caching for dashboard queries
Operational Practices
- Enable automatic snapshots (point-in-time recovery)
- Cross-region snapshot copy for DR
- Use WLM queues to isolate ETL from dashboards
- Monitor with CloudWatch: CPU, disk, query latency
- Enable audit logging for compliance
- Use VPC + security groups (no public access)
- Encrypt at rest (KMS) and in transit (SSL)
- Use IAM roles for COPY/UNLOAD (not credentials)
Using INSERT Instead of COPY
Row-by-row INSERTs are 100β1000Γ slower than COPY for bulk loading. Always stage data in S3 and use COPY. INSERT is only for single-row corrections.
Wrong Distribution Style
EVEN distribution on a large fact table that's frequently joined causes massive data shuffling. Identify your primary join column and use KEY distribution on both tables.
No Sort Key on Date Columns
Most analytics queries filter by date. Without a sort key on the date column, every query scans all blocks. Add a sort key and Redshift skips irrelevant blocks via zone maps.
Over-Provisioned Cluster Running 24/7
A 16-node cluster for a workload that queries 4 hours/day wastes 83% of spend. Either right-size + pause, or switch to Serverless for variable loads.
| If You Need⦠| Use⦠| Why |
|---|---|---|
| Ad-hoc queries on S3, no infrastructure | Athena | Serverless, pay per query, zero ops |
| Sub-second BI dashboards, high concurrency | Redshift (Provisioned or Serverless) | MPP warehouse, result caching, optimised joins |
| Full-text search and log analytics | OpenSearch | Inverted index, Kibana dashboards |
| Real-time streaming analytics | Kinesis + Flink | Sub-second on streaming data |
| ETL / data transformation | Glue ETL | Serverless Spark, format conversion |
| Query S3 from warehouse context | Redshift Spectrum | JOIN S3 data with local warehouse tables |
| Transactional database (OLTP) | RDS / Aurora | Row-based, optimised for writes |
- What: Managed columnar data warehouse β MPP, sub-second queries, petabyte-scale. Built for OLAP / BI / analytics.
- Architecture: Leader Node (coordinates) + Compute Nodes (store data in slices, execute in parallel). RA3 = recommended (compute/storage separation).
- Performance: Distribution keys (co-locate joins), Sort keys (zone map skipping), Compression (3β10Γ storage reduction), Result cache (free repeated queries).
- Loading: COPY from S3 is the only correct bulk method. Split files = slices. Use Parquet + GZIP. Never INSERT for bulk.
- Spectrum: Query S3 without loading. External tables via Glue Catalog. Hot/cold split pattern. $5/TB scanned.
- Serverless: Same engine, zero ops, pay per RPU-hour, $0 idle. Best for variable/bursty workloads.
- Cost: Provisioned = node-hours (RI for 75% savings). Serverless = RPU-hours ($0.375). Spectrum = $5/TB. Optimise: right-size, RI, pause, hot/cold.
- vs Athena: Redshift = fast, always-ready, high concurrency, complex joins. Athena = serverless, ad-hoc, simple, no infrastructure.
Redshift is what you reach for when Athena isn't fast enough. It's the enterprise-grade data warehouse β columnar, parallel, sub-second β designed for BI dashboards with hundreds of concurrent users querying billions of rows. Spectrum extends it to the data lake. Serverless removes the ops tax. Together with Athena (ad-hoc), Kinesis (streaming), and Glue (ETL), Redshift completes the AWS analytics stack as the high-performance query engine for structured data at scale.