Amazon Redshift
LearningTree Β· AWS Β· Analytics

Amazon Redshift β€”
Cloud Data Warehouse

A fully managed, petabyte-scale columnar data warehouse optimised for complex analytical queries. Redshift delivers sub-second performance on billions of rows β€” powering enterprise BI dashboards, data science workloads, and business-critical reporting at scale.

01
Chapter One Β· Analytics

What is Amazon Redshift?

Amazon Redshift is a fully managed, columnar data warehouse designed for Online Analytical Processing (OLAP). It stores structured data in a compressed columnar format, distributes it across multiple nodes, and executes massively parallel queries β€” returning results in seconds even over billions of rows and petabytes of data.

Data Warehouse vs Data Lake β€” The Key Distinction Introductory
AspectData Warehouse (Redshift)Data Lake (S3 + Athena)
Data modelStructured, schema-on-write (predefined tables)Any format, schema-on-read
Query speedSub-second to seconds (optimised)Seconds to minutes (scan-based)
ConcurrencyHigh (hundreds of simultaneous dashboard users)Lower (20-25 concurrent queries default)
Data loadingETL required β€” data loaded into warehouseData stays in S3, no loading
Best forBI dashboards, repeatable reports, complex joinsAd-hoc exploration, infrequent queries, raw data
Cost modelPer-node-hour (always running) or serverless RPUPer-query / per-TB scanned
InfrastructureManaged cluster of compute nodesServerless (no infrastructure)
🧠 Mental Model β€” The High-Performance Research Library

If S3 + Athena is a giant library where you search through shelves on demand, Redshift is a research institute that has already indexed, compressed, and cross-referenced every book. When an analyst asks a question, the answer comes back in milliseconds β€” because the data is pre-organised for fast retrieval. The cost is maintaining the institute (running nodes), but the speed is unmatched.

Why Redshift Exists β€” What It Solves Introductory
❌

Problems Redshift Solves

  • Athena too slow for complex repeated BI queries
  • Need sub-second response for 100+ dashboard users
  • Complex multi-table joins across TBs of data
  • Predictable, consistent query performance required
  • Traditional on-prem warehouse too expensive to scale
βœ…

What Redshift Provides

  • Columnar storage β€” only reads needed columns
  • Massively Parallel Processing (MPP) across nodes
  • Automatic compression (up to 10Γ— storage savings)
  • Result caching β€” repeated queries return instantly
  • Scales from 160GB to petabytes with node additions
Redshift in the AWS Analytics Stack Core
πŸ”

Athena β€” Ad-Hoc

Serverless. Pay per query. Best for exploration, infrequent queries, raw data in S3. No infrastructure.

🏒

Redshift β€” Enterprise BI

Managed warehouse. Sub-second queries. Best for dashboards, complex joins, high concurrency, repeatable reports.

πŸ”Ž

OpenSearch β€” Search

Full-text search + log analytics. Best for observability, log exploration, Kibana dashboards.

Architecture Overview Core
Amazon Redshift β€” Cluster Architecture
BI CLIENTS QuickSight Tableau / JDBC SQL LEADER NODE Receives queries Creates execution plan Aggregates results No data storage COMPUTE NODE 1 Slices: CPU + RAM + local SSD Columnar blocks COMPUTE NODE 2 Slices: CPU + RAM + local SSD Columnar blocks COMPUTE NODE N Slices: CPU + RAM + local SSD Columnar blocks S3 Auto backups + Spectrum data Leader receives SQL β†’ distributes to Compute Nodes (MPP) β†’ aggregates β†’ returns results
Leader Node coordinates queries; Compute Nodes store data in columnar slices and execute in parallel (MPP)
Key Redshift Concepts Core
ConceptWhat It IsWhy It Matters
Columnar StorageData stored by column, not by rowReads only queried columns β€” 10Γ— less I/O than row-based
MPP (Massively Parallel)Query distributed across all compute nodes simultaneouslyLinear performance scaling with nodes added
Leader NodeReceives SQL, plans execution, returns resultsNo data storage; coordinates the cluster
Compute NodesStore data + execute query fragments (slices)Each node has CPU, RAM, SSD; data is partitioned across them
SlicesSubdivisions of a compute node (virtual processors)Each slice processes a portion of data in parallel
Distribution StyleHow rows are distributed across nodes (KEY, ALL, EVEN)Bad distribution = data shuffling = slow joins
Sort KeyColumn(s) that determine physical row order on diskZone maps allow skipping irrelevant blocks (huge speedup)
Result CacheRepeated identical queries return cached resultsDashboard refresh with zero compute cost
Redshift Provisioned vs Serverless Core
AspectRedshift ProvisionedRedshift Serverless
InfrastructureYou choose node type + countAWS manages capacity automatically
ScalingManual resize or elastic resizeAuto-scales based on workload
CostPer-node-hour (always running)Per RPU-hour (only when queries run)
Idle costFull cost even when idle$0 when no queries running
Best forPredictable workloads, maximum control, cost-optimised steady stateVariable workloads, getting started, dev/test
AdministrationManage cluster, vacuuming, nodesZero admin β€” fully managed
🎯 Exam Insight
  • "Fast analytical queries on structured data" or "data warehouse" β†’ Redshift
  • "Sub-second dashboard queries" or "high concurrency BI" β†’ Redshift (not Athena)
  • "Columnar storage" β†’ Redshift (or Parquet in S3 for Athena)
  • "Massively Parallel Processing" β†’ Redshift architecture
  • "Redshift vs Athena" β†’ Redshift = managed warehouse (fast, always-on). Athena = serverless, pay-per-query, ad-hoc
  • "Query S3 data from Redshift without loading" β†’ Redshift Spectrum
  • "Serverless data warehouse" β†’ Redshift Serverless (not Athena β€” Athena = query engine, not DW)
  • "Distribution key" and "sort key" = critical Redshift performance tuning concepts
Chapter 01 β€” Key Takeaway

Amazon Redshift is a managed columnar data warehouse for fast analytical queries at scale. It uses MPP across compute nodes for sub-second performance on billions of rows. Choose Redshift over Athena when you need consistent sub-second latency, high concurrency for dashboards, or complex multi-table joins. Choose Redshift Serverless for variable workloads with zero idle cost, or Provisioned for predictable workloads at optimal cost.

02
Chapter Two Β· Analytics

Architecture & Performance Tuning

Redshift performance depends on three design choices: distribution style (how data is spread across nodes), sort keys (how data is ordered on disk), and compression (how columns are encoded). Master these and your queries run 10–100Γ— faster.

Node Types Core
Node TypeStorageBest ForExample
RA3 (Recommended)Managed storage (S3-backed, unlimited)Most workloads β€” separates compute from storagera3.xlplus, ra3.4xlarge, ra3.16xlarge
DC2 (Dense Compute)Local SSD (fixed per node)Small datasets (<1TB) needing fastest I/Odc2.large, dc2.8xlarge
DS2 (Dense Storage)Local HDD (large capacity)Legacy β€” use RA3 instead for new clustersds2.xlarge, ds2.8xlarge
πŸ’‘ RA3 is the Default Choice

RA3 nodes decouple compute from storage β€” data lives in S3 (managed storage) with hot data cached locally. You can resize compute independently of data size. Always choose RA3 for new clusters unless you have <1TB and need lowest latency (DC2).

Distribution Styles β€” How Data Spreads Across Nodes Core
StyleHow It WorksBest ForGotcha
KEYRows with same key value go to same nodeLarge fact tables joined on that key (e.g. customer_id)Skewed key = hot node
ALLFull copy on every nodeSmall dimension tables (<few million rows) joined by all nodesWrite amplification; only for small tables
EVENRound-robin distribution (default)Tables not frequently joined; staging tablesJoins require data shuffling between nodes
AUTORedshift chooses (ALL for small, EVEN for large)When unsure β€” let Redshift optimiseMay not pick KEY when you need it
🧠 Distribution Key Rule

Set the distribution key on the column you most frequently JOIN on. If orders and customers join on customer_id, distribute both tables on customer_id. This co-locates matching rows on the same node β€” eliminating expensive network shuffles.

Sort Keys β€” Physical Row Ordering Core
Sort Key TypeHow It WorksBest For
CompoundMulti-column sort (col1, then col2, then col3)Queries that filter on the prefix columns (like a composite index)
InterleavedEqual weight to each sort columnAd-hoc queries filtering on any subset of columns
AUTORedshift maintains sort automaticallyDefault β€” good enough for most workloads

Sort keys enable zone maps β€” metadata that tracks min/max values per 1MB block. When filtering, Redshift skips entire blocks that can't match, dramatically reducing I/O.

Compression (Encoding) Core

Redshift automatically applies column-level compression encodings when data is loaded via COPY. This typically reduces storage by 3–10Γ— and improves query speed (less I/O).

πŸ—œοΈ

Common Encodings

  • AZ64 β€” AWS proprietary, best for numeric/date
  • LZO β€” general-purpose for large VARCHAR
  • ZSTD β€” excellent compression ratio, newer
  • RunLength β€” sorted columns with repeated values
  • RAW β€” no compression (sort key columns)
πŸ’‘

Best Practices

  • Use COPY command β€” auto-selects optimal encoding
  • Run ANALYZE COMPRESSION on existing tables
  • Sort key column = RAW encoding (Redshift enforces this)
  • Don't manually specify unless you have a reason
  • AZ64 is default for numeric types since 2020
Concurrency Scaling Deep

When query load exceeds cluster capacity, Redshift can automatically spin up additional clusters to handle the overflow β€” and spin them back down when demand decreases.

  • Enabled per WLM (Workload Management) queue
  • Additional clusters handle read queries only
  • You get 1 hour/day of free concurrency scaling per cluster
  • Beyond free tier: billed per-second at on-demand node rate
  • Ideal for: morning dashboard rush, month-end reporting spikes
AQUA (Advanced Query Accelerator) Deep

AQUA pushes filtering and aggregation down to the storage layer (RA3 managed storage). Instead of moving all data to compute nodes for processing, AQUA applies predicates at the storage level β€” reducing data movement by up to 10Γ—. Available automatically on RA3 nodes.

🎯 Exam Insight
  • "Optimise join performance" β†’ set distribution key on the join column (co-locate data)
  • "Speed up range queries on date" β†’ set sort key on the date column (zone map skipping)
  • "Reduce storage" β†’ compression encodings (automatic with COPY). 3–10Γ— savings.
  • "Handle query spikes without resizing" β†’ Concurrency Scaling
  • "RA3 vs DC2" β†’ RA3 = compute+storage separation (recommended). DC2 = fastest local SSD (<1TB)
  • "Data shuffling" β†’ bad distribution style forces network transfer between nodes during joins
  • "Separate compute from storage" β†’ RA3 nodes (data in managed S3-backed storage)
Chapter 02 β€” Key Takeaway

Redshift performance depends on distribution keys (co-locate joined data on same node), sort keys (enable zone map block skipping), and compression (reduce I/O). Choose RA3 nodes to separate compute from storage. Use concurrency scaling for query spikes. The single biggest performance win: distribute your large fact table on the column you JOIN on most frequently.

03
Chapter Three Β· Analytics

Loading Data into Redshift

Data must be loaded into Redshift before it can be queried at full speed (unlike Athena which reads S3 directly). The COPY command is the primary mechanism β€” parallel, compressed, and optimised for bulk loading from S3.

Loading Methods Compared Core
MethodHowSpeedBest For
COPY from S3Parallel load from S3 files into RedshiftFastest (parallel across slices)Bulk loads, ETL pipelines, primary method
COPY from DynamoDBDirect parallel copy from DynamoDB tableFastAnalytics on DynamoDB data
Kinesis FirehoseStream delivery (Firehose β†’ S3 β†’ COPY into Redshift)Near-real-timeStreaming data into warehouse
INSERTSingle-row SQL insertsVery slowAvoid for bulk loads β€” only for one-off corrections
AWS DMSContinuous replication from RDS/on-premOngoing CDCKeep Redshift in sync with OLTP databases
Glue ETLTransform + load via Spark jobsVariableComplex transforms before loading
COPY Command Best Practices Core
βœ…

Do This

  • Split input files into multiple chunks (β‰ˆ number of slices)
  • Use compressed files (GZIP, LZO, ZSTD)
  • Use columnar formats (Parquet, ORC) when possible
  • Use the same region as your Redshift cluster (no cross-region)
  • Use a manifest file for precise file list control
  • COPY auto-applies optimal compression encoding
❌

Avoid This

  • Single large file (can't parallelise across slices)
  • INSERT statements for bulk data (1000Γ— slower)
  • Cross-region COPY (adds latency + data transfer cost)
  • Uncompressed files (wastes I/O bandwidth)
  • Millions of tiny files (overhead per file)
  • COPY without specifying format/delimiter correctly
πŸ’‘ The Magic Number: Files = Slices

For maximum COPY parallelism, split your data into N files where N = number of slices in your cluster (or a multiple). Each slice loads one file simultaneously. A cluster with 4 nodes Γ— 8 slices = 32 slices, so split into 32 or 64 files for optimal loading speed.

UNLOAD β€” Exporting Data from Redshift Core

The UNLOAD command does the reverse of COPY β€” exports query results from Redshift back to S3 as Parquet, CSV, or JSON files. Used for:

  • Sharing processed data with other systems
  • Feeding Athena or EMR from Redshift-processed results
  • Archiving old data to S3 before deleting from cluster
  • Creating snapshots of specific query results
Data Loading Architecture Core
Common Data Loading Patterns into Redshift
RDS / On-Prem Kinesis / Streams Applications S3 STAGING Parquet / GZIP Split files COPY REDSHIFT Parallel load across all slices UNLOAD S3 OUTPUT Results / Archive β†’ Athena / EMR Sources β†’ S3 (staging) β†’ COPY (parallel) β†’ Redshift β†’ UNLOAD β†’ S3 (output)
🎯 Exam Insight
  • "Fastest way to load data into Redshift" β†’ COPY from S3 (parallel, compressed, split files)
  • "INSERT vs COPY" β†’ COPY is orders of magnitude faster for bulk. INSERT = single-row, very slow.
  • "Split input files" β†’ match number of slices for parallel loading
  • "Stream data into Redshift" β†’ Kinesis Firehose (lands in S3, then COPY)
  • "Export Redshift data to S3" β†’ UNLOAD command
  • "Continuous sync from RDS" β†’ AWS DMS with CDC to Redshift
  • "COPY supports Parquet" β†’ yes, and it's the best format for loading (columnar + compressed)
Chapter 03 β€” Key Takeaway

Always load data with COPY from S3 β€” never INSERT for bulk. Split files into N chunks matching your slice count for maximum parallelism. Use compressed columnar formats (Parquet preferred). Stage data in the same region as your cluster. For streaming ingestion, use Firehose β†’ S3 β†’ COPY. For exports, use UNLOAD to write results back to S3 for Athena or sharing.

04
Chapter Four Β· Analytics

Redshift Spectrum β€” Query S3 Without Loading

Redshift Spectrum lets you run Redshift SQL queries directly against data stored in S3 β€” without loading it into the cluster. It extends your Redshift warehouse to include the entire data lake, combining fast local queries on loaded hot data with on-demand access to cold data in S3.

How Spectrum Works Core
Redshift Spectrum β€” Query Architecture
BI CLIENT SQL query REDSHIFT Leader + Compute Local data (hot) Joins local + S3 external SPECTRUM LAYER Thousands of nodes Filter, aggregate Only return relevant rows scan S3 DATA LAKE Parquet / CSV / ORC Partitioned Cold / archive data GLUE CATALOG (schema)
Redshift sends external table queries to Spectrum's distributed layer, which scans S3 and returns only matching rows
Spectrum vs Athena vs Loading β€” When to Use Each Core
ApproachWhen to UseProsCons
Load into Redshift (COPY)Hot data queried frequently, sub-second neededFastest queries; full indexing, sort keys, cachingStorage cost; must load/maintain
Spectrum (external tables)Cold/historical data; join S3 data with local tablesNo loading; query PBs of S3 data; extend warehouseSlower than local; billed per TB scanned
Athena (standalone)Ad-hoc queries without Redshift cluster; serverlessNo cluster needed; pay per query onlyNo joins with Redshift local data; lower concurrency
Setting Up Spectrum β€” External Schema + Tables Core

Spectrum uses external tables defined in the Glue Data Catalog (or a Hive metastore). The key steps:

  1. Create external schema β€” points Redshift at a Glue Catalog database
  2. External tables β€” already exist in Glue (from crawlers or manual DDL)
  3. Query β€” SELECT * FROM external_schema.table WHERE ...
  4. Join β€” combine local Redshift tables with external S3 tables in one query
Common Spectrum Patterns Core
πŸ”₯

Hot/Cold Split

  • Recent data (last 90 days) loaded into Redshift local
  • Historical data (years) stays in S3 as Parquet
  • Queries join both seamlessly via UNION ALL or JOIN
  • Saves massive cluster storage cost
πŸ”—

Data Lake Extension

  • Redshift has curated dimension tables locally
  • Spectrum queries raw fact tables in S3 data lake
  • Join enriched local dims with massive S3 facts
  • Best of both: warehouse speed + lake scale
Spectrum Performance Tips Deep
  • Use Parquet or ORC β€” columnar formats let Spectrum skip irrelevant columns
  • Partition data β€” date-based partitions let Spectrum skip entire folders
  • Push predicates β€” filter in WHERE clause (Spectrum filters at the S3 layer)
  • Avoid SELECT * β€” only select needed columns to reduce data scanned
  • Large files (128MB–512MB) β€” avoid millions of tiny files
🎯 Exam Insight
  • "Query S3 data from Redshift without loading" β†’ Redshift Spectrum
  • "Join local Redshift tables with S3 data" β†’ Spectrum external tables
  • "Extend data warehouse to petabytes without adding nodes" β†’ Spectrum
  • "Hot/cold data architecture" β†’ recent in Redshift, historical in S3 via Spectrum
  • "Spectrum uses Glue Catalog" β†’ same catalog as Athena and EMR (shared metadata)
  • "Spectrum cost" β†’ $5 per TB of S3 data scanned (same as Athena pricing)
  • "Spectrum vs Athena" β†’ Spectrum = query S3 from Redshift (can join local). Athena = standalone serverless.
Chapter 04 β€” Key Takeaway

Redshift Spectrum extends your warehouse to the entire S3 data lake without loading data. Use it for hot/cold splits (recent data local, historical in S3), joining dimension tables with massive S3 fact tables, or querying petabytes without adding cluster nodes. It shares the Glue Catalog with Athena and costs $5/TB scanned. Use Parquet + partitioning for best Spectrum performance.

05
Chapter Five Β· Analytics

Redshift Serverless

Redshift Serverless removes all cluster management. You get the same Redshift engine, same SQL, same performance β€” but AWS manages capacity, scaling, and infrastructure. You pay only when queries run (measured in RPU-hours). Zero cost when idle.

How Serverless Differs from Provisioned Core
ConcernProvisioned ClusterRedshift Serverless
CapacityYou choose node type + countYou set base RPU (32–512); auto-scales
Idle costFull price 24/7 even if no queries$0 when idle (after cooldown)
ScalingManual resize (minutes of downtime)Scales up/down automatically per workload
MaintenanceVacuum, analyze, WLM tuning, patchesAll automatic β€” zero ops
SpectrumYesYes (same external tables)
Data sharingYes (producer/consumer)Yes
Pricing unitPer-node-hourPer RPU-hour ($0.375 per RPU-hour)
Best forSteady, predictable workloads at lowest costVariable, bursty, or getting-started workloads
RPU β€” Redshift Processing Units Core

RPU is the compute unit for Serverless. You set a base capacity (minimum 32 RPU) and Serverless auto-scales from there:

πŸ“Š

Base Capacity

  • Minimum: 32 RPU (~equivalent to 4 RA3.xlplus nodes)
  • Maximum: 512 RPU
  • Higher base = faster queries but more cost
  • Auto-scales above base during peak demand
πŸ’°

Pricing

  • $0.375 per RPU-hour (compute only)
  • Billed per-second with 60-second minimum per query
  • Storage: $0.024/GB-month (same as RA3 managed)
  • Idle = $0 compute (storage still billed)
  • Cooldown: ~5 minutes before fully idle
When to Choose Serverless vs Provisioned Core
βœ…

Choose Serverless When

  • Workload is unpredictable or bursty
  • Getting started / proof of concept
  • Dev/test environments (idle most of the day)
  • You want zero administration
  • Occasional heavy queries but mostly idle
  • Don't want to manage WLM, vacuum, resize
🏒

Choose Provisioned When

  • Queries run 18+ hours/day (cheaper at steady state)
  • Need reserved instances (1yr/3yr) for cost savings
  • Want full control over node types and cluster config
  • Need deterministic pricing (budget predictability)
  • Complex WLM queue tuning required
  • Very high concurrency requiring concurrency scaling config
Serverless Features Core
FeatureAvailable in Serverless?Notes
Spectrum (external S3 queries)βœ… YesSame as provisioned
Data Sharingβœ… YesCross-account, cross-region
Federated Queryβœ… YesQuery RDS/Aurora from Redshift
ML (CREATE MODEL)βœ… YesBuilt-in ML with SageMaker
Materialized Viewsβœ… YesAuto-refresh supported
Snapshotsβœ… YesRecovery points
VPCβœ… YesRuns in your VPC
Reserved Instances❌ NoOnly provisioned supports RI pricing
Node type selection❌ NoAWS manages infrastructure
🎯 Exam Insight
  • "Serverless data warehouse, zero idle cost" β†’ Redshift Serverless
  • "RPU" = Redshift Processing Unit β€” compute billing unit for Serverless
  • "Variable analytics workload, pay only when querying" β†’ Redshift Serverless
  • "Same Redshift SQL and features, no cluster management" β†’ Serverless
  • "Redshift Serverless vs Athena" β†’ Both are pay-per-use, but Redshift Serverless = full warehouse (joins, caching, concurrency). Athena = simpler query engine on S3.
  • "Need Reserved Instances for cost" β†’ must use Provisioned (Serverless doesn't support RI)
  • "Auto-scales compute for data warehouse" β†’ Redshift Serverless
Chapter 05 β€” Key Takeaway

Redshift Serverless gives you the full Redshift engine with zero ops β€” auto-scaling, no idle cost, no cluster management. Pay $0.375/RPU-hour only when queries run. Choose it for variable workloads, dev/test, or getting started. Choose Provisioned for steady 18+hr/day workloads where Reserved Instances make it cheaper. Both support Spectrum, Data Sharing, and the full Redshift feature set.

06
Chapter Six Β· Analytics

Cost & Best Practices

Redshift costs depend on whether you use Provisioned (node-hours) or Serverless (RPU-hours), plus storage and optional features. Understanding the cost model is critical β€” Redshift can be very cost-efficient or very expensive depending on configuration choices.

Complete Pricing Model Core
ComponentProvisionedServerless
Compute$0.25–$13.04/node-hour (depends on node type)$0.375/RPU-hour (billed per-second)
Storage (RA3)$0.024/GB-month (managed storage)$0.024/GB-month (same)
Spectrum$5/TB scanned from S3$5/TB scanned from S3
Concurrency Scaling1 hr/day free; then on-demand node rateIncluded (auto-scales)
Backup storageFree up to cluster size; $0.024/GB aboveFree up to storage used; $0.024/GB above
Reserved InstancesUp to 75% savings with 1yr/3yr commitmentNot available
Data TransferStandard AWS data transfer ratesStandard AWS data transfer rates
Cost Optimization Checklist Core
#OptimizationImpactHow
1Reserved InstancesUp to 75% savingsCommit 1yr or 3yr for steady workloads (provisioned only)
2Right-size cluster30–50% savingsMonitor CPU/disk utilisation; resize down if under-utilised
3Pause idle clusters$0 compute when pausedSchedule pause/resume for dev/test (provisioned)
4Hot/cold with SpectrumMassive storage savingsKeep recent data local; move old data to S3, query via Spectrum
5Compression3–10Γ— storage reductionUse COPY (auto-compresses); run ANALYZE COMPRESSION
6Serverless for variable loadsZero idle costSwitch dev/test or bursty workloads to Serverless
7Use result cacheFree repeated queriesIdentical queries hit cache (enabled by default)
Production Best Practices Core
βœ…

Performance Practices

  • Set distribution key on most common join column
  • Set sort key on most common filter column (usually date)
  • Use COPY for all bulk loads (never INSERT)
  • Split load files = number of slices
  • Run VACUUM regularly (or use auto-vacuum)
  • Run ANALYZE after large loads (statistics update)
  • Use Parquet for Spectrum queries
  • Leverage result caching for dashboard queries
πŸ—οΈ

Operational Practices

  • Enable automatic snapshots (point-in-time recovery)
  • Cross-region snapshot copy for DR
  • Use WLM queues to isolate ETL from dashboards
  • Monitor with CloudWatch: CPU, disk, query latency
  • Enable audit logging for compliance
  • Use VPC + security groups (no public access)
  • Encrypt at rest (KMS) and in transit (SSL)
  • Use IAM roles for COPY/UNLOAD (not credentials)
Common Mistakes Core
❌

Using INSERT Instead of COPY

Row-by-row INSERTs are 100–1000Γ— slower than COPY for bulk loading. Always stage data in S3 and use COPY. INSERT is only for single-row corrections.

❌

Wrong Distribution Style

EVEN distribution on a large fact table that's frequently joined causes massive data shuffling. Identify your primary join column and use KEY distribution on both tables.

❌

No Sort Key on Date Columns

Most analytics queries filter by date. Without a sort key on the date column, every query scans all blocks. Add a sort key and Redshift skips irrelevant blocks via zone maps.

❌

Over-Provisioned Cluster Running 24/7

A 16-node cluster for a workload that queries 4 hours/day wastes 83% of spend. Either right-size + pause, or switch to Serverless for variable loads.

Redshift vs Everything β€” Decision Matrix Core
If You Need…Use…Why
Ad-hoc queries on S3, no infrastructureAthenaServerless, pay per query, zero ops
Sub-second BI dashboards, high concurrencyRedshift (Provisioned or Serverless)MPP warehouse, result caching, optimised joins
Full-text search and log analyticsOpenSearchInverted index, Kibana dashboards
Real-time streaming analyticsKinesis + FlinkSub-second on streaming data
ETL / data transformationGlue ETLServerless Spark, format conversion
Query S3 from warehouse contextRedshift SpectrumJOIN S3 data with local warehouse tables
Transactional database (OLTP)RDS / AuroraRow-based, optimised for writes
Full Page Summary β€” Amazon Redshift Core
πŸ“‹ Amazon Redshift β€” Complete Recap
  • What: Managed columnar data warehouse β€” MPP, sub-second queries, petabyte-scale. Built for OLAP / BI / analytics.
  • Architecture: Leader Node (coordinates) + Compute Nodes (store data in slices, execute in parallel). RA3 = recommended (compute/storage separation).
  • Performance: Distribution keys (co-locate joins), Sort keys (zone map skipping), Compression (3–10Γ— storage reduction), Result cache (free repeated queries).
  • Loading: COPY from S3 is the only correct bulk method. Split files = slices. Use Parquet + GZIP. Never INSERT for bulk.
  • Spectrum: Query S3 without loading. External tables via Glue Catalog. Hot/cold split pattern. $5/TB scanned.
  • Serverless: Same engine, zero ops, pay per RPU-hour, $0 idle. Best for variable/bursty workloads.
  • Cost: Provisioned = node-hours (RI for 75% savings). Serverless = RPU-hours ($0.375). Spectrum = $5/TB. Optimise: right-size, RI, pause, hot/cold.
  • vs Athena: Redshift = fast, always-ready, high concurrency, complex joins. Athena = serverless, ad-hoc, simple, no infrastructure.
πŸ‘‰ Final Takeaway β€” Redshift in One Thought

Redshift is what you reach for when Athena isn't fast enough. It's the enterprise-grade data warehouse β€” columnar, parallel, sub-second β€” designed for BI dashboards with hundreds of concurrent users querying billions of rows. Spectrum extends it to the data lake. Serverless removes the ops tax. Together with Athena (ad-hoc), Kinesis (streaming), and Glue (ETL), Redshift completes the AWS analytics stack as the high-performance query engine for structured data at scale.