SQL · Advanced

Subqueries & CTEs

A subquery is a query nested inside another query. A CTE (Common Table Expression) is a named subquery defined with WITH that you can reference like a table. Together they let you break complex logic into readable, composable steps — without creating temporary tables or views.

01 Database · 02 WHERE · 03 Sort · 04 Aggregate · 05 Joins · 06 Write · 07 Schema · 08 Indexes · 09 CTEs · 10 Window · 11 Advanced · 12 Production · 13 PL/SQL

Section One · Queries inside queries

Subqueries

A subquery is a SELECT wrapped in parentheses and used inside another statement. It can appear in three places: the SELECT list (scalar), the WHERE clause (filter), or the FROM clause (derived table). The key distinction is whether the subquery references the outer query — correlated vs uncorrelated.

Scalar subqueries in `SELECT`

A scalar subquery returns exactly one value (one row, one column). It runs once per outer row — useful for injecting a computed value alongside each row.

-- Each order with its total revenue alongside
SELECT o.id,
       o.placed_at,
       (SELECT SUM(oi.quantity * oi.unit_price)
        FROM   order_items oi
        WHERE  oi.order_id = o.id) AS order_total
FROM   orders o
WHERE  o.status != 'cancelled';

id   | placed_at           | order_total
-----+---------------------+------------
1001 | 2025-02-10 09:00:00 |       53.99
1002 | 2025-02-11 11:30:00 |       42.00
1003 | 2025-02-14 16:45:00 |       76.99

Gotcha: if the scalar subquery returns more than one row, the engine errors. Always ensure it returns exactly one value (use an aggregate or LIMIT 1).

Subqueries in `WHERE` — `IN` and `EXISTS`

The two most common WHERE-subquery patterns:

-- IN: "orders placed by users who joined in January"
SELECT *
FROM   orders
WHERE  user_id IN (
  SELECT id FROM users WHERE created_at >= '2025-01-01'
                         AND created_at <  '2025-02-01'
);

-- EXISTS: "users who have at least one paid order"
SELECT u.full_name
FROM   users u
WHERE  EXISTS (
  SELECT 1 FROM orders o
  WHERE  o.user_id = u.id AND o.status = 'paid'
);

IN (subquery)

When to use

The subquery returns a list of values. The outer query checks membership. Best when the inner list is small or can use an index.

EXISTS (subquery)

When to use

You only need to know if at least one row exists. The engine can stop searching after the first match. Often faster on large datasets with an index on the correlation column.

Correlated vs uncorrelated

An uncorrelated subquery runs once, independently of the outer query. The result is computed, then the outer query uses it. A correlated subquery references a column from the outer query — conceptually it re-runs for every outer row (though the optimiser often rewrites it).

-- Uncorrelated: inner query runs ONCE, returns a list
SELECT * FROM orders
WHERE user_id IN (SELECT id FROM users WHERE full_name LIKE 'A%');
-- The engine evaluates the inner SELECT, gets 1, then filters orders by that list.

-- Correlated: inner query references outer row (o.id)
SELECT o.id,
       (SELECT COUNT(*) FROM order_items oi WHERE oi.order_id = o.id) AS item_count
FROM   orders o;
-- Conceptually re-evaluates for each order (optimiser may hash-join instead).

Performance note: correlated subqueries look expensive (re-run per row) but modern query planners often transform them into joins internally. If EXPLAIN ANALYZE shows a nested loop with a seq scan inside, consider rewriting as a JOIN explicitly — sometimes the planner needs the hint.

The `NOT IN` NULL trap

Chapter 02 warned about this. NOT IN (subquery) breaks silently if the subquery returns any NULL value — the entire expression evaluates to NULL (unknown), and zero rows pass the filter.

-- BROKEN if any user_id in the subquery is NULL:
SELECT * FROM orders WHERE user_id NOT IN (SELECT manager_id FROM employees);
-- If manager_id has a NULL, this returns ZERO rows regardless of real data.

-- SAFE: use NOT EXISTS instead
SELECT * FROM orders o
WHERE  NOT EXISTS (
  SELECT 1 FROM employees e WHERE e.manager_id = o.user_id
);

Section Two · Named subqueries with WITH

Common Table Expressions (CTEs)

A CTE is a named, temporary result set defined with the WITH clause. It exists only for the duration of the statement that follows it. Think of it as giving a name to a subquery so you can reference it multiple times or compose complex logic in readable steps.

-- Basic CTE
WITH paid_orders AS (
  SELECT id, user_id, placed_at
  FROM   orders
  WHERE  status = 'paid'
)
SELECT u.full_name, po.placed_at
FROM   paid_orders po
JOIN   users u ON u.id = po.user_id;

Chaining multiple CTEs

You can define multiple CTEs separated by commas. Each one can reference CTEs defined above it — building a pipeline of transformations.

WITH order_totals AS (
  -- Step 1: compute each order's total
  SELECT order_id,
         SUM(quantity * unit_price) AS total
  FROM   order_items
  GROUP BY order_id
),
user_spending AS (
  -- Step 2: aggregate per user
  SELECT o.user_id,
         COUNT(*)       AS order_count,
         SUM(ot.total)  AS lifetime_spend
  FROM   orders o
  JOIN   order_totals ot ON ot.order_id = o.id
  WHERE  o.status != 'cancelled'
  GROUP BY o.user_id
)
-- Step 3: final result with user names
SELECT u.full_name,
       us.order_count,
       us.lifetime_spend
FROM   user_spending us
JOIN   users u ON u.id = us.user_id
ORDER BY us.lifetime_spend DESC;

full_name      | order_count | lifetime_spend
---------------+-------------+---------------
Ada Lovelace   |           2 |         130.98
Linus Torvalds |           1 |          42.00

Readability rule: if a query has more than two levels of nested subqueries, refactor into chained CTEs. Each CTE gets a descriptive name and can be tested independently by commenting out the final SELECT and replacing it with SELECT * FROM cte_name.

CTE vs subquery — performance implications

PostgreSQL (before v12)

CTEs were optimisation fences

The planner materialised the CTE and couldn’t push predicates into it. A CTE could be slower than an equivalent subquery. Fixed in v12+ with automatic inlining.

PostgreSQL 12+ / MySQL 8+

CTEs are inlined by default

The planner treats a non-recursive CTE like a view — it can push predicates down, eliminate unused columns, and merge joins. Performance is the same as a subquery.

When to force materialisation

`MATERIALIZED` hint

If the CTE is referenced multiple times and is expensive to compute, add AS MATERIALIZED. The result is computed once and reused. PostgreSQL 12+ only.

SQLite

Always materialises

SQLite stores each CTE result in a temp table. For small results this is fine; for large intermediate sets, consider rewriting as a subquery or a temp table.

-- PostgreSQL 12+: force materialisation (compute once, reuse twice)
WITH expensive_calc AS MATERIALIZED (
  SELECT product_id, SUM(quantity) AS total_sold
  FROM   order_items
  GROUP BY product_id
)
SELECT * FROM expensive_calc WHERE total_sold > 5
UNION ALL
SELECT * FROM expensive_calc WHERE total_sold = 1;

Section Three · Walking hierarchies

Recursive CTEs

A recursive CTE is a CTE that references itself. It has two parts: a base case (the seed rows) and a recursive step (joins back to the CTE to discover more rows). The engine iterates until the recursive step produces no new rows.

Recursive CTE anatomy

Traversing a hierarchy (org chart)

-- Using the employees table from Chapter 05 §4
WITH RECURSIVE org_tree AS (
  -- Base case: the root (CEO, no manager)
  SELECT id, name, manager_id, 0 AS depth
  FROM   employees
  WHERE  manager_id IS NULL

  UNION ALL

  -- Recursive step: find direct reports of the previous level
  SELECT e.id, e.name, e.manager_id, ot.depth + 1
  FROM   employees e
  JOIN   org_tree ot ON ot.id = e.manager_id
)
SELECT depth, name, manager_id
FROM   org_tree
ORDER BY depth, name;

depth | name   | manager_id
------+--------+-----------
    0 | Ada    | NULL
    1 | Grace  | 1
    1 | Linus  | 1
    2 | Donald | 2

Category tree (another common use case)

-- Hypothetical categories table with parent_id
CREATE TABLE categories (
  id        INTEGER PRIMARY KEY,
  name      TEXT NOT NULL,
  parent_id INTEGER REFERENCES categories(id)
);
INSERT INTO categories VALUES
  (1, 'Electronics', NULL),
  (2, 'Computers',   1),
  (3, 'Laptops',     2),
  (4, 'Desktops',    2),
  (5, 'Phones',      1);

-- All descendants of "Electronics"
WITH RECURSIVE descendants AS (
  SELECT id, name, 0 AS depth FROM categories WHERE id = 1
  UNION ALL
  SELECT c.id, c.name, d.depth + 1
  FROM   categories c
  JOIN   descendants d ON d.id = c.parent_id
)
SELECT * FROM descendants ORDER BY depth, name;

Avoiding infinite loops

If data has a cycle (A → B → C → A), the recursion never terminates. Two defences:

-- Defence 1: LIMIT the recursion depth
WITH RECURSIVE tree AS (
  SELECT id, name, manager_id, 0 AS depth FROM employees WHERE manager_id IS NULL
  UNION ALL
  SELECT e.id, e.name, e.manager_id, t.depth + 1
  FROM   employees e JOIN tree t ON t.id = e.manager_id
  WHERE  t.depth < 20   -- hard stop after 20 levels
)
SELECT * FROM tree;

-- Defence 2 (PostgreSQL 14+): CYCLE detection
WITH RECURSIVE tree AS (
  SELECT id, name, manager_id, ARRAY[id] AS path
  FROM   employees WHERE manager_id IS NULL
  UNION ALL
  SELECT e.id, e.name, e.manager_id, t.path || e.id
  FROM   employees e JOIN tree t ON t.id = e.manager_id
  WHERE  e.id != ALL(t.path)  -- skip if already visited
)
CYCLE id SET is_cycle USING path  -- PostgreSQL 14+ syntax
SELECT * FROM tree WHERE NOT is_cycle;

Path tracking

ARRAY-based cycle detection

Accumulate visited IDs in an array. Skip rows whose ID is already in the path. Works on PostgreSQL 9.1+.

CYCLE clause

PostgreSQL 14+ built-in

CYCLE id SET is_cycle USING path lets the engine detect and mark cycles automatically. Cleaner syntax.

MySQL 8+

max_execution_time or depth limit

MySQL doesn’t have CYCLE syntax. Use a depth counter with a WHERE guard, or set cte_max_recursion_depth (default 1000).

Warning

Cycles mean bad data

A cycle in a hierarchy (A manages B who manages A) is almost always a data bug. Fix the data, don’t just cap recursion.

Mental shortcut for the whole chapter:

Subqueries: scalar (one value in SELECT), filter (IN / EXISTS in WHERE), or derived table (in FROM).
Correlated = references outer row, conceptually re-runs per row. Uncorrelated = runs once.
NOT IN breaks on NULLs. Use NOT EXISTS instead.
CTEs (WITH) = named subqueries. Chain them for readability. In PG 12+ / MySQL 8+, performance = same as inline subquery.
Recursive CTEs: base case UNION ALL recursive step. Use for hierarchies (org charts, category trees, graphs).
Always guard against infinite loops: depth limit or cycle detection.

← Chapter 08 — Indexes & Performance · → Continue to Chapter 10 — Window Functions →

LearningTree

Subqueries & CTEs | SQL Tutorial 09

Subqueries & CTEs

Subqueries

Scalar subqueries in `SELECT`

Subqueries in `WHERE` — `IN` and `EXISTS`

When to use

When to use

Correlated vs uncorrelated

The `NOT IN` NULL trap

Common Table Expressions (CTEs)

Chaining multiple CTEs

CTE vs subquery — performance implications

CTEs were optimisation fences

CTEs are inlined by default

`MATERIALIZED` hint

Always materialises

Recursive CTEs

Traversing a hierarchy (org chart)

Category tree (another common use case)

Avoiding infinite loops

ARRAY-based cycle detection

PostgreSQL 14+ built-in

max_execution_time or depth limit

Cycles mean bad data

Subqueries

Scalar subqueries in SELECT

Subqueries in WHERE — IN and EXISTS

When to use

When to use

Correlated vs uncorrelated

The NOT IN NULL trap

Common Table Expressions (CTEs)

Chaining multiple CTEs

CTE vs subquery — performance implications

CTEs were optimisation fences

CTEs are inlined by default

MATERIALIZED hint

Always materialises

Recursive CTEs

Traversing a hierarchy (org chart)

Category tree (another common use case)

Avoiding infinite loops

ARRAY-based cycle detection

PostgreSQL 14+ built-in

max_execution_time or depth limit

Cycles mean bad data

Scalar subqueries in `SELECT`

Subqueries in `WHERE` — `IN` and `EXISTS`

The `NOT IN` NULL trap

`MATERIALIZED` hint