Subqueries & CTEs
A subquery is a query nested inside another query. A CTE (Common Table Expression) is a named subquery defined with WITH that you can reference like a table. Together they let you break complex logic into readable, composable steps — without creating temporary tables or views.
Subqueries
A subquery is a SELECT wrapped in parentheses and used inside another statement. It can appear in three places: the SELECT list (scalar), the WHERE clause (filter), or the FROM clause (derived table). The key distinction is whether the subquery references the outer query — correlated vs uncorrelated.
Scalar subqueries in SELECT
A scalar subquery returns exactly one value (one row, one column). It runs once per outer row — useful for injecting a computed value alongside each row.
-- Each order with its total revenue alongside
SELECT o.id,
o.placed_at,
(SELECT SUM(oi.quantity * oi.unit_price)
FROM order_items oi
WHERE oi.order_id = o.id) AS order_total
FROM orders o
WHERE o.status != 'cancelled'; id | placed_at | order_total
-----+---------------------+------------
1001 | 2025-02-10 09:00:00 | 53.99
1002 | 2025-02-11 11:30:00 | 42.00
1003 | 2025-02-14 16:45:00 | 76.99
Subqueries in WHERE — IN and EXISTS
The two most common WHERE-subquery patterns:
-- IN: "orders placed by users who joined in January"
SELECT *
FROM orders
WHERE user_id IN (
SELECT id FROM users WHERE created_at >= '2025-01-01'
AND created_at < '2025-02-01'
);
-- EXISTS: "users who have at least one paid order"
SELECT u.full_name
FROM users u
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.user_id = u.id AND o.status = 'paid'
); When to use
The subquery returns a list of values. The outer query checks membership. Best when the inner list is small or can use an index.
When to use
You only need to know if at least one row exists. The engine can stop searching after the first match. Often faster on large datasets with an index on the correlation column.
Correlated vs uncorrelated
An uncorrelated subquery runs once, independently of the outer query. The result is computed, then the outer query uses it. A correlated subquery references a column from the outer query — conceptually it re-runs for every outer row (though the optimiser often rewrites it).
-- Uncorrelated: inner query runs ONCE, returns a list
SELECT * FROM orders
WHERE user_id IN (SELECT id FROM users WHERE full_name LIKE 'A%');
-- The engine evaluates the inner SELECT, gets 1, then filters orders by that list.
-- Correlated: inner query references outer row (o.id)
SELECT o.id,
(SELECT COUNT(*) FROM order_items oi WHERE oi.order_id = o.id) AS item_count
FROM orders o;
-- Conceptually re-evaluates for each order (optimiser may hash-join instead). EXPLAIN ANALYZE shows a nested loop with a seq scan inside, consider rewriting as a JOIN explicitly — sometimes the planner needs the hint.
The NOT IN NULL trap
Chapter 02 warned about this. NOT IN (subquery) breaks silently if the subquery returns any NULL value — the entire expression evaluates to NULL (unknown), and zero rows pass the filter.
-- BROKEN if any user_id in the subquery is NULL:
SELECT * FROM orders WHERE user_id NOT IN (SELECT manager_id FROM employees);
-- If manager_id has a NULL, this returns ZERO rows regardless of real data.
-- SAFE: use NOT EXISTS instead
SELECT * FROM orders o
WHERE NOT EXISTS (
SELECT 1 FROM employees e WHERE e.manager_id = o.user_id
); Common Table Expressions (CTEs)
A CTE is a named, temporary result set defined with the WITH clause. It exists only for the duration of the statement that follows it. Think of it as giving a name to a subquery so you can reference it multiple times or compose complex logic in readable steps.
-- Basic CTE
WITH paid_orders AS (
SELECT id, user_id, placed_at
FROM orders
WHERE status = 'paid'
)
SELECT u.full_name, po.placed_at
FROM paid_orders po
JOIN users u ON u.id = po.user_id; Chaining multiple CTEs
You can define multiple CTEs separated by commas. Each one can reference CTEs defined above it — building a pipeline of transformations.
WITH order_totals AS (
-- Step 1: compute each order's total
SELECT order_id,
SUM(quantity * unit_price) AS total
FROM order_items
GROUP BY order_id
),
user_spending AS (
-- Step 2: aggregate per user
SELECT o.user_id,
COUNT(*) AS order_count,
SUM(ot.total) AS lifetime_spend
FROM orders o
JOIN order_totals ot ON ot.order_id = o.id
WHERE o.status != 'cancelled'
GROUP BY o.user_id
)
-- Step 3: final result with user names
SELECT u.full_name,
us.order_count,
us.lifetime_spend
FROM user_spending us
JOIN users u ON u.id = us.user_id
ORDER BY us.lifetime_spend DESC; full_name | order_count | lifetime_spend
---------------+-------------+---------------
Ada Lovelace | 2 | 130.98
Linus Torvalds | 1 | 42.00
SELECT * FROM cte_name.
CTE vs subquery — performance implications
CTEs were optimisation fences
The planner materialised the CTE and couldn’t push predicates into it. A CTE could be slower than an equivalent subquery. Fixed in v12+ with automatic inlining.
CTEs are inlined by default
The planner treats a non-recursive CTE like a view — it can push predicates down, eliminate unused columns, and merge joins. Performance is the same as a subquery.
MATERIALIZED hint
If the CTE is referenced multiple times and is expensive to compute, add AS MATERIALIZED. The result is computed once and reused. PostgreSQL 12+ only.
Always materialises
SQLite stores each CTE result in a temp table. For small results this is fine; for large intermediate sets, consider rewriting as a subquery or a temp table.
-- PostgreSQL 12+: force materialisation (compute once, reuse twice)
WITH expensive_calc AS MATERIALIZED (
SELECT product_id, SUM(quantity) AS total_sold
FROM order_items
GROUP BY product_id
)
SELECT * FROM expensive_calc WHERE total_sold > 5
UNION ALL
SELECT * FROM expensive_calc WHERE total_sold = 1; Recursive CTEs
A recursive CTE is a CTE that references itself. It has two parts: a base case (the seed rows) and a recursive step (joins back to the CTE to discover more rows). The engine iterates until the recursive step produces no new rows.
Traversing a hierarchy (org chart)
-- Using the employees table from Chapter 05 §4
WITH RECURSIVE org_tree AS (
-- Base case: the root (CEO, no manager)
SELECT id, name, manager_id, 0 AS depth
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive step: find direct reports of the previous level
SELECT e.id, e.name, e.manager_id, ot.depth + 1
FROM employees e
JOIN org_tree ot ON ot.id = e.manager_id
)
SELECT depth, name, manager_id
FROM org_tree
ORDER BY depth, name; depth | name | manager_id
------+--------+-----------
0 | Ada | NULL
1 | Grace | 1
1 | Linus | 1
2 | Donald | 2
Category tree (another common use case)
-- Hypothetical categories table with parent_id
CREATE TABLE categories (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
parent_id INTEGER REFERENCES categories(id)
);
INSERT INTO categories VALUES
(1, 'Electronics', NULL),
(2, 'Computers', 1),
(3, 'Laptops', 2),
(4, 'Desktops', 2),
(5, 'Phones', 1);
-- All descendants of "Electronics"
WITH RECURSIVE descendants AS (
SELECT id, name, 0 AS depth FROM categories WHERE id = 1
UNION ALL
SELECT c.id, c.name, d.depth + 1
FROM categories c
JOIN descendants d ON d.id = c.parent_id
)
SELECT * FROM descendants ORDER BY depth, name; Avoiding infinite loops
If data has a cycle (A → B → C → A), the recursion never terminates. Two defences:
-- Defence 1: LIMIT the recursion depth
WITH RECURSIVE tree AS (
SELECT id, name, manager_id, 0 AS depth FROM employees WHERE manager_id IS NULL
UNION ALL
SELECT e.id, e.name, e.manager_id, t.depth + 1
FROM employees e JOIN tree t ON t.id = e.manager_id
WHERE t.depth < 20 -- hard stop after 20 levels
)
SELECT * FROM tree;
-- Defence 2 (PostgreSQL 14+): CYCLE detection
WITH RECURSIVE tree AS (
SELECT id, name, manager_id, ARRAY[id] AS path
FROM employees WHERE manager_id IS NULL
UNION ALL
SELECT e.id, e.name, e.manager_id, t.path || e.id
FROM employees e JOIN tree t ON t.id = e.manager_id
WHERE e.id != ALL(t.path) -- skip if already visited
)
CYCLE id SET is_cycle USING path -- PostgreSQL 14+ syntax
SELECT * FROM tree WHERE NOT is_cycle; ARRAY-based cycle detection
Accumulate visited IDs in an array. Skip rows whose ID is already in the path. Works on PostgreSQL 9.1+.
PostgreSQL 14+ built-in
CYCLE id SET is_cycle USING path lets the engine detect and mark cycles automatically. Cleaner syntax.
max_execution_time or depth limit
MySQL doesn’t have CYCLE syntax. Use a depth counter with a WHERE guard, or set cte_max_recursion_depth (default 1000).
Cycles mean bad data
A cycle in a hierarchy (A manages B who manages A) is almost always a data bug. Fix the data, don’t just cap recursion.
- Subqueries: scalar (one value in SELECT), filter (
IN/EXISTSin WHERE), or derived table (in FROM). - Correlated = references outer row, conceptually re-runs per row. Uncorrelated = runs once.
NOT INbreaks on NULLs. UseNOT EXISTSinstead.- CTEs (
WITH) = named subqueries. Chain them for readability. In PG 12+ / MySQL 8+, performance = same as inline subquery. - Recursive CTEs: base case
UNION ALLrecursive step. Use for hierarchies (org charts, category trees, graphs). - Always guard against infinite loops: depth limit or cycle detection.
← Chapter 08 — Indexes & Performance · → Continue to Chapter 10 — Window Functions →