Aggregation, GROUP BY & Windows
An aggregate is any function that collapses many rows into a single value β a count, a sum, an average. GROUP BY repeats that collapse per bucket. HAVING filters the resulting buckets. Window functions do the math row by row without collapsing. This chapter is the leap from "list the rows" to "describe the data."
Aggregate Functions
Up to now every query has returned the same number of rows it scanned (or fewer, after WHERE / LIMIT). Aggregates do something different: they reduce a whole group of rows down to one. Without GROUP BY (next section) that "group" is the entire result set, so every aggregate query so far in this chapter returns exactly one row.
-- The whole table is one big group
SELECT COUNT(*) AS total_orders FROM orders; total_orders
------------
4
The five core aggregates
-- COUNT β how many rows?
SELECT COUNT(*) FROM orders; -- 4
-- SUM β total of a numeric column
SELECT SUM(quantity * unit_price) FROM order_items; -- 173.47
-- AVG β arithmetic mean of a numeric column
SELECT AVG(price) FROM products; -- 28.83
-- MIN / MAX β smallest / largest value (works on numbers, dates, strings)
SELECT MIN(price), MAX(price) FROM products; -- 9.50, 42.00
SELECT MIN(placed_at), MAX(placed_at) FROM orders; -- earliest / latest date COUNT(*) vs COUNT(col) β the NULL difference
This is the most-asked aggregate question. The two forms count different things:
COUNT(*)counts rows β whether any column is NULL or not.COUNT(col)counts rows where that column is not NULL.COUNT(DISTINCT col)counts the number of distinct non-NULL values.
users tablecreated_at is NULL β and every aggregate silently ignores NULL inputs.
created_at", not "rows in the table".COUNT(col) when you meant COUNT(*). If col has NULLs your row count silently drops, and "users with at least one order" becomes "users with at least one order whose shipped_at is filled in" β a different question with a similar-looking answer.
All aggregates skip NULL (not just COUNT)
The rule generalises: every aggregate ignores NULLs in its input. SUM, AVG, MIN, MAX all drop NULL rows before doing the math. This matters most for AVG β averaging "missing" as zero is almost never what you want, and SQL deliberately makes the safer choice.
-- Suppose products had a NULL price for "TBD" items:
-- id | name | price
-- 1 | Mug | 9.50
-- 2 | Sticker | 3.00
-- 3 | Future Book | NULL
SELECT COUNT(*), COUNT(price), SUM(price), AVG(price), MIN(price), MAX(price)
FROM products;
-- COUNT(*) = 3 COUNT(price) = 2
-- SUM = 12.50 -- NULL contributed 0
-- AVG = 6.25 -- 12.50 / 2, not / 3 β NULL row excluded
-- MIN/MAX = 3.00 / 9.50 -- NULL ignored COALESCE β substitute a value for NULL
This is where COALESCE finally earns a section, exactly where it's most useful: cleaning up aggregates. COALESCE(a, b, c, β¦) returns the first non-NULL argument. Two common patterns:
-- β Treat a missing price as 0 BEFORE aggregating
-- AVG now divides by 3, not 2 β careful, this changes the answer
SELECT AVG(COALESCE(price, 0)) FROM products; -- 4.17, not 6.25
-- β‘ Replace a NULL aggregate result with a friendlier value
-- SUM on an empty table returns NULL; we want 0:
SELECT COALESCE(SUM(price), 0) AS total_revenue
FROM products
WHERE category = 'never-existed'; -- empty result β SUM = NULL β COALESCE = 0 AVG(COALESCE(price, 0)) averages "price or zero", which is a different statistic from "average price among items that have one". Decide what you actually want before reaching for COALESCE.
Dialect note β
NULLIF: the inverse helper, NULLIF(a, b), returns NULL when a = b (else a). Useful to avoid divide-by-zero: x / NULLIF(y, 0) returns NULL instead of erroring.
Aggregates combined with WHERE
WHERE runs before the aggregate (remember the execution-order diagram from Chapter 01). So a filter naturally shrinks the input the aggregate sees.
-- Average price of books only
SELECT AVG(price) FROM products WHERE category = 'books';
-- Number of paid orders
SELECT COUNT(*) FROM orders WHERE status = 'paid';
-- Total revenue from non-cancelled orders
SELECT SUM(quantity * unit_price) AS revenue_usd
FROM order_items oi
JOIN orders o ON o.id = oi.order_id -- Chapter 05
WHERE o.status != 'cancelled'; GROUP BY
An aggregate without GROUP BY reduces the whole result set to one row. GROUP BY says: partition the rows into buckets first, then aggregate inside each bucket separately. One output row per bucket.
SELECT category,
COUNT(*) AS product_count,
AVG(price) AS avg_price,
MIN(price) AS cheapest,
MAX(price) AS priciest
FROM products
GROUP BY category; category | product_count | avg_price | cheapest | priciest
---------+---------------+-----------+----------+---------
books | 2 | 38.50 | 34.99 | 42.00
merch | 1 | 9.50 | 9.50 | 9.50
Multi-column GROUP BY
List more than one column and the buckets become "every distinct combination" of those columns. The result has one row per (colβ, colβ, β¦) tuple.
-- One row per (status, day)
SELECT status,
DATE(placed_at) AS day,
COUNT(*) AS orders,
SUM(quantity * unit_price) AS revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.id
GROUP BY status, DATE(placed_at)
ORDER BY day, status; The "what can I put in SELECT?" rule
Once a query has GROUP BY, every column in the SELECT list must be one of:
- A column listed in the
GROUP BYclause, or - Wrapped in an aggregate function (
SUM,COUNT, etc.), or - A constant or a function of (1) and (2).
The reason is simple: the result has one row per bucket, so any column that isn't a grouping key or an aggregate has multiple possible values per row and no rule for which to pick.
Strict β error
The query above raises ERROR: column "products.name" must appear in the GROUP BY clause or be used in an aggregate function. The right behaviour.
Permissive by default
Until you enable ONLY_FULL_GROUP_BY (default since 5.7), MySQL silently picks some value from the bucket β undefined which one, undefined run to run. A real bug source.
Permissive
Picks an arbitrary value from each bucket, no warning. Useful for ad-hoc work, dangerous for reports.
Functional dependency
You may also select columns that are functionally dependent on the GROUP BY keys β e.g. selecting users.full_name when grouped by users.id. PostgreSQL implements this.
HAVING vs WHERE
WHERE filters individual rows before they enter a bucket. HAVING filters whole buckets after the aggregate has run. They look similar, but they run at different points in the pipeline, and they answer different questions.
Side-by-side example
"Show me categories whose average price is above $20, considering only products priced over $5." Two filters, two clauses β and they're not interchangeable.
SELECT category,
AVG(price) AS avg_price,
COUNT(*) AS items
FROM products
WHERE price > 5 -- β row-level: drop the $3 sticker before bucketing
GROUP BY category
HAVING AVG(price) > 20 -- β‘ bucket-level: only keep categories whose avg > 20
ORDER BY avg_price DESC; category | avg_price | items
---------+-----------+------
books | 38.50 | 2
Use whenβ¦
The condition only needs values from a single row β status = 'paid', price > 5, placed_at >= '2025-01-01'. WHERE is also cheaper: it shrinks the data before grouping.
Use whenβ¦
The condition involves an aggregate β SUM(quantity) > 10, COUNT(*) >= 2, AVG(price) > 20. HAVING is the only clause where aggregate functions are legal.
Two failures the diagram explains
-- β β Aggregate in WHERE β illegal
SELECT category, AVG(price)
FROM products
WHERE AVG(price) > 20 -- ERROR: aggregate functions are not allowed in WHERE
GROUP BY category;
-- why: WHERE runs at step 2; buckets don't exist until step 3, so AVG has nothing to operate on yet.
-- β β‘ Bare-column filter in HAVING β works but pointless
SELECT category, AVG(price)
FROM products
GROUP BY category
HAVING category = 'books'; -- "works" but throws away whole buckets after building them
-- Equivalent and faster:
SELECT category, AVG(price)
FROM products
WHERE category = 'books' -- skip non-book rows before any aggregation happens
GROUP BY category; WHERE. Reserve HAVING for conditions that genuinely involve SUM, COUNT, etc. The two queries return the same rows, but the WHERE version does less work.
Window Functions β Aggregates Without Collapsing
GROUP BY collapses each bucket into one row. That's powerful, but it means you lose the individual rows. Window functions let you compute the same per-bucket numbers and keep every row. Every row carries the aggregate alongside its own values β handy for running totals, rankings, and "how do I compare to my group average?"
GROUP BY collapses Β· window functions preserveThe OVER (β¦) clause
Any aggregate becomes a window function when followed by OVER (β¦). The parenthesised expression is the window definition: which rows count as "this row's group" and (sometimes) what order they're in. Two clauses inside OVER matter at this introductory level:
PARTITION BY colβ split rows into independent windows (analogous to GROUP BY's buckets).ORDER BY colβ order the rows within each window. Required for ranking and running-total functions.
-- Per-row category average (the bottom panel of the diagram above)
SELECT category,
price,
AVG(price) OVER (PARTITION BY category) AS avg_in_cat,
price - AVG(price) OVER (PARTITION BY category) AS delta_from_cat_avg
FROM products;
-- Running total of order revenue, in date order
SELECT id,
placed_at,
(SELECT SUM(quantity * unit_price) FROM order_items WHERE order_id = o.id) AS order_total,
SUM(
(SELECT SUM(quantity * unit_price) FROM order_items WHERE order_id = o.id)
) OVER (ORDER BY placed_at) AS running_total
FROM orders o
WHERE status != 'cancelled'
ORDER BY placed_at; Ranking functions β ROW_NUMBER, RANK, DENSE_RANK
Three different answers to "which place did this row come in?" β and the difference only matters when there are ties. The diagram below uses the same input and ranks it three ways.
ORDER BY price DESCSELECT name,
price,
ROW_NUMBER() OVER (ORDER BY price DESC) AS rn,
RANK() OVER (ORDER BY price DESC) AS rk,
DENSE_RANK() OVER (ORDER BY price DESC) AS drk
FROM products;
-- Top-1 per category, breaking ties by name
SELECT category, name, price
FROM (
SELECT category, name, price,
ROW_NUMBER() OVER (PARTITION BY category
ORDER BY price DESC, name) AS rn
FROM products
) ranked
WHERE rn = 1; Running totals with SUM OVER
An aggregate combined with OVER (ORDER BY β¦) becomes a running aggregate β at each row, it sums everything from the start of the window up to and including that row. This is the bread-and-butter pattern for cumulative-revenue charts.
-- Per-day revenue, plus a running cumulative total
WITH daily AS (
SELECT DATE(o.placed_at) AS day,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o
JOIN order_items oi ON oi.order_id = o.id
WHERE o.status != 'cancelled'
GROUP BY DATE(o.placed_at)
)
SELECT day,
revenue,
SUM(revenue) OVER (ORDER BY day) AS cumulative_revenue
FROM daily
ORDER BY day; - PostgreSQL: always supported.
- MySQL: added in 8.0 (April 2018). Older versions need workarounds (variables, self-joins).
- SQLite: added in 3.25 (Sept 2018).
- SQL Server / Oracle: supported for a long time.
ROWS BETWEEN), LAG/LEAD, percentile functions, and named windows.
- Aggregates collapse many rows to one.
COUNT(*)counts rows;COUNT(col)counts non-NULL values. - Every aggregate silently ignores NULL. Use
COALESCEto substitute, but know it changes the answer. GROUP BY= one output row per distinct combination of grouping keys. Every SELECT column must be a key or an aggregate.WHEREfilters rows before grouping;HAVINGfilters buckets after. If a predicate doesn't use an aggregate, put it inWHERE.- Window functions = aggregates that don't collapse.
OVER (PARTITION BY β¦)defines the per-row group; addORDER BYfor ranking and running totals. ROW_NUMBERnever ties;RANKties and skips;DENSE_RANKties and continues.
β Chapter 03 β Sorting & Limiting Β· π Continue to Chapter 05 β Joining Tables β