SQL Β· Intermediate

Aggregation, GROUP BY & Windows

An aggregate is any function that collapses many rows into a single value β€” a count, a sum, an average. GROUP BY repeats that collapse per bucket. HAVING filters the resulting buckets. Window functions do the math row by row without collapsing. This chapter is the leap from "list the rows" to "describe the data."

01
Section One Β· COUNT, SUM, AVG, MIN, MAX

Aggregate Functions

many rows aggregate COUNT Β· SUM AVG Β· MIN Β· MAX one number "How many?" Β· "What's the total?" Β· "What's the average?"

Up to now every query has returned the same number of rows it scanned (or fewer, after WHERE / LIMIT). Aggregates do something different: they reduce a whole group of rows down to one. Without GROUP BY (next section) that "group" is the entire result set, so every aggregate query so far in this chapter returns exactly one row.

-- The whole table is one big group
SELECT COUNT(*) AS total_orders FROM orders;
total_orders
------------
4

The five core aggregates

-- COUNT β€” how many rows?
SELECT COUNT(*) FROM orders;                       -- 4

-- SUM β€” total of a numeric column
SELECT SUM(quantity * unit_price) FROM order_items; -- 173.47

-- AVG β€” arithmetic mean of a numeric column
SELECT AVG(price) FROM products;                   -- 28.83

-- MIN / MAX β€” smallest / largest value (works on numbers, dates, strings)
SELECT MIN(price), MAX(price) FROM products;       -- 9.50, 42.00
SELECT MIN(placed_at), MAX(placed_at) FROM orders; -- earliest / latest date

COUNT(*) vs COUNT(col) β€” the NULL difference

This is the most-asked aggregate question. The two forms count different things:

  • COUNT(*) counts rows β€” whether any column is NULL or not.
  • COUNT(col) counts rows where that column is not NULL.
  • COUNT(DISTINCT col) counts the number of distinct non-NULL values.
Three flavours of COUNT on the users table
users (source) id full_name created_at 1 Ada 2025-01-04 2 Linus 2025-01-09 3 Grace NULL function result what it counted COUNT(*) 3 every row, NULL or not COUNT(created_at) 2 rows where created_at IS NOT NULL COUNT(DISTINCT full_name) 3 distinct non-NULL values in the column why COUNT(created_at) returned 2, not 3 Grace's created_at is NULL β€” and every aggregate silently ignores NULL inputs. Her row is dropped before counting, so the count is "rows with a non-NULL created_at", not "rows in the table".
Common mistake: writing COUNT(col) when you meant COUNT(*). If col has NULLs your row count silently drops, and "users with at least one order" becomes "users with at least one order whose shipped_at is filled in" β€” a different question with a similar-looking answer.

All aggregates skip NULL (not just COUNT)

The rule generalises: every aggregate ignores NULLs in its input. SUM, AVG, MIN, MAX all drop NULL rows before doing the math. This matters most for AVG β€” averaging "missing" as zero is almost never what you want, and SQL deliberately makes the safer choice.

-- Suppose products had a NULL price for "TBD" items:
--   id | name        | price
--    1 | Mug         |  9.50
--    2 | Sticker     |  3.00
--    3 | Future Book |  NULL
SELECT COUNT(*), COUNT(price), SUM(price), AVG(price), MIN(price), MAX(price)
FROM   products;
-- COUNT(*) = 3   COUNT(price) = 2
-- SUM      = 12.50            -- NULL contributed 0
-- AVG      = 6.25             -- 12.50 / 2, not / 3 β€” NULL row excluded
-- MIN/MAX  = 3.00 / 9.50      -- NULL ignored

COALESCE β€” substitute a value for NULL

This is where COALESCE finally earns a section, exactly where it's most useful: cleaning up aggregates. COALESCE(a, b, c, …) returns the first non-NULL argument. Two common patterns:

-- β‘  Treat a missing price as 0 BEFORE aggregating
--    AVG now divides by 3, not 2 β€” careful, this changes the answer
SELECT AVG(COALESCE(price, 0)) FROM products;   -- 4.17, not 6.25

-- β‘‘ Replace a NULL aggregate result with a friendlier value
--    SUM on an empty table returns NULL; we want 0:
SELECT COALESCE(SUM(price), 0) AS total_revenue
FROM   products
WHERE  category = 'never-existed';   -- empty result β†’ SUM = NULL β†’ COALESCE = 0
Subtle gotcha: pattern β‘  changes the meaning of the aggregate. AVG(COALESCE(price, 0)) averages "price or zero", which is a different statistic from "average price among items that have one". Decide what you actually want before reaching for COALESCE.

Dialect note β€” NULLIF: the inverse helper, NULLIF(a, b), returns NULL when a = b (else a). Useful to avoid divide-by-zero: x / NULLIF(y, 0) returns NULL instead of erroring.

Aggregates combined with WHERE

WHERE runs before the aggregate (remember the execution-order diagram from Chapter 01). So a filter naturally shrinks the input the aggregate sees.

-- Average price of books only
SELECT AVG(price) FROM products WHERE category = 'books';

-- Number of paid orders
SELECT COUNT(*) FROM orders WHERE status = 'paid';

-- Total revenue from non-cancelled orders
SELECT SUM(quantity * unit_price) AS revenue_usd
FROM   order_items oi
JOIN   orders o ON o.id = oi.order_id   -- Chapter 05
WHERE  o.status != 'cancelled';
02
Section Two Β· Aggregating per bucket

GROUP BY

An aggregate without GROUP BY reduces the whole result set to one row. GROUP BY says: partition the rows into buckets first, then aggregate inside each bucket separately. One output row per bucket.

Mental model: bucket the rows, then aggregate each bucket
products (6 rows) books Β· SQL Cookbook Β· 34.99 books Β· Designing Data Β· 42.00 books Β· Refactoring Β· 28.00 merch Β· Mug Β· 9.50 merch Β· T-Shirt Β· 18.00 merch Β· Sticker Β· 3.00 GROUP BY category bucket: books SQL Cookbook Β· 34.99 Designing Data Β· 42.00 Refactoring Β· 28.00 3 rows in this bucket bucket: merch Mug Β· 9.50 T-Shirt Β· 18.00 Sticker Β· 3.00 3 rows in this bucket AVG(price) one row per bucket category avg books 34.99 merch 10.16 2 buckets β†’ 2 result rows. Each bucket's rows collapse into a single aggregate value. rows in = 6 Β· buckets = 2 Β· rows out = 2
SELECT category,
       COUNT(*)   AS product_count,
       AVG(price) AS avg_price,
       MIN(price) AS cheapest,
       MAX(price) AS priciest
FROM   products
GROUP BY category;
category | product_count | avg_price | cheapest | priciest
---------+---------------+-----------+----------+---------
books    |             2 |     38.50 |    34.99 |    42.00
merch    |             1 |      9.50 |     9.50 |     9.50

Multi-column GROUP BY

List more than one column and the buckets become "every distinct combination" of those columns. The result has one row per (col₁, colβ‚‚, …) tuple.

-- One row per (status, day)
SELECT status,
       DATE(placed_at) AS day,
       COUNT(*)        AS orders,
       SUM(quantity * unit_price) AS revenue
FROM   orders o
JOIN   order_items oi ON oi.order_id = o.id
GROUP BY status, DATE(placed_at)
ORDER BY day, status;

The "what can I put in SELECT?" rule

Once a query has GROUP BY, every column in the SELECT list must be one of:

  1. A column listed in the GROUP BY clause, or
  2. Wrapped in an aggregate function (SUM, COUNT, etc.), or
  3. A constant or a function of (1) and (2).

The reason is simple: the result has one row per bucket, so any column that isn't a grouping key or an aggregate has multiple possible values per row and no rule for which to pick.

Why "bare" columns break a GROUP BY query
⚠ the bad query SELECT category, name, ← which name? Each bucket has many. AVG(price) FROM products GROUP BY category; each bucket must collapse to ONE row β€” but name still has many values bucket: category = 'books' name = "SQL Cookbook" name = "Designing Data" name = "Refactoring" AVG(price) = 35.00 βœ“ exactly one value name = ??? β†’ no rule to pick one of three PostgreSQL errors here; MySQL silently picks an arbitrary value. bucket: category = 'merch' name = "Mug" name = "T-Shirt" name = "Sticker" AVG(price) = 10.16 βœ“ exactly one value name = ??? β†’ no rule to pick one of three same problem, different bucket β€” and that's the rule.
PostgreSQL

Strict β€” error

The query above raises ERROR: column "products.name" must appear in the GROUP BY clause or be used in an aggregate function. The right behaviour.

MySQL

Permissive by default

Until you enable ONLY_FULL_GROUP_BY (default since 5.7), MySQL silently picks some value from the bucket β€” undefined which one, undefined run to run. A real bug source.

SQLite

Permissive

Picks an arbitrary value from each bucket, no warning. Useful for ad-hoc work, dangerous for reports.

Standard SQL

Functional dependency

You may also select columns that are functionally dependent on the GROUP BY keys β€” e.g. selecting users.full_name when grouped by users.id. PostgreSQL implements this.

Common mistake: developing on MySQL, where the bad query "works", then shipping to PostgreSQL where it errors. Always write GROUP BY queries as if PostgreSQL was the runtime β€” list every non-aggregated column.
03
Section Three Β· Filtering groups

HAVING vs WHERE

WHERE filters individual rows before they enter a bucket. HAVING filters whole buckets after the aggregate has run. They look similar, but they run at different points in the pipeline, and they answer different questions.

The pipeline β€” where each clause runs (and why aliases don't reach back)
1. FROM load rows 2. WHERE drop ROWS 3. GROUP BY make buckets 4. HAVING drop BUCKETS 5. SELECT project + alias 6. ORDER BY / LIMIT "row-level test" cheap: shrinks the set the aggregate has to scan cannot use SUM/COUNT here (buckets don't exist yet) "bucket-level test" runs AFTER aggregates; predicate can reference SUM/COUNT/AVG cannot filter individual rows here (the rows are already gone) Use WHERE for "which rows go into the buckets"; use HAVING for "which buckets survive after aggregation".

Side-by-side example

"Show me categories whose average price is above $20, considering only products priced over $5." Two filters, two clauses β€” and they're not interchangeable.

SELECT category,
       AVG(price) AS avg_price,
       COUNT(*)   AS items
FROM   products
WHERE  price > 5            -- β‘  row-level: drop the $3 sticker before bucketing
GROUP BY category
HAVING AVG(price) > 20      -- β‘‘ bucket-level: only keep categories whose avg > 20
ORDER BY avg_price DESC;
category | avg_price | items
---------+-----------+------
books    |     38.50 |     2
WHERE

Use when…

The condition only needs values from a single row β€” status = 'paid', price > 5, placed_at >= '2025-01-01'. WHERE is also cheaper: it shrinks the data before grouping.

HAVING

Use when…

The condition involves an aggregate β€” SUM(quantity) > 10, COUNT(*) >= 2, AVG(price) > 20. HAVING is the only clause where aggregate functions are legal.

Two failures the diagram explains

-- ⚠ β‘  Aggregate in WHERE β€” illegal
SELECT category, AVG(price)
FROM   products
WHERE  AVG(price) > 20         -- ERROR: aggregate functions are not allowed in WHERE
GROUP BY category;
-- why: WHERE runs at step 2; buckets don't exist until step 3, so AVG has nothing to operate on yet.

-- ⚠ β‘‘ Bare-column filter in HAVING β€” works but pointless
SELECT category, AVG(price)
FROM   products
GROUP BY category
HAVING category = 'books';     -- "works" but throws away whole buckets after building them
-- Equivalent and faster:
SELECT category, AVG(price)
FROM   products
WHERE  category = 'books'      -- skip non-book rows before any aggregation happens
GROUP BY category;
Rule of thumb: if a predicate could be expressed without an aggregate, put it in WHERE. Reserve HAVING for conditions that genuinely involve SUM, COUNT, etc. The two queries return the same rows, but the WHERE version does less work.
04
Section Four Β· A First Taste

Window Functions β€” Aggregates Without Collapsing

GROUP BY collapses each bucket into one row. That's powerful, but it means you lose the individual rows. Window functions let you compute the same per-bucket numbers and keep every row. Every row carries the aggregate alongside its own values β€” handy for running totals, rankings, and "how do I compare to my group average?"

GROUP BY collapses Β· window functions preserve
products (3 books + 2 merch) category price books 28 books 35 books 42 merch 10 merch 18 GROUP BY category β†’ 2 rows (collapsed) category AVG(price) books 35.00 merch 14.00 ↑ lost individual prices AVG(price) OVER (PARTITION BY category) β†’ 5 rows (preserved) category price avg_in_cat books 28 35.00 books 35 35.00 books 42 35.00 merch 10 14.00 merch 18 14.00

The OVER (…) clause

Any aggregate becomes a window function when followed by OVER (…). The parenthesised expression is the window definition: which rows count as "this row's group" and (sometimes) what order they're in. Two clauses inside OVER matter at this introductory level:

  • PARTITION BY col β€” split rows into independent windows (analogous to GROUP BY's buckets).
  • ORDER BY col β€” order the rows within each window. Required for ranking and running-total functions.
-- Per-row category average (the bottom panel of the diagram above)
SELECT category,
       price,
       AVG(price)   OVER (PARTITION BY category) AS avg_in_cat,
       price - AVG(price) OVER (PARTITION BY category) AS delta_from_cat_avg
FROM   products;

-- Running total of order revenue, in date order
SELECT id,
       placed_at,
       (SELECT SUM(quantity * unit_price) FROM order_items WHERE order_id = o.id) AS order_total,
       SUM(
         (SELECT SUM(quantity * unit_price) FROM order_items WHERE order_id = o.id)
       ) OVER (ORDER BY placed_at) AS running_total
FROM   orders o
WHERE  status != 'cancelled'
ORDER BY placed_at;

Ranking functions β€” ROW_NUMBER, RANK, DENSE_RANK

Three different answers to "which place did this row come in?" β€” and the difference only matters when there are ties. The diagram below uses the same input and ranks it three ways.

Three ranking functions on the same tied data β€” ORDER BY price DESC
product price ROW_NUMBER RANK DENSE_RANK Designing Data 42 1 1 1 SQL Cookbook 35 2 2 2 Refactoring 35 3 2 2 T-Shirt 18 4 4 3 Mug 10 5 5 4 never ties β€” breaks ties arbitrarily ties share rank; SKIPS the next number ties share rank; NEXT number unchanged Olympic medals = RANK (1, 2, 2, 4); classroom honour-roll = DENSE_RANK (1, 2, 2, 3).
SELECT name,
       price,
       ROW_NUMBER() OVER (ORDER BY price DESC) AS rn,
       RANK()       OVER (ORDER BY price DESC) AS rk,
       DENSE_RANK() OVER (ORDER BY price DESC) AS drk
FROM   products;

-- Top-1 per category, breaking ties by name
SELECT category, name, price
FROM (
  SELECT category, name, price,
         ROW_NUMBER() OVER (PARTITION BY category
                            ORDER BY price DESC, name) AS rn
  FROM   products
) ranked
WHERE  rn = 1;

Running totals with SUM OVER

An aggregate combined with OVER (ORDER BY …) becomes a running aggregate β€” at each row, it sums everything from the start of the window up to and including that row. This is the bread-and-butter pattern for cumulative-revenue charts.

-- Per-day revenue, plus a running cumulative total
WITH daily AS (
  SELECT DATE(o.placed_at)               AS day,
         SUM(oi.quantity * oi.unit_price) AS revenue
  FROM   orders o
  JOIN   order_items oi ON oi.order_id = o.id
  WHERE  o.status != 'cancelled'
  GROUP BY DATE(o.placed_at)
)
SELECT day,
       revenue,
       SUM(revenue) OVER (ORDER BY day) AS cumulative_revenue
FROM   daily
ORDER BY day;
Dialect note β€” window function availability:
  • PostgreSQL: always supported.
  • MySQL: added in 8.0 (April 2018). Older versions need workarounds (variables, self-joins).
  • SQLite: added in 3.25 (Sept 2018).
  • SQL Server / Oracle: supported for a long time.
Chapter 10 covers frame clauses (ROWS BETWEEN), LAG/LEAD, percentile functions, and named windows.
Mental shortcut for the whole chapter:
  • Aggregates collapse many rows to one. COUNT(*) counts rows; COUNT(col) counts non-NULL values.
  • Every aggregate silently ignores NULL. Use COALESCE to substitute, but know it changes the answer.
  • GROUP BY = one output row per distinct combination of grouping keys. Every SELECT column must be a key or an aggregate.
  • WHERE filters rows before grouping; HAVING filters buckets after. If a predicate doesn't use an aggregate, put it in WHERE.
  • Window functions = aggregates that don't collapse. OVER (PARTITION BY …) defines the per-row group; add ORDER BY for ranking and running totals.
  • ROW_NUMBER never ties; RANK ties and skips; DENSE_RANK ties and continues.

← Chapter 03 β€” Sorting & Limiting  Β·  πŸ‘‰ Continue to Chapter 05 β€” Joining Tables β†’