SQL · Expert

Window Functions Deep Dive

Chapter 04 introduced window functions as “aggregates without collapsing.” This chapter goes all the way: the full anatomy of OVER(), frame clauses that control which rows the function sees, all the ranking functions, value-access functions like LAG/LEAD, and named windows for DRY queries.

01
Section One · The full OVER clause

Window Function Anatomy

SUM(revenue) OVER ( PARTITION BY category ORDER BY placed_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW )

Every window function has the form function() OVER (window_definition). The window definition has three optional parts:

  1. PARTITION BY col — splits rows into independent groups (like GROUP BY, but without collapsing). If omitted, the entire result set is one partition.
  2. ORDER BY col — defines the logical order of rows within each partition. Required for ranking and running aggregates.
  3. Frame clause — defines which subset of rows within the partition the function operates on relative to the current row. The default depends on whether ORDER BY is present.

Frame clauses: ROWS BETWEEN vs RANGE BETWEEN

The frame defines a sliding window of rows the aggregate sees for each current row. Two modes:

ROWS BETWEEN

Physical offsets

Counts actual rows. ROWS BETWEEN 2 PRECEDING AND CURRENT ROW = the current row plus the 2 rows before it (exactly 3 rows). Deterministic even with ties.

RANGE BETWEEN

Logical (value) offsets

Includes all rows whose ORDER BY value falls within the range. RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW = all rows within the last 7 days. Includes ties automatically.

The default frame (the counterintuitive bit)

When you write ORDER BY inside OVER but no explicit frame clause, the default is:

  • With ORDER BY: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW — a running total from the start up to (and including) all rows tied with the current row.
  • Without ORDER BY: RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING — the entire partition.
Gotcha: the default frame with ORDER BY uses RANGE, which includes ties. If two rows share the same placed_at value, both see the same running total. Use ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW if you want strictly one-row-at-a-time accumulation.
-- Running total (default frame = RANGE, includes ties)
SELECT placed_at,
       SUM(total) OVER (ORDER BY placed_at) AS running_total_range
FROM   daily_revenue;

-- Strictly row-by-row running total (ROWS frame)
SELECT placed_at,
       SUM(total) OVER (ORDER BY placed_at
                        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
       AS running_total_rows
FROM   daily_revenue;

-- 7-day moving average (RANGE with interval)
SELECT placed_at,
       AVG(total) OVER (ORDER BY placed_at
                        RANGE BETWEEN INTERVAL '6 days' PRECEDING
                                  AND CURRENT ROW)
       AS moving_avg_7d
FROM   daily_revenue;

Named windows with WINDOW ... AS

If multiple columns share the same window definition, name it once and reuse it — DRY and less error-prone.

SELECT id,
       placed_at,
       SUM(total)   OVER w AS running_total,
       AVG(total)   OVER w AS running_avg,
       COUNT(*)     OVER w AS running_count
FROM   daily_revenue
WINDOW w AS (ORDER BY placed_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
ORDER BY placed_at;
Dialect note: WINDOW clause is supported in PostgreSQL, MySQL 8+, and SQLite 3.28+. SQL Server does not support it — repeat the OVER inline.
02
Section Two · Numbering and percentiles

Ranking Functions

Chapter 04 introduced ROW_NUMBER, RANK, and DENSE_RANK with a tie-handling comparison. This section adds NTILE, PERCENT_RANK, and CUME_DIST — the percentile-bucket functions.

Quick recap: the three basic rankings

SELECT name, price,
       ROW_NUMBER() OVER (ORDER BY price DESC) AS rn,    -- never ties: 1,2,3,4,5
       RANK()       OVER (ORDER BY price DESC) AS rk,    -- ties share, skip: 1,2,2,4,5
       DENSE_RANK() OVER (ORDER BY price DESC) AS drk    -- ties share, no skip: 1,2,2,3,4
FROM   products;

NTILE(n) — divide into equal-sized buckets

Splits the partition into n roughly equal groups and assigns a bucket number 1..n. Useful for percentile analysis.

-- Split orders into 4 quartiles by revenue
WITH order_rev AS (
  SELECT order_id, SUM(quantity * unit_price) AS total
  FROM   order_items GROUP BY order_id
)
SELECT order_id, total,
       NTILE(4) OVER (ORDER BY total) AS quartile
FROM   order_rev;
order_id | total | quartile
---------+-------+---------
1004     |  9.50 |        1
1002     | 42.00 |        2
1001     | 53.99 |        3
1003     | 76.99 |        4
Uneven splits: if 10 rows are split into 4 buckets, you get groups of 3, 3, 2, 2 (not 2.5). The first buckets get the extra rows. This is by design — NTILE is approximate.

PERCENT_RANK and CUME_DIST

Both return a value between 0 and 1 representing a row’s position in the sorted partition:

PERCENT_RANK()

(rank - 1) / (total - 1)

The relative rank as a fraction. First row = 0, last row = 1. Useful for “this product is more expensive than X% of all products.”

CUME_DIST()

rows ≤ current / total

Cumulative distribution. The fraction of rows that have a value ≤ the current row’s value. Always > 0, last row = 1.

SELECT name, price,
       PERCENT_RANK() OVER (ORDER BY price) AS pct_rank,
       CUME_DIST()    OVER (ORDER BY price) AS cume_dist
FROM   products
ORDER BY price;
name           | price | pct_rank | cume_dist
---------------+-------+----------+----------
Database Mug   |  9.50 |     0.00 |      0.33
SQL Cookbook    | 34.99 |     0.50 |      0.67
Designing Data | 42.00 |     1.00 |      1.00

Top-N per group (the killer pattern)

Chapter 04 previewed this. Here’s the full production pattern: rank within each group, then filter to keep only the top N.

-- Top 2 most expensive products per category
WITH ranked AS (
  SELECT category, name, price,
         ROW_NUMBER() OVER (PARTITION BY category
                            ORDER BY price DESC) AS rn
  FROM   products
)
SELECT category, name, price
FROM   ranked
WHERE  rn <= 2;
Why ROW_NUMBER and not RANK? If you want exactly N rows per group regardless of ties, use ROW_NUMBER (it breaks ties arbitrarily). If you want to include all tied rows at the boundary, use RANK or DENSE_RANK — you may get more than N rows.
03
Section Three · Comparing adjacent rows

Value Functions

Value functions access a specific row in the window without collapsing. They answer questions like “what was the previous row’s value?” or “what was the first value in this partition?”

LAG / LEAD — peek at adjacent rows

LAG(col, offset, default) looks back N rows. LEAD looks forward. Default offset is 1; default value is NULL.

-- Compare each order's total to the previous order's total
WITH order_totals AS (
  SELECT o.id, o.placed_at,
         SUM(oi.quantity * oi.unit_price) AS total
  FROM   orders o
  JOIN   order_items oi ON oi.order_id = o.id
  WHERE  o.status != 'cancelled'
  GROUP BY o.id, o.placed_at
)
SELECT id,
       placed_at,
       total,
       LAG(total) OVER (ORDER BY placed_at)  AS prev_total,
       total - LAG(total) OVER (ORDER BY placed_at) AS change
FROM   order_totals
ORDER BY placed_at;
id   | placed_at           | total | prev_total | change
-----+---------------------+-------+------------+-------
1001 | 2025-02-10 09:00:00 | 53.99 | NULL       | NULL
1002 | 2025-02-11 11:30:00 | 42.00 | 53.99      | -11.99
1003 | 2025-02-14 16:45:00 | 76.99 | 42.00      |  34.99

The first row has LAG = NULL because there’s no previous row. You can supply a default: LAG(total, 1, 0) would return 0 instead of NULL.

FIRST_VALUE / LAST_VALUE

Return the first or last value in the current window frame.

-- Each product vs the cheapest in its category
SELECT category, name, price,
       FIRST_VALUE(name) OVER (PARTITION BY category ORDER BY price)
         AS cheapest_in_cat,
       price - FIRST_VALUE(price) OVER (PARTITION BY category ORDER BY price)
         AS premium_over_cheapest
FROM   products;

The LAST_VALUE frame gotcha

This is the most common window-function bug. LAST_VALUE with the default frame (RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) returns the current row’s value — because the frame ends at the current row! To get the true last value in the partition, extend the frame explicitly:

-- BROKEN: LAST_VALUE with default frame = current row's value (useless)
SELECT name, price,
       LAST_VALUE(name) OVER (PARTITION BY category ORDER BY price)
       AS "wrong_most_expensive"
FROM   products;
-- Every row says its OWN name because the frame ends at the current row.

-- FIXED: extend frame to end of partition
SELECT name, price,
       LAST_VALUE(name) OVER (
         PARTITION BY category ORDER BY price
         ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
       ) AS most_expensive_in_cat
FROM   products;
Why LAST_VALUE returns the wrong answer without an explicit frame
✗ default frame (ends at current row) name price LAST_VALUE Mug 9.50 Mug [] Cookbook 34.99 Cookbook [] Designing 42.00 Designing [] frame ends at current row each time → LAST_VALUE = current row = useless! default: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ✓ explicit frame (full partition) name price LAST_VALUE Mug 9.50 Designing Cookbook 34.99 Designing Designing 42.00 Designing frame covers entire partition → LAST_VALUE = last row in partition ✓ ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

NTH_VALUE

Access the Nth row in the frame. Same frame gotcha applies — always specify an explicit frame if you want to scan the whole partition.

-- Second cheapest product per category
SELECT category, name, price,
       NTH_VALUE(name, 2) OVER (
         PARTITION BY category ORDER BY price
         ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
       ) AS second_cheapest
FROM   products;
Mental shortcut for the whole chapter:
  • OVER() = function + partition + order + frame. All optional, but order is required for ranking/running aggregates.
  • Default frame with ORDER BY = RANGE to CURRENT ROW (includes ties). Use ROWS for strict row-by-row logic.
  • ROWS = physical offset (count rows). RANGE = logical offset (value-based, includes ties).
  • Named windows (WINDOW w AS (...)) eliminate repetition across multiple columns.
  • Ranking: ROW_NUMBER (no ties), RANK (skip), DENSE_RANK (no skip), NTILE (buckets), PERCENT_RANK / CUME_DIST (0..1 position).
  • Top-N per group = ROW_NUMBER + PARTITION BY + filter in outer query.
  • LAG/LEAD: compare to previous/next row. FIRST_VALUE: start of partition. LAST_VALUE: always specify full frame or you get the current row.

Chapter 09 — Subqueries & CTEs  ·  → Continue to Chapter 11 — Advanced SQL Patterns →