Window functions are one of the most powerful tools in a data engineer’s SQL toolkit. Unlike aggregate functions that collapse rows into a single result, window functions perform calculations across a set of related rows while keeping every row in the output. This makes them essential for ranking, running totals, comparing rows to previous rows, and detecting change patterns in your pipeline data.
This tutorial covers the six window functions you will use most often in data engineering work, with real examples using a consistent dataset.
The Dataset
We will use a sales pipeline dataset throughout this tutorial.
sql
CREATE TABLE sales (
sale_id INT,
rep_id INT,
region VARCHAR(20),
product VARCHAR(30),
amount DECIMAL(10,2),
sale_date DATE
);
INSERT INTO sales VALUES
(1, 101, 'North', 'Laptop', 1200.00, '2024-01-05'),
(2, 102, 'North', 'Phone', 800.00, '2024-01-07'),
(3, 101, 'North', 'Tablet', 400.00, '2024-01-10'),
(4, 103, 'South', 'Laptop', 1200.00, '2024-01-12'),
(5, 104, 'South', 'Monitor', 350.00, '2024-01-15'),
(6, 102, 'North', 'Laptop', 1200.00, '2024-02-01'),
(7, 103, 'South', 'Phone', 800.00, '2024-02-03'),
(8, 101, 'North', 'Monitor', 350.00, '2024-02-08'),
(9, 104, 'South', 'Tablet', 400.00, '2024-02-10'),
(10, 102, 'North', 'Phone', 800.00, '2024-02-14');
1. ROW_NUMBER() — Assigning a Unique Rank Per Group
ROW_NUMBER() assigns a sequential number to each row within a partition. No two rows get the same number, even if they have the same value.
sql
SELECT
sale_id,
rep_id,
region,
amount,
ROW_NUMBER() OVER (
PARTITION BY region
ORDER BY amount DESC
) AS rank_in_region
FROM sales;
Output:
| sale_id | rep_id | region | amount | rank_in_region |
|---|---|---|---|---|
| 1 | 101 | North | 1200.00 | 1 |
| 6 | 102 | North | 1200.00 | 2 |
| 2 | 102 | North | 800.00 | 3 |
| 10 | 102 | North | 800.00 | 4 |
| 3 | 101 | North | 400.00 | 5 |
| 8 | 101 | North | 350.00 | 6 |
| 4 | 103 | South | 1200.00 | 1 |
| 7 | 103 | South | 800.00 | 2 |
| 9 | 104 | South | 400.00 | 3 |
| 5 | 104 | South | 350.00 | 4 |
Notice that two North sales both have amount 1200.00, but they get different row numbers (1 and 2). This is the key difference from RANK().
Pipeline use case: Use ROW_NUMBER() to keep only the latest record per entity — the deduplication technique from Tutorial 1. Also used to get the top N rows per group before loading to a report table.
sql
-- Get the single highest sale per region for a dashboard table
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) AS rn
FROM sales
)
SELECT region, rep_id, amount
FROM ranked
WHERE rn = 1;
2. RANK() and DENSE_RANK() — Ranking With Ties
RANK() assigns the same number to tied rows, but then skips numbers. DENSE_RANK() assigns the same number to tied rows but does not skip.
sql
SELECT
rep_id,
SUM(amount) AS total_sales,
RANK() OVER (ORDER BY SUM(amount) DESC) AS rank_with_gaps,
DENSE_RANK() OVER (ORDER BY SUM(amount) DESC) AS rank_no_gaps
FROM sales
GROUP BY rep_id;
Output:
| rep_id | total_sales | rank_with_gaps | rank_no_gaps |
|---|---|---|---|
| 102 | 2800.00 | 1 | 1 |
| 101 | 1950.00 | 2 | 2 |
| 103 | 2000.00 | 3 | 3 |
| 104 | 750.00 | 4 | 4 |
Pipeline use case: Use DENSE_RANK() when you need to label tiers (Tier 1, Tier 2, Tier 3) without gaps. Use RANK() when the exact position matters, such as leaderboards where “there is no 2nd place if two people tied for 1st.”
3. LAG() and LEAD() — Comparing to Previous or Next Rows
LAG() lets you access a value from a previous row. LEAD() accesses a value from a future row. Both are critical for detecting changes and trends in time-series pipeline data.
sql
SELECT
sale_id,
rep_id,
amount,
sale_date,
LAG(amount) OVER (PARTITION BY rep_id ORDER BY sale_date) AS prev_sale_amount,
amount - LAG(amount) OVER (PARTITION BY rep_id ORDER BY sale_date) AS change_vs_prev
FROM sales
ORDER BY rep_id, sale_date;
Output:
| sale_id | rep_id | amount | sale_date | prev_sale_amount | change_vs_prev |
|---|---|---|---|---|---|
| 1 | 101 | 1200.00 | 2024-01-05 | NULL | NULL |
| 3 | 101 | 400.00 | 2024-01-10 | 1200.00 | -800.00 |
| 8 | 101 | 350.00 | 2024-02-08 | 400.00 | -50.00 |
| 2 | 102 | 800.00 | 2024-01-07 | NULL | NULL |
| 6 | 102 | 1200.00 | 2024-02-01 | 800.00 | +400.00 |
| 10 | 102 | 800.00 | 2024-02-14 | 1200.00 | -400.00 |
The first row per rep always returns NULL for prev_sale_amount because there is no previous row. Handle this in your pipeline with COALESCE(LAG(amount) OVER (...), 0) if you need a numeric default.
Pipeline use case: Detecting sudden spikes or drops in metrics. In a CDC pipeline, LAG() is used to compare old vs new values to identify what actually changed. In a financial pipeline, it powers month-over-month change calculations.
4. SUM() OVER — Running Totals
Using SUM() as a window function with an ORDER BY clause creates a cumulative running total — without collapsing rows like a regular GROUP BY.
sql
SELECT
sale_date,
rep_id,
amount,
SUM(amount) OVER (
PARTITION BY rep_id
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM sales
ORDER BY rep_id, sale_date;
Output:
| sale_date | rep_id | amount | running_total |
|---|---|---|---|
| 2024-01-05 | 101 | 1200.00 | 1200.00 |
| 2024-01-10 | 101 | 400.00 | 1600.00 |
| 2024-02-08 | 101 | 350.00 | 1950.00 |
| 2024-01-07 | 102 | 800.00 | 800.00 |
| 2024-02-01 | 102 | 1200.00 | 2000.00 |
| 2024-02-14 | 102 | 800.00 | 2800.00 |
The ROWS BETWEEN clause explained:
UNBOUNDED PRECEDING— start from the first row in the partiti