Window Functions Every Data Engineer Must Know (With Pipeline Examples)

Window functions are one of the most powerful tools in a data engineer’s SQL toolkit. Unlike aggregate functions that collapse rows into a single result, window functions perform calculations across a set of related rows while keeping every row in the output. This makes them essential for ranking, running totals, comparing rows to previous rows, and detecting change patterns in your pipeline data.

This tutorial covers the six window functions you will use most often in data engineering work, with real examples using a consistent dataset.

The Dataset

We will use a sales pipeline dataset throughout this tutorial.

sql

CREATE TABLE sales (
    sale_id      INT,
    rep_id       INT,
    region       VARCHAR(20),
    product      VARCHAR(30),
    amount       DECIMAL(10,2),
    sale_date    DATE
);

INSERT INTO sales VALUES
(1,  101, 'North', 'Laptop',   1200.00, '2024-01-05'),
(2,  102, 'North', 'Phone',     800.00, '2024-01-07'),
(3,  101, 'North', 'Tablet',    400.00, '2024-01-10'),
(4,  103, 'South', 'Laptop',   1200.00, '2024-01-12'),
(5,  104, 'South', 'Monitor',   350.00, '2024-01-15'),
(6,  102, 'North', 'Laptop',   1200.00, '2024-02-01'),
(7,  103, 'South', 'Phone',     800.00, '2024-02-03'),
(8,  101, 'North', 'Monitor',   350.00, '2024-02-08'),
(9,  104, 'South', 'Tablet',    400.00, '2024-02-10'),
(10, 102, 'North', 'Phone',     800.00, '2024-02-14');

1. ROW_NUMBER() — Assigning a Unique Rank Per Group

ROW_NUMBER() assigns a sequential number to each row within a partition. No two rows get the same number, even if they have the same value.

sql

SELECT
    sale_id,
    rep_id,
    region,
    amount,
    ROW_NUMBER() OVER (
        PARTITION BY region
        ORDER BY amount DESC
    ) AS rank_in_region
FROM sales;

Output:

sale_id	rep_id	region	amount	rank_in_region
1	101	North	1200.00	1
6	102	North	1200.00	2
2	102	North	800.00	3
10	102	North	800.00	4
3	101	North	400.00	5
8	101	North	350.00	6
4	103	South	1200.00	1
7	103	South	800.00	2
9	104	South	400.00	3
5	104	South	350.00	4

Notice that two North sales both have amount 1200.00, but they get different row numbers (1 and 2). This is the key difference from RANK().

Pipeline use case: Use ROW_NUMBER() to keep only the latest record per entity — the deduplication technique from Tutorial 1. Also used to get the top N rows per group before loading to a report table.

sql

-- Get the single highest sale per region for a dashboard table
WITH ranked AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) AS rn
    FROM sales
)
SELECT region, rep_id, amount
FROM ranked
WHERE rn = 1;

2. RANK() and DENSE_RANK() — Ranking With Ties

RANK() assigns the same number to tied rows, but then skips numbers. DENSE_RANK() assigns the same number to tied rows but does not skip.

sql

SELECT
    rep_id,
    SUM(amount) AS total_sales,
    RANK()       OVER (ORDER BY SUM(amount) DESC) AS rank_with_gaps,
    DENSE_RANK() OVER (ORDER BY SUM(amount) DESC) AS rank_no_gaps
FROM sales
GROUP BY rep_id;

Output:

rep_id	total_sales	rank_with_gaps	rank_no_gaps
102	2800.00	1	1
101	1950.00	2	2
103	2000.00	3	3
104	750.00	4	4

Pipeline use case: Use DENSE_RANK() when you need to label tiers (Tier 1, Tier 2, Tier 3) without gaps. Use RANK() when the exact position matters, such as leaderboards where “there is no 2nd place if two people tied for 1st.”

3. LAG() and LEAD() — Comparing to Previous or Next Rows

LAG() lets you access a value from a previous row. LEAD() accesses a value from a future row. Both are critical for detecting changes and trends in time-series pipeline data.

sql

SELECT
    sale_id,
    rep_id,
    amount,
    sale_date,
    LAG(amount)  OVER (PARTITION BY rep_id ORDER BY sale_date) AS prev_sale_amount,
    amount - LAG(amount) OVER (PARTITION BY rep_id ORDER BY sale_date) AS change_vs_prev
FROM sales
ORDER BY rep_id, sale_date;

Output:

sale_id	rep_id	amount	sale_date	prev_sale_amount	change_vs_prev
1	101	1200.00	2024-01-05	NULL	NULL
3	101	400.00	2024-01-10	1200.00	-800.00
8	101	350.00	2024-02-08	400.00	-50.00
2	102	800.00	2024-01-07	NULL	NULL
6	102	1200.00	2024-02-01	800.00	+400.00
10	102	800.00	2024-02-14	1200.00	-400.00

The first row per rep always returns NULL for prev_sale_amount because there is no previous row. Handle this in your pipeline with COALESCE(LAG(amount) OVER (...), 0) if you need a numeric default.

Pipeline use case: Detecting sudden spikes or drops in metrics. In a CDC pipeline, LAG() is used to compare old vs new values to identify what actually changed. In a financial pipeline, it powers month-over-month change calculations.

4. SUM() OVER — Running Totals

Using SUM() as a window function with an ORDER BY clause creates a cumulative running total — without collapsing rows like a regular GROUP BY.

sql

SELECT
    sale_date,
    rep_id,
    amount,
    SUM(amount) OVER (
        PARTITION BY rep_id
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM sales
ORDER BY rep_id, sale_date;

Output:

sale_date	rep_id	amount	running_total
2024-01-05	101	1200.00	1200.00
2024-01-10	101	400.00	1600.00
2024-02-08	101	350.00	1950.00
2024-01-07	102	800.00	800.00
2024-02-01	102	1200.00	2000.00
2024-02-14	102	800.00	2800.00

The ROWS BETWEEN clause explained:

UNBOUNDED PRECEDING — start from the first row in the partiti

The Dataset

1. ROW_NUMBER() — Assigning a Unique Rank Per Group

2. RANK() and DENSE_RANK() — Ranking With Ties

3. LAG() and LEAD() — Comparing to Previous or Next Rows

4. SUM() OVER — Running Totals

Leave a Comment Cancel reply