Window Functions Every Data Engineer Must Know (With Pipeline Examples)

Window functions are one of the most powerful tools in a data engineer’s SQL toolkit. Unlike aggregate functions that collapse rows into a single result, window functions perform calculations across a set of related rows while keeping every row in the output. This makes them essential for ranking, running totals, comparing rows to previous rows, and detecting change patterns in your pipeline data.

This tutorial covers the six window functions you will use most often in data engineering work, with real examples using a consistent dataset.

The Dataset

We will use a sales pipeline dataset throughout this tutorial.

sql

CREATE TABLE sales (
    sale_id      INT,
    rep_id       INT,
    region       VARCHAR(20),
    product      VARCHAR(30),
    amount       DECIMAL(10,2),
    sale_date    DATE
);

INSERT INTO sales VALUES
(1,  101, 'North', 'Laptop',   1200.00, '2024-01-05'),
(2,  102, 'North', 'Phone',     800.00, '2024-01-07'),
(3,  101, 'North', 'Tablet',    400.00, '2024-01-10'),
(4,  103, 'South', 'Laptop',   1200.00, '2024-01-12'),
(5,  104, 'South', 'Monitor',   350.00, '2024-01-15'),
(6,  102, 'North', 'Laptop',   1200.00, '2024-02-01'),
(7,  103, 'South', 'Phone',     800.00, '2024-02-03'),
(8,  101, 'North', 'Monitor',   350.00, '2024-02-08'),
(9,  104, 'South', 'Tablet',    400.00, '2024-02-10'),
(10, 102, 'North', 'Phone',     800.00, '2024-02-14');

1. ROW_NUMBER() — Assigning a Unique Rank Per Group

ROW_NUMBER() assigns a sequential number to each row within a partition. No two rows get the same number, even if they have the same value.

sql

SELECT
    sale_id,
    rep_id,
    region,
    amount,
    ROW_NUMBER() OVER (
        PARTITION BY region
        ORDER BY amount DESC
    ) AS rank_in_region
FROM sales;

Output:

sale_idrep_idregionamountrank_in_region
1101North1200.001
6102North1200.002
2102North800.003
10102North800.004
3101North400.005
8101North350.006
4103South1200.001
7103South800.002
9104South400.003
5104South350.004

Notice that two North sales both have amount 1200.00, but they get different row numbers (1 and 2). This is the key difference from RANK().

Pipeline use case: Use ROW_NUMBER() to keep only the latest record per entity — the deduplication technique from Tutorial 1. Also used to get the top N rows per group before loading to a report table.

sql

-- Get the single highest sale per region for a dashboard table
WITH ranked AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) AS rn
    FROM sales
)
SELECT region, rep_id, amount
FROM ranked
WHERE rn = 1;

2. RANK() and DENSE_RANK() — Ranking With Ties

RANK() assigns the same number to tied rows, but then skips numbers. DENSE_RANK() assigns the same number to tied rows but does not skip.

sql

SELECT
    rep_id,
    SUM(amount) AS total_sales,
    RANK()       OVER (ORDER BY SUM(amount) DESC) AS rank_with_gaps,
    DENSE_RANK() OVER (ORDER BY SUM(amount) DESC) AS rank_no_gaps
FROM sales
GROUP BY rep_id;

Output:

rep_idtotal_salesrank_with_gapsrank_no_gaps
1022800.0011
1011950.0022
1032000.0033
104750.0044

Pipeline use case: Use DENSE_RANK() when you need to label tiers (Tier 1, Tier 2, Tier 3) without gaps. Use RANK() when the exact position matters, such as leaderboards where “there is no 2nd place if two people tied for 1st.”

3. LAG() and LEAD() — Comparing to Previous or Next Rows

LAG() lets you access a value from a previous row. LEAD() accesses a value from a future row. Both are critical for detecting changes and trends in time-series pipeline data.

sql

SELECT
    sale_id,
    rep_id,
    amount,
    sale_date,
    LAG(amount)  OVER (PARTITION BY rep_id ORDER BY sale_date) AS prev_sale_amount,
    amount - LAG(amount) OVER (PARTITION BY rep_id ORDER BY sale_date) AS change_vs_prev
FROM sales
ORDER BY rep_id, sale_date;

Output:

sale_idrep_idamountsale_dateprev_sale_amountchange_vs_prev
11011200.002024-01-05NULLNULL
3101400.002024-01-101200.00-800.00
8101350.002024-02-08400.00-50.00
2102800.002024-01-07NULLNULL
61021200.002024-02-01800.00+400.00
10102800.002024-02-141200.00-400.00

The first row per rep always returns NULL for prev_sale_amount because there is no previous row. Handle this in your pipeline with COALESCE(LAG(amount) OVER (...), 0) if you need a numeric default.

Pipeline use case: Detecting sudden spikes or drops in metrics. In a CDC pipeline, LAG() is used to compare old vs new values to identify what actually changed. In a financial pipeline, it powers month-over-month change calculations.

4. SUM() OVER — Running Totals

Using SUM() as a window function with an ORDER BY clause creates a cumulative running total — without collapsing rows like a regular GROUP BY.

sql

SELECT
    sale_date,
    rep_id,
    amount,
    SUM(amount) OVER (
        PARTITION BY rep_id
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM sales
ORDER BY rep_id, sale_date;

Output:

sale_daterep_idamountrunning_total
2024-01-051011200.001200.00
2024-01-10101400.001600.00
2024-02-08101350.001950.00
2024-01-07102800.00800.00
2024-02-011021200.002000.00
2024-02-14102800.002800.00

The ROWS BETWEEN clause explained:

  • UNBOUNDED PRECEDING — start from the first row in the partiti

Leave a Comment