Finding and Removing Duplicates in SQL Before Loading to a Data Warehouse

When you are building a data pipeline, duplicate records are one of the most common and damaging data quality problems. A single duplicate row can inflate revenue numbers, double-count customers, or break a UNIQUE constraint on your warehouse table and halt your entire pipeline load.

This tutorial walks through three practical SQL methods to detect and remove duplicates before your data lands in a warehouse. Every method includes full working code and real output so you can apply it directly in your pipeline.

The Dataset

We will use an orders table that simulates raw data arriving from an API or event stream. Notice that order_id 1001 appears twice — a classic duplicate from a retry or double-event scenario.

sql

CREATE TABLE raw_orders (
    order_id     INT,
    customer_id  INT,
    product      VARCHAR(50),
    amount       DECIMAL(10,2),
    order_date   DATE,
    loaded_at    TIMESTAMP
);

INSERT INTO raw_orders VALUES
(1001, 201, 'Laptop',   1200.00, '2024-03-01', '2024-03-01 08:00:00'),
(1002, 202, 'Phone',     800.00, '2024-03-01', '2024-03-01 08:01:00'),
(1001, 201, 'Laptop',   1200.00, '2024-03-01', '2024-03-01 08:05:00'),  -- duplicate
(1003, 203, 'Tablet',    400.00, '2024-03-02', '2024-03-02 09:00:00'),
(1004, 201, 'Monitor',   350.00, '2024-03-02', '2024-03-02 09:30:00'),
(1002, 202, 'Phone',     800.00, '2024-03-01', '2024-03-01 08:07:00');  -- duplicate

Raw data — 6 rows, but 2 are duplicates:

order_idcustomer_idproductamountorder_dateloaded_at
1001201Laptop1200.002024-03-0108:00:00
1002202Phone800.002024-03-0108:01:00
1001201Laptop1200.002024-03-0108:05:00
1003203Tablet400.002024-03-0209:00:00
1004201Monitor350.002024-03-0209:30:00
1002202Phone800.002024-03-0108:07:00

Method 1 — COUNT + GROUP BY (Detection Only)

Use this first to understand how many duplicates you have before deciding how to remove them.

sql

SELECT
    order_id,
    COUNT(*) AS occurrence_count
FROM raw_orders
GROUP BY order_id
HAVING COUNT(*) > 1;

Output:

order_idoccurrence_count
10012
10022

This tells you order_id 1001 and 1002 each appear twice. You now know the scope of the problem before writing any removal logic.

Pipeline use case: Run this as a data quality check at the start of your pipeline. If occurrence_count returns any rows, raise an alert or log a warning before proceeding. Many teams put this into a dbt test or an Airflow data quality task.

Limitation: This only detects duplicates — it does not tell you which row to keep.

Method 2 — ROW_NUMBER() Window Function (Detection + Deduplication)

This is the most reliable method for deduplication in production pipelines. It assigns a row number within each group of duplicates, so you can cleanly keep row 1 and discard the rest.

sql

WITH ranked_orders AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY order_id
            ORDER BY loaded_at ASC   -- keep the earliest-loaded row
        ) AS row_num
    FROM raw_orders
)
SELECT
    order_id,
    customer_id,
    product,
    amount,
    order_date,
    loaded_at
FROM ranked_orders
WHERE row_num = 1;

Output — 4 clean rows, duplicates removed:

order_idcustomer_idproductamountorder_dateloaded_at
1001201Laptop1200.002024-03-0108:00:00
1002202Phone800.002024-03-0108:01:00
1003203Tablet400.002024-03-0209:00:00
1004201Monitor350.002024-03-0209:30:00

Why ORDER BY loaded_at ASC? In most pipeline scenarios you want the first-arrived record — the original event — not the retry. If your use case requires keeping the most recent update, switch to DESC.

Pipeline use case: Wrap your staging SELECT in this CTE before inserting into your warehouse table. This works identically in PostgreSQL, Snowflake, BigQuery, Redshift, and Spark SQL.

sql

-- Insert deduplicated rows into warehouse
INSERT INTO warehouse.orders
WITH ranked_orders AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY loaded_at ASC) AS row_num
    FROM raw_orders
)
SELECT order_id, customer_id, product, amount, order_date, loaded_at
FROM ranked_orders
WHERE row_num = 1;

Method 3 — QUALIFY (Spark SQL, Snowflake, BigQuery)

If you are working in Databricks, Snowflake, or BigQuery, QUALIFY gives you a cleaner one-step syntax without needing a CTE wrapper.

sql

-- Databricks / Snowflake / BigQuery only
SELECT
    order_id,
    customer_id,
    product,
    amount,
    order_date,
    loaded_at
FROM raw_orders
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY order_id
    ORDER BY loaded_at ASC
) = 1;

Output: Same 4 clean rows as Method 2.

QUALIFY filters on window function results inline — no subquery or CTE needed. This makes it very readable in dbt models and Databricks notebooks.

Note: QUALIFY is not available in PostgreSQL or MySQL. Use the CTE approach from Method 2 in those databases.

Handling Multi-Column Duplicates

Sometimes order_id is not unique on its own — the duplicate is defined by a combination of columns. For example, maybe the same customer ordered the same product on the same date twice.

sql

WITH ranked_orders AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id, product, order_date   -- duplicate defined by these 3 columns
            ORDER BY loaded_at ASC
        ) AS row_num
    FROM raw_orders
)
SELECT order_id, customer_id, product, amount, order_date
FROM ranked_orders
WHERE row_num = 1;

Adjust the PARTITION BY columns to match your business definition of a duplicate.

Common Mistakes

Mistake 1 — Using DISTINCT when columns differ slightly

sql

-- WRONG: loaded_at differs between duplicates, so DISTINCT keeps both rows
SELECT DISTINCT * FROM raw_orders;

DISTINCT only removes rows where every single column is identical. If your duplicates have different timestamps (as is common in event streams), DISTINCT will not remove them.

Mistake 2 — Deleting from the source table directly

sql

-- RISKY in production: deleting from raw/staging tables is destructive
DELETE FROM raw_orders
WHERE order_id IN (SELECT order_id FROM raw_orders GROUP BY order_id HAVING COUNT(*) > 1);

This deletes all copies of a duplicate, including the one you want to keep. Always use a SELECT-based deduplication into a clean table rather than destructive deletes on your staging layer.

Mistake 3 — Wrong PARTITION BY columns

If you PARTITION BY too many columns you may miss genuine duplicates. If you PARTITION BY too few you may incorrectly deduplicate rows that are legitimately different. Define what makes a row a duplicate in business terms before writing the query.

Quick Reference

MethodDatabase SupportBest For
COUNT + GROUP BYAll databasesDetecting and counting duplicates
ROW_NUMBER() CTEAll databasesProduction deduplication, universal
QUALIFYSnowflake, BigQuery, DatabricksCleaner syntax in modern warehouses

What to Learn Next

Once your data is deduplicated, the next challenge is loading it efficiently using upsert patterns — inserting new rows and updating changed ones without full table rewrites. That is covered in the MERGE statement tutorial in this series.

Leave a Comment