When you are building a data pipeline, duplicate records are one of the most common and damaging data quality problems. A single duplicate row can inflate revenue numbers, double-count customers, or break a UNIQUE constraint on your warehouse table and halt your entire pipeline load.
This tutorial walks through three practical SQL methods to detect and remove duplicates before your data lands in a warehouse. Every method includes full working code and real output so you can apply it directly in your pipeline.
The Dataset
We will use an orders table that simulates raw data arriving from an API or event stream. Notice that order_id 1001 appears twice — a classic duplicate from a retry or double-event scenario.
sql
CREATE TABLE raw_orders (
order_id INT,
customer_id INT,
product VARCHAR(50),
amount DECIMAL(10,2),
order_date DATE,
loaded_at TIMESTAMP
);
INSERT INTO raw_orders VALUES
(1001, 201, 'Laptop', 1200.00, '2024-03-01', '2024-03-01 08:00:00'),
(1002, 202, 'Phone', 800.00, '2024-03-01', '2024-03-01 08:01:00'),
(1001, 201, 'Laptop', 1200.00, '2024-03-01', '2024-03-01 08:05:00'), -- duplicate
(1003, 203, 'Tablet', 400.00, '2024-03-02', '2024-03-02 09:00:00'),
(1004, 201, 'Monitor', 350.00, '2024-03-02', '2024-03-02 09:30:00'),
(1002, 202, 'Phone', 800.00, '2024-03-01', '2024-03-01 08:07:00'); -- duplicate
Raw data — 6 rows, but 2 are duplicates:
| order_id | customer_id | product | amount | order_date | loaded_at |
|---|---|---|---|---|---|
| 1001 | 201 | Laptop | 1200.00 | 2024-03-01 | 08:00:00 |
| 1002 | 202 | Phone | 800.00 | 2024-03-01 | 08:01:00 |
| 1001 | 201 | Laptop | 1200.00 | 2024-03-01 | 08:05:00 |
| 1003 | 203 | Tablet | 400.00 | 2024-03-02 | 09:00:00 |
| 1004 | 201 | Monitor | 350.00 | 2024-03-02 | 09:30:00 |
| 1002 | 202 | Phone | 800.00 | 2024-03-01 | 08:07:00 |
Method 1 — COUNT + GROUP BY (Detection Only)
Use this first to understand how many duplicates you have before deciding how to remove them.
sql
SELECT
order_id,
COUNT(*) AS occurrence_count
FROM raw_orders
GROUP BY order_id
HAVING COUNT(*) > 1;
Output:
| order_id | occurrence_count |
|---|---|
| 1001 | 2 |
| 1002 | 2 |
This tells you order_id 1001 and 1002 each appear twice. You now know the scope of the problem before writing any removal logic.
Pipeline use case: Run this as a data quality check at the start of your pipeline. If occurrence_count returns any rows, raise an alert or log a warning before proceeding. Many teams put this into a dbt test or an Airflow data quality task.
Limitation: This only detects duplicates — it does not tell you which row to keep.
Method 2 — ROW_NUMBER() Window Function (Detection + Deduplication)
This is the most reliable method for deduplication in production pipelines. It assigns a row number within each group of duplicates, so you can cleanly keep row 1 and discard the rest.
sql
WITH ranked_orders AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY loaded_at ASC -- keep the earliest-loaded row
) AS row_num
FROM raw_orders
)
SELECT
order_id,
customer_id,
product,
amount,
order_date,
loaded_at
FROM ranked_orders
WHERE row_num = 1;
Output — 4 clean rows, duplicates removed:
| order_id | customer_id | product | amount | order_date | loaded_at |
|---|---|---|---|---|---|
| 1001 | 201 | Laptop | 1200.00 | 2024-03-01 | 08:00:00 |
| 1002 | 202 | Phone | 800.00 | 2024-03-01 | 08:01:00 |
| 1003 | 203 | Tablet | 400.00 | 2024-03-02 | 09:00:00 |
| 1004 | 201 | Monitor | 350.00 | 2024-03-02 | 09:30:00 |
Why ORDER BY loaded_at ASC? In most pipeline scenarios you want the first-arrived record — the original event — not the retry. If your use case requires keeping the most recent update, switch to DESC.
Pipeline use case: Wrap your staging SELECT in this CTE before inserting into your warehouse table. This works identically in PostgreSQL, Snowflake, BigQuery, Redshift, and Spark SQL.
sql
-- Insert deduplicated rows into warehouse
INSERT INTO warehouse.orders
WITH ranked_orders AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY loaded_at ASC) AS row_num
FROM raw_orders
)
SELECT order_id, customer_id, product, amount, order_date, loaded_at
FROM ranked_orders
WHERE row_num = 1;
Method 3 — QUALIFY (Spark SQL, Snowflake, BigQuery)
If you are working in Databricks, Snowflake, or BigQuery, QUALIFY gives you a cleaner one-step syntax without needing a CTE wrapper.
sql
-- Databricks / Snowflake / BigQuery only
SELECT
order_id,
customer_id,
product,
amount,
order_date,
loaded_at
FROM raw_orders
QUALIFY ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY loaded_at ASC
) = 1;
Output: Same 4 clean rows as Method 2.
QUALIFY filters on window function results inline — no subquery or CTE needed. This makes it very readable in dbt models and Databricks notebooks.
Note: QUALIFY is not available in PostgreSQL or MySQL. Use the CTE approach from Method 2 in those databases.
Handling Multi-Column Duplicates
Sometimes order_id is not unique on its own — the duplicate is defined by a combination of columns. For example, maybe the same customer ordered the same product on the same date twice.
sql
WITH ranked_orders AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY customer_id, product, order_date -- duplicate defined by these 3 columns
ORDER BY loaded_at ASC
) AS row_num
FROM raw_orders
)
SELECT order_id, customer_id, product, amount, order_date
FROM ranked_orders
WHERE row_num = 1;
Adjust the PARTITION BY columns to match your business definition of a duplicate.
Common Mistakes
Mistake 1 — Using DISTINCT when columns differ slightly
sql
-- WRONG: loaded_at differs between duplicates, so DISTINCT keeps both rows
SELECT DISTINCT * FROM raw_orders;
DISTINCT only removes rows where every single column is identical. If your duplicates have different timestamps (as is common in event streams), DISTINCT will not remove them.
Mistake 2 — Deleting from the source table directly
sql
-- RISKY in production: deleting from raw/staging tables is destructive
DELETE FROM raw_orders
WHERE order_id IN (SELECT order_id FROM raw_orders GROUP BY order_id HAVING COUNT(*) > 1);
This deletes all copies of a duplicate, including the one you want to keep. Always use a SELECT-based deduplication into a clean table rather than destructive deletes on your staging layer.
Mistake 3 — Wrong PARTITION BY columns
If you PARTITION BY too many columns you may miss genuine duplicates. If you PARTITION BY too few you may incorrectly deduplicate rows that are legitimately different. Define what makes a row a duplicate in business terms before writing the query.
Quick Reference
| Method | Database Support | Best For |
|---|---|---|
| COUNT + GROUP BY | All databases | Detecting and counting duplicates |
| ROW_NUMBER() CTE | All databases | Production deduplication, universal |
| QUALIFY | Snowflake, BigQuery, Databricks | Cleaner syntax in modern warehouses |
What to Learn Next
Once your data is deduplicated, the next challenge is loading it efficiently using upsert patterns — inserting new rows and updating changed ones without full table rewrites. That is covered in the MERGE statement tutorial in this series.