One of the most common operations in a data pipeline is the upsert — insert new records, update existing ones, and optionally delete removed ones, all in a single atomic operation. The SQL MERGE statement handles all three in one query. Understanding it is essential for anyone building incremental load pipelines in Snowflake, BigQuery, Databricks Delta Lake, or PostgreSQL.
This tutorial covers the full MERGE syntax, practical upsert patterns, and how to handle slowly changing dimensions and soft deletes.
The Dataset
We have a warehouse dimension table that already holds customer data, and an incoming staging table with changes from the source system.
-- Target: existing warehouse table
CREATE TABLE warehouse.customers (
customer_id INT PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(100),
city VARCHAR(50),
status VARCHAR(20),
updated_at TIMESTAMP
);
INSERT INTO warehouse.customers VALUES
(101, 'Alice Smith', 'alice@example.com', 'Dubai', 'active', '2024-01-01 00:00:00'),
(102, 'Bob Jones', 'bob@example.com', 'Abu Dhabi', 'active', '2024-01-01 00:00:00'),
(103, 'Carol White', 'carol@example.com', 'Sharjah', 'active', '2024-01-01 00:00:00');
-- Source: incoming changes from API or CDC feed
CREATE TABLE staging.customers_delta (
customer_id INT,
name VARCHAR(100),
email VARCHAR(100),
city VARCHAR(50),
status VARCHAR(20),
updated_at TIMESTAMP
);
INSERT INTO staging.customers_delta VALUES
(101, 'Alice Smith', 'alice.new@example.com', 'Dubai', 'active', '2024-03-15 10:00:00'), -- email changed
(103, 'Carol White', 'carol@example.com', 'Sharjah', 'inactive', '2024-03-15 10:00:00'), -- status changed
(104, 'David Brown', 'david@example.com', 'Ajman', 'active', '2024-03-15 10:00:00'); -- new customer
Current warehouse state (before MERGE):
| customer_id | name | city | status | |
|---|---|---|---|---|
| 101 | Alice Smith | alice@example.com | Dubai | active |
| 102 | Bob Jones | bob@example.com | Abu Dhabi | active |
| 103 | Carol White | carol@example.com | Sharjah | active |
Basic MERGE — Insert and Update
MERGE INTO warehouse.customers AS target
USING staging.customers_delta AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
UPDATE SET
target.name = source.name,
target.email = source.email,
target.city = source.city,
target.status = source.status,
target.updated_at = source.updated_at
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, city, status, updated_at)
VALUES (source.customer_id, source.name, source.email,
source.city, source.status, source.updated_at);
Warehouse state after MERGE:
| customer_id | name | city | status | updated_at | |
|---|---|---|---|---|---|
| 101 | Alice Smith | alice.new@example.com | Dubai | active | 2024-03-15 |
| 102 | Bob Jones | bob@example.com | Abu Dhabi | active | 2024-01-01 |
| 103 | Carol White | carol@example.com | Sharjah | inactive | 2024-03-15 |
| 104 | David Brown | david@example.com | Ajman | active | 2024-03-15 |
What happened:
- Customer 101: email was updated
- Customer 102: no change in staging, so it was not touched
- Customer 103: status was updated to inactive
- Customer 104: new row was inserted
Adding a DELETE Clause — Hard Deletes
If your source system deletes a record and you want to remove it from the warehouse too, add a WHEN NOT MATCHED BY SOURCE clause.
-- First, add a "deleted" customer to the staging table for demo
INSERT INTO staging.customers_delta VALUES
(102, 'Bob Jones', 'bob@example.com', 'Abu Dhabi', 'deleted', '2024-03-15 11:00:00');
MERGE INTO warehouse.customers AS target
USING staging.customers_delta AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.status != 'deleted' THEN
UPDATE SET
target.name = source.name,
target.email = source.email,
target.city = source.city,
target.status = source.status,
target.updated_at = source.updated_at
WHEN MATCHED AND source.status = 'deleted' THEN
DELETE
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, city, status, updated_at)
VALUES (source.customer_id, source.name, source.email,
source.city, source.status, source.updated_at);
Customer 102 is now deleted from the warehouse.
Important: Hard deletes are destructive and irreversible. In most warehouse designs, soft deletes (marking a row as deleted rather than removing it) are safer and more auditable.
Soft Delete Pattern — Safer for Warehouses
Instead of deleting rows, mark them with a deleted flag and timestamp. This preserves history and allows you to audit what was removed and when.
MERGE INTO warehouse.customers AS target
USING staging.customers_delta AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.status = 'deleted' THEN
UPDATE SET
target.status = 'deleted',
target.updated_at = source.updated_at
WHEN MATCHED AND source.status != 'deleted' THEN
UPDATE SET
target.name = source.name,
target.email = source.email,
target.city = source.city,
target.status = source.status,
target.updated_at = source.updated_at
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, city, status, updated_at)
VALUES (source.customer_id, source.name, source.email,
source.city, source.status, source.updated_at);
Your downstream queries then filter with WHERE status != 'deleted' to get active records.
Conditional Update — Only Update When Something Actually Changed
Updating rows that have not actually changed wastes compute and inflates your updated_at timestamp unnecessarily. Add a condition to WHEN MATCHED to only fire the update when values differ.
MERGE INTO warehouse.customers AS target
USING staging.customers_delta AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND (
target.email != source.email OR
target.city != source.city OR
target.status != source.status
) THEN
UPDATE SET
target.email = source.email,
target.city = source.city,
target.status = source.status,
target.updated_at = source.updated_at
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, city, status, updated_at)
VALUES (source.customer_id, source.name, source.email,
source.city, source.status, source.updated_at);
Customer 102 now remains completely untouched because none of its columns changed in the source. This is especially important in large Snowflake or BigQuery tables where unnecessary writes are billed.
MERGE in Databricks Delta Lake
Delta Lake’s MERGE syntax is identical to standard SQL but has some important performance considerations.
# PySpark Delta Lake MERGE
from delta.tables import DeltaTable
target = DeltaTable.forName(spark, "warehouse.customers")
target.alias("target").merge(
source=spark.table("staging.customers_delta").alias("source"),
condition="target.customer_id = source.customer_id"
).whenMatchedUpdate(
condition="target.email != source.email OR target.status != source.status",
set={
"email": "source.email",
"status": "source.status",
"updated_at": "source.updated_at"
}
).whenNotMatchedInsert(
values={
"customer_id": "source.customer_id",
"name": "source.name",
"email": "source.email",
"city": "source.city",
"status": "source.status",
"updated_at": "source.updated_at"
}
).execute()
Or in Spark SQL:
MERGE INTO warehouse.customers AS target
USING staging.customers_delta AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND (target.email != source.email OR target.status != source.status) THEN
UPDATE SET target.email = source.email, target.status = source.status,
target.updated_at = source.updated_at
WHEN NOT MATCHED THEN
INSERT *;
INSERT * is a Delta Lake shorthand that inserts all source columns — useful when the source and target have the same schema.
Delta Lake performance tip: Partition your target table on a date column before running large MERGE operations. Delta will only scan the relevant partitions instead of the entire table.
Common Mistakes
Mistake 1 — Duplicate rows in the source table
-- If staging.customers_delta has two rows with customer_id = 101,
-- the MERGE will fail with a non-deterministic error in most databases.
-- Always deduplicate your source before merging:
MERGE INTO warehouse.customers AS target
USING (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
FROM staging.customers_delta
QUALIFY rn = 1 -- or use a CTE for PostgreSQL
) AS source
ON target.customer_id = source.customer_id
...
Mistake 2 — Joining on a nullable column
If customer_id can be NULL in your staging table, ON target.customer_id = source.customer_id will never match NULL to NULL (because NULL != NULL in SQL). Use ON target.customer_id IS NOT DISTINCT FROM source.customer_id in PostgreSQL, or filter NULLs out of your source first.
Mistake 3 — MERGE without a transaction wrapper
In databases that support it, wrap MERGE in an explicit transaction so a partial failure does not leave your target table in an inconsistent state.
Quick Reference — MERGE Clause Behaviour
| Clause | Condition | Action |
|---|---|---|
| WHEN MATCHED | Row exists in both target and source | UPDATE or DELETE |
| WHEN NOT MATCHED | Row in source but not in target | INSERT |
| WHEN NOT MATCHED BY SOURCE | Row in target but not in source | UPDATE or DELETE (SQL Server, BigQuery) |
What to Learn Next
MERGE is the foundation of Slowly Changing Dimension (SCD) Type 2 pipelines — where you need to preserve the full history of changes rather than just the current state. The SCD tutorial in this series builds directly on the MERGE pattern you learned here.