SQL MERGE Statement Explained — Upsert Patterns for Data Warehouses and Delta Lake

One of the most common operations in a data pipeline is the upsert — insert new records, update existing ones, and optionally delete removed ones, all in a single atomic operation. The SQL MERGE statement handles all three in one query. Understanding it is essential for anyone building incremental load pipelines in Snowflake, BigQuery, Databricks Delta Lake, or PostgreSQL.

This tutorial covers the full MERGE syntax, practical upsert patterns, and how to handle slowly changing dimensions and soft deletes.

The Dataset

We have a warehouse dimension table that already holds customer data, and an incoming staging table with changes from the source system.

-- Target: existing warehouse table
CREATE TABLE warehouse.customers (
    customer_id   INT PRIMARY KEY,
    name          VARCHAR(100),
    email         VARCHAR(100),
    city          VARCHAR(50),
    status        VARCHAR(20),
    updated_at    TIMESTAMP
);

INSERT INTO warehouse.customers VALUES
(101, 'Alice Smith',  'alice@example.com',  'Dubai',     'active',   '2024-01-01 00:00:00'),
(102, 'Bob Jones',    'bob@example.com',    'Abu Dhabi', 'active',   '2024-01-01 00:00:00'),
(103, 'Carol White',  'carol@example.com',  'Sharjah',   'active',   '2024-01-01 00:00:00');

-- Source: incoming changes from API or CDC feed
CREATE TABLE staging.customers_delta (
    customer_id   INT,
    name          VARCHAR(100),
    email         VARCHAR(100),
    city          VARCHAR(50),
    status        VARCHAR(20),
    updated_at    TIMESTAMP
);

INSERT INTO staging.customers_delta VALUES
(101, 'Alice Smith',   'alice.new@example.com', 'Dubai',     'active',    '2024-03-15 10:00:00'),  -- email changed
(103, 'Carol White',   'carol@example.com',     'Sharjah',   'inactive',  '2024-03-15 10:00:00'),  -- status changed
(104, 'David Brown',   'david@example.com',     'Ajman',     'active',    '2024-03-15 10:00:00');  -- new customer

Current warehouse state (before MERGE):

customer_idnameemailcitystatus
101Alice Smithalice@example.comDubaiactive
102Bob Jonesbob@example.comAbu Dhabiactive
103Carol Whitecarol@example.comSharjahactive

Basic MERGE — Insert and Update

MERGE INTO warehouse.customers AS target
USING staging.customers_delta AS source
    ON target.customer_id = source.customer_id

WHEN MATCHED THEN
    UPDATE SET
        target.name       = source.name,
        target.email      = source.email,
        target.city       = source.city,
        target.status     = source.status,
        target.updated_at = source.updated_at

WHEN NOT MATCHED THEN
    INSERT (customer_id, name, email, city, status, updated_at)
    VALUES (source.customer_id, source.name, source.email,
            source.city, source.status, source.updated_at);

Warehouse state after MERGE:

customer_idnameemailcitystatusupdated_at
101Alice Smithalice.new@example.comDubaiactive2024-03-15
102Bob Jonesbob@example.comAbu Dhabiactive2024-01-01
103Carol Whitecarol@example.comSharjahinactive2024-03-15
104David Browndavid@example.comAjmanactive2024-03-15

What happened:

  • Customer 101: email was updated
  • Customer 102: no change in staging, so it was not touched
  • Customer 103: status was updated to inactive
  • Customer 104: new row was inserted

Adding a DELETE Clause — Hard Deletes

If your source system deletes a record and you want to remove it from the warehouse too, add a WHEN NOT MATCHED BY SOURCE clause.

-- First, add a "deleted" customer to the staging table for demo
INSERT INTO staging.customers_delta VALUES
(102, 'Bob Jones', 'bob@example.com', 'Abu Dhabi', 'deleted', '2024-03-15 11:00:00');

MERGE INTO warehouse.customers AS target
USING staging.customers_delta AS source
    ON target.customer_id = source.customer_id

WHEN MATCHED AND source.status != 'deleted' THEN
    UPDATE SET
        target.name       = source.name,
        target.email      = source.email,
        target.city       = source.city,
        target.status     = source.status,
        target.updated_at = source.updated_at

WHEN MATCHED AND source.status = 'deleted' THEN
    DELETE

WHEN NOT MATCHED THEN
    INSERT (customer_id, name, email, city, status, updated_at)
    VALUES (source.customer_id, source.name, source.email,
            source.city, source.status, source.updated_at);

Customer 102 is now deleted from the warehouse.

Important: Hard deletes are destructive and irreversible. In most warehouse designs, soft deletes (marking a row as deleted rather than removing it) are safer and more auditable.

Soft Delete Pattern — Safer for Warehouses

Instead of deleting rows, mark them with a deleted flag and timestamp. This preserves history and allows you to audit what was removed and when.

MERGE INTO warehouse.customers AS target
USING staging.customers_delta AS source
    ON target.customer_id = source.customer_id

WHEN MATCHED AND source.status = 'deleted' THEN
    UPDATE SET
        target.status     = 'deleted',
        target.updated_at = source.updated_at

WHEN MATCHED AND source.status != 'deleted' THEN
    UPDATE SET
        target.name       = source.name,
        target.email      = source.email,
        target.city       = source.city,
        target.status     = source.status,
        target.updated_at = source.updated_at

WHEN NOT MATCHED THEN
    INSERT (customer_id, name, email, city, status, updated_at)
    VALUES (source.customer_id, source.name, source.email,
            source.city, source.status, source.updated_at);

Your downstream queries then filter with WHERE status != 'deleted' to get active records.

Conditional Update — Only Update When Something Actually Changed

Updating rows that have not actually changed wastes compute and inflates your updated_at timestamp unnecessarily. Add a condition to WHEN MATCHED to only fire the update when values differ.

MERGE INTO warehouse.customers AS target
USING staging.customers_delta AS source
    ON target.customer_id = source.customer_id

WHEN MATCHED AND (
    target.email  != source.email  OR
    target.city   != source.city   OR
    target.status != source.status
) THEN
    UPDATE SET
        target.email      = source.email,
        target.city       = source.city,
        target.status     = source.status,
        target.updated_at = source.updated_at

WHEN NOT MATCHED THEN
    INSERT (customer_id, name, email, city, status, updated_at)
    VALUES (source.customer_id, source.name, source.email,
            source.city, source.status, source.updated_at);

Customer 102 now remains completely untouched because none of its columns changed in the source. This is especially important in large Snowflake or BigQuery tables where unnecessary writes are billed.

MERGE in Databricks Delta Lake

Delta Lake’s MERGE syntax is identical to standard SQL but has some important performance considerations.

# PySpark Delta Lake MERGE
from delta.tables import DeltaTable

target = DeltaTable.forName(spark, "warehouse.customers")

target.alias("target").merge(
    source=spark.table("staging.customers_delta").alias("source"),
    condition="target.customer_id = source.customer_id"
).whenMatchedUpdate(
    condition="target.email != source.email OR target.status != source.status",
    set={
        "email":      "source.email",
        "status":     "source.status",
        "updated_at": "source.updated_at"
    }
).whenNotMatchedInsert(
    values={
        "customer_id": "source.customer_id",
        "name":        "source.name",
        "email":       "source.email",
        "city":        "source.city",
        "status":      "source.status",
        "updated_at":  "source.updated_at"
    }
).execute()

Or in Spark SQL:

MERGE INTO warehouse.customers AS target
USING staging.customers_delta AS source
    ON target.customer_id = source.customer_id
WHEN MATCHED AND (target.email != source.email OR target.status != source.status) THEN
    UPDATE SET target.email = source.email, target.status = source.status,
               target.updated_at = source.updated_at
WHEN NOT MATCHED THEN
    INSERT *;

INSERT * is a Delta Lake shorthand that inserts all source columns — useful when the source and target have the same schema.

Delta Lake performance tip: Partition your target table on a date column before running large MERGE operations. Delta will only scan the relevant partitions instead of the entire table.

Common Mistakes

Mistake 1 — Duplicate rows in the source table

-- If staging.customers_delta has two rows with customer_id = 101,
-- the MERGE will fail with a non-deterministic error in most databases.
-- Always deduplicate your source before merging:

MERGE INTO warehouse.customers AS target
USING (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
    FROM staging.customers_delta
    QUALIFY rn = 1   -- or use a CTE for PostgreSQL
) AS source
    ON target.customer_id = source.customer_id
...

Mistake 2 — Joining on a nullable column

If customer_id can be NULL in your staging table, ON target.customer_id = source.customer_id will never match NULL to NULL (because NULL != NULL in SQL). Use ON target.customer_id IS NOT DISTINCT FROM source.customer_id in PostgreSQL, or filter NULLs out of your source first.

Mistake 3 — MERGE without a transaction wrapper

In databases that support it, wrap MERGE in an explicit transaction so a partial failure does not leave your target table in an inconsistent state.

Quick Reference — MERGE Clause Behaviour

ClauseConditionAction
WHEN MATCHEDRow exists in both target and sourceUPDATE or DELETE
WHEN NOT MATCHEDRow in source but not in targetINSERT
WHEN NOT MATCHED BY SOURCERow in target but not in sourceUPDATE or DELETE (SQL Server, BigQuery)

What to Learn Next

MERGE is the foundation of Slowly Changing Dimension (SCD) Type 2 pipelines — where you need to preserve the full history of changes rather than just the current state. The SCD tutorial in this series builds directly on the MERGE pattern you learned here.

Leave a Comment