SQL for Slowly Changing Dimensions — SCD Type 1, 2, and 3 With Full Working Code

Slowly Changing Dimensions (SCD) solve one of the most fundamental questions in data warehousing: when a customer changes their address, do you overwrite the old address or keep it? The answer depends on whether your business needs to report on current state only, or historical state at any point in time.

This tutorial walks through SCD Type 1, 2, and 3 with complete SQL — including the MERGE statements needed to implement each type in a production pipeline.

The Dataset

A customer dimension table is the classic SCD example. Customers change their city, email, or subscription tier over time.

CREATE TABLE dim_customers (
    customer_id      INT,
    name             VARCHAR(100),
    email            VARCHAR(100),
    city             VARCHAR(50),
    subscription     VARCHAR(20),
    -- SCD Type 2 columns (added later)
    surrogate_key    BIGINT,
    valid_from       DATE,
    valid_to         DATE,
    is_current       BOOLEAN
);

-- Initial state of the dimension
INSERT INTO dim_customers (customer_id, name, email, city, subscription) VALUES
(101, 'Alice Smith', 'alice@example.com', 'Dubai',     'Basic'),
(102, 'Bob Jones',   'bob@example.com',   'Abu Dhabi', 'Pro'),
(103, 'Carol White', 'carol@example.com', 'Sharjah',   'Basic');

Incoming change from the source system:

CREATE TABLE staging.customer_changes (
    customer_id  INT,
    name         VARCHAR(100),
    email        VARCHAR(100),
    city         VARCHAR(50),
    subscription VARCHAR(20),
    changed_at   DATE
);

INSERT INTO staging.customer_changes VALUES
(101, 'Alice Smith',  'alice.new@example.com', 'Dubai',     'Pro',   '2024-03-15'),  -- email + subscription changed
(102, 'Bob Jones',    'bob@example.com',       'Dubai',     'Pro',   '2024-03-15'),  -- city changed
(104, 'David Brown',  'david@example.com',     'Ajman',     'Basic', '2024-03-15');  -- new customer

SCD Type 1 — Overwrite (No History)

Type 1 is the simplest approach: when a value changes, overwrite it. No history is kept. The dimension always reflects the current state.

Use when: History does not matter for your reporting. For example, fixing a typo in a customer’s name, or updating a product’s current price.

MERGE INTO dim_customers AS target
USING staging.customer_changes AS source
    ON target.customer_id = source.customer_id

WHEN MATCHED THEN
    UPDATE SET
        target.name         = source.name,
        target.email        = source.email,
        target.city         = source.city,
        target.subscription = source.subscription

WHEN NOT MATCHED THEN
    INSERT (customer_id, name, email, city, subscription)
    VALUES (source.customer_id, source.name, source.email,
            source.city, source.subscription);

Before:

customer_idnameemailcitysubscription
101Alice Smithalice@example.comDubaiBasic
102Bob Jonesbob@example.comAbu DhabiPro

After Type 1 MERGE:

customer_idnameemailcitysubscription
101Alice Smithalice.new@example.comDubaiPro
102Bob Jonesbob@example.comDubaiPro
104David Browndavid@example.comAjmanBasic

Alice’s old email (alice@example.com) and old subscription (Basic) are gone forever. Bob’s old city (Abu Dhabi) is gone. If a sales report asks “how many customers were in Abu Dhabi last quarter?”, Type 1 cannot answer that — the data no longer exists.

SCD Type 2 — Full History (Most Common in Warehouses)

Type 2 keeps every version of a record. Instead of overwriting, you expire the old row and insert a new one. Each row has a valid_from, valid_to, and is_current flag.

Use when: You need to report on historical state. “What was the customer’s subscription tier when they made this purchase?” Type 2 can answer this. Type 1 cannot.

Setting up Type 2 columns

-- Add Type 2 columns to your dimension table
ALTER TABLE dim_customers ADD COLUMN surrogate_key BIGINT;
ALTER TABLE dim_customers ADD COLUMN valid_from    DATE;
ALTER TABLE dim_customers ADD COLUMN valid_to      DATE;
ALTER TABLE dim_customers ADD COLUMN is_current    BOOLEAN DEFAULT TRUE;

-- Set initial values
UPDATE dim_customers SET
    surrogate_key = customer_id * 100 + 1,
    valid_from    = '2024-01-01',
    valid_to      = '9999-12-31',
    is_current    = TRUE;

Type 2 MERGE — Expire old row, insert new row

Standard SQL MERGE cannot insert and update in the same matched clause, so Type 2 typically requires two steps.

Step 1 — Expire changed rows:

UPDATE dim_customers AS target
SET
    valid_to   = source.changed_at - INTERVAL '1 day',
    is_current = FALSE
FROM staging.customer_changes AS source
WHERE target.customer_id = source.customer_id
  AND target.is_current = TRUE
  AND (
      target.email        != source.email        OR
      target.city         != source.city         OR
      target.subscription != source.subscription
  );

Step 2 — Insert new current rows:

INSERT INTO dim_customers (
    customer_id, name, email, city, subscription,
    surrogate_key, valid_from, valid_to, is_current
)
SELECT
    source.customer_id,
    source.name,
    source.email,
    source.city,
    source.subscription,
    -- Generate new surrogate key
    source.customer_id * 100 + (
        SELECT COUNT(*) + 1
        FROM dim_customers d
        WHERE d.customer_id = source.customer_id
    ),
    source.changed_at,
    '9999-12-31',
    TRUE
FROM staging.customer_changes source
WHERE EXISTS (
    SELECT 1 FROM dim_customers target
    WHERE target.customer_id = source.customer_id
      AND target.is_current = FALSE   -- only if we just expired a row
      AND target.valid_to = source.changed_at - INTERVAL '1 day'
)
OR NOT EXISTS (
    SELECT 1 FROM dim_customers WHERE customer_id = source.customer_id
);

Dimension table after Type 2 processing:

customer_idnameemailcitysubscriptionvalid_fromvalid_tois_current
101Alice Smithalice@example.comDubaiBasic2024-01-012024-03-14FALSE
101Alice Smithalice.new@example.comDubaiPro2024-03-159999-12-31TRUE
102Bob Jonesbob@example.comAbu DhabiPro2024-01-012024-03-14FALSE
102Bob Jonesbob@example.comDubaiPro2024-03-159999-12-31TRUE
103Carol Whitecarol@example.comSharjahBasic2024-01-019999-12-31TRUE
104David Browndavid@example.comAjmanBasic2024-03-159999-12-31TRUE

Now you can answer historical questions:

-- What subscription did Alice have on March 10th, 2024?
SELECT subscription
FROM dim_customers
WHERE customer_id = 101
  AND '2024-03-10' BETWEEN valid_from AND valid_to;
-- Returns: 'Basic'

-- What subscription does Alice have today?
SELECT subscription
FROM dim_customers
WHERE customer_id = 101
  AND is_current = TRUE;
-- Returns: 'Pro'

Joining a fact table to a Type 2 dimension

This is the key join pattern in a Type 2 dimensional model. You join the fact’s event date to the dimension’s validity range.

SELECT
    f.order_id,
    f.order_date,
    f.amount,
    c.subscription   AS subscription_at_time_of_order,
    c.city           AS city_at_time_of_order
FROM fact_orders f
JOIN dim_customers c
    ON f.customer_id = c.customer_id
   AND f.order_date BETWEEN c.valid_from AND c.valid_to;

This returns Alice’s subscription as Basic for orders placed before March 15th, and Pro for orders placed after — even though the current dimension row says Pro.

SCD Type 3 — Keep One Previous Version Only

Type 3 stores the previous value alongside the current value in the same row. It is a middle ground — you keep one historical version, but not full history.

Use when: You only care about “what it was before the most recent change,” not the full history. Common for slowly changing attributes like region or tier, where one-level-back comparison is sufficient.

ALTER TABLE dim_customers ADD COLUMN prev_city         VARCHAR(50);
ALTER TABLE dim_customers ADD COLUMN prev_subscription VARCHAR(20);
ALTER TABLE dim_customers ADD COLUMN changed_at        DATE;

MERGE INTO dim_customers AS target
USING staging.customer_changes AS source
    ON target.customer_id = source.customer_id

WHEN MATCHED THEN
    UPDATE SET
        target.prev_city         = target.city,           -- save current as previous
        target.prev_subscription = target.subscription,   -- save current as previous
        target.city              = source.city,           -- overwrite with new
        target.subscription      = source.subscription,   -- overwrite with new
        target.email             = source.email,
        target.changed_at        = source.changed_at

WHEN NOT MATCHED THEN
    INSERT (customer_id, name, email, city, subscription, changed_at)
    VALUES (source.customer_id, source.name, source.email,
            source.city, source.subscription, source.changed_at);

After Type 3 processing:

customer_idnamecityprev_citysubscriptionprev_subscription
101Alice SmithDubaiDubaiProBasic
102Bob JonesDubaiAbu DhabiProPro

You can now answer “where was Bob before his most recent city change?” but you cannot answer where he was two changes ago — that history is overwritten.

Which SCD Type to Use — Decision Guide

QuestionType 1Type 2Type 3
Do you need historical reporting?NoYesLimited (1 version back)
Can you answer “what was the value on date X?”NoYesOnly for the most recent change
Storage costLowHighMedium
Query complexitySimpleComplex (range joins)Medium
Common use casesError corrections, current-state reportingSubscription history, address historyComparing current vs previous tier

SCD Type 2 in Databricks Delta Lake

Delta Lake simplifies Type 2 because Delta itself maintains a transaction log. You can use table_changes() to read the before/after state of updates and construct your history table automatically.

-- Read all updates to source.customers since version 10
SELECT
    customer_id,
    email,
    city,
    subscription,
    _change_type,
    _commit_timestamp::DATE AS change_date
FROM table_changes('source.customers', 10)
WHERE _change_type IN ('update_postimage', 'insert')
ORDER BY customer_id, _commit_timestamp;

This gives you the complete change history from the Delta log, which you then load into your Type 2 dimension using the MERGE pattern above.

Common Mistakes

Mistake 1 — Using customer_id as the join key in a Type 2 dimension

A customer with two history rows has the same customer_id for both. Join your fact table to the dimension using the validity date range, not just customer_id.

Mistake 2 — Forgetting to filter is_current = TRUE in general queries

Without this filter, a query on the dimension returns all historical rows for each customer and inflates your counts dramatically.

Mistake 3 — Not handling the gap between valid_to and valid_from

When you expire a row on March 15th (valid_to = 2024-03-14) and insert a new row starting March 15th (valid_from = 2024-03-15), there is no gap. But if you expire on the 15th and insert on the 16th, March 15th has no matching row. Be precise with your date arithmetic.

What to Learn Next

Type 2 dimensions generate many rows over time and can grow very large. The next tutorial covers SQL data quality checks — the validation queries you run before loading data into your Silver and Gold layers to prevent bad data from corrupting your dimension tables.

Leave a Comment