Slowly Changing Dimensions (SCD) solve one of the most fundamental questions in data warehousing: when a customer changes their address, do you overwrite the old address or keep it? The answer depends on whether your business needs to report on current state only, or historical state at any point in time.
This tutorial walks through SCD Type 1, 2, and 3 with complete SQL — including the MERGE statements needed to implement each type in a production pipeline.
The Dataset
A customer dimension table is the classic SCD example. Customers change their city, email, or subscription tier over time.
CREATE TABLE dim_customers (
customer_id INT,
name VARCHAR(100),
email VARCHAR(100),
city VARCHAR(50),
subscription VARCHAR(20),
-- SCD Type 2 columns (added later)
surrogate_key BIGINT,
valid_from DATE,
valid_to DATE,
is_current BOOLEAN
);
-- Initial state of the dimension
INSERT INTO dim_customers (customer_id, name, email, city, subscription) VALUES
(101, 'Alice Smith', 'alice@example.com', 'Dubai', 'Basic'),
(102, 'Bob Jones', 'bob@example.com', 'Abu Dhabi', 'Pro'),
(103, 'Carol White', 'carol@example.com', 'Sharjah', 'Basic');
Incoming change from the source system:
CREATE TABLE staging.customer_changes (
customer_id INT,
name VARCHAR(100),
email VARCHAR(100),
city VARCHAR(50),
subscription VARCHAR(20),
changed_at DATE
);
INSERT INTO staging.customer_changes VALUES
(101, 'Alice Smith', 'alice.new@example.com', 'Dubai', 'Pro', '2024-03-15'), -- email + subscription changed
(102, 'Bob Jones', 'bob@example.com', 'Dubai', 'Pro', '2024-03-15'), -- city changed
(104, 'David Brown', 'david@example.com', 'Ajman', 'Basic', '2024-03-15'); -- new customer
SCD Type 1 — Overwrite (No History)
Type 1 is the simplest approach: when a value changes, overwrite it. No history is kept. The dimension always reflects the current state.
Use when: History does not matter for your reporting. For example, fixing a typo in a customer’s name, or updating a product’s current price.
MERGE INTO dim_customers AS target
USING staging.customer_changes AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
UPDATE SET
target.name = source.name,
target.email = source.email,
target.city = source.city,
target.subscription = source.subscription
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, city, subscription)
VALUES (source.customer_id, source.name, source.email,
source.city, source.subscription);
Before:
| customer_id | name | city | subscription | |
|---|---|---|---|---|
| 101 | Alice Smith | alice@example.com | Dubai | Basic |
| 102 | Bob Jones | bob@example.com | Abu Dhabi | Pro |
After Type 1 MERGE:
| customer_id | name | city | subscription | |
|---|---|---|---|---|
| 101 | Alice Smith | alice.new@example.com | Dubai | Pro |
| 102 | Bob Jones | bob@example.com | Dubai | Pro |
| 104 | David Brown | david@example.com | Ajman | Basic |
Alice’s old email (alice@example.com) and old subscription (Basic) are gone forever. Bob’s old city (Abu Dhabi) is gone. If a sales report asks “how many customers were in Abu Dhabi last quarter?”, Type 1 cannot answer that — the data no longer exists.
SCD Type 2 — Full History (Most Common in Warehouses)
Type 2 keeps every version of a record. Instead of overwriting, you expire the old row and insert a new one. Each row has a valid_from, valid_to, and is_current flag.
Use when: You need to report on historical state. “What was the customer’s subscription tier when they made this purchase?” Type 2 can answer this. Type 1 cannot.
Setting up Type 2 columns
-- Add Type 2 columns to your dimension table
ALTER TABLE dim_customers ADD COLUMN surrogate_key BIGINT;
ALTER TABLE dim_customers ADD COLUMN valid_from DATE;
ALTER TABLE dim_customers ADD COLUMN valid_to DATE;
ALTER TABLE dim_customers ADD COLUMN is_current BOOLEAN DEFAULT TRUE;
-- Set initial values
UPDATE dim_customers SET
surrogate_key = customer_id * 100 + 1,
valid_from = '2024-01-01',
valid_to = '9999-12-31',
is_current = TRUE;
Type 2 MERGE — Expire old row, insert new row
Standard SQL MERGE cannot insert and update in the same matched clause, so Type 2 typically requires two steps.
Step 1 — Expire changed rows:
UPDATE dim_customers AS target
SET
valid_to = source.changed_at - INTERVAL '1 day',
is_current = FALSE
FROM staging.customer_changes AS source
WHERE target.customer_id = source.customer_id
AND target.is_current = TRUE
AND (
target.email != source.email OR
target.city != source.city OR
target.subscription != source.subscription
);
Step 2 — Insert new current rows:
INSERT INTO dim_customers (
customer_id, name, email, city, subscription,
surrogate_key, valid_from, valid_to, is_current
)
SELECT
source.customer_id,
source.name,
source.email,
source.city,
source.subscription,
-- Generate new surrogate key
source.customer_id * 100 + (
SELECT COUNT(*) + 1
FROM dim_customers d
WHERE d.customer_id = source.customer_id
),
source.changed_at,
'9999-12-31',
TRUE
FROM staging.customer_changes source
WHERE EXISTS (
SELECT 1 FROM dim_customers target
WHERE target.customer_id = source.customer_id
AND target.is_current = FALSE -- only if we just expired a row
AND target.valid_to = source.changed_at - INTERVAL '1 day'
)
OR NOT EXISTS (
SELECT 1 FROM dim_customers WHERE customer_id = source.customer_id
);
Dimension table after Type 2 processing:
| customer_id | name | city | subscription | valid_from | valid_to | is_current | |
|---|---|---|---|---|---|---|---|
| 101 | Alice Smith | alice@example.com | Dubai | Basic | 2024-01-01 | 2024-03-14 | FALSE |
| 101 | Alice Smith | alice.new@example.com | Dubai | Pro | 2024-03-15 | 9999-12-31 | TRUE |
| 102 | Bob Jones | bob@example.com | Abu Dhabi | Pro | 2024-01-01 | 2024-03-14 | FALSE |
| 102 | Bob Jones | bob@example.com | Dubai | Pro | 2024-03-15 | 9999-12-31 | TRUE |
| 103 | Carol White | carol@example.com | Sharjah | Basic | 2024-01-01 | 9999-12-31 | TRUE |
| 104 | David Brown | david@example.com | Ajman | Basic | 2024-03-15 | 9999-12-31 | TRUE |
Now you can answer historical questions:
-- What subscription did Alice have on March 10th, 2024?
SELECT subscription
FROM dim_customers
WHERE customer_id = 101
AND '2024-03-10' BETWEEN valid_from AND valid_to;
-- Returns: 'Basic'
-- What subscription does Alice have today?
SELECT subscription
FROM dim_customers
WHERE customer_id = 101
AND is_current = TRUE;
-- Returns: 'Pro'
Joining a fact table to a Type 2 dimension
This is the key join pattern in a Type 2 dimensional model. You join the fact’s event date to the dimension’s validity range.
SELECT
f.order_id,
f.order_date,
f.amount,
c.subscription AS subscription_at_time_of_order,
c.city AS city_at_time_of_order
FROM fact_orders f
JOIN dim_customers c
ON f.customer_id = c.customer_id
AND f.order_date BETWEEN c.valid_from AND c.valid_to;
This returns Alice’s subscription as Basic for orders placed before March 15th, and Pro for orders placed after — even though the current dimension row says Pro.
SCD Type 3 — Keep One Previous Version Only
Type 3 stores the previous value alongside the current value in the same row. It is a middle ground — you keep one historical version, but not full history.
Use when: You only care about “what it was before the most recent change,” not the full history. Common for slowly changing attributes like region or tier, where one-level-back comparison is sufficient.
ALTER TABLE dim_customers ADD COLUMN prev_city VARCHAR(50);
ALTER TABLE dim_customers ADD COLUMN prev_subscription VARCHAR(20);
ALTER TABLE dim_customers ADD COLUMN changed_at DATE;
MERGE INTO dim_customers AS target
USING staging.customer_changes AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
UPDATE SET
target.prev_city = target.city, -- save current as previous
target.prev_subscription = target.subscription, -- save current as previous
target.city = source.city, -- overwrite with new
target.subscription = source.subscription, -- overwrite with new
target.email = source.email,
target.changed_at = source.changed_at
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, city, subscription, changed_at)
VALUES (source.customer_id, source.name, source.email,
source.city, source.subscription, source.changed_at);
After Type 3 processing:
| customer_id | name | city | prev_city | subscription | prev_subscription |
|---|---|---|---|---|---|
| 101 | Alice Smith | Dubai | Dubai | Pro | Basic |
| 102 | Bob Jones | Dubai | Abu Dhabi | Pro | Pro |
You can now answer “where was Bob before his most recent city change?” but you cannot answer where he was two changes ago — that history is overwritten.
Which SCD Type to Use — Decision Guide
| Question | Type 1 | Type 2 | Type 3 |
|---|---|---|---|
| Do you need historical reporting? | No | Yes | Limited (1 version back) |
| Can you answer “what was the value on date X?” | No | Yes | Only for the most recent change |
| Storage cost | Low | High | Medium |
| Query complexity | Simple | Complex (range joins) | Medium |
| Common use cases | Error corrections, current-state reporting | Subscription history, address history | Comparing current vs previous tier |
SCD Type 2 in Databricks Delta Lake
Delta Lake simplifies Type 2 because Delta itself maintains a transaction log. You can use table_changes() to read the before/after state of updates and construct your history table automatically.
-- Read all updates to source.customers since version 10
SELECT
customer_id,
email,
city,
subscription,
_change_type,
_commit_timestamp::DATE AS change_date
FROM table_changes('source.customers', 10)
WHERE _change_type IN ('update_postimage', 'insert')
ORDER BY customer_id, _commit_timestamp;
This gives you the complete change history from the Delta log, which you then load into your Type 2 dimension using the MERGE pattern above.
Common Mistakes
Mistake 1 — Using customer_id as the join key in a Type 2 dimension
A customer with two history rows has the same customer_id for both. Join your fact table to the dimension using the validity date range, not just customer_id.
Mistake 2 — Forgetting to filter is_current = TRUE in general queries
Without this filter, a query on the dimension returns all historical rows for each customer and inflates your counts dramatically.
Mistake 3 — Not handling the gap between valid_to and valid_from
When you expire a row on March 15th (valid_to = 2024-03-14) and insert a new row starting March 15th (valid_from = 2024-03-15), there is no gap. But if you expire on the 15th and insert on the 16th, March 15th has no matching row. Be precise with your date arithmetic.
What to Learn Next
Type 2 dimensions generate many rows over time and can grow very large. The next tutorial covers SQL data quality checks — the validation queries you run before loading data into your Silver and Gold layers to prevent bad data from corrupting your dimension tables.