The Medallion architecture (Bronze → Silver → Gold) is the most widely adopted pattern for organizing data in a modern lakehouse. Each layer has a specific job, and the SQL you write for each layer is fundamentally different. Bronze SQL is about raw ingestion and auditability. Silver SQL is about cleaning, conforming, and enriching. Gold SQL is about aggregation and business-ready output.
This tutorial walks through building each layer with real SQL, using a consistent e-commerce dataset from raw API events all the way to a finished dashboard table.
The Scenario
You receive raw JSON-like event data from an e-commerce API. The data lands as flat rows in a Bronze staging table. Your job is to build the Silver and Gold layer transformations using SQL.
Bronze Layer — Raw Ingestion, No Transformation
The Bronze layer stores exactly what arrived from the source. No filtering, no cleaning, no business logic. If the source sends bad data, it lands in Bronze — you fix it in Silver. This gives you a full audit trail and the ability to reprocess from scratch.
-- Bronze table: raw as-received from the API
CREATE TABLE bronze.raw_events (
raw_id BIGINT GENERATED ALWAYS AS IDENTITY,
event_id VARCHAR(50), -- from source, may be null or duplicate
user_id VARCHAR(50), -- source sends as string
event_type VARCHAR(50),
product_id VARCHAR(50), -- source sends as string
quantity VARCHAR(10), -- source sends everything as string
unit_price VARCHAR(20),
currency VARCHAR(10),
event_timestamp VARCHAR(50), -- raw timestamp string, timezone may vary
source_country VARCHAR(10),
raw_payload VARCHAR(5000), -- full raw JSON for debugging
pipeline_batch VARCHAR(50), -- which pipeline run loaded this
ingested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Bronze SQL — ingest raw events with no transformation:
-- Bronze load: insert everything as-received, track the batch
INSERT INTO bronze.raw_events (
event_id, user_id, event_type, product_id, quantity,
unit_price, currency, event_timestamp, source_country,
raw_payload, pipeline_batch
)
SELECT
source_event_id,
source_user_id,
source_event_type,
source_product_id,
source_quantity,
source_unit_price,
source_currency,
source_timestamp,
source_country,
full_json_payload,
'2024-03-01-batch-001'
FROM staging.api_events_incoming;
What Bronze SQL should never do:
- Cast data types (keep everything as VARCHAR if source is inconsistent)
- Filter rows (even bad rows belong in Bronze)
- Apply business rules
- Join to dimension tables
Bronze is your insurance policy. You can always reprocess Silver and Gold from Bronze if something goes wrong downstream.
Silver Layer — Clean, Conform, Enrich
Silver transforms the raw Bronze data into a clean, typed, deduplicated, and enriched dataset. This is where most of your pipeline logic lives.
-- Silver table: clean, typed, conformed
CREATE TABLE silver.events (
event_id BIGINT,
user_id INT,
event_type VARCHAR(30),
product_id INT,
quantity INT,
unit_price DECIMAL(10,2),
revenue DECIMAL(12,2), -- derived: quantity * unit_price
currency VARCHAR(3),
revenue_usd DECIMAL(12,2), -- normalized to USD
event_ts TIMESTAMP, -- UTC normalized
event_date DATE,
event_hour INT,
source_country VARCHAR(50),
is_valid BOOLEAN, -- passed all quality checks
rejection_reason VARCHAR(200), -- null if valid
bronze_id BIGINT, -- traceability back to Bronze
processed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Silver SQL — the full transformation from Bronze:
INSERT INTO silver.events (
event_id, user_id, event_type, product_id,
quantity, unit_price, revenue, currency, revenue_usd,
event_ts, event_date, event_hour,
source_country, is_valid, rejection_reason, bronze_id
)
WITH
-- Step 1: Cast and clean types
typed_events AS (
SELECT
raw_id AS bronze_id,
TRY_CAST(event_id AS BIGINT) AS event_id,
TRY_CAST(user_id AS INT) AS user_id,
LOWER(TRIM(event_type)) AS event_type,
TRY_CAST(product_id AS INT) AS product_id,
TRY_CAST(quantity AS INT) AS quantity,
TRY_CAST(unit_price AS DECIMAL(10,2)) AS unit_price,
UPPER(TRIM(currency)) AS currency,
TRY_CAST(event_timestamp AS TIMESTAMP) AS event_ts_raw,
source_country
FROM bronze.raw_events
WHERE pipeline_batch = '2024-03-01-batch-001' -- process one batch at a time
),
-- Step 2: Deduplicate on event_id (keep the first-arrived)
deduped AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY event_id
ORDER BY bronze_id ASC
) AS rn
FROM typed_events
WHERE event_id IS NOT NULL -- nulls handled as invalid below
),
-- Step 3: Normalize timestamps to UTC
utc_normalized AS (
SELECT
*,
event_ts_raw AT TIME ZONE 'UTC' AS event_ts,
DATE(event_ts_raw AT TIME ZONE 'UTC') AS event_date,
EXTRACT(HOUR FROM event_ts_raw AT TIME ZONE 'UTC') AS event_hour
FROM deduped
WHERE rn = 1 -- deduplicated rows only
),
-- Step 4: Apply currency conversion to USD
usd_converted AS (
SELECT
u.*,
quantity * unit_price AS revenue,
quantity * unit_price * COALESCE(fx.rate_to_usd, 1.0) AS revenue_usd
FROM utc_normalized u
LEFT JOIN reference.fx_rates fx
ON u.currency = fx.currency
AND u.event_date = fx.rate_date
),
-- Step 5: Apply data quality rules and flag invalid rows
quality_checked AS (
SELECT
*,
CASE
WHEN event_id IS NULL THEN FALSE
WHEN user_id IS NULL THEN FALSE
WHEN unit_price IS NULL OR unit_price <= 0 THEN FALSE
WHEN quantity IS NULL OR quantity <= 0 THEN FALSE
WHEN event_ts IS NULL THEN FALSE
WHEN event_type NOT IN ('purchase','view','add_to_cart','refund')
THEN FALSE
ELSE TRUE
END AS is_valid,
CASE
WHEN event_id IS NULL THEN 'null event_id'
WHEN user_id IS NULL THEN 'null user_id'
WHEN unit_price IS NULL OR unit_price <= 0 THEN 'invalid unit_price'
WHEN quantity IS NULL OR quantity <= 0 THEN 'invalid quantity'
WHEN event_ts IS NULL THEN 'invalid timestamp'
WHEN event_type NOT IN ('purchase','view','add_to_cart','refund')
THEN 'unknown event_type: ' || event_type
ELSE NULL
END AS rejection_reason
FROM usd_converted
)
SELECT
event_id, user_id, event_type, product_id,
quantity, unit_price, revenue, currency, revenue_usd,
event_ts, event_date, event_hour,
source_country, is_valid, rejection_reason, bronze_id
FROM quality_checked;
This single CTE chain moves data through every transformation step in a readable, debuggable pipeline. Each step builds on the last — you can run any individual CTE to inspect intermediate results.
What Silver SQL should always include:
- Type casting with null-safe functions (
TRY_CASTin Databricks/SQL Server,SAFE_CASTin BigQuery) - Deduplication
- Timestamp normalization to UTC
- Data quality flags (is_valid, rejection_reason)
- Traceability back to Bronze (bronze_id)
Gold Layer — Aggregated, Business-Ready Output
Gold tables are the final output — pre-aggregated, business-named, and ready for dashboards, reports, and ML features. Gold SQL should be simple and focused. The complex transformation logic belongs in Silver.
Gold Table 1 — Daily Revenue by Country and Category
CREATE TABLE gold.daily_revenue (
revenue_date DATE,
country VARCHAR(50),
category VARCHAR(30),
total_revenue DECIMAL(14,2),
total_revenue_usd DECIMAL(14,2),
order_count INT,
unique_users INT,
avg_order_value DECIMAL(10,2),
refreshed_at TIMESTAMP
);
-- Gold load: aggregate from Silver (valid events only)
INSERT INTO gold.daily_revenue (
revenue_date, country, category, total_revenue,
total_revenue_usd, order_count, unique_users, avg_order_value, refreshed_at
)
SELECT
e.event_date AS revenue_date,
e.source_country AS country,
p.category,
SUM(e.revenue) AS total_revenue,
SUM(e.revenue_usd) AS total_revenue_usd,
COUNT(*) AS order_count,
COUNT(DISTINCT e.user_id) AS unique_users,
ROUND(AVG(e.revenue_usd), 2) AS avg_order_value,
CURRENT_TIMESTAMP AS refreshed_at
FROM silver.events e
JOIN dim_products p ON e.product_id = p.product_id
WHERE e.is_valid = TRUE
AND e.event_type = 'purchase'
AND e.event_date = '2024-03-01' -- incremental: only today's partition
GROUP BY e.event_date, e.source_country, p.category;
Gold Table 2 — User Purchase Funnel (View → Add to Cart → Purchase)
CREATE TABLE gold.purchase_funnel (
funnel_date DATE,
country VARCHAR(50),
views INT,
add_to_carts INT,
purchases INT,
view_to_cart_rate DECIMAL(5,2),
cart_to_purchase_rate DECIMAL(5,2),
overall_conversion DECIMAL(5,2)
);
INSERT INTO gold.purchase_funnel (
funnel_date, country, views, add_to_carts, purchases,
view_to_cart_rate, cart_to_purchase_rate, overall_conversion
)
SELECT
event_date AS funnel_date,
source_country AS country,
SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END) AS views,
SUM(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END) AS add_to_carts,
SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchases,
ROUND(
SUM(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END) * 100.0
/ NULLIF(SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END), 0),
2
) AS view_to_cart_rate,
ROUND(
SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) * 100.0
/ NULLIF(SUM(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END), 0),
2
) AS cart_to_purchase_rate,
ROUND(
SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) * 100.0
/ NULLIF(SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END), 0),
2
) AS overall_conversion
FROM silver.events
WHERE is_valid = TRUE
AND event_date = '2024-03-01'
GROUP BY event_date, source_country;
Gold Table 3 — Customer Lifetime Value (CLV) Feature Table
CREATE TABLE gold.customer_clv (
user_id INT,
first_purchase_date DATE,
last_purchase_date DATE,
total_orders INT,
total_revenue_usd DECIMAL(14,2),
avg_order_value DECIMAL(10,2),
days_since_last_order INT,
clv_segment VARCHAR(20), -- High / Medium / Low
refreshed_at TIMESTAMP
);
INSERT INTO gold.customer_clv (
user_id, first_purchase_date, last_purchase_date,
total_orders, total_revenue_usd, avg_order_value,
days_since_last_order, clv_segment, refreshed_at
)
WITH customer_stats AS (
SELECT
user_id,
MIN(event_date) AS first_purchase_date,
MAX(event_date) AS last_purchase_date,
COUNT(*) AS total_orders,
SUM(revenue_usd) AS total_revenue_usd,
AVG(revenue_usd) AS avg_order_value,
CURRENT_DATE - MAX(event_date) AS days_since_last_order
FROM silver.events
WHERE is_valid = TRUE
AND event_type = 'purchase'
GROUP BY user_id
),
clv_with_segment AS (
SELECT
*,
CASE
WHEN total_revenue_usd >= 5000 THEN 'High'
WHEN total_revenue_usd >= 1000 THEN 'Medium'
ELSE 'Low'
END AS clv_segment
FROM customer_stats
)
SELECT *, CURRENT_TIMESTAMP AS refreshed_at
FROM clv_with_segment;
Layer Responsibilities Summary
| Layer | SQL Pattern | Transformation Level |
|---|---|---|
| Bronze | INSERT AS-IS | None — raw only |
| Silver | CTEs with TRY_CAST, dedup, DQ flags | Full cleaning, typing, enrichment |
| Gold | GROUP BY aggregations, CASE WHEN pivots | Aggregation and business metrics only |
Common Mistakes
Mistake 1 — Putting business logic in Bronze
Bronze SQL should never contain WHERE status = 'active' or any filter that discards rows. Every row from the source belongs in Bronze — even the bad ones. Filtering belongs in Silver.
Mistake 2 — Loading invalid Silver rows into Gold
Always filter WHERE is_valid = TRUE in your Gold SQL. Invalid rows flagged in Silver should never reach Gold tables that power dashboards.
Mistake 3 — Full-replacing Gold tables on every run
Most Gold tables should be incrementally updated — update today’s partition only. Full table rewrites are slow and disrupt dashboard queries that are running against the table.
-- WRONG: delete everything and reload
DELETE FROM gold.daily_revenue;
INSERT INTO gold.daily_revenue SELECT ...;
-- CORRECT: delete only today's partition, then reload it
DELETE FROM gold.daily_revenue WHERE revenue_date = CURRENT_DATE;
INSERT INTO gold.daily_revenue SELECT ... WHERE event_date = CURRENT_DATE;
What to Learn Next
The Medallion architecture organizes your data. The next challenge is making sure it stays correct as your pipeline runs continuously — the next tutorial covers SQL patterns for monitoring pipeline health: detecting late-arriving data, stale partitions, and anomalous row counts automatically.