SQL for the Medallion Architecture — Building Bronze, Silver, and Gold Layer Queries

The Medallion architecture (Bronze → Silver → Gold) is the most widely adopted pattern for organizing data in a modern lakehouse. Each layer has a specific job, and the SQL you write for each layer is fundamentally different. Bronze SQL is about raw ingestion and auditability. Silver SQL is about cleaning, conforming, and enriching. Gold SQL is about aggregation and business-ready output.

This tutorial walks through building each layer with real SQL, using a consistent e-commerce dataset from raw API events all the way to a finished dashboard table.


The Scenario

You receive raw JSON-like event data from an e-commerce API. The data lands as flat rows in a Bronze staging table. Your job is to build the Silver and Gold layer transformations using SQL.


Bronze Layer — Raw Ingestion, No Transformation

The Bronze layer stores exactly what arrived from the source. No filtering, no cleaning, no business logic. If the source sends bad data, it lands in Bronze — you fix it in Silver. This gives you a full audit trail and the ability to reprocess from scratch.

-- Bronze table: raw as-received from the API
CREATE TABLE bronze.raw_events (
    raw_id          BIGINT GENERATED ALWAYS AS IDENTITY,
    event_id        VARCHAR(50),    -- from source, may be null or duplicate
    user_id         VARCHAR(50),    -- source sends as string
    event_type      VARCHAR(50),
    product_id      VARCHAR(50),    -- source sends as string
    quantity        VARCHAR(10),    -- source sends everything as string
    unit_price      VARCHAR(20),
    currency        VARCHAR(10),
    event_timestamp VARCHAR(50),    -- raw timestamp string, timezone may vary
    source_country  VARCHAR(10),
    raw_payload     VARCHAR(5000),  -- full raw JSON for debugging
    pipeline_batch  VARCHAR(50),    -- which pipeline run loaded this
    ingested_at     TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Bronze SQL — ingest raw events with no transformation:

-- Bronze load: insert everything as-received, track the batch
INSERT INTO bronze.raw_events (
    event_id, user_id, event_type, product_id, quantity,
    unit_price, currency, event_timestamp, source_country,
    raw_payload, pipeline_batch
)
SELECT
    source_event_id,
    source_user_id,
    source_event_type,
    source_product_id,
    source_quantity,
    source_unit_price,
    source_currency,
    source_timestamp,
    source_country,
    full_json_payload,
    '2024-03-01-batch-001'
FROM staging.api_events_incoming;

What Bronze SQL should never do:

  • Cast data types (keep everything as VARCHAR if source is inconsistent)
  • Filter rows (even bad rows belong in Bronze)
  • Apply business rules
  • Join to dimension tables

Bronze is your insurance policy. You can always reprocess Silver and Gold from Bronze if something goes wrong downstream.


Silver Layer — Clean, Conform, Enrich

Silver transforms the raw Bronze data into a clean, typed, deduplicated, and enriched dataset. This is where most of your pipeline logic lives.

-- Silver table: clean, typed, conformed
CREATE TABLE silver.events (
    event_id        BIGINT,
    user_id         INT,
    event_type      VARCHAR(30),
    product_id      INT,
    quantity        INT,
    unit_price      DECIMAL(10,2),
    revenue         DECIMAL(12,2),    -- derived: quantity * unit_price
    currency        VARCHAR(3),
    revenue_usd     DECIMAL(12,2),    -- normalized to USD
    event_ts        TIMESTAMP,        -- UTC normalized
    event_date      DATE,
    event_hour      INT,
    source_country  VARCHAR(50),
    is_valid        BOOLEAN,          -- passed all quality checks
    rejection_reason VARCHAR(200),    -- null if valid
    bronze_id       BIGINT,           -- traceability back to Bronze
    processed_at    TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Silver SQL — the full transformation from Bronze:

INSERT INTO silver.events (
    event_id, user_id, event_type, product_id,
    quantity, unit_price, revenue, currency, revenue_usd,
    event_ts, event_date, event_hour,
    source_country, is_valid, rejection_reason, bronze_id
)
WITH

-- Step 1: Cast and clean types
typed_events AS (
    SELECT
        raw_id                                              AS bronze_id,
        TRY_CAST(event_id AS BIGINT)                        AS event_id,
        TRY_CAST(user_id AS INT)                            AS user_id,
        LOWER(TRIM(event_type))                             AS event_type,
        TRY_CAST(product_id AS INT)                         AS product_id,
        TRY_CAST(quantity AS INT)                           AS quantity,
        TRY_CAST(unit_price AS DECIMAL(10,2))               AS unit_price,
        UPPER(TRIM(currency))                               AS currency,
        TRY_CAST(event_timestamp AS TIMESTAMP)              AS event_ts_raw,
        source_country
    FROM bronze.raw_events
    WHERE pipeline_batch = '2024-03-01-batch-001'   -- process one batch at a time
),

-- Step 2: Deduplicate on event_id (keep the first-arrived)
deduped AS (
    SELECT *,
           ROW_NUMBER() OVER (
               PARTITION BY event_id
               ORDER BY bronze_id ASC
           ) AS rn
    FROM typed_events
    WHERE event_id IS NOT NULL   -- nulls handled as invalid below
),

-- Step 3: Normalize timestamps to UTC
utc_normalized AS (
    SELECT
        *,
        event_ts_raw AT TIME ZONE 'UTC'           AS event_ts,
        DATE(event_ts_raw AT TIME ZONE 'UTC')     AS event_date,
        EXTRACT(HOUR FROM event_ts_raw AT TIME ZONE 'UTC') AS event_hour
    FROM deduped
    WHERE rn = 1   -- deduplicated rows only
),

-- Step 4: Apply currency conversion to USD
usd_converted AS (
    SELECT
        u.*,
        quantity * unit_price                           AS revenue,
        quantity * unit_price * COALESCE(fx.rate_to_usd, 1.0) AS revenue_usd
    FROM utc_normalized u
    LEFT JOIN reference.fx_rates fx
        ON u.currency = fx.currency
        AND u.event_date = fx.rate_date
),

-- Step 5: Apply data quality rules and flag invalid rows
quality_checked AS (
    SELECT
        *,
        CASE
            WHEN event_id    IS NULL                           THEN FALSE
            WHEN user_id     IS NULL                           THEN FALSE
            WHEN unit_price  IS NULL OR unit_price <= 0        THEN FALSE
            WHEN quantity    IS NULL OR quantity <= 0           THEN FALSE
            WHEN event_ts    IS NULL                           THEN FALSE
            WHEN event_type NOT IN ('purchase','view','add_to_cart','refund')
                                                               THEN FALSE
            ELSE TRUE
        END AS is_valid,
        CASE
            WHEN event_id    IS NULL                           THEN 'null event_id'
            WHEN user_id     IS NULL                           THEN 'null user_id'
            WHEN unit_price  IS NULL OR unit_price <= 0        THEN 'invalid unit_price'
            WHEN quantity    IS NULL OR quantity <= 0           THEN 'invalid quantity'
            WHEN event_ts    IS NULL                           THEN 'invalid timestamp'
            WHEN event_type NOT IN ('purchase','view','add_to_cart','refund')
                                                               THEN 'unknown event_type: ' || event_type
            ELSE NULL
        END AS rejection_reason
    FROM usd_converted
)

SELECT
    event_id, user_id, event_type, product_id,
    quantity, unit_price, revenue, currency, revenue_usd,
    event_ts, event_date, event_hour,
    source_country, is_valid, rejection_reason, bronze_id
FROM quality_checked;

This single CTE chain moves data through every transformation step in a readable, debuggable pipeline. Each step builds on the last — you can run any individual CTE to inspect intermediate results.

What Silver SQL should always include:

  • Type casting with null-safe functions (TRY_CAST in Databricks/SQL Server, SAFE_CAST in BigQuery)
  • Deduplication
  • Timestamp normalization to UTC
  • Data quality flags (is_valid, rejection_reason)
  • Traceability back to Bronze (bronze_id)

Gold Layer — Aggregated, Business-Ready Output

Gold tables are the final output — pre-aggregated, business-named, and ready for dashboards, reports, and ML features. Gold SQL should be simple and focused. The complex transformation logic belongs in Silver.

Gold Table 1 — Daily Revenue by Country and Category

CREATE TABLE gold.daily_revenue (
    revenue_date   DATE,
    country        VARCHAR(50),
    category       VARCHAR(30),
    total_revenue  DECIMAL(14,2),
    total_revenue_usd DECIMAL(14,2),
    order_count    INT,
    unique_users   INT,
    avg_order_value DECIMAL(10,2),
    refreshed_at   TIMESTAMP
);

-- Gold load: aggregate from Silver (valid events only)
INSERT INTO gold.daily_revenue (
    revenue_date, country, category, total_revenue,
    total_revenue_usd, order_count, unique_users, avg_order_value, refreshed_at
)
SELECT
    e.event_date                   AS revenue_date,
    e.source_country               AS country,
    p.category,
    SUM(e.revenue)                 AS total_revenue,
    SUM(e.revenue_usd)             AS total_revenue_usd,
    COUNT(*)                       AS order_count,
    COUNT(DISTINCT e.user_id)      AS unique_users,
    ROUND(AVG(e.revenue_usd), 2)   AS avg_order_value,
    CURRENT_TIMESTAMP              AS refreshed_at
FROM silver.events e
JOIN dim_products p ON e.product_id = p.product_id
WHERE e.is_valid = TRUE
  AND e.event_type = 'purchase'
  AND e.event_date = '2024-03-01'   -- incremental: only today's partition
GROUP BY e.event_date, e.source_country, p.category;

Gold Table 2 — User Purchase Funnel (View → Add to Cart → Purchase)

CREATE TABLE gold.purchase_funnel (
    funnel_date    DATE,
    country        VARCHAR(50),
    views          INT,
    add_to_carts   INT,
    purchases      INT,
    view_to_cart_rate    DECIMAL(5,2),
    cart_to_purchase_rate DECIMAL(5,2),
    overall_conversion   DECIMAL(5,2)
);

INSERT INTO gold.purchase_funnel (
    funnel_date, country, views, add_to_carts, purchases,
    view_to_cart_rate, cart_to_purchase_rate, overall_conversion
)
SELECT
    event_date AS funnel_date,
    source_country AS country,
    SUM(CASE WHEN event_type = 'view'        THEN 1 ELSE 0 END) AS views,
    SUM(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END) AS add_to_carts,
    SUM(CASE WHEN event_type = 'purchase'    THEN 1 ELSE 0 END) AS purchases,
    ROUND(
        SUM(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END) * 100.0
        / NULLIF(SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END), 0),
        2
    ) AS view_to_cart_rate,
    ROUND(
        SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) * 100.0
        / NULLIF(SUM(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END), 0),
        2
    ) AS cart_to_purchase_rate,
    ROUND(
        SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) * 100.0
        / NULLIF(SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END), 0),
        2
    ) AS overall_conversion
FROM silver.events
WHERE is_valid = TRUE
  AND event_date = '2024-03-01'
GROUP BY event_date, source_country;

Gold Table 3 — Customer Lifetime Value (CLV) Feature Table

CREATE TABLE gold.customer_clv (
    user_id              INT,
    first_purchase_date  DATE,
    last_purchase_date   DATE,
    total_orders         INT,
    total_revenue_usd    DECIMAL(14,2),
    avg_order_value      DECIMAL(10,2),
    days_since_last_order INT,
    clv_segment          VARCHAR(20),   -- High / Medium / Low
    refreshed_at         TIMESTAMP
);

INSERT INTO gold.customer_clv (
    user_id, first_purchase_date, last_purchase_date,
    total_orders, total_revenue_usd, avg_order_value,
    days_since_last_order, clv_segment, refreshed_at
)
WITH customer_stats AS (
    SELECT
        user_id,
        MIN(event_date)                AS first_purchase_date,
        MAX(event_date)                AS last_purchase_date,
        COUNT(*)                       AS total_orders,
        SUM(revenue_usd)               AS total_revenue_usd,
        AVG(revenue_usd)               AS avg_order_value,
        CURRENT_DATE - MAX(event_date) AS days_since_last_order
    FROM silver.events
    WHERE is_valid = TRUE
      AND event_type = 'purchase'
    GROUP BY user_id
),
clv_with_segment AS (
    SELECT
        *,
        CASE
            WHEN total_revenue_usd >= 5000  THEN 'High'
            WHEN total_revenue_usd >= 1000  THEN 'Medium'
            ELSE                                 'Low'
        END AS clv_segment
    FROM customer_stats
)
SELECT *, CURRENT_TIMESTAMP AS refreshed_at
FROM clv_with_segment;

Layer Responsibilities Summary

LayerSQL PatternTransformation Level
BronzeINSERT AS-ISNone — raw only
SilverCTEs with TRY_CAST, dedup, DQ flagsFull cleaning, typing, enrichment
GoldGROUP BY aggregations, CASE WHEN pivotsAggregation and business metrics only

Common Mistakes

Mistake 1 — Putting business logic in Bronze

Bronze SQL should never contain WHERE status = 'active' or any filter that discards rows. Every row from the source belongs in Bronze — even the bad ones. Filtering belongs in Silver.

Mistake 2 — Loading invalid Silver rows into Gold

Always filter WHERE is_valid = TRUE in your Gold SQL. Invalid rows flagged in Silver should never reach Gold tables that power dashboards.

Mistake 3 — Full-replacing Gold tables on every run

Most Gold tables should be incrementally updated — update today’s partition only. Full table rewrites are slow and disrupt dashboard queries that are running against the table.

-- WRONG: delete everything and reload
DELETE FROM gold.daily_revenue;
INSERT INTO gold.daily_revenue SELECT ...;

-- CORRECT: delete only today's partition, then reload it
DELETE FROM gold.daily_revenue WHERE revenue_date = CURRENT_DATE;
INSERT INTO gold.daily_revenue SELECT ... WHERE event_date = CURRENT_DATE;

What to Learn Next

The Medallion architecture organizes your data. The next challenge is making sure it stays correct as your pipeline runs continuously — the next tutorial covers SQL patterns for monitoring pipeline health: detecting late-arriving data, stale partitions, and anomalous row counts automatically.

Leave a Comment