SQL for Data Validation Between Source and Target Tables — Reconciliation Queries Every Data Engineer Needs

After every pipeline run, you need to verify that what landed in your warehouse actually matches what was in the source. Row counts alone are not enough — a pipeline can load the correct number of rows but with wrong values, missing columns, or shifted amounts. Reconciliation SQL catches these problems before your data consumers do.

This tutorial covers the full set of SQL queries data engineers use to validate pipeline output — from basic counts to field-level comparison and statistical reconciliation.


The Setup

-- Source system table (read-only reference)
CREATE TABLE source.orders (
    order_id    INT,
    customer_id INT,
    amount      DECIMAL(12,2),
    status      VARCHAR(20),
    order_date  DATE
);

-- Warehouse target table (what the pipeline loaded)
CREATE TABLE warehouse.orders (
    order_id    INT,
    customer_id INT,
    amount      DECIMAL(12,2),
    status      VARCHAR(20),
    order_date  DATE,
    loaded_at   TIMESTAMP
);

-- Sample source data
INSERT INTO source.orders VALUES
(1001, 201, 1200.00, 'completed', '2024-03-01'),
(1002, 202,  800.00, 'completed', '2024-03-01'),
(1003, 203,  400.00, 'returned',  '2024-03-02'),
(1004, 201,  350.00, 'completed', '2024-03-02'),
(1005, 204, 1500.00, 'pending',   '2024-03-03');

-- Warehouse loaded with a problem: order 1003 has wrong amount, order 1005 is missing
INSERT INTO warehouse.orders (order_id, customer_id, amount, status, order_date, loaded_at) VALUES
(1001, 201, 1200.00, 'completed', '2024-03-01', CURRENT_TIMESTAMP),
(1002, 202,  800.00, 'completed', '2024-03-01', CURRENT_TIMESTAMP),
(1003, 203,  450.00, 'returned',  '2024-03-02', CURRENT_TIMESTAMP),  -- wrong amount: should be 400
(1004, 201,  350.00, 'completed', '2024-03-02', CURRENT_TIMESTAMP);
-- order 1005 is missing entirely

Check 1 — Row Count Reconciliation

The first and fastest check. If counts do not match, the pipeline missed rows or loaded extras.

SELECT
    'source'    AS table_name,
    COUNT(*)    AS row_count,
    MIN(order_date) AS earliest_date,
    MAX(order_date) AS latest_date
FROM source.orders

UNION ALL

SELECT
    'warehouse' AS table_name,
    COUNT(*)    AS row_count,
    MIN(order_date) AS earliest_date,
    MAX(order_date) AS latest_date
FROM warehouse.orders;

Output:

table_namerow_countearliest_datelatest_date
source52024-03-012024-03-03
warehouse42024-03-012024-03-02

Immediately visible: the warehouse has 4 rows, source has 5. And the latest date in the warehouse is March 2nd, not March 3rd — the March 3rd order is missing.


Check 2 — Find Rows in Source That Are Missing From Warehouse

SELECT
    s.order_id,
    s.customer_id,
    s.amount,
    s.status,
    s.order_date
FROM source.orders s
LEFT JOIN warehouse.orders w ON s.order_id = w.order_id
WHERE w.order_id IS NULL;

Output:

order_idcustomer_idamountstatusorder_date
10052041500.00pending2024-03-03

Order 1005 is in source but missing from the warehouse. This is a pipeline gap.


Check 3 — Find Rows in Warehouse That Are Not in Source

This catches phantom rows — data that was loaded but should not exist, perhaps from a bad pipeline run or a failed delete sync.

SELECT
    w.order_id,
    w.customer_id,
    w.amount,
    w.loaded_at
FROM warehouse.orders w
LEFT JOIN source.orders s ON w.order_id = s.order_id
WHERE s.order_id IS NULL;

Output: No rows — good, the warehouse has no phantom rows in this case. In production, any rows returned here are red flags.


Check 4 — Field-Level Comparison (Find Value Mismatches)

This is the most important reconciliation query and the most commonly skipped. Row counts can match perfectly while values are wrong.

SELECT
    s.order_id,
    'amount'              AS field_name,
    s.amount::VARCHAR     AS source_value,
    w.amount::VARCHAR     AS warehouse_value
FROM source.orders s
JOIN warehouse.orders w ON s.order_id = w.order_id
WHERE s.amount != w.amount

UNION ALL

SELECT
    s.order_id,
    'status'              AS field_name,
    s.status              AS source_value,
    w.status              AS warehouse_value
FROM source.orders s
JOIN warehouse.orders w ON s.order_id = w.order_id
WHERE s.status != w.status

UNION ALL

SELECT
    s.order_id,
    'customer_id'         AS field_name,
    s.customer_id::VARCHAR AS source_value,
    w.customer_id::VARCHAR AS warehouse_value
FROM source.orders s
JOIN warehouse.orders w ON s.order_id = w.order_id
WHERE s.customer_id != w.customer_id

ORDER BY order_id, field_name;

Output:

order_idfield_namesource_valuewarehouse_value
1003amount400.00450.00

Order 1003 has the wrong amount in the warehouse — 450.00 loaded instead of 400.00. Without field-level reconciliation, this would have been invisible because row counts matched (4 matched rows between source and warehouse minus the missing row 1005).


Check 5 — Aggregate Reconciliation (Sum, Average, Count)

For financial pipelines, the total amounts must match exactly. Even one incorrect value of 50 cents will break your reconciliation.

SELECT
    'source'    AS source,
    COUNT(*)    AS row_count,
    SUM(amount) AS total_amount,
    AVG(amount) AS avg_amount,
    MIN(amount) AS min_amount,
    MAX(amount) AS max_amount
FROM source.orders

UNION ALL

SELECT
    'warehouse' AS source,
    COUNT(*)    AS row_count,
    SUM(amount) AS total_amount,
    AVG(amount) AS avg_amount,
    MIN(amount) AS min_amount,
    MAX(amount) AS max_amount
FROM warehouse.orders;

Output:

sourcerow_counttotal_amountavg_amountmin_amountmax_amount
source54250.00850.00350.001500.00
warehouse43800.00950.00350.001200.00

Two differences visible:

  • Total amount is off by 450 (source 4250 vs warehouse 3800): 1500 from missing order 1005, minus 50 from the wrong amount on order 1003
  • Average is different, max differs (1500 is missing from warehouse)

Check 6 — Partition-Level Reconciliation

For large tables that load incrementally, reconcile by partition (date or month) to quickly identify which time period has the gap.

SELECT
    order_date,
    COUNT(*)    AS source_count,
    SUM(amount) AS source_total
FROM source.orders
GROUP BY order_date

EXCEPT

SELECT
    order_date,
    COUNT(*)    AS warehouse_count,
    SUM(amount) AS warehouse_total
FROM warehouse.orders
GROUP BY order_date

ORDER BY order_date;

Output:

order_datesource_countsource_total
2024-03-022750.00
2024-03-0311500.00

Both March 2nd and March 3rd show differences. March 3rd is missing entirely. March 2nd has the right count but wrong total (due to the amount error on order 1003).


Check 7 — Hash-Based Row Comparison

For exact byte-level comparison of rows (not just specific fields), hashing each row gives a fingerprint you can compare between source and warehouse.

-- PostgreSQL: use MD5 to hash all columns into a single comparable value
SELECT
    order_id,
    MD5(
        CONCAT(
            COALESCE(order_id::VARCHAR, ''),
            COALESCE(customer_id::VARCHAR, ''),
            COALESCE(amount::VARCHAR, ''),
            COALESCE(status, ''),
            COALESCE(order_date::VARCHAR, '')
        )
    ) AS row_hash
FROM source.orders

EXCEPT

SELECT
    order_id,
    MD5(
        CONCAT(
            COALESCE(order_id::VARCHAR, ''),
            COALESCE(customer_id::VARCHAR, ''),
            COALESCE(amount::VARCHAR, ''),
            COALESCE(status, ''),
            COALESCE(order_date::VARCHAR, '')
        )
    ) AS row_hash
FROM warehouse.orders;

Output:

order_idrow_hash
1003a3f8c…
10059b2d1…

Any row where the hash differs between source and warehouse appears here. Order 1003 (wrong amount) and order 1005 (missing) are both caught. This is the fastest way to check large tables — compare one hash column instead of many individual fields.


Building a Reusable Reconciliation Report

Combine all checks into a single report that runs after every pipeline load:

WITH
source_stats AS (
    SELECT COUNT(*) AS cnt, SUM(amount) AS total FROM source.orders
),
warehouse_stats AS (
    SELECT COUNT(*) AS cnt, SUM(amount) AS total FROM warehouse.orders
),
missing_rows AS (
    SELECT COUNT(*) AS cnt
    FROM source.orders s
    LEFT JOIN warehouse.orders w ON s.order_id = w.order_id
    WHERE w.order_id IS NULL
),
phantom_rows AS (
    SELECT COUNT(*) AS cnt
    FROM warehouse.orders w
    LEFT JOIN source.orders s ON w.order_id = s.order_id
    WHERE s.order_id IS NULL
),
mismatched_values AS (
    SELECT COUNT(*) AS cnt
    FROM source.orders s
    JOIN warehouse.orders w ON s.order_id = w.order_id
    WHERE s.amount != w.amount OR s.status != w.status
)
SELECT
    s.cnt         AS source_rows,
    w.cnt         AS warehouse_rows,
    s.cnt - w.cnt AS row_difference,
    s.total       AS source_total,
    w.total       AS warehouse_total,
    s.total - w.total AS amount_difference,
    m.cnt         AS missing_rows,
    p.cnt         AS phantom_rows,
    mv.cnt        AS mismatched_rows,
    CASE
        WHEN s.cnt = w.cnt AND s.total = w.total AND m.cnt = 0 AND p.cnt = 0 AND mv.cnt = 0
        THEN 'PASS'
        ELSE 'FAIL'
    END AS reconciliation_status
FROM source_stats s, warehouse_stats w, missing_rows m, phantom_rows p, mismatched_values mv;

Output:

source_rowswarehouse_rowsrow_differencesource_totalwarehouse_totalamount_differencemissing_rowsphantom_rowsmismatched_rowsreconciliation_status
5414250.003800.00450.00101FAIL

This single query gives your pipeline monitoring system everything it needs to know in one pass. Store the output in a pipeline.reconciliation_log table for historical tracking.


Automating in Your Pipeline (Airflow Example Pattern)

# In Airflow, run reconciliation as a sensor after the load task
def run_reconciliation(**kwargs):
    result = run_sql("""
        SELECT reconciliation_status, missing_rows, mismatched_rows
        FROM reconciliation_report
    """)
    if result['reconciliation_status'] != 'PASS':
        raise ValueError(
            f"Reconciliation FAILED: {result['missing_rows']} missing rows, "
            f"{result['mismatched_rows']} mismatched values"
        )

reconcile_task = PythonOperator(
    task_id='reconcile_orders',
    python_callable=run_reconciliation,
    dag=dag
)

load_task >> reconcile_task  # reconciliation runs after load, blocks downstream

Common Mistakes

Mistake 1 — Only checking row counts

Row counts matching means nothing if values are wrong. Always run field-level comparison on at least your critical numeric columns (amounts, quantities, prices).

Mistake 2 — Not reconciling by partition

Checking total counts and totals across all time may hide a gap in a specific month. Always reconcile at the partition level (by date, month, or region).

Mistake 3 — Skipping reconciliation on incremental loads

Incremental pipelines are more error-prone than full loads — late-arriving data, watermark drift, and MERGE bugs all create gaps. Reconcile every incremental run, not just full reloads.


What to Learn Next

Reconciliation catches pipeline errors after they happen. The next tutorial covers proactive data quality — writing SQL checks that validate your data before the load, so problems are caught at the source rather than discovered in the warehouse.

Leave a Comment