After every pipeline run, you need to verify that what landed in your warehouse actually matches what was in the source. Row counts alone are not enough — a pipeline can load the correct number of rows but with wrong values, missing columns, or shifted amounts. Reconciliation SQL catches these problems before your data consumers do.
This tutorial covers the full set of SQL queries data engineers use to validate pipeline output — from basic counts to field-level comparison and statistical reconciliation.
The Setup
-- Source system table (read-only reference)
CREATE TABLE source.orders (
order_id INT,
customer_id INT,
amount DECIMAL(12,2),
status VARCHAR(20),
order_date DATE
);
-- Warehouse target table (what the pipeline loaded)
CREATE TABLE warehouse.orders (
order_id INT,
customer_id INT,
amount DECIMAL(12,2),
status VARCHAR(20),
order_date DATE,
loaded_at TIMESTAMP
);
-- Sample source data
INSERT INTO source.orders VALUES
(1001, 201, 1200.00, 'completed', '2024-03-01'),
(1002, 202, 800.00, 'completed', '2024-03-01'),
(1003, 203, 400.00, 'returned', '2024-03-02'),
(1004, 201, 350.00, 'completed', '2024-03-02'),
(1005, 204, 1500.00, 'pending', '2024-03-03');
-- Warehouse loaded with a problem: order 1003 has wrong amount, order 1005 is missing
INSERT INTO warehouse.orders (order_id, customer_id, amount, status, order_date, loaded_at) VALUES
(1001, 201, 1200.00, 'completed', '2024-03-01', CURRENT_TIMESTAMP),
(1002, 202, 800.00, 'completed', '2024-03-01', CURRENT_TIMESTAMP),
(1003, 203, 450.00, 'returned', '2024-03-02', CURRENT_TIMESTAMP), -- wrong amount: should be 400
(1004, 201, 350.00, 'completed', '2024-03-02', CURRENT_TIMESTAMP);
-- order 1005 is missing entirely
Check 1 — Row Count Reconciliation
The first and fastest check. If counts do not match, the pipeline missed rows or loaded extras.
SELECT
'source' AS table_name,
COUNT(*) AS row_count,
MIN(order_date) AS earliest_date,
MAX(order_date) AS latest_date
FROM source.orders
UNION ALL
SELECT
'warehouse' AS table_name,
COUNT(*) AS row_count,
MIN(order_date) AS earliest_date,
MAX(order_date) AS latest_date
FROM warehouse.orders;
Output:
| table_name | row_count | earliest_date | latest_date |
|---|---|---|---|
| source | 5 | 2024-03-01 | 2024-03-03 |
| warehouse | 4 | 2024-03-01 | 2024-03-02 |
Immediately visible: the warehouse has 4 rows, source has 5. And the latest date in the warehouse is March 2nd, not March 3rd — the March 3rd order is missing.
Check 2 — Find Rows in Source That Are Missing From Warehouse
SELECT
s.order_id,
s.customer_id,
s.amount,
s.status,
s.order_date
FROM source.orders s
LEFT JOIN warehouse.orders w ON s.order_id = w.order_id
WHERE w.order_id IS NULL;
Output:
| order_id | customer_id | amount | status | order_date |
|---|---|---|---|---|
| 1005 | 204 | 1500.00 | pending | 2024-03-03 |
Order 1005 is in source but missing from the warehouse. This is a pipeline gap.
Check 3 — Find Rows in Warehouse That Are Not in Source
This catches phantom rows — data that was loaded but should not exist, perhaps from a bad pipeline run or a failed delete sync.
SELECT
w.order_id,
w.customer_id,
w.amount,
w.loaded_at
FROM warehouse.orders w
LEFT JOIN source.orders s ON w.order_id = s.order_id
WHERE s.order_id IS NULL;
Output: No rows — good, the warehouse has no phantom rows in this case. In production, any rows returned here are red flags.
Check 4 — Field-Level Comparison (Find Value Mismatches)
This is the most important reconciliation query and the most commonly skipped. Row counts can match perfectly while values are wrong.
SELECT
s.order_id,
'amount' AS field_name,
s.amount::VARCHAR AS source_value,
w.amount::VARCHAR AS warehouse_value
FROM source.orders s
JOIN warehouse.orders w ON s.order_id = w.order_id
WHERE s.amount != w.amount
UNION ALL
SELECT
s.order_id,
'status' AS field_name,
s.status AS source_value,
w.status AS warehouse_value
FROM source.orders s
JOIN warehouse.orders w ON s.order_id = w.order_id
WHERE s.status != w.status
UNION ALL
SELECT
s.order_id,
'customer_id' AS field_name,
s.customer_id::VARCHAR AS source_value,
w.customer_id::VARCHAR AS warehouse_value
FROM source.orders s
JOIN warehouse.orders w ON s.order_id = w.order_id
WHERE s.customer_id != w.customer_id
ORDER BY order_id, field_name;
Output:
| order_id | field_name | source_value | warehouse_value |
|---|---|---|---|
| 1003 | amount | 400.00 | 450.00 |
Order 1003 has the wrong amount in the warehouse — 450.00 loaded instead of 400.00. Without field-level reconciliation, this would have been invisible because row counts matched (4 matched rows between source and warehouse minus the missing row 1005).
Check 5 — Aggregate Reconciliation (Sum, Average, Count)
For financial pipelines, the total amounts must match exactly. Even one incorrect value of 50 cents will break your reconciliation.
SELECT
'source' AS source,
COUNT(*) AS row_count,
SUM(amount) AS total_amount,
AVG(amount) AS avg_amount,
MIN(amount) AS min_amount,
MAX(amount) AS max_amount
FROM source.orders
UNION ALL
SELECT
'warehouse' AS source,
COUNT(*) AS row_count,
SUM(amount) AS total_amount,
AVG(amount) AS avg_amount,
MIN(amount) AS min_amount,
MAX(amount) AS max_amount
FROM warehouse.orders;
Output:
| source | row_count | total_amount | avg_amount | min_amount | max_amount |
|---|---|---|---|---|---|
| source | 5 | 4250.00 | 850.00 | 350.00 | 1500.00 |
| warehouse | 4 | 3800.00 | 950.00 | 350.00 | 1200.00 |
Two differences visible:
- Total amount is off by 450 (source 4250 vs warehouse 3800): 1500 from missing order 1005, minus 50 from the wrong amount on order 1003
- Average is different, max differs (1500 is missing from warehouse)
Check 6 — Partition-Level Reconciliation
For large tables that load incrementally, reconcile by partition (date or month) to quickly identify which time period has the gap.
SELECT
order_date,
COUNT(*) AS source_count,
SUM(amount) AS source_total
FROM source.orders
GROUP BY order_date
EXCEPT
SELECT
order_date,
COUNT(*) AS warehouse_count,
SUM(amount) AS warehouse_total
FROM warehouse.orders
GROUP BY order_date
ORDER BY order_date;
Output:
| order_date | source_count | source_total |
|---|---|---|
| 2024-03-02 | 2 | 750.00 |
| 2024-03-03 | 1 | 1500.00 |
Both March 2nd and March 3rd show differences. March 3rd is missing entirely. March 2nd has the right count but wrong total (due to the amount error on order 1003).
Check 7 — Hash-Based Row Comparison
For exact byte-level comparison of rows (not just specific fields), hashing each row gives a fingerprint you can compare between source and warehouse.
-- PostgreSQL: use MD5 to hash all columns into a single comparable value
SELECT
order_id,
MD5(
CONCAT(
COALESCE(order_id::VARCHAR, ''),
COALESCE(customer_id::VARCHAR, ''),
COALESCE(amount::VARCHAR, ''),
COALESCE(status, ''),
COALESCE(order_date::VARCHAR, '')
)
) AS row_hash
FROM source.orders
EXCEPT
SELECT
order_id,
MD5(
CONCAT(
COALESCE(order_id::VARCHAR, ''),
COALESCE(customer_id::VARCHAR, ''),
COALESCE(amount::VARCHAR, ''),
COALESCE(status, ''),
COALESCE(order_date::VARCHAR, '')
)
) AS row_hash
FROM warehouse.orders;
Output:
| order_id | row_hash |
|---|---|
| 1003 | a3f8c… |
| 1005 | 9b2d1… |
Any row where the hash differs between source and warehouse appears here. Order 1003 (wrong amount) and order 1005 (missing) are both caught. This is the fastest way to check large tables — compare one hash column instead of many individual fields.
Building a Reusable Reconciliation Report
Combine all checks into a single report that runs after every pipeline load:
WITH
source_stats AS (
SELECT COUNT(*) AS cnt, SUM(amount) AS total FROM source.orders
),
warehouse_stats AS (
SELECT COUNT(*) AS cnt, SUM(amount) AS total FROM warehouse.orders
),
missing_rows AS (
SELECT COUNT(*) AS cnt
FROM source.orders s
LEFT JOIN warehouse.orders w ON s.order_id = w.order_id
WHERE w.order_id IS NULL
),
phantom_rows AS (
SELECT COUNT(*) AS cnt
FROM warehouse.orders w
LEFT JOIN source.orders s ON w.order_id = s.order_id
WHERE s.order_id IS NULL
),
mismatched_values AS (
SELECT COUNT(*) AS cnt
FROM source.orders s
JOIN warehouse.orders w ON s.order_id = w.order_id
WHERE s.amount != w.amount OR s.status != w.status
)
SELECT
s.cnt AS source_rows,
w.cnt AS warehouse_rows,
s.cnt - w.cnt AS row_difference,
s.total AS source_total,
w.total AS warehouse_total,
s.total - w.total AS amount_difference,
m.cnt AS missing_rows,
p.cnt AS phantom_rows,
mv.cnt AS mismatched_rows,
CASE
WHEN s.cnt = w.cnt AND s.total = w.total AND m.cnt = 0 AND p.cnt = 0 AND mv.cnt = 0
THEN 'PASS'
ELSE 'FAIL'
END AS reconciliation_status
FROM source_stats s, warehouse_stats w, missing_rows m, phantom_rows p, mismatched_values mv;
Output:
| source_rows | warehouse_rows | row_difference | source_total | warehouse_total | amount_difference | missing_rows | phantom_rows | mismatched_rows | reconciliation_status |
|---|---|---|---|---|---|---|---|---|---|
| 5 | 4 | 1 | 4250.00 | 3800.00 | 450.00 | 1 | 0 | 1 | FAIL |
This single query gives your pipeline monitoring system everything it needs to know in one pass. Store the output in a pipeline.reconciliation_log table for historical tracking.
Automating in Your Pipeline (Airflow Example Pattern)
# In Airflow, run reconciliation as a sensor after the load task
def run_reconciliation(**kwargs):
result = run_sql("""
SELECT reconciliation_status, missing_rows, mismatched_rows
FROM reconciliation_report
""")
if result['reconciliation_status'] != 'PASS':
raise ValueError(
f"Reconciliation FAILED: {result['missing_rows']} missing rows, "
f"{result['mismatched_rows']} mismatched values"
)
reconcile_task = PythonOperator(
task_id='reconcile_orders',
python_callable=run_reconciliation,
dag=dag
)
load_task >> reconcile_task # reconciliation runs after load, blocks downstream
Common Mistakes
Mistake 1 — Only checking row counts
Row counts matching means nothing if values are wrong. Always run field-level comparison on at least your critical numeric columns (amounts, quantities, prices).
Mistake 2 — Not reconciling by partition
Checking total counts and totals across all time may hide a gap in a specific month. Always reconcile at the partition level (by date, month, or region).
Mistake 3 — Skipping reconciliation on incremental loads
Incremental pipelines are more error-prone than full loads — late-arriving data, watermark drift, and MERGE bugs all create gaps. Reconcile every incremental run, not just full reloads.
What to Learn Next
Reconciliation catches pipeline errors after they happen. The next tutorial covers proactive data quality — writing SQL checks that validate your data before the load, so problems are caught at the source rather than discovered in the warehouse.