SQL for Pipeline Monitoring — Detecting Stale Data, Late Arrivals, and Anomalous Loads

A data pipeline is not done when the first load succeeds. In production, pipelines fail silently all the time — a partition stops updating, row counts drop unexpectedly, timestamps stop moving forward. Without active monitoring, these problems go unnoticed until a dashboard shows wrong numbers and someone in finance calls.

This tutorial covers the SQL queries that data engineers use to monitor pipeline health continuously — detecting problems automatically before users notice them.

The Monitoring Dataset

-- Pipeline run metadata — your pipeline writes this after every successful run
CREATE TABLE pipeline.run_log (
    run_id        BIGINT GENERATED ALWAYS AS IDENTITY,
    pipeline_name VARCHAR(100),
    target_table  VARCHAR(100),
    batch_date    DATE,
    rows_loaded   BIGINT,
    bytes_loaded  BIGINT,
    start_time    TIMESTAMP,
    end_time      TIMESTAMP,
    status        VARCHAR(20),    -- 'success', 'failed', 'partial'
    error_message VARCHAR(500)
);

-- Simulate 14 days of run history
INSERT INTO pipeline.run_log (pipeline_name, target_table, batch_date, rows_loaded, bytes_loaded, start_time, end_time, status) VALUES
('orders_pipeline', 'silver.orders', '2024-03-01', 45230, 15200000, '2024-03-01 02:00:00', '2024-03-01 02:14:00', 'success'),
('orders_pipeline', 'silver.orders', '2024-03-02', 43100, 14800000, '2024-03-02 02:00:00', '2024-03-02 02:13:00', 'success'),
('orders_pipeline', 'silver.orders', '2024-03-03', 44800, 15100000, '2024-03-03 02:00:00', '2024-03-03 02:15:00', 'success'),
('orders_pipeline', 'silver.orders', '2024-03-04', 250,   85000,    '2024-03-04 02:00:00', '2024-03-04 02:01:00', 'success'),  -- suspiciously low
('orders_pipeline', 'silver.orders', '2024-03-05', 44200, 14900000, '2024-03-05 02:00:00', '2024-03-05 02:14:00', 'success'),
('orders_pipeline', 'silver.orders', '2024-03-06', 46100, 15500000, '2024-03-06 02:00:00', '2024-03-06 02:16:00', 'success'),
('orders_pipeline', 'silver.orders', '2024-03-07', 38900, 13100000, '2024-03-07 02:00:00', '2024-03-07 02:12:00', 'success'),
('orders_pipeline', 'silver.orders', '2024-03-08', 45500, 15300000, '2024-03-08 02:00:00', '2024-03-08 02:14:00', 'success');
-- No run on 2024-03-09 or beyond -- pipeline stopped running

Check 1 — Detect Missing Pipeline Runs (Stale Partitions)

The most critical check: is the pipeline still running every day? A gap means data consumers are looking at stale information.

WITH date_spine AS (
    SELECT CURRENT_DATE - (n.n || ' days')::INTERVAL AS expected_date
    FROM GENERATE_SERIES(0, 13) AS n(n)
),
run_status AS (
    SELECT
        batch_date,
        status,
        rows_loaded
    FROM pipeline.run_log
    WHERE pipeline_name = 'orders_pipeline'
      AND status = 'success'
)
SELECT
    ds.expected_date,
    CASE
        WHEN rs.batch_date IS NOT NULL THEN 'OK'
        ELSE 'MISSING'
    END AS run_status,
    rs.rows_loaded
FROM date_spine ds
LEFT JOIN run_status rs ON ds.expected_date = rs.batch_date
ORDER BY ds.expected_date DESC;

Output:

expected_date	run_status	rows_loaded
2024-03-14	MISSING	NULL
2024-03-13	MISSING	NULL
2024-03-12	MISSING	NULL
2024-03-11	MISSING	NULL
2024-03-10	MISSING	NULL
2024-03-09	MISSING	NULL
2024-03-08	OK	45500
2024-03-07	OK	38900
…	…	…

6 consecutive days of MISSING runs. Your pipeline stopped on March 9th.

Alert query — fire when last successful run was more than 1 day ago:

SELECT
    pipeline_name,
    MAX(batch_date) AS last_successful_run,
    CURRENT_DATE - MAX(batch_date) AS days_since_last_run
FROM pipeline.run_log
WHERE status = 'success'
GROUP BY pipeline_name
HAVING CURRENT_DATE - MAX(batch_date) > 1
ORDER BY days_since_last_run DESC;

Check 2 — Anomalous Row Counts (Statistical Detection)

March 4th loaded only 250 rows when the pipeline normally loads ~44,000. This is a silent failure — the pipeline reported success but the data was clearly wrong.

WITH stats AS (
    SELECT
        pipeline_name,
        AVG(rows_loaded)    AS avg_rows,
        STDDEV(rows_loaded) AS stddev_rows,
        MIN(rows_loaded)    AS min_rows,
        MAX(rows_loaded)    AS max_rows
    FROM pipeline.run_log
    WHERE status = 'success'
      AND batch_date >= CURRENT_DATE - 30   -- use last 30 days as baseline
    GROUP BY pipeline_name
)
SELECT
    r.pipeline_name,
    r.batch_date,
    r.rows_loaded,
    ROUND(s.avg_rows)    AS typical_rows,
    ROUND(s.stddev_rows) AS stddev_rows,
    CASE
        WHEN r.rows_loaded < s.avg_rows - 3 * s.stddev_rows THEN 'ANOMALY: too few rows'
        WHEN r.rows_loaded > s.avg_rows + 3 * s.stddev_rows THEN 'ANOMALY: too many rows'
        ELSE 'OK'
    END AS anomaly_check
FROM pipeline.run_log r
JOIN stats s ON r.pipeline_name = s.pipeline_name
WHERE r.batch_date >= CURRENT_DATE - 7
ORDER BY r.batch_date DESC;

Output:

pipeline_name	batch_date	rows_loaded	typical_rows	anomaly_check
orders_pipeline	2024-03-08	45500	43985	OK
orders_pipeline	2024-03-07	38900	43985	OK
orders_pipeline	2024-03-06	46100	43985	OK
orders_pipeline	2024-03-05	44200	43985	OK
orders_pipeline	2024-03-04	250	43985	ANOMALY: too few rows

March 4th flagged automatically. 3 standard deviations is the standard threshold for anomaly detection — adjust based on how noisy your pipeline is.

Check 3 — Detect Stale Data in the Target Table

Even when the pipeline runs, the data inside the target table might not be fresh. This checks the actual max timestamp in the Silver table.

SELECT
    'silver.orders'                           AS table_name,
    MAX(updated_at)                           AS latest_record_ts,
    CURRENT_TIMESTAMP                         AS check_time,
    CURRENT_TIMESTAMP - MAX(updated_at)       AS data_age,
    CASE
        WHEN CURRENT_TIMESTAMP - MAX(updated_at) > INTERVAL '25 hours'
        THEN 'STALE: data older than 25 hours'
        ELSE 'FRESH'
    END AS freshness_status
FROM silver.orders;

Output when pipeline is failing:

table_name	latest_record_ts	data_age	freshness_status
silver.orders	2024-03-08 02:14:00	6 days 03:15:00	STALE: data older than 25 hours

Run this check across all your key tables:

-- Multi-table freshness check
SELECT 'silver.orders'   AS tbl, MAX(updated_at) AS latest FROM silver.orders   UNION ALL
SELECT 'silver.events'   AS tbl, MAX(event_ts)   AS latest FROM silver.events   UNION ALL
SELECT 'silver.customers'AS tbl, MAX(updated_at) AS latest FROM silver.customers
ORDER BY latest ASC;   -- oldest table at the top = first problem to investigate

Check 4 — Pipeline Duration Trend (Detecting Slowdowns)

A pipeline that runs in 14 minutes today but takes 90 minutes next month needs attention. Catching the trend early prevents SLA breaches.

SELECT
    batch_date,
    rows_loaded,
    EXTRACT(EPOCH FROM (end_time - start_time)) / 60 AS duration_minutes,
    rows_loaded / NULLIF(EXTRACT(EPOCH FROM (end_time - start_time)), 0) AS rows_per_second,
    AVG(EXTRACT(EPOCH FROM (end_time - start_time)) / 60)
        OVER (ORDER BY batch_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg_duration_7d
FROM pipeline.run_log
WHERE pipeline_name = 'orders_pipeline'
  AND status = 'success'
ORDER BY batch_date;

Output:

batch_date	rows_loaded	duration_minutes	rows_per_second	moving_avg_duration_7d
2024-03-01	45230	14.0	53.8	14.0
2024-03-02	43100	13.0	55.3	13.5
2024-03-03	44800	15.0	49.8	14.0
2024-03-04	250	1.0	4.2	10.8
2024-03-05	44200	14.0	52.6	11.4
2024-03-06	46100	16.0	48.0	12.2
2024-03-07	38900	12.0	54.0	12.1
2024-03-08	45500	14.0	54.2	12.5

March 4th shows 4.2 rows/second vs the usual 50+ — another signal that the run was anomalous, even though it completed without error.

Check 5 — Partition Completeness by Hour

For pipelines that load hourly partitions, verify that every expected hour arrived.

WITH expected_hours AS (
    SELECT
        DATE_TRUNC('hour', CURRENT_TIMESTAMP - (n || ' hours')::INTERVAL) AS expected_hour
    FROM GENERATE_SERIES(1, 24) AS n
)
SELECT
    eh.expected_hour,
    COUNT(o.order_id) AS row_count,
    CASE WHEN COUNT(o.order_id) = 0 THEN 'MISSING' ELSE 'OK' END AS status
FROM expected_hours eh
LEFT JOIN silver.orders o
    ON DATE_TRUNC('hour', o.loaded_at) = eh.expected_hour
GROUP BY eh.expected_hour
ORDER BY eh.expected_hour DESC;

This shows which hourly partitions have data and which are empty — useful for near-real-time pipelines where a 2-hour gap needs immediate investigation.

Check 6 — Late-Arriving Data Detection

Some rows arrive with event timestamps from previous days — common in mobile apps that batch events, or systems with upload delays. This check shows how much data is arriving late.

SELECT
    loaded_date,
    event_date,
    loaded_date - event_date AS days_late,
    COUNT(*) AS row_count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY loaded_date), 2) AS pct_of_daily_load
FROM (
    SELECT
        DATE(loaded_at) AS loaded_date,
        event_date
    FROM silver.events
    WHERE loaded_at >= CURRENT_DATE - 7
) sub
GROUP BY loaded_date, event_date
ORDER BY loaded_date DESC, days_late;

Output:

loaded_date	event_date	days_late	row_count	pct_of_daily_load
2024-03-08	2024-03-08	0	41200	90.5%
2024-03-08	2024-03-07	1	3100	6.8%
2024-03-08	2024-03-06	2	900	2.0%
2024-03-08	2024-03-05	3	300	0.7%

90.5% of data arrives same-day. 6.8% arrives 1 day late — this informs your lookback buffer decision. If 3% arrives 3 days late, set your watermark lookback to at least 4 days.

Building a Monitoring Dashboard Table

Combine all checks into a daily monitoring table that your alerting system reads:

INSERT INTO pipeline.monitoring_report (
    check_date, pipeline_name, check_name, status, detail, created_at
)

-- Check 1: Missing run
SELECT
    CURRENT_DATE,
    'orders_pipeline',
    'last_run_freshness',
    CASE WHEN CURRENT_DATE - MAX(batch_date) > 1 THEN 'FAIL' ELSE 'PASS' END,
    'Last run: ' || MAX(batch_date)::VARCHAR || ', days ago: ' || (CURRENT_DATE - MAX(batch_date))::VARCHAR,
    CURRENT_TIMESTAMP
FROM pipeline.run_log
WHERE pipeline_name = 'orders_pipeline' AND status = 'success'

UNION ALL

-- Check 2: Row count anomaly (latest run vs 7-day average)
SELECT
    CURRENT_DATE,
    'orders_pipeline',
    'row_count_anomaly',
    CASE
        WHEN ABS(latest.rows_loaded - hist.avg_rows) > 3 * hist.stddev_rows
        THEN 'FAIL'
        ELSE 'PASS'
    END,
    'Latest: ' || latest.rows_loaded::VARCHAR || ', Avg: ' || ROUND(hist.avg_rows)::VARCHAR,
    CURRENT_TIMESTAMP
FROM (
    SELECT rows_loaded FROM pipeline.run_log
    WHERE pipeline_name = 'orders_pipeline' AND status = 'success'
    ORDER BY batch_date DESC LIMIT 1
) latest
CROSS JOIN (
    SELECT AVG(rows_loaded) AS avg_rows, STDDEV(rows_loaded) AS stddev_rows
    FROM pipeline.run_log
    WHERE pipeline_name = 'orders_pipeline' AND status = 'success'
      AND batch_date >= CURRENT_DATE - 30
) hist

UNION ALL

-- Check 3: Data freshness
SELECT
    CURRENT_DATE,
    'orders_pipeline',
    'data_freshness',
    CASE
        WHEN CURRENT_TIMESTAMP - MAX(updated_at) > INTERVAL '25 hours' THEN 'FAIL'
        ELSE 'PASS'
    END,
    'Latest record: ' || MAX(updated_at)::VARCHAR,
    CURRENT_TIMESTAMP
FROM silver.orders;

Query this table from Grafana, Apache Superset, or any alerting tool to get a real-time view of pipeline health.

Common Mistakes

Mistake 1 — Monitoring the run log but not the target table

A pipeline can log “success” but load zero rows or wrong data. Always check the actual target table, not just the run log.

Mistake 2 — Fixed thresholds instead of dynamic baselines

WHERE rows_loaded < 40000 breaks when seasonality causes legitimate dips (weekends, holidays). Use statistical thresholds (mean ± 3 std) instead.

Mistake 3 — Not storing monitoring history

Without history, you cannot answer “how long has this been broken?” or “what was the trend before it broke?” Always INSERT monitoring results into a log table rather than SELECT-only queries.

What to Learn Next

Monitoring tells you when something is wrong. The next tutorial covers SQL patterns for debugging broken pipelines — how to trace a specific row from Gold back through Silver to Bronze, and how to use SQL to reproduce and explain unexpected results.

The Monitoring Dataset

Check 1 — Detect Missing Pipeline Runs (Stale Partitions)

Check 2 — Anomalous Row Counts (Statistical Detection)

Check 3 — Detect Stale Data in the Target Table

Check 4 — Pipeline Duration Trend (Detecting Slowdowns)

Check 5 — Partition Completeness by Hour

Check 6 — Late-Arriving Data Detection

Building a Monitoring Dashboard Table

Common Mistakes

What to Learn Next

Leave a Comment Cancel reply