Common SQL Mistakes That Break Data Pipelines — NULL Traps, Type Casting, and Timezone Issues

The most dangerous bugs in a data pipeline are not the ones that throw errors — those are easy to find and fix. The dangerous bugs are the ones that silently produce wrong results: queries that run successfully but return incorrect data that flows into your warehouse, corrupts reports, and goes undetected for weeks.

This tutorial covers the SQL mistakes that show up most often in production data pipelines, with examples of exactly what goes wrong and how to fix each one.


The Dataset

CREATE TABLE pipeline_events (
    event_id     INT,
    user_id      INT,
    event_type   VARCHAR(30),
    amount       DECIMAL(10,2),
    discount     DECIMAL(10,2),
    event_ts     TIMESTAMP WITH TIME ZONE,
    category_id  VARCHAR(10),
    quantity     INT
);

INSERT INTO pipeline_events VALUES
(1,  101, 'purchase',  500.00,  50.00, '2024-03-01 08:00:00+00', '10',  2),
(2,  102, 'purchase',  300.00,  NULL,  '2024-03-01 14:00:00+05', '20',  1),
(3,  103, 'refund',   -100.00,   0.00, '2024-03-01 23:30:00+00', '10',  NULL),
(4,  101, 'purchase',  750.00,  75.00, '2024-03-02 01:00:00+00', '10',  3),
(5,  104, 'purchase',  200.00,  NULL,  '2024-03-02 09:00:00-05', '30',  1),
(6,  102, 'purchase',  450.00,  45.00, '2024-03-02 16:00:00+00', '20',  NULL);

Mistake 1 — NULL Arithmetic Silently Returns NULL

This is the most common silent data error in SQL. Any arithmetic operation involving NULL returns NULL — not zero, not an error, just NULL.

-- WRONG: discount is NULL for some rows
SELECT
    event_id,
    amount,
    discount,
    amount - discount AS net_amount   -- returns NULL when discount is NULL
FROM pipeline_events;

Output:

event_idamountdiscountnet_amount
1500.0050.00450.00
2300.00NULLNULL
3-100.000.00-100.00
4750.0075.00675.00
5200.00NULLNULL
6450.0045.00405.00

Events 2 and 5 have NULL net_amount. If this column feeds a revenue dashboard, those two events disappear from your totals entirely — silent data loss.

-- CORRECT: treat NULL discount as 0
SELECT
    event_id,
    amount,
    discount,
    amount - COALESCE(discount, 0) AS net_amount
FROM pipeline_events;

Output:

event_idamountdiscountnet_amount
1500.0050.00450.00
2300.00NULL300.00
4750.0075.00675.00
5200.00NULL200.00

Rule: Always apply COALESCE() or NULLIF() when doing arithmetic on columns that can contain NULL.


Mistake 2 — NULL in WHERE Clauses Silently Excludes Rows

Comparing anything to NULL with = always returns NULL (not TRUE or FALSE), which means the row is excluded from results.

-- WRONG: trying to find rows where discount is NOT 50
SELECT event_id, discount
FROM pipeline_events
WHERE discount != 50;

Output:

event_iddiscount
30.00
645.00

Events 2 and 5 (where discount IS NULL) are silently missing. NULL != 50 evaluates to NULL, which is treated as FALSE.

-- CORRECT: explicitly include NULLs
SELECT event_id, discount
FROM pipeline_events
WHERE discount != 50
   OR discount IS NULL;

Output:

event_iddiscount
2NULL
30.00
5NULL
645.00

Rule: Whenever you filter on a nullable column, ask yourself whether NULL rows should be included. If yes, add OR column IS NULL.


Mistake 3 — NULL in COUNT Gives Wrong Row Counts

COUNT(column) counts non-NULL values only. COUNT(*) counts all rows including NULLs. This distinction causes subtle miscounts in pipeline aggregations.

SELECT
    COUNT(*)        AS total_rows,
    COUNT(discount) AS rows_with_discount,   -- excludes NULLs
    COUNT(quantity) AS rows_with_quantity    -- excludes NULLs
FROM pipeline_events;

Output:

total_rowsrows_with_discountrows_with_quantity
644

If you expected COUNT(discount) to return 6 and it returns 4, your completeness metric will report incorrect data coverage. Be explicit about which count you need and why.


Mistake 4 — Implicit Type Casting Breaks Joins and Filters

When you join or filter columns of different data types, the database silently casts one type to another. This can prevent index usage, cause unexpected matches, or filter out rows you intended to include.

In our dataset, category_id is stored as VARCHAR('10') but in your warehouse dimension it might be INT(10).

-- WARNING: implicit cast may work but disables index on category_id
SELECT *
FROM pipeline_events
WHERE category_id = 10;   -- integer literal, column is VARCHAR

Most databases cast 10 to '10' and the query works — but the execution plan does a full table scan instead of using an index, making this extremely slow at scale.

-- CORRECT: match the data type explicitly
SELECT *
FROM pipeline_events
WHERE category_id = '10';

Worse case — joins between INT and VARCHAR lose rows:

-- If dim_categories.category_id is INT and pipeline_events.category_id is VARCHAR,
-- rows like '10 ' (with trailing space) will not match 10 (integer)
SELECT p.*, d.category_name
FROM pipeline_events p
JOIN dim_categories d ON p.category_id = d.category_id;  -- implicit cast, silent mismatches

Fix: Enforce consistent data types in your schema. If they must differ, cast explicitly in the JOIN:

JOIN dim_categories d ON CAST(p.category_id AS INT) = d.category_id

Mistake 5 — Timezone Errors That Shift Your Data by Hours

This is one of the most damaging pipeline bugs because it is almost invisible. Your data looks correct, but all your timestamps are off by 5 or 13 hours depending on server timezone settings.

Our dataset has events from different timezones: +00, +05, -05.

-- WRONG: treating all timestamps as if they are in the same timezone
SELECT
    DATE(event_ts) AS event_date,
    COUNT(*) AS event_count
FROM pipeline_events
GROUP BY DATE(event_ts);

Output (timezone-naive — wrong):

event_dateevent_count
2024-03-013
2024-03-023
-- CORRECT: convert all timestamps to UTC before extracting the date
SELECT
    DATE(event_ts AT TIME ZONE 'UTC') AS event_date_utc,
    COUNT(*) AS event_count
FROM pipeline_events
GROUP BY DATE(event_ts AT TIME ZONE 'UTC');

Output (timezone-correct — UTC):

event_date_utcevent_count
2024-03-014
2024-03-022

Event 4 (timestamp 2024-03-02 01:00:00+00) correctly falls on March 2nd UTC. Event 5 (timestamp 2024-03-02 09:00:00-05) is actually 2024-03-02 14:00:00 UTC — still March 2nd. But event 2 (timestamp 2024-03-01 14:00:00+05) is 2024-03-01 09:00:00 UTC — still March 1st. Without UTC conversion, offsets can shift events across day boundaries.

Rule: Always store and process timestamps in UTC. Convert to local timezone only at the reporting layer.


Mistake 6 — BETWEEN Is Inclusive on Both Ends

Most data engineers know BETWEEN is inclusive, but it trips people up on timestamp ranges.

-- WRONG: this includes events at exactly 2024-03-02 00:00:00 in BOTH ranges
SELECT * FROM pipeline_events WHERE event_ts BETWEEN '2024-03-01' AND '2024-03-02';
-- AND
SELECT * FROM pipeline_events WHERE event_ts BETWEEN '2024-03-02' AND '2024-03-03';
-- Event at exactly midnight 2024-03-02 appears in BOTH queries
-- CORRECT: use >= and < for timestamp ranges to avoid boundary overlap
SELECT * FROM pipeline_events
WHERE event_ts >= '2024-03-01 00:00:00'
  AND event_ts <  '2024-03-02 00:00:00';

This pattern — >= lower bound, < upper bound — is the standard for all date range partitioning in pipelines.


Mistake 7 — String Aggregation Order Is Not Guaranteed

When building comma-separated lists or concatenated strings in SQL (for logging, tagging, or metadata), most databases do not guarantee the order of aggregation without an explicit ORDER BY.

-- WRONG: order of event types is non-deterministic across runs
SELECT
    user_id,
    STRING_AGG(event_type, ',') AS event_sequence
FROM pipeline_events
GROUP BY user_id;

Output (may vary between runs):

user_idevent_sequence
101purchase,purchase
102purchase,purchase

For user activity sequences used as ML features, order matters. purchase,refund is a different pattern from refund,purchase.

-- CORRECT: specify ORDER BY inside STRING_AGG
SELECT
    user_id,
    STRING_AGG(event_type, ',' ORDER BY event_ts) AS event_sequence
FROM pipeline_events
GROUP BY user_id;

Mistake 8 — Using NOT IN With a Subquery That Can Return NULL

NOT IN with a subquery silently returns zero rows if the subquery returns any NULL value.

-- Suppose we want events NOT made by users in a blocklist
CREATE TABLE blocked_users (user_id INT);
INSERT INTO blocked_users VALUES (104), (NULL);   -- one NULL slips in

-- WRONG: returns zero rows when blocked_users contains any NULL
SELECT * FROM pipeline_events
WHERE user_id NOT IN (SELECT user_id FROM blocked_users);

Output: 0 rows (even though only user 104 should be excluded).

NOT IN (104, NULL) evaluates as user_id != 104 AND user_id != NULL. Since user_id != NULL is always NULL (not TRUE), the entire condition is NULL and every row is excluded.

-- CORRECT: use NOT EXISTS instead
SELECT * FROM pipeline_events p
WHERE NOT EXISTS (
    SELECT 1 FROM blocked_users b
    WHERE b.user_id = p.user_id
);

-- OR: filter NULLs from the subquery
SELECT * FROM pipeline_events
WHERE user_id NOT IN (
    SELECT user_id FROM blocked_users WHERE user_id IS NOT NULL
);

Rule: Never use NOT IN with a subquery unless you are 100% certain the subquery will never return NULL. NOT EXISTS is always the safer choice.


Mistake 9 — Dividing Integers Returns an Integer

In most SQL databases, dividing two integers performs integer division — it truncates the decimal portion silently.

-- WRONG: integer division truncates result
SELECT
    event_id,
    quantity,
    amount,
    amount / quantity AS unit_price   -- if quantity is INT, result may truncate
FROM pipeline_events
WHERE quantity IS NOT NULL;

In databases that enforce integer division, 500 / 2 = 250 (correct here), but 300 / 1 = 300. The danger is less obvious: 750 / 3 = 250 but what if you had 751 / 3? In integer division, 751 / 3 = 250, not 250.33.

-- CORRECT: cast to DECIMAL before dividing
SELECT
    event_id,
    quantity,
    amount,
    amount / CAST(quantity AS DECIMAL(10,4)) AS unit_price
FROM pipeline_events
WHERE quantity IS NOT NULL;

Mistake 10 — GROUP BY Without All Non-Aggregated Columns

Most databases will error on this, but some (MySQL, SQLite) allow it with unpredictable results — picking an arbitrary value for columns not in the GROUP BY.

-- WRONG in MySQL: user_id is not in GROUP BY, database picks arbitrary value
SELECT user_id, event_type, SUM(amount) AS total
FROM pipeline_events
GROUP BY user_id;   -- event_type is not aggregated and not in GROUP BY

In MySQL this returns a result with a random event_type per user. In PostgreSQL and most other databases this throws an error — which is actually more helpful because at least you know there is a problem.

-- CORRECT: include all non-aggregated columns in GROUP BY
SELECT user_id, event_type, SUM(amount) AS total
FROM pipeline_events
GROUP BY user_id, event_type;

-- OR: aggregate event_type explicitly if you only want one value
SELECT user_id, MAX(event_type) AS latest_event_type, SUM(amount) AS total
FROM pipeline_events
GROUP BY user_id;

Quick Reference — Mistake Checklist

Before deploying any pipeline SQL, run through this list:

#CheckWhat to look for
1NULL arithmeticAny +, -, *, / on nullable columns — wrap with COALESCE
2NULL in WHEREFilters on nullable columns — add OR col IS NULL if needed
3COUNT vs COUNT(*)Using COUNT(col) when you want all rows
4Implicit castingJOIN or WHERE between different data types
5TimezonesDATE() or GROUP BY on timestamps without UTC conversion
6BETWEEN on timestampsUse >= and < instead
7STRING_AGG orderAdd ORDER BY inside STRING_AGG when sequence matters
8NOT IN with NULLsUse NOT EXISTS instead of NOT IN for subqueries
9Integer divisionCast to DECIMAL before dividing
10GROUP BY completenessAll non-aggregated columns must appear in GROUP BY

What to Learn Next

Now that you know what can go wrong, the next tutorial covers SQL query optimization — how to read execution plans, use EXPLAIN ANALYZE, and rewrite slow queries that are bottlenecking your pipelines.

Leave a Comment