The most dangerous bugs in a data pipeline are not the ones that throw errors — those are easy to find and fix. The dangerous bugs are the ones that silently produce wrong results: queries that run successfully but return incorrect data that flows into your warehouse, corrupts reports, and goes undetected for weeks.
This tutorial covers the SQL mistakes that show up most often in production data pipelines, with examples of exactly what goes wrong and how to fix each one.
The Dataset
CREATE TABLE pipeline_events (
event_id INT,
user_id INT,
event_type VARCHAR(30),
amount DECIMAL(10,2),
discount DECIMAL(10,2),
event_ts TIMESTAMP WITH TIME ZONE,
category_id VARCHAR(10),
quantity INT
);
INSERT INTO pipeline_events VALUES
(1, 101, 'purchase', 500.00, 50.00, '2024-03-01 08:00:00+00', '10', 2),
(2, 102, 'purchase', 300.00, NULL, '2024-03-01 14:00:00+05', '20', 1),
(3, 103, 'refund', -100.00, 0.00, '2024-03-01 23:30:00+00', '10', NULL),
(4, 101, 'purchase', 750.00, 75.00, '2024-03-02 01:00:00+00', '10', 3),
(5, 104, 'purchase', 200.00, NULL, '2024-03-02 09:00:00-05', '30', 1),
(6, 102, 'purchase', 450.00, 45.00, '2024-03-02 16:00:00+00', '20', NULL);
Mistake 1 — NULL Arithmetic Silently Returns NULL
This is the most common silent data error in SQL. Any arithmetic operation involving NULL returns NULL — not zero, not an error, just NULL.
-- WRONG: discount is NULL for some rows
SELECT
event_id,
amount,
discount,
amount - discount AS net_amount -- returns NULL when discount is NULL
FROM pipeline_events;
Output:
| event_id | amount | discount | net_amount |
|---|---|---|---|
| 1 | 500.00 | 50.00 | 450.00 |
| 2 | 300.00 | NULL | NULL |
| 3 | -100.00 | 0.00 | -100.00 |
| 4 | 750.00 | 75.00 | 675.00 |
| 5 | 200.00 | NULL | NULL |
| 6 | 450.00 | 45.00 | 405.00 |
Events 2 and 5 have NULL net_amount. If this column feeds a revenue dashboard, those two events disappear from your totals entirely — silent data loss.
-- CORRECT: treat NULL discount as 0
SELECT
event_id,
amount,
discount,
amount - COALESCE(discount, 0) AS net_amount
FROM pipeline_events;
Output:
| event_id | amount | discount | net_amount |
|---|---|---|---|
| 1 | 500.00 | 50.00 | 450.00 |
| 2 | 300.00 | NULL | 300.00 |
| 4 | 750.00 | 75.00 | 675.00 |
| 5 | 200.00 | NULL | 200.00 |
Rule: Always apply COALESCE() or NULLIF() when doing arithmetic on columns that can contain NULL.
Mistake 2 — NULL in WHERE Clauses Silently Excludes Rows
Comparing anything to NULL with = always returns NULL (not TRUE or FALSE), which means the row is excluded from results.
-- WRONG: trying to find rows where discount is NOT 50
SELECT event_id, discount
FROM pipeline_events
WHERE discount != 50;
Output:
| event_id | discount |
|---|---|
| 3 | 0.00 |
| 6 | 45.00 |
Events 2 and 5 (where discount IS NULL) are silently missing. NULL != 50 evaluates to NULL, which is treated as FALSE.
-- CORRECT: explicitly include NULLs
SELECT event_id, discount
FROM pipeline_events
WHERE discount != 50
OR discount IS NULL;
Output:
| event_id | discount |
|---|---|
| 2 | NULL |
| 3 | 0.00 |
| 5 | NULL |
| 6 | 45.00 |
Rule: Whenever you filter on a nullable column, ask yourself whether NULL rows should be included. If yes, add OR column IS NULL.
Mistake 3 — NULL in COUNT Gives Wrong Row Counts
COUNT(column) counts non-NULL values only. COUNT(*) counts all rows including NULLs. This distinction causes subtle miscounts in pipeline aggregations.
SELECT
COUNT(*) AS total_rows,
COUNT(discount) AS rows_with_discount, -- excludes NULLs
COUNT(quantity) AS rows_with_quantity -- excludes NULLs
FROM pipeline_events;
Output:
| total_rows | rows_with_discount | rows_with_quantity |
|---|---|---|
| 6 | 4 | 4 |
If you expected COUNT(discount) to return 6 and it returns 4, your completeness metric will report incorrect data coverage. Be explicit about which count you need and why.
Mistake 4 — Implicit Type Casting Breaks Joins and Filters
When you join or filter columns of different data types, the database silently casts one type to another. This can prevent index usage, cause unexpected matches, or filter out rows you intended to include.
In our dataset, category_id is stored as VARCHAR('10') but in your warehouse dimension it might be INT(10).
-- WARNING: implicit cast may work but disables index on category_id
SELECT *
FROM pipeline_events
WHERE category_id = 10; -- integer literal, column is VARCHAR
Most databases cast 10 to '10' and the query works — but the execution plan does a full table scan instead of using an index, making this extremely slow at scale.
-- CORRECT: match the data type explicitly
SELECT *
FROM pipeline_events
WHERE category_id = '10';
Worse case — joins between INT and VARCHAR lose rows:
-- If dim_categories.category_id is INT and pipeline_events.category_id is VARCHAR,
-- rows like '10 ' (with trailing space) will not match 10 (integer)
SELECT p.*, d.category_name
FROM pipeline_events p
JOIN dim_categories d ON p.category_id = d.category_id; -- implicit cast, silent mismatches
Fix: Enforce consistent data types in your schema. If they must differ, cast explicitly in the JOIN:
JOIN dim_categories d ON CAST(p.category_id AS INT) = d.category_id
Mistake 5 — Timezone Errors That Shift Your Data by Hours
This is one of the most damaging pipeline bugs because it is almost invisible. Your data looks correct, but all your timestamps are off by 5 or 13 hours depending on server timezone settings.
Our dataset has events from different timezones: +00, +05, -05.
-- WRONG: treating all timestamps as if they are in the same timezone
SELECT
DATE(event_ts) AS event_date,
COUNT(*) AS event_count
FROM pipeline_events
GROUP BY DATE(event_ts);
Output (timezone-naive — wrong):
| event_date | event_count |
|---|---|
| 2024-03-01 | 3 |
| 2024-03-02 | 3 |
-- CORRECT: convert all timestamps to UTC before extracting the date
SELECT
DATE(event_ts AT TIME ZONE 'UTC') AS event_date_utc,
COUNT(*) AS event_count
FROM pipeline_events
GROUP BY DATE(event_ts AT TIME ZONE 'UTC');
Output (timezone-correct — UTC):
| event_date_utc | event_count |
|---|---|
| 2024-03-01 | 4 |
| 2024-03-02 | 2 |
Event 4 (timestamp 2024-03-02 01:00:00+00) correctly falls on March 2nd UTC. Event 5 (timestamp 2024-03-02 09:00:00-05) is actually 2024-03-02 14:00:00 UTC — still March 2nd. But event 2 (timestamp 2024-03-01 14:00:00+05) is 2024-03-01 09:00:00 UTC — still March 1st. Without UTC conversion, offsets can shift events across day boundaries.
Rule: Always store and process timestamps in UTC. Convert to local timezone only at the reporting layer.
Mistake 6 — BETWEEN Is Inclusive on Both Ends
Most data engineers know BETWEEN is inclusive, but it trips people up on timestamp ranges.
-- WRONG: this includes events at exactly 2024-03-02 00:00:00 in BOTH ranges
SELECT * FROM pipeline_events WHERE event_ts BETWEEN '2024-03-01' AND '2024-03-02';
-- AND
SELECT * FROM pipeline_events WHERE event_ts BETWEEN '2024-03-02' AND '2024-03-03';
-- Event at exactly midnight 2024-03-02 appears in BOTH queries
-- CORRECT: use >= and < for timestamp ranges to avoid boundary overlap
SELECT * FROM pipeline_events
WHERE event_ts >= '2024-03-01 00:00:00'
AND event_ts < '2024-03-02 00:00:00';
This pattern — >= lower bound, < upper bound — is the standard for all date range partitioning in pipelines.
Mistake 7 — String Aggregation Order Is Not Guaranteed
When building comma-separated lists or concatenated strings in SQL (for logging, tagging, or metadata), most databases do not guarantee the order of aggregation without an explicit ORDER BY.
-- WRONG: order of event types is non-deterministic across runs
SELECT
user_id,
STRING_AGG(event_type, ',') AS event_sequence
FROM pipeline_events
GROUP BY user_id;
Output (may vary between runs):
| user_id | event_sequence |
|---|---|
| 101 | purchase,purchase |
| 102 | purchase,purchase |
For user activity sequences used as ML features, order matters. purchase,refund is a different pattern from refund,purchase.
-- CORRECT: specify ORDER BY inside STRING_AGG
SELECT
user_id,
STRING_AGG(event_type, ',' ORDER BY event_ts) AS event_sequence
FROM pipeline_events
GROUP BY user_id;
Mistake 8 — Using NOT IN With a Subquery That Can Return NULL
NOT IN with a subquery silently returns zero rows if the subquery returns any NULL value.
-- Suppose we want events NOT made by users in a blocklist
CREATE TABLE blocked_users (user_id INT);
INSERT INTO blocked_users VALUES (104), (NULL); -- one NULL slips in
-- WRONG: returns zero rows when blocked_users contains any NULL
SELECT * FROM pipeline_events
WHERE user_id NOT IN (SELECT user_id FROM blocked_users);
Output: 0 rows (even though only user 104 should be excluded).
NOT IN (104, NULL) evaluates as user_id != 104 AND user_id != NULL. Since user_id != NULL is always NULL (not TRUE), the entire condition is NULL and every row is excluded.
-- CORRECT: use NOT EXISTS instead
SELECT * FROM pipeline_events p
WHERE NOT EXISTS (
SELECT 1 FROM blocked_users b
WHERE b.user_id = p.user_id
);
-- OR: filter NULLs from the subquery
SELECT * FROM pipeline_events
WHERE user_id NOT IN (
SELECT user_id FROM blocked_users WHERE user_id IS NOT NULL
);
Rule: Never use NOT IN with a subquery unless you are 100% certain the subquery will never return NULL. NOT EXISTS is always the safer choice.
Mistake 9 — Dividing Integers Returns an Integer
In most SQL databases, dividing two integers performs integer division — it truncates the decimal portion silently.
-- WRONG: integer division truncates result
SELECT
event_id,
quantity,
amount,
amount / quantity AS unit_price -- if quantity is INT, result may truncate
FROM pipeline_events
WHERE quantity IS NOT NULL;
In databases that enforce integer division, 500 / 2 = 250 (correct here), but 300 / 1 = 300. The danger is less obvious: 750 / 3 = 250 but what if you had 751 / 3? In integer division, 751 / 3 = 250, not 250.33.
-- CORRECT: cast to DECIMAL before dividing
SELECT
event_id,
quantity,
amount,
amount / CAST(quantity AS DECIMAL(10,4)) AS unit_price
FROM pipeline_events
WHERE quantity IS NOT NULL;
Mistake 10 — GROUP BY Without All Non-Aggregated Columns
Most databases will error on this, but some (MySQL, SQLite) allow it with unpredictable results — picking an arbitrary value for columns not in the GROUP BY.
-- WRONG in MySQL: user_id is not in GROUP BY, database picks arbitrary value
SELECT user_id, event_type, SUM(amount) AS total
FROM pipeline_events
GROUP BY user_id; -- event_type is not aggregated and not in GROUP BY
In MySQL this returns a result with a random event_type per user. In PostgreSQL and most other databases this throws an error — which is actually more helpful because at least you know there is a problem.
-- CORRECT: include all non-aggregated columns in GROUP BY
SELECT user_id, event_type, SUM(amount) AS total
FROM pipeline_events
GROUP BY user_id, event_type;
-- OR: aggregate event_type explicitly if you only want one value
SELECT user_id, MAX(event_type) AS latest_event_type, SUM(amount) AS total
FROM pipeline_events
GROUP BY user_id;
Quick Reference — Mistake Checklist
Before deploying any pipeline SQL, run through this list:
| # | Check | What to look for |
|---|---|---|
| 1 | NULL arithmetic | Any +, -, *, / on nullable columns — wrap with COALESCE |
| 2 | NULL in WHERE | Filters on nullable columns — add OR col IS NULL if needed |
| 3 | COUNT vs COUNT(*) | Using COUNT(col) when you want all rows |
| 4 | Implicit casting | JOIN or WHERE between different data types |
| 5 | Timezones | DATE() or GROUP BY on timestamps without UTC conversion |
| 6 | BETWEEN on timestamps | Use >= and < instead |
| 7 | STRING_AGG order | Add ORDER BY inside STRING_AGG when sequence matters |
| 8 | NOT IN with NULLs | Use NOT EXISTS instead of NOT IN for subqueries |
| 9 | Integer division | Cast to DECIMAL before dividing |
| 10 | GROUP BY completeness | All non-aggregated columns must appear in GROUP BY |
What to Learn Next
Now that you know what can go wrong, the next tutorial covers SQL query optimization — how to read execution plans, use EXPLAIN ANALYZE, and rewrite slow queries that are bottlenecking your pipelines.