A query that takes 30 seconds in development becomes a 45-minute pipeline blocker in production when it runs against 200 million rows. Query optimization is not optional for data engineers — it is what separates a pipeline that scales from one that needs to be rebuilt six months later.
This tutorial covers how to read execution plans, identify the bottlenecks, and apply the right fix for each one.
The Dataset
CREATE TABLE orders (
order_id BIGINT,
customer_id INT,
product_id INT,
amount DECIMAL(12,2),
status VARCHAR(20),
region VARCHAR(30),
order_date DATE
);
CREATE TABLE customers (
customer_id INT,
name VARCHAR(100),
segment VARCHAR(20),
country VARCHAR(50)
);
CREATE TABLE products (
product_id INT,
product_name VARCHAR(100),
category VARCHAR(30),
cost DECIMAL(10,2)
);
-- Simulate a large orders table
-- In production this would have 100M+ rows
-- Add an index to demonstrate the difference
CREATE INDEX idx_orders_customer ON orders(customer_id);
CREATE INDEX idx_orders_date ON orders(order_date);
CREATE INDEX idx_orders_status ON orders(status);
Step 1 — How to Read an Execution Plan
In PostgreSQL, prefix any query with EXPLAIN ANALYZE to see the actual execution plan with real timing.
EXPLAIN ANALYZE
SELECT
c.segment,
SUM(o.amount) AS total_revenue
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01'
AND o.status = 'completed'
GROUP BY c.segment;
Example output:
HashAggregate (cost=84321.50..84321.75 rows=25 width=48) (actual time=1823.4..1823.5 rows=4 loops=1)
-> Hash Join (cost=1540.00..83215.50 rows=212800 width=40) (actual time=25.3..1790.2 rows=218430 loops=1)
Hash Cond: (o.customer_id = c.customer_id)
-> Index Scan using idx_orders_date on orders (cost=0.43..71450.00 rows=212800 width=20) (actual time=0.1..950.3 rows=218430 loops=1)
Filter: (status = 'completed')
Rows Removed by Filter: 51200
-> Hash (cost=890.00..890.00 rows=52000 width=28) (actual time=22.1..22.1 rows=52000 loops=1)
-> Seq Scan on customers (cost=0.00..890.00 rows=52000 width=28) (actual time=0.1..10.4 rows=52000 loops=1)
Planning Time: 2.3 ms
Execution Time: 1823.8 ms
How to read this:
- Seq Scan — reads every row in the table. This is expensive on large tables. Look for this as your first optimization target.
- Index Scan — uses an index to find rows. Fast.
- Hash Join — builds a hash table from the smaller table, then probes it for each row of the larger table. Usually efficient.
- cost=X..Y — estimated cost: X = startup cost, Y = total cost. Higher Y = more expensive.
- actual time=X..Y — real execution time in milliseconds. This is what matters.
- rows=N — actual rows processed at this step.
- Rows Removed by Filter: N — how many rows passed the scan but were then discarded by a filter. High numbers here suggest a missing index.
Problem 1 — Seq Scan on a Large Table (Missing Index)
The most common performance problem. The database reads every single row to find the ones you want.
-- This query does a full table scan on orders (no index on status alone)
EXPLAIN ANALYZE
SELECT COUNT(*) FROM orders WHERE status = 'completed';
Bad plan output:
Seq Scan on orders (cost=0.00..285000.00 rows=3200000 width=8)
Filter: (status = 'completed')
Rows Removed by Filter: 1800000
Execution Time: 4821 ms
The database scanned all 5 million rows and discarded 1.8 million of them.
Fix: Add a composite index
-- For status + order_date (both commonly filtered together)
CREATE INDEX idx_orders_status_date ON orders(status, order_date);
After index, the plan becomes:
Index Scan using idx_orders_status_date on orders
Index Cond: (status = 'completed')
Execution Time: 312 ms
From 4,821ms to 312ms — 15x faster.
When a composite index helps: Put the highest-selectivity column (the one that filters the most rows) first in the index. For WHERE status = 'completed' AND order_date >= '2024-01-01', both columns filter data, so the composite index works well.
Problem 2 — Inefficient JOIN Order
The database’s query optimizer usually figures out the best join order, but sometimes you need to help it — especially when statistics are stale.
-- SLOW: joining large table to large table first, then filtering
SELECT o.order_id, c.name, p.product_name
FROM orders o
JOIN products p ON o.product_id = p.product_id
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.country = 'UAE' -- this filter reduces customers dramatically
AND p.category = 'Hardware';
Fix: filter before joining using a CTE or subquery
-- FAST: filter both dimension tables first, then join to the large fact table
WITH uae_customers AS (
SELECT customer_id, name FROM customers WHERE country = 'UAE'
),
hardware_products AS (
SELECT product_id, product_name FROM products WHERE category = 'Hardware'
)
SELECT o.order_id, c.name, p.product_name
FROM orders o
JOIN uae_customers c ON o.customer_id = c.customer_id
JOIN hardware_products p ON o.product_id = p.product_id;
By filtering customers and products first, the JOIN to orders only needs to match a small subset of IDs instead of scanning the full dimension tables on every row.
Problem 3 — Using a Function on an Indexed Column Kills the Index
Wrapping an indexed column in a function prevents the database from using the index.
-- SLOW: EXTRACT() wraps order_date, index on order_date cannot be used
SELECT COUNT(*)
FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2024;
EXPLAIN output:
Seq Scan on orders
Filter: (EXTRACT(year FROM order_date) = 2024)
Execution Time: 5103 ms
-- FAST: use a range condition directly on the indexed column
SELECT COUNT(*)
FROM orders
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01';
EXPLAIN output:
Index Scan using idx_orders_date on orders
Index Cond: (order_date >= '2024-01-01' AND order_date < '2025-01-01')
Execution Time: 287 ms
Same rule applies to any function: UPPER(email) = 'X' — slow. email = LOWER('X') — fast (function applied to the constant, not the column).
Problem 4 — SELECT * Fetches Unnecessary Columns
In columnar storage databases (Snowflake, BigQuery, Redshift, Databricks Delta), SELECT * reads every column from storage even if you only need two or three. At scale this wastes significant I/O and compute.
-- SLOW in columnar storage: reads all columns from disk
SELECT *
FROM orders
WHERE order_date >= '2024-01-01';
-- FAST: reads only the columns you need
SELECT order_id, customer_id, amount, order_date
FROM orders
WHERE order_date >= '2024-01-01';
In Snowflake and BigQuery, you are billed by bytes scanned. SELECT * on a 50-column wide table that has billions of rows can cost 50x more than selecting only the 3 columns you actually use.
Rule: Never use SELECT * in production pipeline SQL. Always name your columns explicitly.
Problem 5 — Correlated Subqueries Execute Once Per Row
A correlated subquery references the outer query and re-executes for every row. On 10 million rows this means 10 million separate subquery executions.
-- SLOW: correlated subquery runs once per order row
SELECT
order_id,
amount,
(SELECT AVG(amount)
FROM orders o2
WHERE o2.customer_id = o1.customer_id) AS customer_avg
FROM orders o1;
Fix: replace with a window function or CTE
-- FAST: window function computes the average in a single pass
SELECT
order_id,
amount,
AVG(amount) OVER (PARTITION BY customer_id) AS customer_avg
FROM orders;
Or using a CTE (better when you need to join rather than just reference):
WITH customer_avgs AS (
SELECT customer_id, AVG(amount) AS avg_amount
FROM orders
GROUP BY customer_id
)
SELECT o.order_id, o.amount, ca.avg_amount AS customer_avg
FROM orders o
JOIN customer_avgs ca ON o.customer_id = ca.customer_id;
Both alternatives run the aggregation once instead of once per row — typically 100x to 1000x faster.
Problem 6 — DISTINCT Instead of Proper JOIN or Aggregation
DISTINCT is sometimes used as a fix for accidental row multiplication from bad joins. It works but hides the real problem and performs poorly.
-- SLOW and misleading: DISTINCT used to fix a bad join
SELECT DISTINCT o.order_id, o.amount
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id;
-- order_items has multiple rows per order, causing duplicates
-- DISTINCT removes them but scans and sorts all the extra rows first
Fix: identify why the join is multiplying rows and fix the join
-- FAST: pre-aggregate order_items before joining
WITH item_totals AS (
SELECT order_id, COUNT(*) AS item_count, SUM(quantity) AS total_qty
FROM order_items
GROUP BY order_id
)
SELECT o.order_id, o.amount, it.item_count, it.total_qty
FROM orders o
JOIN item_totals it ON o.order_id = it.order_id;
Problem 7 — Partitioned Tables Not Being Pruned
In partitioned tables (very common in Databricks, BigQuery, and Redshift), queries that do not filter on the partition column scan every partition — effectively doing a full table scan.
-- SLOW: no filter on the partition column (order_date)
-- Databricks will scan ALL partitions
SELECT SUM(amount) FROM orders WHERE customer_id = 12345;
-- FAST: include the partition column in the filter
-- Databricks only reads the partitions that match
SELECT SUM(amount) FROM orders
WHERE customer_id = 12345
AND order_date >= '2024-01-01'
AND order_date < '2024-04-01';
In Databricks, you can verify partition pruning with:
EXPLAIN
SELECT SUM(amount) FROM orders WHERE order_date >= '2024-01-01';
-- Look for "PartitionFilters" in the plan output
If PartitionFilters is empty, partition pruning is not happening.
Optimization Checklist
Run through this before pushing any pipeline query to production:
| Check | Command | What to Look For |
|---|---|---|
| Check for Seq Scans | EXPLAIN ANALYZE | Seq Scan on large tables |
| Check filter selectivity | EXPLAIN ANALYZE | Rows Removed by Filter > 50% |
| Check for functions on columns | Review SQL | UPPER(), EXTRACT(), DATE() on indexed columns |
| Check for SELECT * | Review SQL | Replace with explicit column list |
| Check for correlated subqueries | Review SQL | Subquery references outer query alias |
| Check partition pruning | EXPLAIN | PartitionFilters present in plan |
| Update statistics | ANALYZE tablename | After large data loads |
Quick Reference — Slow Pattern vs Fast Pattern
| Slow | Fast | Reason |
|---|---|---|
WHERE YEAR(date) = 2024 | WHERE date >= '2024-01-01' AND date < '2025-01-01' | Function prevents index use |
| Correlated subquery | Window function or JOIN to aggregated CTE | Subquery runs once per row |
SELECT * | SELECT col1, col2, col3 | Columnar scan reads only needed columns |
NOT IN (subquery) | NOT EXISTS (subquery) | NOT IN breaks on NULLs; NOT EXISTS is safe |
DISTINCT to fix duplicates | Fix the JOIN | DISTINCT sorts all rows; fixing join prevents duplicates |
| No partition filter | Add partition column to WHERE | Partition pruning reads only relevant files |
What to Learn Next
Query optimization is the foundation of pipeline performance. The next tutorial applies these techniques to the most performance-critical operation in any warehouse pipeline — the SQL MERGE statement at scale — covering how to optimize MERGE on billion-row Delta Lake tables.