SQL Query Optimization for Data Engineers — Reading EXPLAIN ANALYZE and Fixing Slow Queries

A query that takes 30 seconds in development becomes a 45-minute pipeline blocker in production when it runs against 200 million rows. Query optimization is not optional for data engineers — it is what separates a pipeline that scales from one that needs to be rebuilt six months later.

This tutorial covers how to read execution plans, identify the bottlenecks, and apply the right fix for each one.

The Dataset

CREATE TABLE orders (
    order_id    BIGINT,
    customer_id INT,
    product_id  INT,
    amount      DECIMAL(12,2),
    status      VARCHAR(20),
    region      VARCHAR(30),
    order_date  DATE
);

CREATE TABLE customers (
    customer_id INT,
    name        VARCHAR(100),
    segment     VARCHAR(20),
    country     VARCHAR(50)
);

CREATE TABLE products (
    product_id   INT,
    product_name VARCHAR(100),
    category     VARCHAR(30),
    cost         DECIMAL(10,2)
);

-- Simulate a large orders table
-- In production this would have 100M+ rows
-- Add an index to demonstrate the difference
CREATE INDEX idx_orders_customer ON orders(customer_id);
CREATE INDEX idx_orders_date     ON orders(order_date);
CREATE INDEX idx_orders_status   ON orders(status);

Step 1 — How to Read an Execution Plan

In PostgreSQL, prefix any query with EXPLAIN ANALYZE to see the actual execution plan with real timing.

EXPLAIN ANALYZE
SELECT
    c.segment,
    SUM(o.amount) AS total_revenue
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01'
  AND o.status = 'completed'
GROUP BY c.segment;

Example output:

HashAggregate (cost=84321.50..84321.75 rows=25 width=48) (actual time=1823.4..1823.5 rows=4 loops=1)
  ->  Hash Join (cost=1540.00..83215.50 rows=212800 width=40) (actual time=25.3..1790.2 rows=218430 loops=1)
        Hash Cond: (o.customer_id = c.customer_id)
        ->  Index Scan using idx_orders_date on orders (cost=0.43..71450.00 rows=212800 width=20) (actual time=0.1..950.3 rows=218430 loops=1)
              Filter: (status = 'completed')
              Rows Removed by Filter: 51200
        ->  Hash (cost=890.00..890.00 rows=52000 width=28) (actual time=22.1..22.1 rows=52000 loops=1)
              ->  Seq Scan on customers (cost=0.00..890.00 rows=52000 width=28) (actual time=0.1..10.4 rows=52000 loops=1)
Planning Time: 2.3 ms
Execution Time: 1823.8 ms

How to read this:

Seq Scan — reads every row in the table. This is expensive on large tables. Look for this as your first optimization target.
Index Scan — uses an index to find rows. Fast.
Hash Join — builds a hash table from the smaller table, then probes it for each row of the larger table. Usually efficient.
cost=X..Y — estimated cost: X = startup cost, Y = total cost. Higher Y = more expensive.
actual time=X..Y — real execution time in milliseconds. This is what matters.
rows=N — actual rows processed at this step.
Rows Removed by Filter: N — how many rows passed the scan but were then discarded by a filter. High numbers here suggest a missing index.

Problem 1 — Seq Scan on a Large Table (Missing Index)

The most common performance problem. The database reads every single row to find the ones you want.

-- This query does a full table scan on orders (no index on status alone)
EXPLAIN ANALYZE
SELECT COUNT(*) FROM orders WHERE status = 'completed';

Bad plan output:

Seq Scan on orders (cost=0.00..285000.00 rows=3200000 width=8)
  Filter: (status = 'completed')
  Rows Removed by Filter: 1800000
Execution Time: 4821 ms

The database scanned all 5 million rows and discarded 1.8 million of them.

Fix: Add a composite index

-- For status + order_date (both commonly filtered together)
CREATE INDEX idx_orders_status_date ON orders(status, order_date);

After index, the plan becomes:

Index Scan using idx_orders_status_date on orders
  Index Cond: (status = 'completed')
Execution Time: 312 ms

From 4,821ms to 312ms — 15x faster.

When a composite index helps: Put the highest-selectivity column (the one that filters the most rows) first in the index. For WHERE status = 'completed' AND order_date >= '2024-01-01', both columns filter data, so the composite index works well.

Problem 2 — Inefficient JOIN Order

The database’s query optimizer usually figures out the best join order, but sometimes you need to help it — especially when statistics are stale.

-- SLOW: joining large table to large table first, then filtering
SELECT o.order_id, c.name, p.product_name
FROM orders o
JOIN products p  ON o.product_id = p.product_id
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.country = 'UAE'   -- this filter reduces customers dramatically
  AND p.category = 'Hardware';

Fix: filter before joining using a CTE or subquery

-- FAST: filter both dimension tables first, then join to the large fact table
WITH uae_customers AS (
    SELECT customer_id, name FROM customers WHERE country = 'UAE'
),
hardware_products AS (
    SELECT product_id, product_name FROM products WHERE category = 'Hardware'
)
SELECT o.order_id, c.name, p.product_name
FROM orders o
JOIN uae_customers    c ON o.customer_id = c.customer_id
JOIN hardware_products p ON o.product_id  = p.product_id;

By filtering customers and products first, the JOIN to orders only needs to match a small subset of IDs instead of scanning the full dimension tables on every row.

Problem 3 — Using a Function on an Indexed Column Kills the Index

Wrapping an indexed column in a function prevents the database from using the index.

-- SLOW: EXTRACT() wraps order_date, index on order_date cannot be used
SELECT COUNT(*)
FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2024;

EXPLAIN output:

Seq Scan on orders
  Filter: (EXTRACT(year FROM order_date) = 2024)
Execution Time: 5103 ms

-- FAST: use a range condition directly on the indexed column
SELECT COUNT(*)
FROM orders
WHERE order_date >= '2024-01-01'
  AND order_date <  '2025-01-01';

EXPLAIN output:

Index Scan using idx_orders_date on orders
  Index Cond: (order_date >= '2024-01-01' AND order_date < '2025-01-01')
Execution Time: 287 ms

Same rule applies to any function: UPPER(email) = 'X' — slow. email = LOWER('X') — fast (function applied to the constant, not the column).

Problem 4 — SELECT * Fetches Unnecessary Columns

In columnar storage databases (Snowflake, BigQuery, Redshift, Databricks Delta), SELECT * reads every column from storage even if you only need two or three. At scale this wastes significant I/O and compute.

-- SLOW in columnar storage: reads all columns from disk
SELECT *
FROM orders
WHERE order_date >= '2024-01-01';

-- FAST: reads only the columns you need
SELECT order_id, customer_id, amount, order_date
FROM orders
WHERE order_date >= '2024-01-01';

In Snowflake and BigQuery, you are billed by bytes scanned. SELECT * on a 50-column wide table that has billions of rows can cost 50x more than selecting only the 3 columns you actually use.

Rule: Never use SELECT * in production pipeline SQL. Always name your columns explicitly.

Problem 5 — Correlated Subqueries Execute Once Per Row

A correlated subquery references the outer query and re-executes for every row. On 10 million rows this means 10 million separate subquery executions.

-- SLOW: correlated subquery runs once per order row
SELECT
    order_id,
    amount,
    (SELECT AVG(amount)
     FROM orders o2
     WHERE o2.customer_id = o1.customer_id) AS customer_avg
FROM orders o1;

Fix: replace with a window function or CTE

-- FAST: window function computes the average in a single pass
SELECT
    order_id,
    amount,
    AVG(amount) OVER (PARTITION BY customer_id) AS customer_avg
FROM orders;

Or using a CTE (better when you need to join rather than just reference):

WITH customer_avgs AS (
    SELECT customer_id, AVG(amount) AS avg_amount
    FROM orders
    GROUP BY customer_id
)
SELECT o.order_id, o.amount, ca.avg_amount AS customer_avg
FROM orders o
JOIN customer_avgs ca ON o.customer_id = ca.customer_id;

Both alternatives run the aggregation once instead of once per row — typically 100x to 1000x faster.

Problem 6 — DISTINCT Instead of Proper JOIN or Aggregation

DISTINCT is sometimes used as a fix for accidental row multiplication from bad joins. It works but hides the real problem and performs poorly.

-- SLOW and misleading: DISTINCT used to fix a bad join
SELECT DISTINCT o.order_id, o.amount
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id;
-- order_items has multiple rows per order, causing duplicates
-- DISTINCT removes them but scans and sorts all the extra rows first

Fix: identify why the join is multiplying rows and fix the join

-- FAST: pre-aggregate order_items before joining
WITH item_totals AS (
    SELECT order_id, COUNT(*) AS item_count, SUM(quantity) AS total_qty
    FROM order_items
    GROUP BY order_id
)
SELECT o.order_id, o.amount, it.item_count, it.total_qty
FROM orders o
JOIN item_totals it ON o.order_id = it.order_id;

Problem 7 — Partitioned Tables Not Being Pruned

In partitioned tables (very common in Databricks, BigQuery, and Redshift), queries that do not filter on the partition column scan every partition — effectively doing a full table scan.

-- SLOW: no filter on the partition column (order_date)
-- Databricks will scan ALL partitions
SELECT SUM(amount) FROM orders WHERE customer_id = 12345;

-- FAST: include the partition column in the filter
-- Databricks only reads the partitions that match
SELECT SUM(amount) FROM orders
WHERE customer_id = 12345
  AND order_date >= '2024-01-01'
  AND order_date <  '2024-04-01';

In Databricks, you can verify partition pruning with:

EXPLAIN
SELECT SUM(amount) FROM orders WHERE order_date >= '2024-01-01';
-- Look for "PartitionFilters" in the plan output

If PartitionFilters is empty, partition pruning is not happening.

Optimization Checklist

Run through this before pushing any pipeline query to production:

Check	Command	What to Look For
Check for Seq Scans	EXPLAIN ANALYZE	Seq Scan on large tables
Check filter selectivity	EXPLAIN ANALYZE	Rows Removed by Filter > 50%
Check for functions on columns	Review SQL	UPPER(), EXTRACT(), DATE() on indexed columns
Check for SELECT *	Review SQL	Replace with explicit column list
Check for correlated subqueries	Review SQL	Subquery references outer query alias
Check partition pruning	EXPLAIN	PartitionFilters present in plan
Update statistics	ANALYZE tablename	After large data loads

Quick Reference — Slow Pattern vs Fast Pattern

Slow	Fast	Reason
`WHERE YEAR(date) = 2024`	`WHERE date >= '2024-01-01' AND date < '2025-01-01'`	Function prevents index use
Correlated subquery	Window function or JOIN to aggregated CTE	Subquery runs once per row
`SELECT *`	`SELECT col1, col2, col3`	Columnar scan reads only needed columns
`NOT IN (subquery)`	`NOT EXISTS (subquery)`	NOT IN breaks on NULLs; NOT EXISTS is safe
`DISTINCT` to fix duplicates	Fix the JOIN	DISTINCT sorts all rows; fixing join prevents duplicates
No partition filter	Add partition column to WHERE	Partition pruning reads only relevant files

What to Learn Next

Query optimization is the foundation of pipeline performance. The next tutorial applies these techniques to the most performance-critical operation in any warehouse pipeline — the SQL MERGE statement at scale — covering how to optimize MERGE on billion-row Delta Lake tables.