SQL Indexing Deep Dive for Data Engineers — When to Add Them, When to Avoid Them

Indexes are the most impactful single change you can make to a slow SQL query. A query that takes 45 seconds on an unindexed column can drop to under 100ms with the right index. But indexes are not free — they slow down writes, consume storage, and can actually hurt performance when added incorrectly.

This tutorial covers the types of indexes that matter most in data engineering, how to verify they are being used, and when not to add them.

The Dataset

CREATE TABLE events (
    event_id    BIGINT,
    user_id     INT,
    session_id  VARCHAR(36),
    event_type  VARCHAR(30),
    page        VARCHAR(100),
    country     VARCHAR(50),
    device      VARCHAR(20),
    amount      DECIMAL(10,2),
    event_ts    TIMESTAMP,
    event_date  DATE
);

-- 10 million rows simulated
-- For demo: small sample that illustrates index behavior
INSERT INTO events VALUES
(1,  101, 'sess-aaa', 'purchase', '/checkout',  'UAE', 'mobile', 250.00, '2024-03-01 09:00:00', '2024-03-01'),
(2,  102, 'sess-bbb', 'view',     '/product/1', 'UK',  'desktop', NULL,  '2024-03-01 09:05:00', '2024-03-01'),
(3,  101, 'sess-aaa', 'view',     '/product/2', 'UAE', 'mobile',  NULL,  '2024-03-01 09:10:00', '2024-03-01'),
(4,  103, 'sess-ccc', 'purchase', '/checkout',  'USA', 'desktop', 800.00,'2024-03-02 10:00:00', '2024-03-02'),
(5,  102, 'sess-ddd', 'purchase', '/checkout',  'UK',  'desktop', 120.00,'2024-03-02 11:00:00', '2024-03-02');

How Indexes Work — The Core Concept

Without an index, finding rows where user_id = 101 requires scanning every row in the table from top to bottom. This is a Sequential Scan (Seq Scan).

An index on user_id builds a sorted B-tree structure separately from the table that maps each user_id value to the physical row location. Finding user_id = 101 now means: look up 101 in the B-tree (fast, like binary search), get the row locations, jump directly to those rows in the table.

For a table with 10 million rows, a Seq Scan reads ~10 million rows. An Index Scan reads ~log2(10,000,000) ≈ 23 comparisons to find the entry, then jumps to the exact rows.

Index Type 1 — Single-Column B-tree Index

The default index type. Works for equality (=), range (>, <, BETWEEN), and sorting (ORDER BY).

-- Create a single-column index on user_id
CREATE INDEX idx_events_user_id ON events(user_id);

-- Verify it is used by the query planner
EXPLAIN ANALYZE
SELECT * FROM events WHERE user_id = 101;

Without index:

Seq Scan on events  (cost=0.00..285000.00 rows=3 width=120)
  Filter: (user_id = 101)
  Rows Removed by Filter: 9999997
Execution Time: 4820 ms

With index:

Index Scan using idx_events_user_id on events  (cost=0.43..12.50 rows=3 width=120)
  Index Cond: (user_id = 101)
Execution Time: 0.08 ms

60,000x faster.

Index Type 2 — Composite Index

A composite index covers multiple columns. The column order matters — the index is most useful when your query filters on the leftmost columns first.

-- Common pattern: queries filter on event_date AND event_type together
CREATE INDEX idx_events_date_type ON events(event_date, event_type);

This index helps these queries:

-- Uses the index: filters on both columns
SELECT * FROM events WHERE event_date = '2024-03-01' AND event_type = 'purchase';

-- Uses the index: filters on leftmost column only
SELECT * FROM events WHERE event_date = '2024-03-01';

This index does NOT help this query:

-- Does NOT use the index: skips the leftmost column
SELECT * FROM events WHERE event_type = 'purchase';

The leading column rule: A composite index on (A, B, C) can be used by queries filtering on A, A+B, or A+B+C. It cannot be used by queries filtering on B alone, C alone, or B+C. Always put the most-filtered column first.

Choosing column order in practice:

Put the column with the highest selectivity (most distinct values) first unless you always filter on a specific column first. For a pipeline that always filters by event_date first and then optionally by event_type, put event_date first regardless of selectivity.

Index Type 3 — Partial Index

A partial index only covers a subset of rows — defined by a WHERE condition. This makes the index much smaller and faster when your queries consistently filter on a specific subset.

-- Most queries on this table only care about 'purchase' events
-- A partial index only indexes purchase rows
CREATE INDEX idx_events_purchase_user ON events(user_id, event_date)
WHERE event_type = 'purchase';

Query that benefits:

EXPLAIN ANALYZE
SELECT user_id, amount, event_date
FROM events
WHERE event_type = 'purchase'
  AND user_id = 101;

The partial index is smaller than a full index (only purchase rows are indexed, not view/click/etc.), so it fits better in memory and is faster to scan.

When to use partial indexes:

Soft-delete tables: WHERE is_deleted = FALSE (only index active rows)
Status-filtered queries: WHERE status = 'active'
Recent data: WHERE event_date > '2024-01-01'

Index Type 4 — Covering Index (INCLUDE columns)

A covering index includes all the columns a query needs, so the database never has to look up the actual table row. The entire query is served from the index alone.

-- Query: filter on user_id, return amount and event_date
-- Without INCLUDE: index finds the row location, then goes to the table for amount and event_date
-- With INCLUDE: index already has amount and event_date — no table lookup needed

CREATE INDEX idx_events_user_covering ON events(user_id)
INCLUDE (amount, event_date);

EXPLAIN output with covering index:

Index Only Scan using idx_events_user_covering on events
  Index Cond: (user_id = 101)
  Heap Fetches: 0   -- 0 means the table was never touched
Execution Time: 0.04 ms

Index Only Scan with Heap Fetches: 0 means the query was served entirely from the index — maximum performance.

Note: INCLUDE is available in PostgreSQL 11+, SQL Server, and Databricks. Snowflake and BigQuery handle this differently through clustering keys.

When Indexes Do NOT Help (And Can Hurt)

Understanding when NOT to add an index is as important as knowing when to add one.

1. Low-cardinality columns

A column with only a few distinct values (like device = mobile/desktop/tablet) gives the index almost no benefit. If 40% of the table has device = 'mobile', the database will skip the index and do a Seq Scan anyway — because fetching 40% of the table via random index lookups is slower than scanning it sequentially.

-- This index is often useless — too few distinct values
CREATE INDEX idx_events_device ON events(device);  -- BAD

Rule of thumb: If a column has fewer than 20-30 distinct values and any single value covers more than 5% of the table, a B-tree index will rarely be used.

2. Tables that are frequently bulk-loaded

Every INSERT, UPDATE, and DELETE must also update every index on the table. In a pipeline that bulk-loads millions of rows per run, indexes dramatically slow down the load.

Pattern used in production data warehouses:

-- Step 1: Drop indexes before bulk load
DROP INDEX IF EXISTS idx_events_user_id;
DROP INDEX IF EXISTS idx_events_date_type;

-- Step 2: Bulk load data (fast, no index maintenance overhead)
INSERT INTO events SELECT * FROM staging.events_incoming;

-- Step 3: Rebuild indexes after load
CREATE INDEX idx_events_user_id  ON events(user_id);
CREATE INDEX idx_events_date_type ON events(event_date, event_type);

This pattern is standard in PostgreSQL-based pipelines. Dropping and rebuilding indexes is faster than maintaining them row-by-row during a large insert.

3. Very small tables

For tables with fewer than a few thousand rows, the database will almost always prefer a Seq Scan. The overhead of an index lookup adds latency for small tables where scanning the whole thing is nearly instant.

Databricks / Delta Lake — Z-Ordering Instead of Indexes

Delta Lake does not use traditional B-tree indexes. Instead, it uses Z-Ordering and file-level statistics to skip entire Parquet files when they do not contain the data you need.

-- Z-Order by the columns you most commonly filter on
OPTIMIZE events
ZORDER BY (event_date, user_id);

After Z-Ordering, Databricks stores rows with similar event_date and user_id values in the same Parquet files. When a query filters on those columns, the Delta transaction log’s min/max statistics tell Spark which files can be skipped entirely without reading them.

-- Verify file skipping is happening
EXPLAIN EXTENDED
SELECT * FROM events WHERE event_date = '2024-03-01' AND user_id = 101;
-- Look for "PartitionFilters" and "PushedFilters" in the output

Important: Z-Order does not help with full table scans. It only helps when your query has a filter on the Z-Ordered columns. Also, Z-Order is most effective when each file contains many rows — small Delta tables with few files see little benefit.

Finding Unused and Duplicate Indexes

In PostgreSQL, you can query the system catalog to find indexes that the planner never uses — wasting storage and slowing writes with no benefit.

-- Find indexes with zero scans since last statistics reset
SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan    AS times_used,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

Any index with times_used = 0 is a candidate for removal — unless it was recently created or the table sees very low query volume.

Index Maintenance — ANALYZE and VACUUM

Indexes become less effective over time as data changes. Stale statistics cause the query planner to make bad decisions.

-- Update statistics so the planner knows current data distribution
ANALYZE events;

-- Reclaim space from deleted rows and update statistics
VACUUM ANALYZE events;

-- In Databricks: update Delta statistics
ANALYZE TABLE events COMPUTE STATISTICS FOR ALL COLUMNS;

Run ANALYZE after every large bulk load. Many pipelines include this as the final step after loading data.

Quick Reference — Index Decision Guide

Situation	Recommendation
Column used in WHERE/JOIN frequently, high cardinality	Add B-tree index
Multiple columns always filtered together	Add composite index (most selective column first)
Query always filters on a specific value	Add partial index with WHERE clause
Query only needs 2-3 columns	Add covering index with INCLUDE
Column has < 10 distinct values	Skip the index
Bulk loading millions of rows	Drop indexes, load, rebuild
Databricks Delta Lake	Use Z-ORDER BY instead of indexes
Table < 10,000 rows	Skip the index

What to Learn Next

Now that you understand how to make individual queries fast, the next tutorial covers how to structure the full pipeline SQL — specifically how to build reusable SQL patterns for the Medallion architecture (Bronze, Silver, Gold layers) using CTEs and views.