Advanced SQL Joins for Data Engineers — ANTI JOIN, SELF JOIN, LATERAL, and Non-Equi Joins

Most data engineers are comfortable with INNER JOIN and LEFT JOIN. But production pipelines regularly require more advanced join patterns — finding records that have no match, comparing rows within the same table, or joining on a range condition rather than an equality. Getting these wrong produces silent data errors or queries that never finish on large tables.

This tutorial covers the six join patterns that separate intermediate SQL users from data engineers who can handle any pipeline requirement.

The Dataset

CREATE TABLE employees (
    emp_id      INT,
    name        VARCHAR(50),
    dept_id     INT,
    salary      DECIMAL(10,2),
    manager_id  INT,
    hire_date   DATE
);

CREATE TABLE departments (
    dept_id     INT,
    dept_name   VARCHAR(50),
    budget      DECIMAL(12,2)
);

CREATE TABLE salary_bands (
    band_name   VARCHAR(20),
    min_salary  DECIMAL(10,2),
    max_salary  DECIMAL(10,2)
);

INSERT INTO departments VALUES
(10, 'Engineering', 500000),
(20, 'Sales',       300000),
(30, 'Finance',     200000),
(40, 'HR',           80000);  -- no employees

INSERT INTO employees VALUES
(1,  'Alice',   10, 120000, NULL, '2020-01-15'),
(2,  'Bob',     10,  95000,    1, '2021-03-10'),
(3,  'Carol',   20,  85000, NULL, '2019-06-01'),
(4,  'David',   20,  72000,    3, '2022-09-20'),
(5,  'Eve',     10, 110000,    1, '2020-11-01'),
(6,  'Frank',   30,  90000, NULL, '2018-04-15'),
(7,  'Grace',   10,  88000,    1, '2023-02-01'),
-- emp 8 has a dept_id that doesn't exist (orphaned record)
(8,  'Henry',   99,  78000,    1, '2023-05-01');

INSERT INTO salary_bands VALUES
('Junior',    0,      80000),
('Mid',   80001,     110000),
('Senior',110001,    150000),
('Staff', 150001,    999999);

Pattern 1 — ANTI JOIN (Find Records With No Match)

An anti join returns rows from the left table that have no matching row in the right table. This is the correct way to find “things that are missing from another table” — a core data pipeline operation.

Find departments with no employees:

-- Method 1: LEFT JOIN + IS NULL (most widely supported)
SELECT d.dept_id, d.dept_name
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
WHERE e.emp_id IS NULL;

Output:

dept_id	dept_name
40	HR

-- Method 2: NOT EXISTS (cleaner intent, same performance)
SELECT dept_id, dept_name
FROM departments d
WHERE NOT EXISTS (
    SELECT 1 FROM employees e WHERE e.dept_id = d.dept_id
);

-- Method 3: EXCEPT (set-based approach)
SELECT dept_id FROM departments
EXCEPT
SELECT dept_id FROM employees;

All three return the same result. LEFT JOIN + IS NULL is most universal. NOT EXISTS is safest when the right side might have NULLs in the join key (see Tutorial 9). EXCEPT is cleanest for simple key comparisons.

Pipeline use case: Finding orphaned foreign keys, detecting rows that failed to load to the target table, finding dimension records with no corresponding fact data.

Pattern 2 — SELF JOIN (Compare Rows Within the Same Table)

A self join joins a table to itself. This is necessary when the relationship you need is between rows in the same table — such as employees and their managers, or orders and previous orders.

Find each employee and their manager’s name:

SELECT
    e.emp_id,
    e.name          AS employee_name,
    e.salary        AS employee_salary,
    m.name          AS manager_name,
    m.salary        AS manager_salary,
    m.salary - e.salary AS salary_gap
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.emp_id
ORDER BY e.dept_id, e.emp_id;

Output:

emp_id	employee_name	employee_salary	manager_name	manager_salary	salary_gap
1	Alice	120000	NULL	NULL	NULL
2	Bob	95000	Alice	120000	25000
5	Eve	110000	Alice	120000	10000
7	Grace	88000	Alice	120000	32000
3	Carol	85000	NULL	NULL	NULL
4	David	72000	Carol	85000	13000
6	Frank	90000	NULL	NULL	NULL
8	Henry	78000	Alice	120000	42000

Alice and Carol have no manager (LEFT JOIN keeps them with NULLs). Henry reports to Alice even though Henry is in a non-existent department.

Find all pairs of employees in the same department:

SELECT
    e1.name AS employee_1,
    e2.name AS employee_2,
    d.dept_name
FROM employees e1
JOIN employees e2
    ON e1.dept_id = e2.dept_id
    AND e1.emp_id < e2.emp_id    -- < prevents duplicate pairs and self-matches
JOIN departments d ON e1.dept_id = d.dept_id
ORDER BY d.dept_name, e1.name;

Output:

employee_1	employee_2	dept_name
Alice	Bob	Engineering
Alice	Eve	Engineering
Alice	Grace	Engineering
Bob	Eve	Engineering
Bob	Grace	Engineering
Eve	Grace	Engineering
Carol	David	Sales

e1.emp_id < e2.emp_id is the key — it prevents (Alice, Bob) and (Bob, Alice) both appearing, and prevents (Alice, Alice) self-pairs.

Pattern 3 — Non-Equi Join (Join on a Range Condition)

Standard joins use = as the join condition. Non-equi joins use <, >, BETWEEN, or other comparisons. This is essential for range-based lookups like salary bands, date validity windows, and tiered pricing.

Assign every employee to their salary band:

SELECT
    e.name,
    e.salary,
    sb.band_name
FROM employees e
JOIN salary_bands sb
    ON e.salary BETWEEN sb.min_salary AND sb.max_salary;

Output:

name	salary	band_name
Alice	120000	Senior
Bob	95000	Mid
Carol	85000	Mid
David	72000	Junior
Eve	110000	Mid
Frank	90000	Mid
Grace	88000	Mid
Henry	78000	Junior

No subquery needed — the range join assigns each salary directly to its band. This is the same pattern used in SCD Type 2 joins (joining on a date validity range) and incremental load watermark joins.

Find employees earning more than their department average:

SELECT
    e.name,
    e.salary,
    dept_avg.avg_salary,
    e.salary - dept_avg.avg_salary AS above_average_by
FROM employees e
JOIN (
    SELECT dept_id, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY dept_id
) dept_avg ON e.dept_id = dept_avg.dept_id
                AND e.salary > dept_avg.avg_salary;   -- non-equi condition

Output:

name	salary	avg_salary	above_average_by
Alice	120000	103250	16750
Eve	110000	103250	6750
Carol	85000	78500	6500
Frank	90000	90000	0

Pattern 4 — LATERAL JOIN (Correlated Subquery as a Join)

LATERAL (PostgreSQL, Databricks) or CROSS APPLY (SQL Server) allows you to reference columns from a previous table in a subquery on the right side of a join. This is more powerful than a correlated subquery because the lateral subquery can return multiple rows per left-side row.

Get the top 2 highest-paid employees per department:

SELECT d.dept_name, top_emps.name, top_emps.salary, top_emps.rank
FROM departments d
JOIN LATERAL (
    SELECT e.name, e.salary,
           ROW_NUMBER() OVER (ORDER BY e.salary DESC) AS rank
    FROM employees e
    WHERE e.dept_id = d.dept_id   -- reference outer table here
    ORDER BY e.salary DESC
    LIMIT 2
) top_emps ON TRUE
WHERE d.dept_id != 40   -- exclude HR (no employees)
ORDER BY d.dept_name, top_emps.rank;

Output:

dept_name	name	salary	rank
Engineering	Alice	120000	1
Engineering	Eve	110000	2
Finance	Frank	90000	1
Sales	Carol	85000	1
Sales	David	72000	2

ON TRUE means “always join” — the LATERAL produces the rows, you do not need an additional matching condition.

Pipeline use case: Getting the N most recent events per user, the N largest orders per customer, or the closest matching reference record by date — without a window function followed by a filter.

Pattern 5 — CROSS JOIN (Cartesian Product for Data Generation)

A CROSS JOIN returns every combination of rows from both tables. Normally this is a mistake (a million-row table CROSS JOINed with another million-row table = 1 trillion rows). But intentional CROSS JOINs are essential for generating date spines, test data, and grid tables.

Generate a complete employee × department grid (for presence/absence analysis):

SELECT
    e.name        AS employee,
    d.dept_name   AS department,
    CASE WHEN e.dept_id = d.dept_id THEN 'YES' ELSE 'no' END AS assigned
FROM employees e
CROSS JOIN departments d
WHERE e.emp_id <= 4   -- limit output for readability
ORDER BY e.name, d.dept_name;

Output:

employee	department	assigned
Alice	Engineering	YES
Alice	Finance	no
Alice	HR	no
Alice	Sales	no
Bob	Engineering	YES
Bob	Finance	no
…	…	…

Pipeline use case — date spine generation:

-- Generate one row per day for the last 30 days
SELECT
    CURRENT_DATE - (n.n || ' days')::INTERVAL AS date
FROM GENERATE_SERIES(0, 29) n(n);

-- Or with CROSS JOIN to create a date × category grid
SELECT
    dates.d AS metric_date,
    categories.category
FROM
    (SELECT CURRENT_DATE - (n || ' days')::INTERVAL::DATE AS d FROM GENERATE_SERIES(0, 29) n) dates
    CROSS JOIN
    (SELECT DISTINCT category FROM products) categories
ORDER BY metric_date, category;

Pattern 6 — FULL OUTER JOIN (Keep All Rows From Both Tables)

A FULL OUTER JOIN returns all rows from both the left and right tables, with NULLs where there is no match on either side. This is less common than LEFT JOIN but critical for comparing two datasets where either side can have records the other does not.

Compare revenue in two different periods:

WITH q1_revenue AS (
    SELECT dept_id, SUM(salary) AS q1_total FROM employees GROUP BY dept_id
),
q2_revenue AS (
    SELECT dept_id, SUM(salary) * 1.1 AS q2_total FROM employees GROUP BY dept_id
)
SELECT
    COALESCE(q1.dept_id, q2.dept_id)    AS dept_id,
    q1.q1_total,
    q2.q2_total,
    COALESCE(q2.q2_total, 0) - COALESCE(q1.q1_total, 0) AS change,
    CASE
        WHEN q1.dept_id IS NULL THEN 'New in Q2'
        WHEN q2.dept_id IS NULL THEN 'Dropped in Q2'
        ELSE 'Existing'
    END AS status
FROM q1_revenue q1
FULL OUTER JOIN q2_revenue q2 ON q1.dept_id = q2.dept_id;

Pipeline use case: Source-to-target reconciliation where you need to see both missing rows (in target but not source) and phantom rows (in source but not target) in a single query — exactly the pattern from Tutorial 11.

Common Mistakes

Mistake 1 — Using LEFT JOIN + IS NULL instead of NOT EXISTS when the join key can be NULL

If e.dept_id is NULL, LEFT JOIN ... WHERE e.emp_id IS NULL still works — but only because the join fails for NULL dept_id in a different way. Use NOT EXISTS when your join key might contain NULLs to avoid this ambiguity.

Mistake 2 — CROSS JOIN on large tables without a WHERE clause

A CROSS JOIN without WHERE or a limiting condition is always intentional. If you see one in existing code without a comment explaining why, it is almost certainly a bug.

Mistake 3 — Forgetting e1.emp_id < e2.emp_id in SELF JOINs

Without this condition, self joins produce symmetric duplicates (Alice-Bob and Bob-Alice) plus self-pairs (Alice-Alice). Always add a strict inequality condition unless you specifically need both directions.

Quick Reference

Join Type	Returns	Use Case
INNER JOIN	Rows matching in both tables	Standard dimension lookup
LEFT JOIN	All left rows, matched right rows	Optional dimension, keep orphans
ANTI JOIN	Left rows with NO match in right	Find gaps, orphaned records
SELF JOIN	Rows joined to other rows in same table	Hierarchy traversal, pair comparison
Non-Equi JOIN	Rows matching on range condition	Salary bands, SCD Type 2, date ranges
LATERAL JOIN	One correlated subquery result per left row	Top-N per group, nearest match
CROSS JOIN	Every combination of rows	Date spines, grid generation
FULL OUTER JOIN	All rows from both tables	Dataset comparison, reconciliation

What to Learn Next

Advanced joins handle the structural relationships in your data. The next tutorial covers SQL for handling JSON and semi-structured data — parsing nested fields, exploding arrays, and flattening nested objects in Databricks, Snowflake, and PostgreSQL.

The Dataset

Pattern 1 — ANTI JOIN (Find Records With No Match)

Pattern 2 — SELF JOIN (Compare Rows Within the Same Table)

Pattern 3 — Non-Equi Join (Join on a Range Condition)

Pattern 4 — LATERAL JOIN (Correlated Subquery as a Join)

Pattern 5 — CROSS JOIN (Cartesian Product for Data Generation)

Pattern 6 — FULL OUTER JOIN (Keep All Rows From Both Tables)

Common Mistakes

Quick Reference

What to Learn Next

Leave a Comment Cancel reply