Advanced SQL Joins for Data Engineers — ANTI JOIN, SELF JOIN, LATERAL, and Non-Equi Joins

Most data engineers are comfortable with INNER JOIN and LEFT JOIN. But production pipelines regularly require more advanced join patterns — finding records that have no match, comparing rows within the same table, or joining on a range condition rather than an equality. Getting these wrong produces silent data errors or queries that never finish on large tables.

This tutorial covers the six join patterns that separate intermediate SQL users from data engineers who can handle any pipeline requirement.


The Dataset

CREATE TABLE employees (
    emp_id      INT,
    name        VARCHAR(50),
    dept_id     INT,
    salary      DECIMAL(10,2),
    manager_id  INT,
    hire_date   DATE
);

CREATE TABLE departments (
    dept_id     INT,
    dept_name   VARCHAR(50),
    budget      DECIMAL(12,2)
);

CREATE TABLE salary_bands (
    band_name   VARCHAR(20),
    min_salary  DECIMAL(10,2),
    max_salary  DECIMAL(10,2)
);

INSERT INTO departments VALUES
(10, 'Engineering', 500000),
(20, 'Sales',       300000),
(30, 'Finance',     200000),
(40, 'HR',           80000);  -- no employees

INSERT INTO employees VALUES
(1,  'Alice',   10, 120000, NULL, '2020-01-15'),
(2,  'Bob',     10,  95000,    1, '2021-03-10'),
(3,  'Carol',   20,  85000, NULL, '2019-06-01'),
(4,  'David',   20,  72000,    3, '2022-09-20'),
(5,  'Eve',     10, 110000,    1, '2020-11-01'),
(6,  'Frank',   30,  90000, NULL, '2018-04-15'),
(7,  'Grace',   10,  88000,    1, '2023-02-01'),
-- emp 8 has a dept_id that doesn't exist (orphaned record)
(8,  'Henry',   99,  78000,    1, '2023-05-01');

INSERT INTO salary_bands VALUES
('Junior',    0,      80000),
('Mid',   80001,     110000),
('Senior',110001,    150000),
('Staff', 150001,    999999);

Pattern 1 — ANTI JOIN (Find Records With No Match)

An anti join returns rows from the left table that have no matching row in the right table. This is the correct way to find “things that are missing from another table” — a core data pipeline operation.

Find departments with no employees:

-- Method 1: LEFT JOIN + IS NULL (most widely supported)
SELECT d.dept_id, d.dept_name
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
WHERE e.emp_id IS NULL;

Output:

dept_iddept_name
40HR
-- Method 2: NOT EXISTS (cleaner intent, same performance)
SELECT dept_id, dept_name
FROM departments d
WHERE NOT EXISTS (
    SELECT 1 FROM employees e WHERE e.dept_id = d.dept_id
);
-- Method 3: EXCEPT (set-based approach)
SELECT dept_id FROM departments
EXCEPT
SELECT dept_id FROM employees;

All three return the same result. LEFT JOIN + IS NULL is most universal. NOT EXISTS is safest when the right side might have NULLs in the join key (see Tutorial 9). EXCEPT is cleanest for simple key comparisons.

Pipeline use case: Finding orphaned foreign keys, detecting rows that failed to load to the target table, finding dimension records with no corresponding fact data.


Pattern 2 — SELF JOIN (Compare Rows Within the Same Table)

A self join joins a table to itself. This is necessary when the relationship you need is between rows in the same table — such as employees and their managers, or orders and previous orders.

Find each employee and their manager’s name:

SELECT
    e.emp_id,
    e.name          AS employee_name,
    e.salary        AS employee_salary,
    m.name          AS manager_name,
    m.salary        AS manager_salary,
    m.salary - e.salary AS salary_gap
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.emp_id
ORDER BY e.dept_id, e.emp_id;

Output:

emp_idemployee_nameemployee_salarymanager_namemanager_salarysalary_gap
1Alice120000NULLNULLNULL
2Bob95000Alice12000025000
5Eve110000Alice12000010000
7Grace88000Alice12000032000
3Carol85000NULLNULLNULL
4David72000Carol8500013000
6Frank90000NULLNULLNULL
8Henry78000Alice12000042000

Alice and Carol have no manager (LEFT JOIN keeps them with NULLs). Henry reports to Alice even though Henry is in a non-existent department.

Find all pairs of employees in the same department:

SELECT
    e1.name AS employee_1,
    e2.name AS employee_2,
    d.dept_name
FROM employees e1
JOIN employees e2
    ON e1.dept_id = e2.dept_id
    AND e1.emp_id < e2.emp_id    -- < prevents duplicate pairs and self-matches
JOIN departments d ON e1.dept_id = d.dept_id
ORDER BY d.dept_name, e1.name;

Output:

employee_1employee_2dept_name
AliceBobEngineering
AliceEveEngineering
AliceGraceEngineering
BobEveEngineering
BobGraceEngineering
EveGraceEngineering
CarolDavidSales

e1.emp_id < e2.emp_id is the key — it prevents (Alice, Bob) and (Bob, Alice) both appearing, and prevents (Alice, Alice) self-pairs.


Pattern 3 — Non-Equi Join (Join on a Range Condition)

Standard joins use = as the join condition. Non-equi joins use <, >, BETWEEN, or other comparisons. This is essential for range-based lookups like salary bands, date validity windows, and tiered pricing.

Assign every employee to their salary band:

SELECT
    e.name,
    e.salary,
    sb.band_name
FROM employees e
JOIN salary_bands sb
    ON e.salary BETWEEN sb.min_salary AND sb.max_salary;

Output:

namesalaryband_name
Alice120000Senior
Bob95000Mid
Carol85000Mid
David72000Junior
Eve110000Mid
Frank90000Mid
Grace88000Mid
Henry78000Junior

No subquery needed — the range join assigns each salary directly to its band. This is the same pattern used in SCD Type 2 joins (joining on a date validity range) and incremental load watermark joins.

Find employees earning more than their department average:

SELECT
    e.name,
    e.salary,
    dept_avg.avg_salary,
    e.salary - dept_avg.avg_salary AS above_average_by
FROM employees e
JOIN (
    SELECT dept_id, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY dept_id
) dept_avg ON e.dept_id = dept_avg.dept_id
                AND e.salary > dept_avg.avg_salary;   -- non-equi condition

Output:

namesalaryavg_salaryabove_average_by
Alice12000010325016750
Eve1100001032506750
Carol85000785006500
Frank90000900000

Pattern 4 — LATERAL JOIN (Correlated Subquery as a Join)

LATERAL (PostgreSQL, Databricks) or CROSS APPLY (SQL Server) allows you to reference columns from a previous table in a subquery on the right side of a join. This is more powerful than a correlated subquery because the lateral subquery can return multiple rows per left-side row.

Get the top 2 highest-paid employees per department:

SELECT d.dept_name, top_emps.name, top_emps.salary, top_emps.rank
FROM departments d
JOIN LATERAL (
    SELECT e.name, e.salary,
           ROW_NUMBER() OVER (ORDER BY e.salary DESC) AS rank
    FROM employees e
    WHERE e.dept_id = d.dept_id   -- reference outer table here
    ORDER BY e.salary DESC
    LIMIT 2
) top_emps ON TRUE
WHERE d.dept_id != 40   -- exclude HR (no employees)
ORDER BY d.dept_name, top_emps.rank;

Output:

dept_namenamesalaryrank
EngineeringAlice1200001
EngineeringEve1100002
FinanceFrank900001
SalesCarol850001
SalesDavid720002

ON TRUE means “always join” — the LATERAL produces the rows, you do not need an additional matching condition.

Pipeline use case: Getting the N most recent events per user, the N largest orders per customer, or the closest matching reference record by date — without a window function followed by a filter.


Pattern 5 — CROSS JOIN (Cartesian Product for Data Generation)

A CROSS JOIN returns every combination of rows from both tables. Normally this is a mistake (a million-row table CROSS JOINed with another million-row table = 1 trillion rows). But intentional CROSS JOINs are essential for generating date spines, test data, and grid tables.

Generate a complete employee × department grid (for presence/absence analysis):

SELECT
    e.name        AS employee,
    d.dept_name   AS department,
    CASE WHEN e.dept_id = d.dept_id THEN 'YES' ELSE 'no' END AS assigned
FROM employees e
CROSS JOIN departments d
WHERE e.emp_id <= 4   -- limit output for readability
ORDER BY e.name, d.dept_name;

Output:

employeedepartmentassigned
AliceEngineeringYES
AliceFinanceno
AliceHRno
AliceSalesno
BobEngineeringYES
BobFinanceno

Pipeline use case — date spine generation:

-- Generate one row per day for the last 30 days
SELECT
    CURRENT_DATE - (n.n || ' days')::INTERVAL AS date
FROM GENERATE_SERIES(0, 29) n(n);

-- Or with CROSS JOIN to create a date × category grid
SELECT
    dates.d AS metric_date,
    categories.category
FROM
    (SELECT CURRENT_DATE - (n || ' days')::INTERVAL::DATE AS d FROM GENERATE_SERIES(0, 29) n) dates
    CROSS JOIN
    (SELECT DISTINCT category FROM products) categories
ORDER BY metric_date, category;

Pattern 6 — FULL OUTER JOIN (Keep All Rows From Both Tables)

A FULL OUTER JOIN returns all rows from both the left and right tables, with NULLs where there is no match on either side. This is less common than LEFT JOIN but critical for comparing two datasets where either side can have records the other does not.

Compare revenue in two different periods:

WITH q1_revenue AS (
    SELECT dept_id, SUM(salary) AS q1_total FROM employees GROUP BY dept_id
),
q2_revenue AS (
    SELECT dept_id, SUM(salary) * 1.1 AS q2_total FROM employees GROUP BY dept_id
)
SELECT
    COALESCE(q1.dept_id, q2.dept_id)    AS dept_id,
    q1.q1_total,
    q2.q2_total,
    COALESCE(q2.q2_total, 0) - COALESCE(q1.q1_total, 0) AS change,
    CASE
        WHEN q1.dept_id IS NULL THEN 'New in Q2'
        WHEN q2.dept_id IS NULL THEN 'Dropped in Q2'
        ELSE 'Existing'
    END AS status
FROM q1_revenue q1
FULL OUTER JOIN q2_revenue q2 ON q1.dept_id = q2.dept_id;

Pipeline use case: Source-to-target reconciliation where you need to see both missing rows (in target but not source) and phantom rows (in source but not target) in a single query — exactly the pattern from Tutorial 11.


Common Mistakes

Mistake 1 — Using LEFT JOIN + IS NULL instead of NOT EXISTS when the join key can be NULL

If e.dept_id is NULL, LEFT JOIN ... WHERE e.emp_id IS NULL still works — but only because the join fails for NULL dept_id in a different way. Use NOT EXISTS when your join key might contain NULLs to avoid this ambiguity.

Mistake 2 — CROSS JOIN on large tables without a WHERE clause

A CROSS JOIN without WHERE or a limiting condition is always intentional. If you see one in existing code without a comment explaining why, it is almost certainly a bug.

Mistake 3 — Forgetting e1.emp_id < e2.emp_id in SELF JOINs

Without this condition, self joins produce symmetric duplicates (Alice-Bob and Bob-Alice) plus self-pairs (Alice-Alice). Always add a strict inequality condition unless you specifically need both directions.


Quick Reference

Join TypeReturnsUse Case
INNER JOINRows matching in both tablesStandard dimension lookup
LEFT JOINAll left rows, matched right rowsOptional dimension, keep orphans
ANTI JOINLeft rows with NO match in rightFind gaps, orphaned records
SELF JOINRows joined to other rows in same tableHierarchy traversal, pair comparison
Non-Equi JOINRows matching on range conditionSalary bands, SCD Type 2, date ranges
LATERAL JOINOne correlated subquery result per left rowTop-N per group, nearest match
CROSS JOINEvery combination of rowsDate spines, grid generation
FULL OUTER JOINAll rows from both tablesDataset comparison, reconciliation

What to Learn Next

Advanced joins handle the structural relationships in your data. The next tutorial covers SQL for handling JSON and semi-structured data — parsing nested fields, exploding arrays, and flattening nested objects in Databricks, Snowflake, and PostgreSQL.

Leave a Comment