Most data engineers are comfortable with INNER JOIN and LEFT JOIN. But production pipelines regularly require more advanced join patterns — finding records that have no match, comparing rows within the same table, or joining on a range condition rather than an equality. Getting these wrong produces silent data errors or queries that never finish on large tables.
This tutorial covers the six join patterns that separate intermediate SQL users from data engineers who can handle any pipeline requirement.
The Dataset
CREATE TABLE employees (
emp_id INT,
name VARCHAR(50),
dept_id INT,
salary DECIMAL(10,2),
manager_id INT,
hire_date DATE
);
CREATE TABLE departments (
dept_id INT,
dept_name VARCHAR(50),
budget DECIMAL(12,2)
);
CREATE TABLE salary_bands (
band_name VARCHAR(20),
min_salary DECIMAL(10,2),
max_salary DECIMAL(10,2)
);
INSERT INTO departments VALUES
(10, 'Engineering', 500000),
(20, 'Sales', 300000),
(30, 'Finance', 200000),
(40, 'HR', 80000); -- no employees
INSERT INTO employees VALUES
(1, 'Alice', 10, 120000, NULL, '2020-01-15'),
(2, 'Bob', 10, 95000, 1, '2021-03-10'),
(3, 'Carol', 20, 85000, NULL, '2019-06-01'),
(4, 'David', 20, 72000, 3, '2022-09-20'),
(5, 'Eve', 10, 110000, 1, '2020-11-01'),
(6, 'Frank', 30, 90000, NULL, '2018-04-15'),
(7, 'Grace', 10, 88000, 1, '2023-02-01'),
-- emp 8 has a dept_id that doesn't exist (orphaned record)
(8, 'Henry', 99, 78000, 1, '2023-05-01');
INSERT INTO salary_bands VALUES
('Junior', 0, 80000),
('Mid', 80001, 110000),
('Senior',110001, 150000),
('Staff', 150001, 999999);
Pattern 1 — ANTI JOIN (Find Records With No Match)
An anti join returns rows from the left table that have no matching row in the right table. This is the correct way to find “things that are missing from another table” — a core data pipeline operation.
Find departments with no employees:
-- Method 1: LEFT JOIN + IS NULL (most widely supported)
SELECT d.dept_id, d.dept_name
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
WHERE e.emp_id IS NULL;
Output:
| dept_id | dept_name |
|---|---|
| 40 | HR |
-- Method 2: NOT EXISTS (cleaner intent, same performance)
SELECT dept_id, dept_name
FROM departments d
WHERE NOT EXISTS (
SELECT 1 FROM employees e WHERE e.dept_id = d.dept_id
);
-- Method 3: EXCEPT (set-based approach)
SELECT dept_id FROM departments
EXCEPT
SELECT dept_id FROM employees;
All three return the same result. LEFT JOIN + IS NULL is most universal. NOT EXISTS is safest when the right side might have NULLs in the join key (see Tutorial 9). EXCEPT is cleanest for simple key comparisons.
Pipeline use case: Finding orphaned foreign keys, detecting rows that failed to load to the target table, finding dimension records with no corresponding fact data.
Pattern 2 — SELF JOIN (Compare Rows Within the Same Table)
A self join joins a table to itself. This is necessary when the relationship you need is between rows in the same table — such as employees and their managers, or orders and previous orders.
Find each employee and their manager’s name:
SELECT
e.emp_id,
e.name AS employee_name,
e.salary AS employee_salary,
m.name AS manager_name,
m.salary AS manager_salary,
m.salary - e.salary AS salary_gap
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.emp_id
ORDER BY e.dept_id, e.emp_id;
Output:
| emp_id | employee_name | employee_salary | manager_name | manager_salary | salary_gap |
|---|---|---|---|---|---|
| 1 | Alice | 120000 | NULL | NULL | NULL |
| 2 | Bob | 95000 | Alice | 120000 | 25000 |
| 5 | Eve | 110000 | Alice | 120000 | 10000 |
| 7 | Grace | 88000 | Alice | 120000 | 32000 |
| 3 | Carol | 85000 | NULL | NULL | NULL |
| 4 | David | 72000 | Carol | 85000 | 13000 |
| 6 | Frank | 90000 | NULL | NULL | NULL |
| 8 | Henry | 78000 | Alice | 120000 | 42000 |
Alice and Carol have no manager (LEFT JOIN keeps them with NULLs). Henry reports to Alice even though Henry is in a non-existent department.
Find all pairs of employees in the same department:
SELECT
e1.name AS employee_1,
e2.name AS employee_2,
d.dept_name
FROM employees e1
JOIN employees e2
ON e1.dept_id = e2.dept_id
AND e1.emp_id < e2.emp_id -- < prevents duplicate pairs and self-matches
JOIN departments d ON e1.dept_id = d.dept_id
ORDER BY d.dept_name, e1.name;
Output:
| employee_1 | employee_2 | dept_name |
|---|---|---|
| Alice | Bob | Engineering |
| Alice | Eve | Engineering |
| Alice | Grace | Engineering |
| Bob | Eve | Engineering |
| Bob | Grace | Engineering |
| Eve | Grace | Engineering |
| Carol | David | Sales |
e1.emp_id < e2.emp_id is the key — it prevents (Alice, Bob) and (Bob, Alice) both appearing, and prevents (Alice, Alice) self-pairs.
Pattern 3 — Non-Equi Join (Join on a Range Condition)
Standard joins use = as the join condition. Non-equi joins use <, >, BETWEEN, or other comparisons. This is essential for range-based lookups like salary bands, date validity windows, and tiered pricing.
Assign every employee to their salary band:
SELECT
e.name,
e.salary,
sb.band_name
FROM employees e
JOIN salary_bands sb
ON e.salary BETWEEN sb.min_salary AND sb.max_salary;
Output:
| name | salary | band_name |
|---|---|---|
| Alice | 120000 | Senior |
| Bob | 95000 | Mid |
| Carol | 85000 | Mid |
| David | 72000 | Junior |
| Eve | 110000 | Mid |
| Frank | 90000 | Mid |
| Grace | 88000 | Mid |
| Henry | 78000 | Junior |
No subquery needed — the range join assigns each salary directly to its band. This is the same pattern used in SCD Type 2 joins (joining on a date validity range) and incremental load watermark joins.
Find employees earning more than their department average:
SELECT
e.name,
e.salary,
dept_avg.avg_salary,
e.salary - dept_avg.avg_salary AS above_average_by
FROM employees e
JOIN (
SELECT dept_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY dept_id
) dept_avg ON e.dept_id = dept_avg.dept_id
AND e.salary > dept_avg.avg_salary; -- non-equi condition
Output:
| name | salary | avg_salary | above_average_by |
|---|---|---|---|
| Alice | 120000 | 103250 | 16750 |
| Eve | 110000 | 103250 | 6750 |
| Carol | 85000 | 78500 | 6500 |
| Frank | 90000 | 90000 | 0 |
Pattern 4 — LATERAL JOIN (Correlated Subquery as a Join)
LATERAL (PostgreSQL, Databricks) or CROSS APPLY (SQL Server) allows you to reference columns from a previous table in a subquery on the right side of a join. This is more powerful than a correlated subquery because the lateral subquery can return multiple rows per left-side row.
Get the top 2 highest-paid employees per department:
SELECT d.dept_name, top_emps.name, top_emps.salary, top_emps.rank
FROM departments d
JOIN LATERAL (
SELECT e.name, e.salary,
ROW_NUMBER() OVER (ORDER BY e.salary DESC) AS rank
FROM employees e
WHERE e.dept_id = d.dept_id -- reference outer table here
ORDER BY e.salary DESC
LIMIT 2
) top_emps ON TRUE
WHERE d.dept_id != 40 -- exclude HR (no employees)
ORDER BY d.dept_name, top_emps.rank;
Output:
| dept_name | name | salary | rank |
|---|---|---|---|
| Engineering | Alice | 120000 | 1 |
| Engineering | Eve | 110000 | 2 |
| Finance | Frank | 90000 | 1 |
| Sales | Carol | 85000 | 1 |
| Sales | David | 72000 | 2 |
ON TRUE means “always join” — the LATERAL produces the rows, you do not need an additional matching condition.
Pipeline use case: Getting the N most recent events per user, the N largest orders per customer, or the closest matching reference record by date — without a window function followed by a filter.
Pattern 5 — CROSS JOIN (Cartesian Product for Data Generation)
A CROSS JOIN returns every combination of rows from both tables. Normally this is a mistake (a million-row table CROSS JOINed with another million-row table = 1 trillion rows). But intentional CROSS JOINs are essential for generating date spines, test data, and grid tables.
Generate a complete employee × department grid (for presence/absence analysis):
SELECT
e.name AS employee,
d.dept_name AS department,
CASE WHEN e.dept_id = d.dept_id THEN 'YES' ELSE 'no' END AS assigned
FROM employees e
CROSS JOIN departments d
WHERE e.emp_id <= 4 -- limit output for readability
ORDER BY e.name, d.dept_name;
Output:
| employee | department | assigned |
|---|---|---|
| Alice | Engineering | YES |
| Alice | Finance | no |
| Alice | HR | no |
| Alice | Sales | no |
| Bob | Engineering | YES |
| Bob | Finance | no |
| … | … | … |
Pipeline use case — date spine generation:
-- Generate one row per day for the last 30 days
SELECT
CURRENT_DATE - (n.n || ' days')::INTERVAL AS date
FROM GENERATE_SERIES(0, 29) n(n);
-- Or with CROSS JOIN to create a date × category grid
SELECT
dates.d AS metric_date,
categories.category
FROM
(SELECT CURRENT_DATE - (n || ' days')::INTERVAL::DATE AS d FROM GENERATE_SERIES(0, 29) n) dates
CROSS JOIN
(SELECT DISTINCT category FROM products) categories
ORDER BY metric_date, category;
Pattern 6 — FULL OUTER JOIN (Keep All Rows From Both Tables)
A FULL OUTER JOIN returns all rows from both the left and right tables, with NULLs where there is no match on either side. This is less common than LEFT JOIN but critical for comparing two datasets where either side can have records the other does not.
Compare revenue in two different periods:
WITH q1_revenue AS (
SELECT dept_id, SUM(salary) AS q1_total FROM employees GROUP BY dept_id
),
q2_revenue AS (
SELECT dept_id, SUM(salary) * 1.1 AS q2_total FROM employees GROUP BY dept_id
)
SELECT
COALESCE(q1.dept_id, q2.dept_id) AS dept_id,
q1.q1_total,
q2.q2_total,
COALESCE(q2.q2_total, 0) - COALESCE(q1.q1_total, 0) AS change,
CASE
WHEN q1.dept_id IS NULL THEN 'New in Q2'
WHEN q2.dept_id IS NULL THEN 'Dropped in Q2'
ELSE 'Existing'
END AS status
FROM q1_revenue q1
FULL OUTER JOIN q2_revenue q2 ON q1.dept_id = q2.dept_id;
Pipeline use case: Source-to-target reconciliation where you need to see both missing rows (in target but not source) and phantom rows (in source but not target) in a single query — exactly the pattern from Tutorial 11.
Common Mistakes
Mistake 1 — Using LEFT JOIN + IS NULL instead of NOT EXISTS when the join key can be NULL
If e.dept_id is NULL, LEFT JOIN ... WHERE e.emp_id IS NULL still works — but only because the join fails for NULL dept_id in a different way. Use NOT EXISTS when your join key might contain NULLs to avoid this ambiguity.
Mistake 2 — CROSS JOIN on large tables without a WHERE clause
A CROSS JOIN without WHERE or a limiting condition is always intentional. If you see one in existing code without a comment explaining why, it is almost certainly a bug.
Mistake 3 — Forgetting e1.emp_id < e2.emp_id in SELF JOINs
Without this condition, self joins produce symmetric duplicates (Alice-Bob and Bob-Alice) plus self-pairs (Alice-Alice). Always add a strict inequality condition unless you specifically need both directions.
Quick Reference
| Join Type | Returns | Use Case |
|---|---|---|
| INNER JOIN | Rows matching in both tables | Standard dimension lookup |
| LEFT JOIN | All left rows, matched right rows | Optional dimension, keep orphans |
| ANTI JOIN | Left rows with NO match in right | Find gaps, orphaned records |
| SELF JOIN | Rows joined to other rows in same table | Hierarchy traversal, pair comparison |
| Non-Equi JOIN | Rows matching on range condition | Salary bands, SCD Type 2, date ranges |
| LATERAL JOIN | One correlated subquery result per left row | Top-N per group, nearest match |
| CROSS JOIN | Every combination of rows | Date spines, grid generation |
| FULL OUTER JOIN | All rows from both tables | Dataset comparison, reconciliation |
What to Learn Next
Advanced joins handle the structural relationships in your data. The next tutorial covers SQL for handling JSON and semi-structured data — parsing nested fields, exploding arrays, and flattening nested objects in Databricks, Snowflake, and PostgreSQL.