50 Advanced SQL Interview Questions for Data Engineers — With Full Solutions

This question set covers the SQL topics that come up most often in data engineering interviews at companies running modern data stacks: window functions, CTEs, incremental patterns, performance, and data modelling. Every question includes the full working solution and an explanation of why the answer works.

Dataset Setup

All questions use these three tables. Run this setup once before working through the questions.

CREATE TABLE employees (
    emp_id     INT,
    name       VARCHAR(50),
    dept       VARCHAR(30),
    salary     DECIMAL(10,2),
    manager_id INT,
    hire_date  DATE
);

CREATE TABLE orders (
    order_id    INT,
    customer_id INT,
    product     VARCHAR(50),
    amount      DECIMAL(10,2),
    order_date  DATE,
    status      VARCHAR(20)
);

CREATE TABLE events (
    event_id   INT,
    user_id    INT,
    event_type VARCHAR(30),
    event_date DATE,
    page       VARCHAR(50)
);

INSERT INTO employees VALUES
(1,'Alice','Engineering',120000,NULL,'2020-01-15'),
(2,'Bob','Engineering',95000,1,'2021-03-10'),
(3,'Carol','Sales',85000,NULL,'2019-06-01'),
(4,'David','Sales',72000,3,'2022-09-20'),
(5,'Eve','Engineering',110000,1,'2020-11-01'),
(6,'Frank','Finance',90000,NULL,'2018-04-15'),
(7,'Grace','Engineering',88000,1,'2023-02-01'),
(8,'Henry','Sales',68000,3,'2023-07-01');

INSERT INTO orders VALUES
(1001,201,'Laptop',1200.00,'2024-01-05','completed'),
(1002,202,'Phone',800.00,'2024-01-07','completed'),
(1003,201,'Tablet',400.00,'2024-01-10','completed'),
(1004,203,'Laptop',1200.00,'2024-02-01','completed'),
(1005,204,'Monitor',350.00,'2024-02-05','returned'),
(1006,202,'Laptop',1200.00,'2024-02-10','completed'),
(1007,203,'Phone',800.00,'2024-02-15','completed'),
(1008,201,'Monitor',350.00,'2024-03-01','completed'),
(1009,205,'Laptop',1200.00,'2024-03-10','completed'),
(1010,205,'Tablet',400.00,'2024-03-20','returned');

INSERT INTO events VALUES
(1,101,'page_view','2024-03-01','/home'),
(2,101,'page_view','2024-03-01','/product'),
(3,101,'purchase','2024-03-01','/checkout'),
(4,102,'page_view','2024-03-01','/home'),
(5,102,'page_view','2024-03-02','/product'),
(6,103,'purchase','2024-03-01','/checkout'),
(7,103,'page_view','2024-03-02','/home'),
(8,104,'page_view','2024-03-01','/home'),
(9,105,'purchase','2024-03-03','/checkout'),
(10,101,'page_view','2024-03-04','/home');

Window Functions (Q1–Q10)

Q1. Rank employees by salary within each department. Show all employees, their rank, and their department.

SELECT
    name,
    dept,
    salary,
    RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS dept_rank
FROM employees
ORDER BY dept, dept_rank;

Output:

namedeptsalarydept_rank
AliceEngineering1200001
EveEngineering1100002
BobEngineering950003
GraceEngineering880004
FrankFinance900001
CarolSales850001
DavidSales720002
HenrySales680003

Why RANK() not ROW_NUMBER(): If two employees tie on salary, RANK() gives them the same rank. ROW_NUMBER() assigns arbitrary different numbers. For “who is the top earner?” questions, RANK() is correct.

Q2. Find the top 2 earners per department.

WITH ranked AS (
    SELECT name, dept, salary,
           ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn
    FROM employees
)
SELECT name, dept, salary
FROM ranked
WHERE rn <= 2
ORDER BY dept, salary DESC;

Key point: Use ROW_NUMBER() not RANK() here — if there is a tie at rank 2 and 3, RANK() would return 3 rows for a department. ROW_NUMBER() guarantees exactly 2.

Q3. For each order, show the previous order amount by the same customer and the difference.

SELECT
    order_id,
    customer_id,
    amount,
    order_date,
    LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_amount,
    amount - LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS change
FROM orders
ORDER BY customer_id, order_date;

The first order per customer returns NULL for prev_amount — there is no previous order to compare to. Handle with COALESCE(LAG(amount) OVER (...), 0) if you need a numeric default.

Q4. Calculate a 3-order moving average of amount per customer.

SELECT
    order_id,
    customer_id,
    amount,
    order_date,
    ROUND(AVG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ), 2) AS moving_avg_3
FROM orders
ORDER BY customer_id, order_date;

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW = current row + 2 rows before it. For the first two orders, the average uses fewer than 3 rows — this is correct behavior.

Q5. For each order, show its percentage of the total amount for that month.

SELECT
    order_id,
    customer_id,
    amount,
    order_date,
    SUM(amount) OVER (PARTITION BY DATE_TRUNC('month', order_date)) AS monthly_total,
    ROUND(amount * 100.0 / SUM(amount) OVER (PARTITION BY DATE_TRUNC('month', order_date)), 2) AS pct_of_month
FROM orders
ORDER BY order_date, pct_of_month DESC;

Q6. Find employees whose salary is above the average salary in their department.

SELECT name, dept, salary,
       ROUND(AVG(salary) OVER (PARTITION BY dept), 2) AS dept_avg
FROM employees
WHERE salary > AVG(salary) OVER (PARTITION BY dept);   -- WRONG: can't use window in WHERE

Common mistake shown above. Window functions cannot be used in WHERE. Use a CTE or subquery:

WITH dept_avg AS (
    SELECT name, dept, salary,
           AVG(salary) OVER (PARTITION BY dept) AS avg_dept_salary
    FROM employees
)
SELECT name, dept, salary, ROUND(avg_dept_salary, 2) AS dept_avg
FROM dept_avg
WHERE salary > avg_dept_salary;

Q7. Assign each order to a quartile based on amount.

SELECT
    order_id,
    amount,
    NTILE(4) OVER (ORDER BY amount DESC) AS amount_quartile
FROM orders
ORDER BY amount_quartile, amount DESC;

Quartile 1 = top 25% by amount. Quartile 4 = bottom 25%.

Q8. For each employee, show the gap in days between their hire date and the next person hired in the same department.

SELECT
    name,
    dept,
    hire_date,
    LEAD(hire_date) OVER (PARTITION BY dept ORDER BY hire_date) AS next_hire_date,
    LEAD(hire_date) OVER (PARTITION BY dept ORDER BY hire_date) - hire_date AS days_to_next_hire
FROM employees
ORDER BY dept, hire_date;

Q9. Find the first and last order date for each customer without GROUP BY.

SELECT DISTINCT
    customer_id,
    FIRST_VALUE(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS first_order,
    LAST_VALUE(order_date)  OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS last_order
FROM orders
ORDER BY customer_id;

Important: LAST_VALUE requires ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Without this frame specification, the default frame ends at the current row, making LAST_VALUE return the current row’s value — not the actual last value.

Q10. Classify orders as ‘High’ (top 25% by amount), ‘Medium’ (middle 50%), or ‘Low’ (bottom 25%).

WITH quartiles AS (
    SELECT order_id, amount,
           NTILE(4) OVER (ORDER BY amount DESC) AS q
    FROM orders
)
SELECT
    order_id,
    amount,
    CASE WHEN q = 1       THEN 'High'
         WHEN q IN (2, 3) THEN 'Medium'
         WHEN q = 4       THEN 'Low'
    END AS amount_tier
FROM quartiles
ORDER BY amount DESC;

CTEs and Subqueries (Q11–Q20)

Q11. Find customers who placed orders in both January and February 2024.

WITH jan_customers AS (
    SELECT DISTINCT customer_id FROM orders
    WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31'
),
feb_customers AS (
    SELECT DISTINCT customer_id FROM orders
    WHERE order_date BETWEEN '2024-02-01' AND '2024-02-29'
)
SELECT j.customer_id
FROM jan_customers j
JOIN feb_customers f ON j.customer_id = f.customer_id;

Output: customer_id 202 and 203 — both ordered in January and February.

Q12. Find customers who ordered in January but NOT in February (churned).

WITH jan AS (
    SELECT DISTINCT customer_id FROM orders WHERE DATE_TRUNC('month', order_date) = '2024-01-01'
)
SELECT j.customer_id
FROM jan j
WHERE NOT EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = j.customer_id
      AND DATE_TRUNC('month', o.order_date) = '2024-02-01'
);

Output: customer_id 201 — ordered in January but not February.

Q13. For each department, find the employee with the highest salary and the employee with the lowest salary in a single query.

WITH ranked AS (
    SELECT name, dept, salary,
           RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS rank_high,
           RANK() OVER (PARTITION BY dept ORDER BY salary ASC)  AS rank_low
    FROM employees
)
SELECT
    dept,
    MAX(CASE WHEN rank_high = 1 THEN name END) AS highest_earner,
    MAX(CASE WHEN rank_high = 1 THEN salary END) AS highest_salary,
    MAX(CASE WHEN rank_low  = 1 THEN name END) AS lowest_earner,
    MAX(CASE WHEN rank_low  = 1 THEN salary END) AS lowest_salary
FROM ranked
GROUP BY dept;

Q14. Write a query that returns each order and the running total of orders (count) per customer.

SELECT
    order_id,
    customer_id,
    order_date,
    COUNT(*) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS order_number_for_customer
FROM orders
ORDER BY customer_id, order_date;

This tells you whether each order is the customer’s 1st, 2nd, 3rd, etc. purchase — useful for first-purchase vs repeat-purchase analysis.

Q15. Find the second highest salary in the entire company (without using LIMIT/TOP).

SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

Or using DENSE_RANK():

WITH ranked AS (
    SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
    FROM employees
)
SELECT salary AS second_highest FROM ranked WHERE rnk = 2;

Use DENSE_RANK() if multiple employees can tie for the highest salary — it correctly skips to the next distinct value.

Aggregation and Grouping (Q16–Q25)

Q16. Find the month with the highest total order amount.

SELECT
    DATE_TRUNC('month', order_date) AS month,
    SUM(amount) AS total_amount
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY total_amount DESC
LIMIT 1;

Q17. Find products that were ordered more than twice.

SELECT product, COUNT(*) AS order_count
FROM orders
GROUP BY product
HAVING COUNT(*) > 2;

HAVING vs WHERE: WHERE filters individual rows before aggregation. HAVING filters aggregated groups after aggregation. You cannot use WHERE COUNT(*) > 2.

Q18. Find the average salary by department, including only departments with more than 2 employees.

SELECT dept, ROUND(AVG(salary), 2) AS avg_salary, COUNT(*) AS headcount
FROM employees
GROUP BY dept
HAVING COUNT(*) > 2;

Q19. Find customers who have spent more than the average customer spend.

WITH customer_spend AS (
    SELECT customer_id, SUM(amount) AS total_spend
    FROM orders
    WHERE status = 'completed'
    GROUP BY customer_id
)
SELECT customer_id, total_spend
FROM customer_spend
WHERE total_spend > (SELECT AVG(total_spend) FROM customer_spend);

Q20. Write a query to show total and completed orders per customer side by side.

SELECT
    customer_id,
    COUNT(*) AS total_orders,
    SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) AS completed_orders,
    SUM(CASE WHEN status = 'returned'  THEN 1 ELSE 0 END) AS returned_orders,
    ROUND(SUM(CASE WHEN status='completed' THEN 1 ELSE 0 END)*100.0/COUNT(*),1) AS completion_rate
FROM orders
GROUP BY customer_id
ORDER BY total_orders DESC;

Data Engineering Patterns (Q26–Q35)

Q21. Write a query to deduplicate orders keeping only the most recent record per order_id.

WITH deduped AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_date DESC) AS rn
    FROM orders
)
SELECT order_id, customer_id, product, amount, order_date, status
FROM deduped
WHERE rn = 1;

Q22. Find users who made a purchase within 3 days of their first page view.

WITH first_view AS (
    SELECT user_id, MIN(event_date) AS first_view_date
    FROM events WHERE event_type = 'page_view'
    GROUP BY user_id
),
first_purchase AS (
    SELECT user_id, MIN(event_date) AS purchase_date
    FROM events WHERE event_type = 'purchase'
    GROUP BY user_id
)
SELECT
    fv.user_id,
    fv.first_view_date,
    fp.purchase_date,
    fp.purchase_date - fv.first_view_date AS days_to_purchase
FROM first_view fv
JOIN first_purchase fp ON fv.user_id = fp.user_id
WHERE fp.purchase_date - fv.first_view_date <= 3
ORDER BY days_to_purchase;

Q23. Write an incremental load query that only selects orders updated since last pipeline run.

SELECT *
FROM orders
WHERE order_date > (
    SELECT last_loaded FROM pipeline.watermarks WHERE table_name = 'orders'
)
ORDER BY order_date;

Then after load: UPDATE pipeline.watermarks SET last_loaded = MAX(order_date) from loaded batch.


Q24. Identify gaps in a sequential order_id column.

SELECT
    order_id + 1 AS gap_start,
    next_order_id - 1 AS gap_end,
    next_order_id - order_id - 1 AS gap_size
FROM (
    SELECT
        order_id,
        LEAD(order_id) OVER (ORDER BY order_id) AS next_order_id
    FROM orders
) gaps
WHERE next_order_id - order_id > 1;

Any row where consecutive order_ids are not sequential represents a gap — deleted records, failed inserts, or source system issues.


Q25. Write a query to detect which employees have the same salary as another employee.

SELECT e1.name, e1.dept, e1.salary
FROM employees e1
JOIN employees e2
    ON e1.salary = e2.salary
    AND e1.emp_id != e2.emp_id
ORDER BY e1.salary DESC;

Tricky and Trap Questions (Q26–Q30)

Q26. What does this query return, and why might it surprise you?

SELECT COUNT(*) FROM employees WHERE manager_id != 1;

Answer: Returns 5, not 7. Three employees have manager_id = 1 (Bob, Eve, Grace, Henry — wait, check the data: employees 2, 5, 7, 8 report to manager 1 — so 4 report to 1). Wait — let us count properly from the data: Bob(1), Eve(1), Grace(1), Henry(1) — 4 report to manager 1. The remaining non-NULL managers: Carol(NULL), Frank(NULL), Alice(NULL), David(3). So WHERE manager_id != 1 returns David(3) and nobody else from the non-NULLs… Actually NULL rows are the trap. Employees with NULL manager_id (Alice, Carol, Frank) are excluded because NULL != 1 evaluates to NULL, not TRUE. So the query returns rows where manager_id is a non-NULL value that is not 1 — only David(3). The answer: 1 row, not 7. The NULL trap in action.

Fix: WHERE manager_id != 1 OR manager_id IS NULL


Q27. Why does this query return 0 rows even though there are orders with NULL status?

SELECT * FROM orders WHERE status NOT IN ('completed', 'returned', NULL);

Answer: NOT IN with a NULL in the list causes every row to evaluate as NULL (not TRUE). SQL logic: status NOT IN (list_with_NULL) expands to status != 'completed' AND status != 'returned' AND status != NULL. status != NULL is always NULL. NULL in an AND chain makes the whole expression NULL. Zero rows returned.

Fix: Remove NULL from the list or use NOT EXISTS.


Q28. Find all managers (employees who manage at least one other employee).

SELECT DISTINCT e1.emp_id, e1.name, e1.dept
FROM employees e1
JOIN employees e2 ON e1.emp_id = e2.manager_id;

This uses a self-join: for each employee in e2, we look up who their manager is in e1. DISTINCT removes duplicates when a manager has multiple direct reports.


Q29. Find the department with the highest average salary.

SELECT dept, ROUND(AVG(salary), 2) AS avg_salary
FROM employees
GROUP BY dept
ORDER BY avg_salary DESC
LIMIT 1;

Trick: If two departments tie for highest average salary, LIMIT 1 returns only one arbitrarily. A more robust answer:

WITH dept_avgs AS (
    SELECT dept, ROUND(AVG(salary), 2) AS avg_salary FROM employees GROUP BY dept
)
SELECT dept, avg_salary FROM dept_avgs
WHERE avg_salary = (SELECT MAX(avg_salary) FROM dept_avgs);

Q30. A table has 1 million rows. You run SELECT COUNT(*) FROM table and it returns instantly. Why?

Answer: Most modern databases (PostgreSQL, Snowflake, Databricks, BigQuery) maintain row count statistics in their metadata catalog. For simple COUNT(*) with no WHERE clause, they read the statistic directly rather than scanning the table. However, COUNT(column) (which excludes NULLs) cannot use this shortcut and may require a full scan. SELECT COUNT(*) FROM table WHERE column = value also requires a scan unless there is an index or partition that limits the rows read.


Questions Specific to Data Engineering Interviews (Q31–Q35)

Q31. What is the difference between RANK(), DENSE_RANK(), and ROW_NUMBER()?

FunctionTiesGap After Tie
ROW_NUMBER()No ties — assigns unique numbersN/A
RANK()Same rank for tiesYes — skips numbers
DENSE_RANK()Same rank for tiesNo — consecutive

Example: salaries 100k, 90k, 90k, 80k

  • ROW_NUMBER(): 1, 2, 3, 4
  • RANK(): 1, 2, 2, 4
  • DENSE_RANK(): 1, 2, 2, 3

Q32. What is the difference between WHERE and HAVING?

WHERE filters individual rows before aggregation. HAVING filters grouped results after aggregation.

-- WHERE: filter before grouping
SELECT dept, COUNT(*) FROM employees WHERE salary > 80000 GROUP BY dept;

-- HAVING: filter after grouping
SELECT dept, COUNT(*) FROM employees GROUP BY dept HAVING COUNT(*) > 2;

-- Both together
SELECT dept, COUNT(*) FROM employees WHERE salary > 80000 GROUP BY dept HAVING COUNT(*) > 1;

Q33. Explain what happens when you JOIN on a column with NULL values.

NULL = NULL evaluates to NULL, not TRUE. This means rows where the join key is NULL on either side will never match any row — they are silently excluded from INNER JOINs, and produce NULL on the other side of LEFT JOINs. Always check for NULLs in join keys before writing your JOIN, and decide whether you want to include or exclude those rows explicitly.


Q34. What is a surrogate key and why do you use it in a data warehouse?

A surrogate key is a system-generated unique identifier (usually an auto-incrementing integer) that is used as the primary key of a dimension table instead of the natural business key. Reasons to use surrogate keys:

  • The natural key may change (a customer’s email address changes — a surrogate key stays stable)
  • Required for SCD Type 2 to distinguish multiple historical versions of the same entity
  • Joins on small integers are faster than joins on long VARCHAR strings
  • Decouples the warehouse from upstream system key changes

Q35. How would you find all orders where the amount is greater than the average amount for that customer?

WITH customer_avgs AS (
    SELECT customer_id, AVG(amount) AS avg_amount
    FROM orders GROUP BY customer_id
)
SELECT o.order_id, o.customer_id, o.amount, ca.avg_amount
FROM orders o
JOIN customer_avgs ca ON o.customer_id = ca.customer_id
WHERE o.amount > ca.avg_amount;

Or with a window function (slightly different: operates at query time, not pre-aggregated):

SELECT order_id, customer_id, amount,
       AVG(amount) OVER (PARTITION BY customer_id) AS customer_avg
FROM orders
WHERE amount > AVG(amount) OVER (PARTITION BY customer_id);  -- WRONG: window in WHERE

-- Correct version:
WITH windowed AS (
    SELECT *, AVG(amount) OVER (PARTITION BY customer_id) AS customer_avg FROM orders
)
SELECT order_id, customer_id, amount, customer_avg
FROM windowed WHERE amount > customer_avg;

Quick Reference — Most Common Interview Traps

TrapWhat Goes WrongFix
NULL in NOT IN listReturns 0 rows silentlyFilter NULLs from subquery, or use NOT EXISTS
NULL != valueReturns NULL, not TRUEAdd OR column IS NULL
Window function in WHERESQL errorWrap in CTE, filter in outer query
COUNT(col) vs COUNT(*)Different counts if NULLs presentBe explicit about which you need
LAST_VALUE default frameReturns current row, not lastAdd ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
DISTINCT to fix bad JOINHides the real problem, slowFix the JOIN to not multiply rows

Leave a Comment