Cohort analysis is one of the most valuable analytical patterns in any business — it answers “are customers we acquired last quarter sticking around better than customers from two years ago?” Without cohort analysis, retention metrics are meaningless averages that hide whether your business is improving or decaying.
This tutorial builds a complete cohort analysis pipeline in SQL, from raw order data to a retention matrix ready for a dashboard.
The Dataset
CREATE TABLE orders (
order_id INT,
customer_id INT,
amount DECIMAL(10,2),
order_date DATE
);
INSERT INTO orders VALUES
-- Customer 101: first order Jan, orders in Feb and Apr
(1001, 101, 120.00, '2024-01-05'),
(1002, 101, 85.00, '2024-02-12'),
(1003, 101, 200.00, '2024-04-03'),
-- Customer 102: first order Jan, only one order
(1004, 102, 95.00, '2024-01-18'),
-- Customer 103: first order Feb, orders in Feb and Mar
(1005, 103, 150.00, '2024-02-07'),
(1006, 103, 60.00, '2024-03-15'),
-- Customer 104: first order Feb, orders in Feb, Mar, Apr
(1007, 104, 300.00, '2024-02-20'),
(1008, 104, 210.00, '2024-03-22'),
(1009, 104, 175.00, '2024-04-18'),
-- Customer 105: first order Mar, only one order
(1010, 105, 80.00, '2024-03-10'),
-- Customer 106: first order Mar, orders in Mar and May
(1011, 106, 140.00, '2024-03-25'),
(1012, 106, 90.00, '2024-05-10'),
-- Customer 107: first order Apr, orders in Apr and May
(1013, 107, 250.00, '2024-04-08'),
(1014, 107, 180.00, '2024-05-20');
Step 1 — Assign Each Customer to a Cohort (First Order Month)
WITH first_orders AS (
SELECT
customer_id,
MIN(order_date) AS first_order_date,
DATE_TRUNC('month', MIN(order_date))::DATE AS cohort_month
FROM orders
GROUP BY customer_id
)
SELECT * FROM first_orders ORDER BY customer_id;
Output:
| customer_id | first_order_date | cohort_month |
|---|---|---|
| 101 | 2024-01-05 | 2024-01-01 |
| 102 | 2024-01-18 | 2024-01-01 |
| 103 | 2024-02-07 | 2024-02-01 |
| 104 | 2024-02-20 | 2024-02-01 |
| 105 | 2024-03-10 | 2024-03-01 |
| 106 | 2024-03-25 | 2024-03-01 |
| 107 | 2024-04-08 | 2024-04-01 |
January cohort: customers 101 and 102. February cohort: 103 and 104. And so on.
Step 2 — Calculate Month Index for Each Order
The month index tells you how many months after the first order a customer made each subsequent purchase. Month 0 = the acquisition month, Month 1 = one month later, etc.
WITH first_orders AS (
SELECT
customer_id,
DATE_TRUNC('month', MIN(order_date))::DATE AS cohort_month
FROM orders
GROUP BY customer_id
),
order_months AS (
SELECT
o.customer_id,
fo.cohort_month,
DATE_TRUNC('month', o.order_date)::DATE AS order_month,
-- Calculate how many months since first order
(EXTRACT(YEAR FROM o.order_date) - EXTRACT(YEAR FROM fo.cohort_month)) * 12
+ (EXTRACT(MONTH FROM o.order_date) - EXTRACT(MONTH FROM fo.cohort_month))
AS month_index
FROM orders o
JOIN first_orders fo ON o.customer_id = fo.customer_id
)
SELECT * FROM order_months ORDER BY customer_id, month_index;
Output:
| customer_id | cohort_month | order_month | month_index |
|---|---|---|---|
| 101 | 2024-01-01 | 2024-01-01 | 0 |
| 101 | 2024-01-01 | 2024-02-01 | 1 |
| 101 | 2024-01-01 | 2024-04-01 | 3 |
| 102 | 2024-01-01 | 2024-01-01 | 0 |
| 103 | 2024-02-01 | 2024-02-01 | 0 |
| 103 | 2024-02-01 | 2024-03-01 | 1 |
| 104 | 2024-02-01 | 2024-02-01 | 0 |
| 104 | 2024-02-01 | 2024-03-01 | 1 |
| 104 | 2024-02-01 | 2024-04-01 | 2 |
Customer 101 ordered in Month 0 (Jan), Month 1 (Feb), and Month 3 (Apr). No order in Month 2 (Mar) — they churned temporarily.
Step 3 — Build the Retention Matrix
Count unique customers retained in each cohort at each month index.
WITH first_orders AS (
SELECT
customer_id,
DATE_TRUNC('month', MIN(order_date))::DATE AS cohort_month
FROM orders
GROUP BY customer_id
),
order_months AS (
SELECT
o.customer_id,
fo.cohort_month,
(EXTRACT(YEAR FROM o.order_date) - EXTRACT(YEAR FROM fo.cohort_month)) * 12
+ (EXTRACT(MONTH FROM o.order_date) - EXTRACT(MONTH FROM fo.cohort_month))
AS month_index
FROM orders o
JOIN first_orders fo ON o.customer_id = fo.customer_id
),
cohort_sizes AS (
SELECT cohort_month, COUNT(DISTINCT customer_id) AS cohort_size
FROM first_orders
GROUP BY cohort_month
),
retention_raw AS (
SELECT
cohort_month,
month_index,
COUNT(DISTINCT customer_id) AS active_customers
FROM order_months
GROUP BY cohort_month, month_index
)
SELECT
r.cohort_month,
cs.cohort_size,
r.month_index,
r.active_customers,
ROUND(r.active_customers * 100.0 / cs.cohort_size, 1) AS retention_pct
FROM retention_raw r
JOIN cohort_sizes cs ON r.cohort_month = cs.cohort_month
ORDER BY r.cohort_month, r.month_index;
Output:
| cohort_month | cohort_size | month_index | active_customers | retention_pct |
|---|---|---|---|---|
| 2024-01-01 | 2 | 0 | 2 | 100.0% |
| 2024-01-01 | 2 | 1 | 1 | 50.0% |
| 2024-01-01 | 2 | 3 | 1 | 50.0% |
| 2024-02-01 | 2 | 0 | 2 | 100.0% |
| 2024-02-01 | 2 | 1 | 2 | 100.0% |
| 2024-02-01 | 2 | 2 | 1 | 50.0% |
| 2024-03-01 | 2 | 0 | 2 | 100.0% |
| 2024-03-01 | 2 | 2 | 1 | 50.0% |
| 2024-04-01 | 1 | 0 | 1 | 100.0% |
| 2024-04-01 | 1 | 1 | 1 | 100.0% |
January cohort lost 50% by Month 1. February cohort retained 100% through Month 1 before dropping. This is the insight cohort analysis provides.
Step 4 — Pivot Into a Retention Grid
WITH first_orders AS (
SELECT customer_id,
DATE_TRUNC('month', MIN(order_date))::DATE AS cohort_month
FROM orders GROUP BY customer_id
),
order_months AS (
SELECT o.customer_id, fo.cohort_month,
(EXTRACT(YEAR FROM o.order_date) - EXTRACT(YEAR FROM fo.cohort_month)) * 12
+ (EXTRACT(MONTH FROM o.order_date) - EXTRACT(MONTH FROM fo.cohort_month)) AS month_index
FROM orders o JOIN first_orders fo ON o.customer_id = fo.customer_id
),
cohort_sizes AS (
SELECT cohort_month, COUNT(DISTINCT customer_id) AS cohort_size FROM first_orders GROUP BY cohort_month
),
retention_raw AS (
SELECT cohort_month, month_index, COUNT(DISTINCT customer_id) AS retained
FROM order_months GROUP BY cohort_month, month_index
)
SELECT
r.cohort_month,
cs.cohort_size AS "Size",
MAX(CASE WHEN month_index = 0 THEN ROUND(retained*100.0/cs.cohort_size,0) END) || '%' AS "M0",
MAX(CASE WHEN month_index = 1 THEN ROUND(retained*100.0/cs.cohort_size,0) END) || '%' AS "M1",
MAX(CASE WHEN month_index = 2 THEN ROUND(retained*100.0/cs.cohort_size,0) END) || '%' AS "M2",
MAX(CASE WHEN month_index = 3 THEN ROUND(retained*100.0/cs.cohort_size,0) END) || '%' AS "M3",
MAX(CASE WHEN month_index = 4 THEN ROUND(retained*100.0/cs.cohort_size,0) END) || '%' AS "M4"
FROM retention_raw r
JOIN cohort_sizes cs ON r.cohort_month = cs.cohort_month
GROUP BY r.cohort_month, cs.cohort_size
ORDER BY r.cohort_month;
Output — the classic retention grid:
| cohort_month | Size | M0 | M1 | M2 | M3 | M4 |
|---|---|---|---|---|---|---|
| 2024-01-01 | 2 | 100% | 50% | NULL | 50% | NULL |
| 2024-02-01 | 2 | 100% | 100% | 50% | NULL | NULL |
| 2024-03-01 | 2 | 100% | NULL | 50% | NULL | NULL |
| 2024-04-01 | 1 | 100% | 100% | NULL | NULL | NULL |
This is the exact format used in every SaaS retention dashboard. NULL means no customers from that cohort ordered in that month.
Step 5 — Revenue Cohort Analysis (LTV by Cohort)
Retention tells you if customers came back. Revenue cohort analysis tells you how much they spent.
WITH first_orders AS (
SELECT customer_id,
DATE_TRUNC('month', MIN(order_date))::DATE AS cohort_month
FROM orders GROUP BY customer_id
),
order_months AS (
SELECT
o.customer_id,
o.amount,
fo.cohort_month,
(EXTRACT(YEAR FROM o.order_date) - EXTRACT(YEAR FROM fo.cohort_month)) * 12
+ (EXTRACT(MONTH FROM o.order_date) - EXTRACT(MONTH FROM fo.cohort_month)) AS month_index
FROM orders o JOIN first_orders fo ON o.customer_id = fo.customer_id
),
cohort_sizes AS (
SELECT cohort_month, COUNT(DISTINCT customer_id) AS cohort_size FROM first_orders GROUP BY cohort_month
)
SELECT
om.cohort_month,
cs.cohort_size,
om.month_index,
ROUND(SUM(om.amount), 2) AS cohort_revenue,
ROUND(SUM(om.amount) / cs.cohort_size, 2) AS revenue_per_customer,
ROUND(SUM(SUM(om.amount)) OVER (
PARTITION BY om.cohort_month
ORDER BY om.month_index
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) / cs.cohort_size, 2) AS cumulative_ltv
FROM order_months om
JOIN cohort_sizes cs ON om.cohort_month = cs.cohort_month
GROUP BY om.cohort_month, om.month_index, cs.cohort_size
ORDER BY om.cohort_month, om.month_index;
Output:
| cohort_month | cohort_size | month_index | cohort_revenue | revenue_per_customer | cumulative_ltv |
|---|---|---|---|---|---|
| 2024-01-01 | 2 | 0 | 215.00 | 107.50 | 107.50 |
| 2024-01-01 | 2 | 1 | 85.00 | 42.50 | 150.00 |
| 2024-01-01 | 2 | 3 | 200.00 | 100.00 | 250.00 |
| 2024-02-01 | 2 | 0 | 450.00 | 225.00 | 225.00 |
| 2024-02-01 | 2 | 1 | 270.00 | 135.00 | 360.00 |
| 2024-02-01 | 2 | 2 | 175.00 | 87.50 | 447.50 |
February cohort has significantly higher LTV than January — even with the same size. This is the kind of insight that drives acquisition strategy decisions.
Common Mistakes
Mistake 1 — Using calendar months instead of month index
If you group by the actual order month (Jan, Feb, Mar) instead of the month index (0, 1, 2), all cohorts collapse into the same columns and you cannot compare retention across cohorts.
Mistake 2 — Counting orders instead of customers
A customer who ordered 3 times in Month 1 should count as 1 retained customer, not 3. Always use COUNT(DISTINCT customer_id) for retention metrics.
Mistake 3 — Including trial or test accounts in cohorts
If your data has internal test users or free trial accounts, filter them out before building cohorts. They inflate Month 0 size and distort all downstream percentages.
What to Learn Next
Cohort analysis shows how existing customers behave over time. The next tutorial covers funnel analysis — measuring how potential customers progress through acquisition steps, and where they drop off.