What is join fan-out and how does it cause duplicate rows or inflated aggregates?
Fan-out occurs when a join key is not unique on one side, causing each row on the unique side to match multiple rows on the non-unique side and multiply the result set. This silently inflates SUM, COUNT, and AVG unless the duplicate rows are handled before or after the join.
How to think about it
This is one of the most common silent data correctness bugs in analytics SQL. The query runs, returns a number, and nobody questions it until a BI report is off by 3x. The interviewer wants to see you recognize the grain mismatch — not just fix it after the fact.
Start by asking: “Is the join key unique on both sides?”
How fan-out happens
Suppose orders has one row per order and order_items has multiple rows per order.
-- orders: 1 row per order_id
-- order_items: N rows per order_id
SELECT o.order_id, o.customer_id, oi.product_id, o.order_total
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id;
If an order has 4 items, order_total from orders appears 4 times in the result. Any SUM over that column now counts it 4x.
The inflated aggregate trap
-- WRONG: SUM(o.order_total) is inflated by the number of line items per order
SELECT customer_id, SUM(o.order_total) AS total_spend
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY customer_id;
Correct patterns
Option 1 — aggregate first, then join
WITH item_totals AS (
SELECT order_id, SUM(line_amount) AS items_total
FROM order_items
GROUP BY order_id
)
SELECT o.customer_id, SUM(o.order_total) AS total_spend
FROM orders o
JOIN item_totals it ON o.order_id = it.order_id
GROUP BY o.customer_id;
Option 2 — use COUNT(DISTINCT) as a smell detector
-- If COUNT(*) != COUNT(DISTINCT order_id) after the join, you have fan-out
SELECT COUNT(*) AS rows, COUNT(DISTINCT o.order_id) AS distinct_orders
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id;
Many-to-many fan-out
When both sides are non-unique on the join key, the result is a full Cartesian product of the matching subsets — 5 rows on the left times 4 rows on the right gives 20 result rows for that key. This is exponential, not linear, and can silently turn a 10 M-row table into a multi-billion-row result.
See the fan-out happen — then fix it
The playground below lets you run the broken query and the corrected version side by side. Watch order_total get counted multiple times in the bad version.
What to say in an interview
- Always check grain before joining: “Is this key unique on each side?”
- Know the cardinality of every join — make it a habit, not an afterthought.
- Use
COUNT(*) vs COUNT(DISTINCT key)to diagnose fan-out in an existing query. - The fix is almost always: aggregate the many-side first in a CTE, then join at the correct grain.