SQL Hard Asked at MetaAsked at AmazonAsked at Stripe

How do you safely join two tables in a many-to-many relationship without creating a row explosion?

For Data Analyst Data Engineer Data Scientist

The short answer

Many-to-many joins produce a Cartesian product of each matching subset, multiplying row counts exponentially. The correct approach is to pre-aggregate at least one side to a unique grain before joining, or to use a bridge/junction table that resolves the relationship into two one-to-many joins.

How to think about it

Many-to-many is the most dangerous join in analytics SQL. A direct join between two non-unique tables silently explodes the row count — and the result looks plausible until you reconcile totals against the source system. The interviewer wants to hear you recognise the grain problem before you propose a fix, so lead with the question that prevents it: “is the join key unique on each side?”

Why it explodes

If user_tags has 3 rows for user 1 and user_events has 2, joining them on user_id yields 3 × 2 = 6 rows for that user — the Cartesian product of the two matching subsets. Every aggregate computed over that result is now inflated.

A worked example — the explosion, then the fix

User 1 has 3 tags and 2 events; user 2 has 2 tags and 3 events. A direct join then COUNT(*) reports 6 events for both — each genuine event counted once per tag:

-- BROKEN: 3 tags x 2 events = 6 rows for user 1; the count is the product, not the truth
SELECT user_id, COUNT(*) AS inflated_event_count
FROM user_tags
JOIN user_events USING (user_id)
GROUP BY user_id;

user_id	inflated_event_count
1	6
2	6

User 1 truly has 2 events and user 2 has 3 — both wrong, both inflated to 6. Collapse one side to a unique grain first, and the numbers come right:

WITH ec AS (                         -- one row per user, no fan-out
  SELECT user_id, COUNT(*) AS event_count
  FROM user_events
  GROUP BY user_id
)
SELECT ut.user_id, ut.tag, ec.event_count
FROM user_tags ut
JOIN ec USING (user_id)
ORDER BY ut.user_id, ut.tag;

user_id	tag	event_count
1	data	2
1	python	2
1	sql	2
2	ml	3
2	stats	3

Now event_count is the real per-user figure (2 and 3), repeated once per tag because the tag is the grain you asked for — and it’s correct repetition, not inflation.

The other two patterns

Bridge / junction table. The canonical relational answer resolves M:N into two 1:N joins: students JOIN enrollments JOIN courses. Each hop is one-to-many, so no explosion.
ARRAY_AGG(DISTINCT ...) to flatten tags alongside a metric. This de-duplicates at the output, but it still materialises the exploded set internally — a last resort, not a fix, for large tables.

To catch the trap in code review, run a grain check before merging any new join:

SELECT user_id FROM user_tags   GROUP BY user_id HAVING COUNT(*) > 1 LIMIT 1;
SELECT user_id FROM user_events GROUP BY user_id HAVING COUNT(*) > 1 LIMIT 1;

If both return a row, the key is non-unique on each side — you must pre-aggregate or bridge.

Learn it properly Deduplication

How do you safely join two tables in a many-to-many relationship without creating a row explosion?

Why it explodes

A worked example — the explosion, then the fix

The other two patterns

Keep practising

Explore further