How do you safely join two tables in a many-to-many relationship without creating a row explosion?
Many-to-many joins produce a Cartesian product of each matching subset, multiplying row counts exponentially. The correct approach is to pre-aggregate at least one side to a unique grain before joining, or to use a bridge/junction table that resolves the relationship into two one-to-many joins.
How to think about it
Many-to-many joins are the most dangerous pattern in analytics SQL. A direct join between two large non-unique tables silently explodes row counts — the result looks plausible until you compare totals to source system numbers.
The interviewer wants to hear you recognize the grain problem before proposing a solution. Lead with: “I always check whether the join key is unique on each side before writing the JOIN.”
Why it explodes
Suppose user_tags maps users to tags (one user can have many tags) and user_events records user actions (one user can have many events). Joining directly on user_id:
user_tags rows for user 42: 5
user_events rows for user 42: 200
Result rows for user 42: 5 x 200 = 1,000
With 1 M users, this is catastrophic.
Pattern 1: pre-aggregate one side first
-- Aggregate events to user grain BEFORE joining tags
WITH user_event_counts AS (
SELECT user_id, COUNT(*) AS event_count
FROM user_events
GROUP BY user_id
)
SELECT ut.user_id, ut.tag, uec.event_count
FROM user_tags ut
JOIN user_event_counts uec ON ut.user_id = uec.user_id;
Now user_event_counts has one row per user — no fan-out.
Pattern 2: bridge/junction table
The canonical relational answer is a junction table that breaks the M:N into two 1:N relationships:
-- students and courses: many-to-many resolved through enrollments
SELECT s.name, c.title
FROM students s
JOIN enrollments e ON s.id = e.student_id
JOIN courses c ON e.course_id = c.id;
enrollments is the bridge. Each join is one-to-many, and no row explosion occurs.
Pattern 3: DISTINCT or ARRAY_AGG to flatten
When you need tags alongside a user metric, aggregate the tags:
SELECT e.user_id,
COUNT(*) AS event_count,
ARRAY_AGG(DISTINCT t.tag) AS tags
FROM user_events e
JOIN user_tags t ON e.user_id = t.user_id
GROUP BY e.user_id;
This de-duplicates at the output level rather than preventing fan-out at the join level — it still materialises the exploded set internally, so it is less efficient than pre-aggregation for large tables.
See the explosion — then fix it
Run the broken query below. User 1 has 3 tags and 2 events → 6 result rows. The event_count will be double what it should be. Then try the corrected pre-aggregation approach.
How to detect a many-to-many join in a code review
-- Quick grain check: if any user_id appears more than once on either side, you have M:N
SELECT user_id, COUNT(*) FROM user_tags GROUP BY user_id HAVING COUNT(*) > 1 LIMIT 1;
SELECT user_id, COUNT(*) FROM user_events GROUP BY user_id HAVING COUNT(*) > 1 LIMIT 1;
Run these two checks before merging any new join into a production pipeline. If either returns rows, you must pre-aggregate before joining.
The key insight
The safest mental model: every join key must be unique on at least one side. If it isn’t, you need a bridge table (Pattern 2) or pre-aggregation (Pattern 1). Pattern 3 — DISTINCT or ARRAY_AGG — still materialises the full Cartesian product internally, so it is a last resort, not a fix.