When should you use EXISTS, IN, or a JOIN for a semi-join, and what are the NULL-safety differences?
EXISTS short-circuits as soon as one match is found and is NULL-safe; IN loads the full subquery result and returns no rows when the list contains a NULL; a JOIN can multiply rows if the right side has duplicates. For large datasets, EXISTS or a deduplicated JOIN is generally safest.
How to think about it
All three can answer the question “does a matching row exist?” — but they differ in how they handle NULLs and duplicates. The interviewer is almost certainly going to push on the NULL edge case, so make sure you cover it.
Building up each approach
Start with a concrete scenario: find all orders placed by VIP customers.
IN — loads the subquery into a set, then checks membership for each outer row.
EXISTS — for each outer row, runs the subquery and stops at the first match (short-circuits). It never cares how many matches there are.
JOIN with DISTINCT — joins the tables and then deduplicates to avoid row multiplication if a customer appears multiple times in the VIP table.
The NULL trap with NOT IN
This is where most people stumble. If the subquery returns even one NULL value, NOT IN returns zero rows — silently and without any error. Here is why: SQL uses three-valued logic. x NOT IN (1, 2, NULL) evaluates as x <> 1 AND x <> 2 AND x <> NULL. That last comparison is UNKNOWN, and TRUE AND UNKNOWN = UNKNOWN, so no row passes.
Choosing between them
- Small, stable subquery result with no NULLs —
INis fine and reads cleanly. - Large subquery or early termination matters — prefer
EXISTS. - Need columns from both sides — use
JOIN, but addDISTINCTor aGROUP BYif the right side is not unique on the join key. - Negative filter (
NOT IN/NOT EXISTS) — always useNOT EXISTS. Never useNOT INagainst a column that could contain NULLs.