SQL Medium Asked at GoogleAsked at Amazon

What is a semi-join and how does it differ from an INNER JOIN in terms of output and performance?

For Data Analyst Data Engineer Data Scientist

The short answer

A semi-join returns each row from the left table at most once when at least one match exists on the right, without returning any columns from the right table. An INNER JOIN can duplicate left rows when the right side has multiple matches. In SQL, semi-joins are written with EXISTS or IN subqueries.

How to think about it

“Semi-join” is a relational-algebra term that SQL expresses through EXISTS and IN. The interviewer is checking: when you only need to know whether a match exists, do you reach for EXISTS/IN, or do you write an INNER JOIN and patch the duplicates with DISTINCT? Naming the concept — and the early-termination reason it matters — signals you understand the relational model, not just the syntax.

A worked example — the duplication, then the fix

An INNER JOIN returns one row per match, so a customer with two paid orders shows up twice:

SELECT c.id, c.name
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.status = 'paid';

id	name
1	Aarav
1	Aarav
2	Bea

Aarav is doubled because he has two paid orders — and if you only wanted the list of customers who’ve paid, that’s a correctness bug. The semi-join via EXISTS returns each left row at most once, regardless of how many matches exist on the right:

SELECT c.id, c.name
FROM customers c
WHERE EXISTS (
  SELECT 1 FROM orders o
  WHERE o.customer_id = c.id AND o.status = 'paid'
);

id	name
1	Aarav
2	Bea

One Aarav, one Bea — Chen is absent (his only order is pending). No DISTINCT needed, because EXISTS never multiplies the left row in the first place.

IN, NULLs, and performance

IN (SELECT customer_id FROM orders WHERE status = 'paid') gives the same result here — but EXISTS is safer when the subquery column is nullable: x IN (1, NULL) evaluates to UNKNOWN, so a single NULL can make IN return no rows. On performance, modern optimisers recognise both EXISTS and IN and turn them into true semi-join plans that short-circuit — they stop scanning the right table the instant one match is found, a real saving when matches are rare. The INNER JOIN + DISTINCT alternative is logically equivalent but must build the full duplicated result before deduplicating.

Need	Pattern
Columns from both tables	`INNER JOIN`
Existence check, one row per left row	`EXISTS` / `IN` (semi-join)
Non-existence check	`NOT EXISTS` / `LEFT JOIN … IS NULL`
Aggregate from the right side	`JOIN` + `GROUP BY`, or a pre-aggregated CTE

Learn it properly INNER JOIN

What is a semi-join and how does it differ from an INNER JOIN in terms of output and performance?

A worked example — the duplication, then the fix

IN, NULLs, and performance

Keep practising

Explore further