datarekha
SQL Medium Asked at GoogleAsked at Amazon

What is a semi-join and how does it differ from an INNER JOIN in terms of output and performance?

The short answer

A semi-join returns each row from the left table at most once when at least one match exists on the right, without returning any columns from the right table. An INNER JOIN can duplicate left rows when the right side has multiple matches. In SQL, semi-joins are written with EXISTS or IN subqueries.

How to think about it

“Semi-join” is a relational-algebra concept that SQL expresses through EXISTS and IN subqueries. Knowing the term — and the reason it matters — signals a level of rigour that interviewers notice.

What is really being tested

The interviewer wants to know: when you only need to check whether a match exists, do you reach for EXISTS/IN, or do you blindly write an INNER JOIN and add DISTINCT to patch it? The semi-join approach is not just idiomatic — it can stop scanning the right table as soon as one match is found.

The row-duplication problem with INNER JOIN

-- INNER JOIN: customer 42 appears 3 times if they have 3 paid orders
SELECT c.customer_id, c.name
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.status = 'paid';

If a customer has three paid orders, they appear three times. You need DISTINCT to fix that — but that forces the engine to materialise all duplicates before deduplicating.

The semi-join approach: one row per customer, always

-- EXISTS: each customer appears exactly once
SELECT c.customer_id, c.name
FROM customers c
WHERE EXISTS (
  SELECT 1 FROM orders o
  WHERE o.customer_id = c.id
    AND o.status = 'paid'
);

Run the playground below and compare — notice how the customer with two orders only appears once with EXISTS:

IN is equivalent (with a NULL caveat)

SELECT customer_id, name
FROM customers
WHERE id IN (
  SELECT customer_id FROM orders WHERE status = 'paid'
);

IN and EXISTS produce the same result here. The difference: if orders.customer_id can be NULL, IN with a NULL in the subquery returns no rows at all (because x IN (1, NULL) evaluates to UNKNOWN, not TRUE). EXISTS is safer when NULLs are possible.

Performance characteristics

Modern optimisers (PostgreSQL, SQL Server, BigQuery) recognise EXISTS/IN patterns and convert them to hash or merge semi-join plans internally. These plans can short-circuit — stop scanning the right table the moment one match is found — which is a meaningful saving when matches are rare.

The INNER JOIN + DISTINCT pattern achieves the same logical result but forces the engine to build the full join result before deduplicating:

-- Works but less efficient: join first, then deduplicate
SELECT DISTINCT c.customer_id, c.name
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.status = 'paid';

When to use each

NeedPattern
Columns from both tablesINNER JOIN
Existence check only, one row per left rowEXISTS / IN (semi-join)
Non-existence checkNOT EXISTS / LEFT JOIN … IS NULL
Aggregate from right sideJOIN with GROUP BY or pre-aggregate CTE

Naming “semi-join” explicitly signals to an interviewer that you understand the relational model, not just SQL syntax. Pair it with the early-termination performance note and you have a complete answer.

Learn it properly INNER JOIN

Keep practising

All SQL questions

Explore further

Skip to content