NULL is not a value, and that wrecks your WHERE clause

A data analyst I worked with once spent two days debugging a churn report. The numbers were off by roughly 12 percent. The query looked correct. The joins were right. The date filters were right. The problem was a single WHERE status != 'churned' on a column that was NULL for about 12 percent of users — users who had signed up but never completed onboarding. The filter did not include them. It did not exclude them. It silently dropped them, the way a sieve drops water but keeps the rocks, except nobody told her the sieve was there.

That is what NULL does when you do not know what NULL is.

NULL is not a value. It is a confession of ignorance.

Every other thing you store in a database is a value. The integer 0 is a value. The empty string '' is a value. The boolean false is a value. NULL is none of these. NULL is the database saying: I do not know what this is. Maybe the data never arrived. Maybe it was deliberately omitted. Maybe it is inapplicable — like the maiden name of a person who was never married. The database does not distinguish between these cases. It stores a single marker: unknown.

This matters enormously because of what happens when you do arithmetic or comparison with the unknown.

What is 5 + NULL? Unknown. If you do not know what the second operand is, you cannot know what the sum is.

What is NULL = NULL? Also unknown. You have two unknowns. They might be equal, or they might not be. You cannot tell. This is not a philosophical stance — it is the formal definition in the SQL standard, and every major database engine (Postgres, MySQL, BigQuery, Snowflake, DuckDB) implements it exactly this way.

The consequence is that SQL operates on three-valued logic: a predicate evaluates to TRUE, FALSE, or UNKNOWN. A WHERE clause keeps only the rows where the predicate is TRUE. Rows where the predicate is FALSE are filtered out, as expected. Rows where the predicate is UNKNOWN are also filtered out — silently, with no error, no warning, no indication in the result set that anything was dropped.

That is the sieve.

Three-valued logic: equality comparisons with NULL always yield UNKNOWN, which WHERE silently discards.

`WHERE x = NULL` never works. Ever.

The correct syntax is WHERE x IS NULL. The distinction is not stylistic. WHERE x = NULL produces UNKNOWN for every row — because comparing anything to the unknown is unknown — so no rows pass the filter. The query does not error. It returns zero rows, and if you are not watching for this, you conclude the column has no NULLs.

IS NULL and IS NOT NULL are special predicates that operate outside three-valued logic. They are explicitly designed to answer the question “does this cell hold the unknown marker?” and they return ordinary TRUE or FALSE. They are the only safe way to test for NULL.

The same logic applies to negation. WHERE x != 'churned' does not match rows where x is NULL. UNKNOWN negated is still UNKNOWN. If you want “rows that are either not churned or have no status at all,” you write WHERE x != 'churned' OR x IS NULL. This is not a workaround. It is the correct statement of the intent.

The NOT IN trap that silences entire queries

This is the behavior that surprises senior engineers who have been writing SQL for years.

Suppose you have a users table and a banned_users table. You want all users who are not banned.

SELECT id FROM users
WHERE id NOT IN (SELECT user_id FROM banned_users);

This looks correct. It is correct — unless banned_users.user_id contains even a single NULL. If it does, the query returns zero rows. Not fewer rows. Zero.

Here is why. NOT IN is syntactic sugar for a chain of inequality checks: id != val1 AND id != val2 AND ... AND id != valN. If any valN is NULL, then id != NULL evaluates to UNKNOWN. TRUE AND UNKNOWN is UNKNOWN. So the entire chain collapses to UNKNOWN for every row. The WHERE clause drops every row.

The fix is one of two things. Either filter NULLs out of the subquery explicitly:

WHERE id NOT IN (SELECT user_id FROM banned_users WHERE user_id IS NOT NULL)

Or, better in most cases, rewrite as NOT EXISTS:

SELECT id FROM users u
WHERE NOT EXISTS (SELECT 1 FROM banned_users b WHERE b.user_id = u.id);

NOT EXISTS uses a correlated subquery (one that references the outer query’s current row) and handles NULLs correctly because it tests for the existence of a matching row, not equality of values. It has no equivalent NULL propagation problem.

This is one of those places where SQL’s three-valued logic is genuinely surprising even to practitioners who know it abstractly. Knowing the rule and recognizing its manifestation in a real query are different skills.

COUNT treats NULL differently depending on what you count

COUNT(*) counts rows. COUNT(col) counts non-NULL values in that column. These are not equivalent, and the difference is meaningful.

If a table has 1,000 rows and the email column is NULL for 80 of them, then COUNT(*) returns 1,000 and COUNT(email) returns 920. Neither number is wrong. They answer different questions: how many rows exist versus how many rows have a known email address.

The problem arises when analysts conflate them. A dashboard metric that uses COUNT(email) to report “active users” quietly excludes 8 percent of users every time, with no indication in the query or the output that anything is being excluded.

The same applies to SUM, AVG, MIN, MAX. All of them ignore NULLs. AVG(col) divides the sum of non-NULL values by the count of non-NULL values — not by the total row count. If you are computing an average and NULL means “the event did not happen” (say, days-to-first-purchase for users who never bought), then AVG will silently give you the average only among purchasers. Whether that is the right number depends on the question. The trap is not knowing the choice is being made.

COALESCE(col, 0) — which returns the first non-NULL argument — lets you substitute a default before aggregating. COALESCE(revenue, 0) treats missing revenue as zero. Whether zero is the right substitute for unknown depends entirely on what NULL means in context.

A WHERE clause on a NULLable column silently drops all NULL rows — no error, no count, no indication.

Joins and NULL: the LEFT JOIN tax

When you LEFT JOIN (a join that keeps all rows from the left table regardless of whether a match exists on the right), unmatched rows produce NULL in every column from the right table. This is by design and it is useful. But it means that any subsequent WHERE filter applied to a right-table column quietly converts your LEFT JOIN into an INNER JOIN (which keeps only matched rows).

SELECT u.id, o.total
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE o.total > 100;

The LEFT JOIN was meant to include users with no orders. But WHERE o.total > 100 evaluates to UNKNOWN when o.total is NULL (because the user has no orders), so those rows vanish. The result is identical to an INNER JOIN filtered to orders over 100.

The intent and the behavior diverge silently. The fix is to put the filter in the JOIN condition, not in WHERE:

LEFT JOIN orders o ON u.id = o.user_id AND o.total > 100

Now the join keeps all users, attaches order data only when both the ID matches and the total exceeds 100, and leaves NULLs for users with no qualifying orders. Different semantics. Often the right ones.

NULL-safe equality: when you need `IS DISTINCT FROM`

Sometimes you genuinely want NULL to equal NULL. Configuration diffing is a common case: you have a current config table and a previous config table, and you want to find columns that changed. If both old and new values are NULL, the column did not change. But old_val = new_val is UNKNOWN when both are NULL, so the diffing query marks every NULL-to-NULL transition as a change.

The SQL standard provides IS DISTINCT FROM for this. a IS DISTINCT FROM b returns FALSE when both are NULL (they are not distinct), TRUE when one is NULL and the other is not, and the normal comparison result otherwise. It is NULL-safe. It returns only TRUE or FALSE, never UNKNOWN.

Not every database implemented it identically or early. Postgres has had it for years. BigQuery supports it. MySQL added it in 8.0 as <=> (the spaceship operator), which is NULL-safe equality rather than NULL-safe inequality. DuckDB supports both IS DISTINCT FROM and IS NOT DISTINCT FROM. The portability story is good now, but worth checking for older MySQL.

What this changes about how you write queries

The practical posture is: treat any nullable column as a three-state variable, not a two-state one. Before filtering, ask what NULLs in this column mean. Before aggregating, ask whether NULLs should be zero, excluded, or counted separately. Before using NOT IN, check whether the subquery can contain NULLs.

None of this requires you to purge NULLs from your schema. NULLs are legitimate. They encode real-world ignorance that two-valued types cannot express. A table with a cancelled_at timestamp is cleaner than a is_cancelled boolean plus a cancelled_at that is coerced to a dummy date when the order is still active. The NULL-bearing schema is more honest.

But honesty requires you to read the contract. The contract is: I stored unknown here, and your query will encounter that unknown at runtime, and if you compare it with =, you will get UNKNOWN back, and your WHERE clause will quietly discard the row.

The analyst who spent two days debugging that churn report eventually found the issue. She added OR status IS NULL to the filter. The numbers matched. Twelve percent of users reappeared. Two days to learn that NULL is not a value — it is an epistemological category that SQL takes more seriously than most practitioners do.

It is worth knowing before the dashboard ships.