What is the difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column)?
COUNT(*) counts every row including those with NULLs. COUNT(column) counts only rows where that column is non-NULL. COUNT(DISTINCT column) counts unique non-NULL values in the column.
How to think about it
This looks like a basic question but it’s a reliable filter for candidates who understand NULLs. The interviewer wants to see whether you can reason through a concrete example with NULLs and duplicates — not just recite three definitions.
The setup: what each form actually counts
Think of a users table where some users haven’t provided an email, and some share an email (family accounts, test data, etc.).
COUNT(*)— counts every row in the table, period. Doesn’t care about NULLs in any column.COUNT(email)— counts only rows whereemailis not NULL. Good for data quality checks.COUNT(DISTINCT email)— counts unique, non-NULL email values. Good for cardinality: “how many distinct users?”
Try it with real data
When to reach for each
COUNT(*)— total row count; pagination totals; cardinality checks.COUNT(col)— data completeness: “how many rows actually have this field filled in?”COUNT(DISTINCT col)— dimension cardinality: “how many unique users clicked?” This is the standard funnel analysis form.
The approximate version for large tables
COUNT(DISTINCT col) requires the engine to track every unique value it sees — expensive at scale. BigQuery and Redshift offer approximate variants:
-- BigQuery
APPROX_COUNT_DISTINCT(user_id)
-- Redshift
HLL_COUNT.MERGE(HLL_COUNT.INIT(user_id))
These trade a small error margin (typically under 2%) for dramatically better performance on billions of rows.