Indexes: the data structure your WHERE clause is begging for

A team at a payments company ran a query every night to pull all transactions over $10,000 for compliance reporting. On Monday it took 12 milliseconds. By Thursday it was 4 seconds. By the following month, 38 seconds. No schema change, no new code, no infrastructure incident. Just more rows.

The query had always done a full-table scan (reading every single row in the table, start to finish). Early on, that table had 200,000 rows. Twelve months later it had 80 million. The scan cost grew linearly with the data. Nobody noticed until the nightly job started timing out.

One CREATE INDEX later: 11 milliseconds, flat, regardless of table size.

This is not a war story about bad engineers. It is a story about what a database actually does when you give it a WHERE clause and no index to guide it.

The honest cost of a full-table scan

When you write WHERE amount > 10000, the database needs to find matching rows. If it has no structural hint about where those rows live, it does the only rational thing: it reads every row and checks the condition. Every. Single. One. This is a full-table scan, and its time complexity is O(n) — linear in the number of rows.

For 10,000 rows that is fine. For 10 million rows it is painful. For 100 million rows it will eat your query budget before breakfast.

The database is not being lazy or stupid. It is making the only move available to it given the information at hand. If every row might satisfy the condition, the engine must inspect every row. An index is how you give it better information.

What an index actually is

An index is a separate, sorted data structure — usually a B-tree (a balanced tree where every leaf is equidistant from the root) — that the database maintains alongside your table. It stores the indexed column values in sorted order, each pointing back to the physical row in the main table.

The key word is sorted. Sorting is what makes binary search possible. In a sorted structure of n entries, you can locate any value in O(log n) steps by repeatedly halving the search space. A B-tree with 80 million rows has a height of roughly 27 levels (log base 2 of 80 million). To find all rows matching amount > 10000, the database descends 27 levels to the first matching leaf, then reads forward along the leaf level — touching only matching rows, never scanning irrelevant ones.

That is the entire secret. Not magic, not caching, not clever hardware. Just sorting.

Left: a full-table scan touches every row to find two matches. Right: a B-tree index descends three levels and lands exactly on the matching leaves — the rest of the table is invisible to the query.

Why writes cost more with an index

Every time you insert a row, the database must also update the index. The B-tree must stay sorted and balanced. For a single index on a low-traffic table, this overhead is negligible. For a table with eight indexes receiving ten thousand inserts per second, the write amplification is real and measurable.

This is why the reflex to “index everything” is wrong. Indexes are a deliberate trade: you pay higher write costs and consume extra storage in exchange for faster reads. A write-heavy table with rarely-queried columns is a bad candidate for indexing. An analytics table that gets bulk-loaded daily and then queried ten thousand times is a great candidate.

The decision is always: what is the read-to-write ratio for this column, and how selective is the data?

Selectivity: the metric that determines whether an index is worth it

Selectivity (the fraction of rows a condition eliminates) is the single most important factor in index usefulness. A highly selective condition — user_id = 'abc123' returning one row from a million — benefits enormously from an index. A low-selectivity condition — status = 'active' returning 80 percent of rows — may not.

When the optimizer estimates that it would need to fetch 40 percent of the table via the index, it does the math and often concludes that a full scan is cheaper. Random I/O through the index (one page fetch per matching row, scattered across disk) can be slower than sequential I/O through the whole table, especially on spinning disks. The query planner has access to statistics about your data and will sometimes ignore your index, not out of stubbornness, but because the math said no.

This catches people off guard. You built the index. The query planner did not use it. The fix is usually data architecture: add a more discriminating column, use a partial index, or reconsider the query shape.

Three ways your index gets silently bypassed

Applying a function to the indexed column. If your index is on created_at and you write WHERE YEAR(created_at) = 2025, the optimizer cannot use the index. The function transforms the column value before comparison, so the sorted structure is useless. The fix is to rewrite the condition as a range: WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01'. Same semantics, index-friendly shape.

Leading-wildcard LIKE patterns. WHERE name LIKE '%smith' cannot use a standard B-tree index because the unknown prefix means the database cannot locate a starting point in the sorted structure. It must scan everything. WHERE name LIKE 'smith%' can use an index because the prefix is fixed and the tree can seek to it. If you genuinely need arbitrary substring search, full-text indexes or dedicated search infrastructure (Elasticsearch, Postgres tsvector) are the right tools.

Low selectivity on a non-partial index. As discussed above, if the optimizer estimates it will touch more than a threshold fraction of the table (typically somewhere between 10 and 40 percent, varying by engine and storage), it will abandon the index entirely. A partial index — one that only indexes rows matching a specific condition — can rescue this case. An index on status = 'pending' that only contains the 2 percent of rows actually in that state is far more selective than a full-column index on status.

JOIN columns are WHERE clauses in disguise

A JOIN condition — ON orders.user_id = users.id — is logically identical to a WHERE clause at query time. The optimizer needs to match rows from two tables, and if neither side has an index on the join column, it either scans both tables (nested loop, O(n times m)), builds an in-memory hash table, or performs a sort-merge join. All of these are more expensive than a B-tree lookup.

Foreign keys, by definition, should almost always be indexed. Primary keys always are (the engine enforces uniqueness via an index). The gap is the foreign key column on the referencing table. Many ORMs create foreign key constraints without creating indexes. This is a common silent performance killer: your schema is correct, the data is consistent, but every join on that column is a scan.

ORDER BY and the index as a free sort

A B-tree index is already sorted. If your query ends with ORDER BY amount DESC and you have an index on amount, the database can simply traverse the index in reverse order and collect rows without an explicit sort step. Sorts are O(n log n); traversing an already-sorted structure is O(k) where k is the number of rows returned.

Composite indexes — indexes on multiple columns — extend this. An index on (user_id, created_at) satisfies a query that filters on user_id and sorts by created_at in a single pass. The catch: the index is only usable for the leftmost prefix of its columns. An index on (a, b, c) helps queries filtering on a, on (a, b), or on (a, b, c). It does not help a query filtering only on b or only on c. This is called the leftmost prefix rule, and forgetting it is how developers end up confused about why their composite index went unused.

Reading the query planner instead of guessing

Every serious database engine exposes a query planner. EXPLAIN in Postgres and MySQL, EXPLAIN ANALYZE for actual runtime statistics. The output shows whether the engine chose an index scan, a sequential scan, a hash join, or a sort. If you are unsure whether your index is working, stop guessing and run EXPLAIN.

The output is dense, but the key signals are simple: look for Seq Scan (sequential — no index) versus Index Scan or Bitmap Index Scan (index used), and look for the row count estimates. If the planner’s estimated row count is wildly wrong relative to the actual count, the table statistics are stale and need ANALYZE or its equivalent.

Index design without reading the planner is like tuning a car engine by listening to the exhaust. You might get lucky. You will mostly be guessing.

The intuition to carry forward

An index does one thing: it replaces a linear search with a logarithmic one. Every benefit flows from that — faster WHERE clauses, faster JOINs, free sorting. Every cost flows from it too — write overhead, storage, optimizer complexity.

The engineers who use indexes well do not reach for them automatically. They think about the query shape first: what condition filters most aggressively, where the reads are, what the write rate looks like, whether the data is selective enough to reward the trade. Then they verify with the query planner.

The payments company’s 4-second query was not a bug. It was a linear scan on a table that had grown large enough to make the linearity matter. The index did not change the data. It changed the search strategy — and that is the whole story.