EXPLAIN: reading the plan your database hands you

A team I know spent two weeks tuning a Python pipeline that was slow. They rewrote the aggregation logic, batched the API calls, switched serialization formats. The pipeline was still slow. When someone finally ran EXPLAIN ANALYZE on the main query driving it, the first line read: Seq Scan on events (cost=0.00..4182934.00 rows=89234871 width=48). The database was reading 89 million rows from disk, sequentially, every time the job ran. One index, added in thirty seconds, dropped the query from 47 seconds to 340 milliseconds.

Two weeks of the wrong work, because nobody looked at the plan.

What a query plan actually is

When you submit a SQL query, the database does not execute it directly. It compiles it. The SQL string is parsed into a parse tree, then a planner (sometimes called the optimizer) rewrites that tree into a physical execution plan — a concrete sequence of operations that says exactly how data will be moved, filtered, joined, and sorted. EXPLAIN shows you that plan. EXPLAIN ANALYZE runs the query for real and overlays actual timing and row counts on top of the planner’s estimates.

The plan is a tree of nodes. Each node is one operation. The leaves touch storage. Inner nodes combine what the leaves return. The root node produces the final result set. Execution flows from leaves up to the root, which means you read the plan from the bottom up — the bottom is where the work begins.

Every node carries two numbers that matter: cost and rows. Cost is a unit-less number the planner uses internally to pick between plan alternatives — higher means the planner thinks this path is more expensive. Rows is the planner’s estimate of how many rows will emerge from that node. When the estimate is wildly off from the actual (which EXPLAIN ANALYZE shows you), the planner made decisions based on bad information, and that is almost always a sign that table statistics are stale or that you have a data distribution the planner did not anticipate.

The three red flags

There are dozens of node types, but three patterns cause the majority of slow queries in practice.

Sequential scan on a large table. A sequential scan (Seq Scan in Postgres) means the database is reading every page of the table from disk, in order, regardless of how many rows you actually need. For a table with a few thousand rows, this is fine — it is often faster than an index lookup because the overhead of navigating a B-tree index is not worth it at small scale. For a table with tens of millions of rows, a sequential scan is almost always a bug. The fix is an index on the column you are filtering or joining on. The nuance is that an index only helps if the query is selective — if your WHERE clause matches 60% of the table, the planner will ignore the index and scan anyway, correctly, because random I/O to fetch 60% of pages is slower than sequential I/O over all of them.

Nested loop join over large inputs. A nested loop join works by taking each row from the outer side and scanning the inner side for matches. Its cost is outer_rows * inner_rows. When both sides are small — say, a few hundred rows each — this is cheap and fast. When the outer side has 100,000 rows and the inner side has 50,000, the database is doing up to 5 billion comparisons. You will see this manifest as a node with an enormous estimated or actual total cost deep inside the join subtree. The fix depends on why the loop appeared: often it means a missing index on the join column of the inner relation, which would let the planner switch to an index nested loop (lookup instead of scan) or to a hash join, which builds a hash table from one side and probes it with the other at O(n + m) cost rather than O(n * m).

Sort spilling to disk. When a sort node cannot fit its working data into work_mem (Postgres’s per-query sort memory budget, default 4 MB), it writes temporary files to disk and performs an external merge sort. The plan will say Sort Method: external merge Disk: 48218kB — that number is the disk used. A sort that spills is not necessarily wrong, but if it spills gigabytes it is a strong signal either that work_mem is too low for this workload or that you are sorting far more rows than you need to, typically because a filter that could have eliminated rows earlier in the plan was applied too late.

A query plan tree. Execution flows from leaves (bottom) to root (top). The Seq Scan on 89 million rows is the true culprit — everything upstream inherits its cost.

The number the planner got wrong

The most underappreciated skill in reading a plan is comparing estimated rows to actual rows. EXPLAIN ANALYZE prints both. When the planner estimates 400 rows and the query actually returns 4,000,000, the planner built the entire downstream plan around a fantasy. It probably chose a nested loop because it thought the result set was tiny; a hash join would have been better.

Why does the planner get estimates wrong? Because its statistics are stale or coarse. Postgres collects histogram statistics on each column via ANALYZE (which autovacuum runs periodically), but if data distribution changes faster than autovacuum runs — say, you ingested 200 million rows overnight — the statistics describe the old table. The fix is ANALYZE table_name, which takes seconds and refreshes the planner’s world model. For tables with highly skewed distributions or columns that are always queried together, you can also create extended statistics (CREATE STATISTICS) to give the planner a more accurate multivariate model.

A rule of thumb: if the estimated rows and actual rows differ by more than a factor of 10, look at your statistics before anything else.

Joins are decisions, not facts

Most practitioners think of a JOIN as a declaration — “combine these two tables on this key.” The database treats it as an optimization problem with three possible algorithms: nested loop, hash join, and merge join.

A nested loop (described above) is chosen when one side is small and the other has an index on the join key. It can be very fast in that case. A hash join builds a hash table from the smaller relation, then probes it with each row from the larger one — it is the right choice when both sides are large and neither has an index. A merge join sorts both sides by the join key and then walks them together in lockstep — it is efficient when both inputs are already sorted or can be sorted cheaply, and it avoids the memory cost of a hash table.

The planner picks among these based on its row estimates. When it picks wrong, it is almost always because the estimates are off. And when the estimates are off, the dashed red path in the plan tree above — Sort feeding a Nested Loop feeding a Seq Scan — is the shape of suffering you will see.

What to actually do with the output

The workflow is: run EXPLAIN (ANALYZE, BUFFERS) in Postgres (the BUFFERS flag adds I/O statistics that are very useful). Read from the bottom up. Find the highest-cost node. Ask: is the cost here a surprise, or is it doing proportional work? If it is a Seq Scan on a large table, check whether an index exists on the filter column — if not, create one and re-run the plan. If it is a Nested Loop, check whether the inner side has an index on the join key. If the planner estimates are way off actual, run ANALYZE on the table. If a sort spills, either filter more rows before the sort (restructure the query so WHERE clauses run earlier) or increase work_mem for this session with SET work_mem = '256MB' before the query.

Most slow queries fail at exactly one of these. They rarely fail at all of them simultaneously. The plan tells you which one.

Indexes are not always the answer

The standard reflex — “query is slow, add an index” — is wrong about a third of the time. If your WHERE clause has low selectivity (the condition matches most of the table), the planner will ignore the index and scan anyway. If the table is tiny, the overhead of traversing the index is more expensive than the scan. If the query returns many columns from many rows, a heap fetch per index entry costs more than sequential reads. The plan tells you whether the planner chose your index or bypassed it; if it bypassed it, either the index is irrelevant to this query or the planner has correctly determined it would be slower.

Adding a partial index — one that indexes only the rows matching a common filter — can change the calculus entirely. A partial index on events WHERE status = 'pending' is dramatically smaller than an index on all of events, and the planner can use it precisely when it matters. Similarly, a covering index (one that includes all the columns the query touches in its index leaves) eliminates the heap fetch entirely, reducing a query that touches storage twice per row to one that touches storage once.

The intuition to carry forward

A database query planner is a cost-based optimizer that builds a plan tree using statistics about your data. It is not magic, and it is not always right. EXPLAIN makes the plan readable. EXPLAIN ANALYZE makes the plan honest — it shows what the planner believed versus what actually happened.

The practitioners who are fast at query optimization are not faster because they know more syntax. They are faster because they look at the plan first, every time, before forming any hypothesis. The plan tells you where the work is being done. Everything else is guessing.

Read the plan. Start at the bottom. Find the node that accounts for most of the cost. There is almost always one. Fix that one thing. Re-run. Repeat.

The database hands you the evidence. Most people never look at it.