Walk me through how you would systematically diagnose a slow SQL query in production.
Start by capturing the query plan with EXPLAIN ANALYZE to find the most expensive node, then check whether estimates match actuals to spot stale statistics, look for missing indexes on filter and join columns, verify the predicates are sargable, and finally check for resource contention — locks, buffer eviction, or an overwhelmed disk.
How to think about it
Diagnosing a slow query is a structured process, not guesswork. The biggest mistake candidates make is jumping straight to “add an index” — which may be exactly the wrong answer if the problem is stale statistics, a non-sargable predicate, or lock contention. Cover these layers in order.
The diagnostic process at a glance
Step 1: capture the actual plan
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) <your slow query>;
Never use EXPLAIN alone — estimated plans can show a perfect index scan while the real execution does a ten-second sequential scan. ANALYZE runs the query and attaches actual row counts and timing to every node.
Step 2: find the costliest node
Read the plan bottom-up. The innermost node with the highest actual time is your target.
| Node | What it usually means |
|---|---|
Seq Scan on a large table | Missing or unused index |
Hash Join with many Batches | Insufficient work_mem; spilling to disk |
Sort with external merge | Same — sort spilling to disk |
Nested Loop with large outer set | Row estimate far too low; wrong join algorithm |
Index Scan with high Heap Fetches | Non-covering index; many random I/O round-trips |
Step 3: check estimate accuracy
A large gap between estimated rows and actual rows means the planner chose the wrong algorithm. Fix stale statistics first, then re-examine the plan.
-- Refresh table statistics
ANALYZE orders;
-- For correlated columns (e.g. status + created_at are not independent):
CREATE STATISTICS orders_status_ts ON (status, created_at) FROM orders;
ANALYZE orders;
Step 4: audit predicates for sargability
“Sargable” means the predicate can use an index. Wrapping a column in a function silently disables index use.
-- These silently block index use:
WHERE YEAR(created_at) = 2024
WHERE amount::TEXT LIKE '100%'
WHERE COALESCE(region, 'US') = 'US'
-- Sargable rewrites:
WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'
WHERE amount >= 100 AND amount < 1000
WHERE (region = 'US' OR region IS NULL)
Step 5: check for lock contention
A perfectly indexed query can still queue for seconds if another transaction holds a row-level lock on the rows it needs.
-- PostgreSQL: find blocking sessions
SELECT pid, query, wait_event_type, wait_event
FROM pg_stat_activity
WHERE wait_event_type = 'Lock';
Step 6: inspect I/O and memory
-- Cache hit ratio — should be > 99% for OLTP
SELECT blks_hit,
blks_read,
round(blks_hit * 100.0 / (blks_hit + blks_read), 2) AS cache_hit_ratio
FROM pg_stat_database
WHERE datname = current_database();
A low hit ratio means data is being read from disk on most queries — consider raising shared_buffers or investigating the working set size.
Checklist summary
EXPLAIN ANALYZE— find the expensive node.- Estimates vs. actuals — fix stale statistics if divergent.
- Sargability — remove functions from filter columns.
- Index coverage — add or extend indexes for filter and join columns.
- Join order — verify the planner chose the right algorithm given actual row counts.
- Lock contention — check
pg_stat_activityfor blocking sessions. - Resource limits — check
work_memif sort/hash batches spill to disk.