SQL Hard Asked at AmazonAsked at MetaAsked at StripeAsked at UberAsked at Shopify

Walk me through how you would systematically diagnose a slow SQL query in production.

The short answer

Start by capturing the query plan with EXPLAIN ANALYZE to find the most expensive node, then check whether estimates match actuals to spot stale statistics, look for missing indexes on filter and join columns, verify the predicates are sargable, and finally check for resource contention — locks, buffer eviction, or an overwhelmed disk.

How to think about it

Diagnosing a slow query is a structured process, not guesswork. The biggest mistake candidates make is jumping straight to “add an index” — which may be exactly the wrong answer if the problem is stale statistics, a non-sargable predicate, or lock contention. Cover these layers in order.

The diagnostic process at a glance

Diagnosis pipeline

1. Capture the real plan   EXPLAIN ANALYZE (never skip ANALYZE)
        ↓
2. Find the expensive node  highest actual time, bottom-up
        ↓
3. Check estimate accuracy  estimated rows vs actual rows
        ↓
4. Audit predicate sargability  functions on filter columns?
        ↓
5. Index coverage          filter and join columns indexed?
        ↓
6. Lock contention         pg_stat_activity, blocking sessions
        ↓
7. Resource limits         work_mem, shared_buffers, cache hit ratio

Step 1: capture the actual plan

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) <your slow query>;

Never use EXPLAIN alone — estimated plans can show a perfect index scan while the real execution does a ten-second sequential scan. ANALYZE runs the query and attaches actual row counts and timing to every node.

Step 2: find the costliest node

Read the plan bottom-up. The innermost node with the highest actual time is your target.

Node	What it usually means
`Seq Scan` on a large table	Missing or unused index
`Hash Join` with many `Batches`	Insufficient `work_mem`; spilling to disk
`Sort` with `external merge`	Same — sort spilling to disk
`Nested Loop` with large outer set	Row estimate far too low; wrong join algorithm
`Index Scan` with high `Heap Fetches`	Non-covering index; many random I/O round-trips

Step 3: check estimate accuracy

A large gap between estimated rows and actual rows means the planner chose the wrong algorithm. Fix stale statistics first, then re-examine the plan.

-- Refresh table statistics
ANALYZE orders;

-- For correlated columns (e.g. status + created_at are not independent):
CREATE STATISTICS orders_status_ts ON (status, created_at) FROM orders;
ANALYZE orders;

Step 4: audit predicates for sargability

“Sargable” means the predicate can use an index. Wrapping a column in a function silently disables index use.

-- These silently block index use:
WHERE YEAR(created_at) = 2024
WHERE amount::TEXT LIKE '100%'
WHERE COALESCE(region, 'US') = 'US'

-- Sargable rewrites:
WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'
WHERE amount >= 100 AND amount < 1000
WHERE (region = 'US' OR region IS NULL)

Step 5: check for lock contention

A perfectly indexed query can still queue for seconds if another transaction holds a row-level lock on the rows it needs.

-- PostgreSQL: find blocking sessions
SELECT pid, query, wait_event_type, wait_event
FROM pg_stat_activity
WHERE wait_event_type = 'Lock';

Step 6: inspect I/O and memory

-- Cache hit ratio — should be > 99% for OLTP
SELECT blks_hit,
       blks_read,
       round(blks_hit * 100.0 / (blks_hit + blks_read), 2) AS cache_hit_ratio
FROM pg_stat_database
WHERE datname = current_database();

A low hit ratio means data is being read from disk on most queries — consider raising shared_buffers or investigating the working set size.

Checklist summary

EXPLAIN ANALYZE — find the expensive node.
Estimates vs. actuals — fix stale statistics if divergent.
Sargability — remove functions from filter columns.
Index coverage — add or extend indexes for filter and join columns.
Join order — verify the planner chose the right algorithm given actual row counts.
Lock contention — check pg_stat_activity for blocking sessions.
Resource limits — check work_mem if sort/hash batches spill to disk.