How do you use EXPLAIN / EXPLAIN ANALYZE to diagnose a slow query?

EXPLAIN shows the optimizer's chosen plan with estimated rows and costs before execution. EXPLAIN ANALYZE runs the query and overlays actual row counts and timing, letting you spot where estimates diverge from reality — bad statistics, missing indexes, or the wrong join algorithm — and target the correct fix.

Walk me through how you would systematically diagnose a slow SQL query in production.

Start by capturing the query plan with EXPLAIN ANALYZE to find the most expensive node, then check whether estimates match actuals to spot stale statistics, look for missing indexes on filter and join columns, verify the predicates are sargable, and finally check for resource contention — locks, buffer eviction, or an overwhelmed disk.

How can aggregating after a JOIN produce inflated (double-counted) totals, and how do you fix it?

A JOIN that fans out rows — one-to-many or many-to-many — causes the same source row to appear multiple times in the joined result set. Aggregating on that inflated set multiplies values, giving totals larger than the true sum.

How do common SQL operations map to pandas, and when should you use SQL instead of pandas?

Every core SQL clause — SELECT, WHERE, GROUP BY, HAVING, JOIN, ORDER BY, LIMIT — has a direct pandas equivalent, but SQL executes inside a database engine with optimized query planning and disk-backed storage, while pandas requires all data to fit in RAM. Use SQL for large persistent datasets and pandas for in-memory transformation, feature engineering, and integration with the Python ML ecosystem.

SQL: Computing Results by Hand — GATE DA

The last lesson named SQL as the declarative idea made practical — so here it is, doing real work. And GATE’s question about it is almost always the same five words: “How many rows does this query return?” Not “write a query,” not “design a schema” — just stare at a small SQL snippet and a tiny table and tell us the count.

The trick is to read SQL like a tiny pipeline: tables in, filter, group, project, rows out. Six clauses do almost all the work, and once you know the order in which they fire, counting becomes mechanical.

Reading SQL as a pipeline

A query is six steps. They run in this order — not the order you write them:

Logical order: tables → rows → groups → columns. SELECT names appear last but you write them first.

SELECT picks which columns come out (and runs aggregates).
FROM picks the tables. Multiple tables separated by commas form a cross product — every row of A paired with every row of B.
WHERE filters individual rows.
JOIN … ON combines two tables on a matching condition — same as cross product plus a WHERE, but cleaner.
GROUP BY buckets rows that share a value, and aggregates collapse each bucket to one row.
HAVING filters those grouped buckets (use it instead of WHERE when the test involves an aggregate).
DISTINCT removes duplicate output rows at the end.

The cross-product mental model

The most common GATE pattern uses the old “comma join” — FROM Raider, Team. That is a cross product: if Raider has 6 rows and Team has 6 rows, the intermediate result is 36 rows. The WHERE clause then prunes it.

FROM A, B fires first, exploding rows. WHERE then prunes — most rows usually die there.

Worked example — GATE DA 2024 Q21

This is the actual exam question, stated nearly verbatim:

Schema: Raider(ID, City, ...) and Team(ID, RaidPoints, ...). How many rows does the following SQL query return?
SELECT *
FROM   Raider, Team
WHERE  Raider.ID = Team.ID
   AND City = 'Jaipur'
   AND RaidPoints > 200;

The answer is 3. Let us see why with a concrete tiny dataset — six raiders and the same six teams (one per raider, joined on ID):

Raider.ID	City	Team.ID	RaidPoints
1	Jaipur	1	250
2	Jaipur	2	210
3	Mumbai	3	300
4	Jaipur	4	180
5	Jaipur	5	220
6	Patna	6	400

Walk the WHERE filter row by row, after the ID match collapses 36 cross-product rows down to 6:

ID 1: City = Jaipur, RaidPoints = 250 > 200. Keep.
ID 2: Jaipur, 210 > 200. Keep.
ID 3: Mumbai — wrong city. Drop.
ID 4: Jaipur, 180 — not above 200. Drop.
ID 5: Jaipur, 220 > 200. Keep.
ID 6: Patna — wrong city. Drop.

Three rows survive, and here is exactly what the query returns:

Raider.ID | City   | Team.ID | RaidPoints
--------- | ------ | ------- | ----------
1         | Jaipur | 1       | 250
2         | Jaipur | 2       | 210
5         | Jaipur | 5       | 220
                  (3 rows)

Two quick variations confirm where the work happens. Remove AND RaidPoints > 200 and you get 4 rows (every Jaipur raider). Drop the ID-equality condition and you get 24 (4 Jaipur raiders crossed with 6 teams, filtered only by city). The WHERE clause is doing all the pruning.

Aggregates and GROUP BY

COUNT, SUM, AVG, MIN, MAX collapse many rows into one. Add GROUP BY and you get one collapsed row per group instead of one for the whole table.

SELECT   City, COUNT(*) AS raiders
FROM     Raider
GROUP BY City;

Output rows = number of distinct cities = 3 (Jaipur, Mumbai, Patna) — not 6.

HAVING filters those groups after aggregation:

SELECT   City, COUNT(*) AS raiders
FROM     Raider
GROUP BY City
HAVING   COUNT(*) >= 2;

Now only Jaipur (4 raiders) survives — 1 row.

How GATE asks this

A NAT: a tiny schema (two or three tables, a handful of rows each), a query that uses some mix of cross-product join, WHERE, GROUP BY, HAVING, and maybe DISTINCT — and you must type the exact row count. GATE DA 2024 Q21 was exactly this pattern with two tables. The way to beat the question every time: walk the six-clause pipeline above with a pencil, one clause at a time, and write down the row count after each step.

A question to carry forward

You can now query any schema and count its rows by hand. But notice we have never asked whether the schema was worth querying. Imagine cramming student, course, and instructor all into one wide table: the instructor’s name would repeat in every single enrolment row — and the day one of those copies is updated and the others are not, the database starts contradicting itself. That rot is a property of the design, invisible to any query. Here is the thread onward: what is the formal rule that captures “this column’s value is fixed by that one” — student_id determines name, course_id determines title — and how does that rule let you diagnose a bad design, and find a table’s keys, before a single bad update corrupts your data?

In one breath

GATE’s SQL question is almost always “how many rows?” — read the query as a pipeline and count after each clause.
Logical order (not written order): FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY.
FROM A, B is a cross product (|A|·|B| rows); a missing join condition is the classic count-balloon trap. WHERE then prunes.
Aggregates (COUNT/SUM/AVG/MIN/MAX) collapse rows; GROUP BY → one row per group; HAVING filters groups (WHERE cannot test an aggregate).
GATE DA 2024 Q21: Raider, Team joined on ID, then City='Jaipur' and RaidPoints>200 → 3 rows.

Practice

Quick check

0/6

Q1Recall: which statements about SQL evaluation are TRUE? (select all that apply)select all that apply

Q2Trace: table T(x) has rows {1, 1, 2, 2, 3, NULL}. How many rows does `SELECT DISTINCT x FROM T` return? (SQL treats NULL as a distinct group of its own.) (integer)numerical answer — type a number

Q3Apply: given Raider(ID, City) with rows (1,'Jaipur'),(2,'Jaipur'),(3,'Mumbai'),(4,'Jaipur'),(5,'Jaipur'),(6,'Patna') and Team(ID, RaidPoints) with rows (1,250),(2,210),(3,300),(4,180),(5,220),(6,400), how many rows does `SELECT * FROM Raider, Team WHERE Raider.ID = Team.ID AND City = 'Jaipur' AND RaidPoints > 200` return? (integer)numerical answer — type a number

Q4Apply: same Raider/Team tables. How many rows does `SELECT * FROM Raider, Team WHERE City = 'Jaipur'` return? (Note: no join condition.) (integer)numerical answer — type a number

Q5Apply: same Raider table. How many rows does `SELECT City, COUNT(*) FROM Raider GROUP BY City HAVING COUNT(*) >= 2` return? (integer)numerical answer — type a number

Q6Create: given the Raider/Team tables, how many rows does `SELECT DISTINCT City FROM Raider, Team WHERE Raider.ID = Team.ID AND RaidPoints > 200` return? (integer)numerical answer — type a number

SQL: Computing Results by Hand

What you'll learn

Before you start

Reading SQL as a pipeline

The cross-product mental model

Worked example — GATE DA 2024 Q21

Aggregates and GROUP BY

How GATE asks this

A question to carry forward

In one breath

Practice

Quick check

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further