How do common SQL operations map to pandas, and when should you use SQL instead of pandas?
Every core SQL clause — SELECT, WHERE, GROUP BY, HAVING, JOIN, ORDER BY, LIMIT — has a direct pandas equivalent, but SQL executes inside a database engine with optimized query planning and disk-backed storage, while pandas requires all data to fit in RAM. Use SQL for large persistent datasets and pandas for in-memory transformation, feature engineering, and integration with the Python ML ecosystem.
How to think about it
What the interviewer is testing
This is a translation question with a decision layer on top. They want to know: (1) can you work fluently in both worlds, and (2) do you understand why you’d pick one over the other, rather than just defaulting to whichever you learned first. The honest answer is that in most data stacks you use both — SQL to retrieve and aggregate, pandas to transform and model.
The translation reference
Every core SQL clause maps cleanly to a pandas method:
import pandas as pd
# SELECT col1, col2 FROM df
df[["col1", "col2"]]
# SELECT DISTINCT region FROM df
df["region"].unique()
df[["region"]].drop_duplicates()
# WHERE amount > 100
df[df["amount"] > 100]
df.query("amount > 100") # cleaner for complex conditions
# GROUP BY + aggregate
df.groupby("region").agg(total=("amount", "sum"), n=("amount", "count"))
# HAVING (post-aggregation filter) — filter after groupby
agg = df.groupby("region").agg(total=("amount", "sum"))
agg[agg["total"] > 1000]
# ORDER BY amount DESC LIMIT 10
df.nlargest(10, "amount")
# LEFT JOIN
orders.merge(customers, on="customer_id", how="left")
# WINDOW FUNCTION: ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount)
df["row_num"] = df.groupby("region")["amount"].rank(method="first")
The trickiest translation is HAVING — there is no single method for it. You run groupby().agg() first, then filter the resulting DataFrame. Think of it as a two-step operation.
Interactive demo: GROUP BY + HAVING in pandas
Decision guide: SQL vs pandas
| Factor | Use SQL | Use pandas |
|---|---|---|
| Data size | Larger than RAM | Fits in RAM |
| Data location | Database / data warehouse | File / in-memory |
| Team | Mixed SQL/non-Python | Python-first |
| Output | Report / dashboard | ML features / arrays |
| Iteration speed | Set-and-forget query | Exploratory, lots of changes |
The DuckDB sweet spot
If you need SQL syntax and the pandas ecosystem, DuckDB can query a pandas DataFrame directly:
import duckdb
result = duckdb.sql("""
SELECT region, SUM(amount) AS total
FROM orders
GROUP BY region
HAVING SUM(amount) > 700
""").df() # returns a pandas DataFrame
This is often faster than pandas for aggregation-heavy work because DuckDB has a vectorized query engine under the hood.