datarekha
Pandas & Data Wrangling Easy Asked at AmazonAsked at GoogleAsked at MetaAsked at Microsoft

How do common SQL operations map to pandas, and when should you use SQL instead of pandas?

The short answer

Every core SQL clause — SELECT, WHERE, GROUP BY, HAVING, JOIN, ORDER BY, LIMIT — has a direct pandas equivalent, but SQL executes inside a database engine with optimized query planning and disk-backed storage, while pandas requires all data to fit in RAM. Use SQL for large persistent datasets and pandas for in-memory transformation, feature engineering, and integration with the Python ML ecosystem.

How to think about it

What the interviewer is testing

This is a translation question with a decision layer on top. They want to know: (1) can you work fluently in both worlds, and (2) do you understand why you’d pick one over the other, rather than just defaulting to whichever you learned first. The honest answer is that in most data stacks you use both — SQL to retrieve and aggregate, pandas to transform and model.

The translation reference

Every core SQL clause maps cleanly to a pandas method:

import pandas as pd

# SELECT col1, col2 FROM df
df[["col1", "col2"]]

# SELECT DISTINCT region FROM df
df["region"].unique()
df[["region"]].drop_duplicates()

# WHERE amount > 100
df[df["amount"] > 100]
df.query("amount > 100")       # cleaner for complex conditions

# GROUP BY + aggregate
df.groupby("region").agg(total=("amount", "sum"), n=("amount", "count"))

# HAVING (post-aggregation filter) — filter after groupby
agg = df.groupby("region").agg(total=("amount", "sum"))
agg[agg["total"] > 1000]

# ORDER BY amount DESC LIMIT 10
df.nlargest(10, "amount")

# LEFT JOIN
orders.merge(customers, on="customer_id", how="left")

# WINDOW FUNCTION: ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount)
df["row_num"] = df.groupby("region")["amount"].rank(method="first")

The trickiest translation is HAVING — there is no single method for it. You run groupby().agg() first, then filter the resulting DataFrame. Think of it as a two-step operation.

Interactive demo: GROUP BY + HAVING in pandas

Decision guide: SQL vs pandas

FactorUse SQLUse pandas
Data sizeLarger than RAMFits in RAM
Data locationDatabase / data warehouseFile / in-memory
TeamMixed SQL/non-PythonPython-first
OutputReport / dashboardML features / arrays
Iteration speedSet-and-forget queryExploratory, lots of changes

The DuckDB sweet spot

If you need SQL syntax and the pandas ecosystem, DuckDB can query a pandas DataFrame directly:

import duckdb

result = duckdb.sql("""
    SELECT region, SUM(amount) AS total
    FROM orders
    GROUP BY region
    HAVING SUM(amount) > 700
""").df()   # returns a pandas DataFrame

This is often faster than pandas for aggregation-heavy work because DuckDB has a vectorized query engine under the hood.

Learn it properly Merge & join

Keep practising

All Pandas & Data Wrangling questions

Explore further

Skip to content