Pandas & Data Wrangling Easy Asked at AmazonAsked at GoogleAsked at MetaAsked at Microsoft

How do common SQL operations map to pandas, and when should you use SQL instead of pandas?

For Data Analyst Data Scientist Data Engineer

The short answer

Every core SQL clause — SELECT, WHERE, GROUP BY, HAVING, JOIN, ORDER BY, LIMIT — has a direct pandas equivalent, but SQL executes inside a database engine with optimized query planning and disk-backed storage, while pandas requires all data to fit in RAM. Use SQL for large persistent datasets and pandas for in-memory transformation, feature engineering, and integration with the Python ML ecosystem.

How to think about it

This is a translation question with a decision layer bolted on top. The interviewer wants two things: that you move fluently between both worlds, and that you know why you would reach for one over the other instead of defaulting to whichever you learned first. The honest answer is that real stacks use both — SQL to retrieve and pre-aggregate close to the data, pandas to transform, engineer features, and feed the Python ML ecosystem.

Start with the mapping, because most of it is one-to-one. SELECT col1, col2 is df[["col1", "col2"]]. WHERE amount > 100 is df[df["amount"] > 100], or df.query("amount > 100") when the condition gets busy. GROUP BY with aggregates is df.groupby("region").agg(...). ORDER BY amount DESC LIMIT 10 collapses to df.nlargest(10, "amount"). A LEFT JOIN is orders.merge(customers, on="customer_id", how="left"). The one clause with no single method is HAVING — a filter that runs after aggregation. In pandas you do it in two steps: aggregate first, then apply a boolean mask to the result.

One query, both engines

The cleanest way to prove the mapping is to run the same query in real SQL and in pandas and watch the answers line up. Here a six-row table goes through GROUP BY region + HAVING SUM(amount) > 700 in SQLite, then the identical logic in pandas:

import pandas as pd
import sqlite3

orders = pd.DataFrame({
    "region": ["East", "East", "West", "West", "North", "North"],
    "amount": [400, 350, 200, 180, 600, 550],
})

con = sqlite3.connect(":memory:")
orders.to_sql("orders", con, index=False)

sql_result = pd.read_sql_query("""
    SELECT region, SUM(amount) AS total
    FROM orders
    GROUP BY region
    HAVING SUM(amount) > 700
    ORDER BY region
""", con)
print("SQL (sqlite3):")
print(sql_result)
print()

# pandas: groupby/agg, then a boolean mask for the HAVING step
agg = orders.groupby("region", as_index=False).agg(total=("amount", "sum"))
pandas_result = agg[agg["total"] > 700].sort_values("region").reset_index(drop=True)
print("pandas:")
print(pandas_result)
con.close()

SQL (sqlite3):
  region  total
0   East    750
1  North   1150

pandas:
  region  total
0   East    750
1  North   1150

Identical rows from two engines. SQL fuses the aggregate and the HAVING filter into one declarative statement the database planner optimizes for you; pandas splits it into groupby().agg() and then agg[agg["total"] > 700]. West (380) drops out of both because it never clears 700. The mental shift to internalize: SQL’s HAVING is one clause, but in pandas it is always two operations — aggregate, then filter the aggregate.

So which do you pick? Lean on SQL when the data is larger than RAM, lives in a warehouse, the audience is mixed SQL/non-Python, or the output is a report you set and forget. Lean on pandas when the data fits in memory, you are iterating hard, and the output is ML features or NumPy arrays headed into a model. And if you want SQL syntax with in-memory frames, DuckDB queries a pandas DataFrame in place — duckdb.sql("SELECT region, SUM(amount) ... FROM orders GROUP BY region").df() — often beating pandas on aggregation-heavy work thanks to its vectorized engine.

Learn it properly Merge & join

How do common SQL operations map to pandas, and when should you use SQL instead of pandas?

One query, both engines

Keep practising

Explore further