datarekha
Pandas & Data Wrangling Hard Asked at MetaAsked at UberAsked at DatabricksAsked at Snowflake

Why is pandas slow, and what are the main strategies to speed it up?

The short answer

pandas is slow primarily because Python loops bypass NumPy's vectorized C kernels, object-dtype columns prevent SIMD optimizations, and keeping entire datasets in memory limits scalability. The fixes are vectorization, categorical encoding, eval/query for large frames, chunking for out-of-core data, and switching to Polars or DuckDB for compute-heavy pipelines.

How to think about it

How to frame this answer

The honest answer has two layers. First, there’s a correctness layer: most slowness comes from code that’s pandas-shaped but still doing Python-level work (loops, iterrows, apply with a lambda that calls Python functions). Second, there’s a scalability layer: even well-written pandas is single-threaded and fully in-memory. Knowing both tells the interviewer you understand not just the quick fix but also the architecture.

The root cause: Python overhead on every row

NumPy’s speed comes from processing entire arrays in C without touching Python between elements. When you loop over rows — whether with a for loop, iterrows(), or a slow apply — you pay Python’s interpreter overhead for every single cell. On a million-row DataFrame, that’s a million Python interpreter round-trips.

Strategy 1: Vectorize — eliminate all Python loops

Strategy 2: categorical dtype for low-cardinality strings

String comparisons on object dtype columns are slow because each element is an independent Python string object. Converting to category dtype replaces the strings with small integer codes — comparisons become integer comparisons, groupby builds its hash table over codes instead of strings.

# object dtype: comparisons and groupby hash every Python string
df["city"] = df["city"].astype("category")
# Now groupby, value_counts, and boolean masks are all faster
# Memory drops from ~50 bytes/value to ~1-2 bytes/value for low cardinality

Strategy 3: eval and query for large arithmetic and filter expressions

pd.eval uses numexpr under the hood to evaluate arithmetic across columns without creating intermediate arrays. For a DataFrame with millions of rows, a + b + c creates two temporary full-size arrays; pd.eval("a + b + c") avoids both.

# Avoids intermediate arrays for large frames
result = pd.eval("df['a'] * df['b'] + df['c']")

# query avoids Python overhead in boolean indexing
df.query("price > 100 and region == 'East'")

Strategy 4: Correct dtypes from the start

Reading with explicit dtypes avoids the object→category conversion later and means pandas never allocates the large initial representation:

df = pd.read_csv(
    "data.csv",
    dtype={"region": "category", "id": "int32", "score": "float32"},
)

Strategy 5: Chunking for out-of-core data

When the file is too large to fit in RAM, process it in pieces:

totals = []
for chunk in pd.read_csv("big.csv", chunksize=100_000):
    totals.append(chunk.groupby("region")["sales"].sum())
pd.concat(totals).groupby(level=0).sum()

Strategy 6: Escape to Polars or DuckDB for compute-heavy pipelines

When the bottleneck is the pandas engine itself (single-threaded, fully in-memory), the right move is to swap tools. Polars uses all CPU cores automatically; DuckDB has a vectorized query engine and can operate out-of-core.

Learn it properly Memory optimization

Keep practising

All Pandas & Data Wrangling questions

Explore further

Skip to content