Why is pandas slow, and what are the main strategies to speed it up?

pandas is slow primarily because Python loops bypass NumPy's vectorized C kernels, object-dtype columns prevent SIMD optimizations, and keeping entire datasets in memory limits scalability. The fixes are vectorization, categorical encoding, eval/query for large frames, chunking for out-of-core data, and switching to Polars or DuckDB for compute-heavy pipelines.

When should you use Spark instead of pandas, and what are the key trade-offs?

pandas operates in-memory on a single machine, making it fast and simple for datasets under a few gigabytes. Spark distributes computation across a cluster, handles terabyte-scale data, and integrates with cloud storage — but adds significant overhead for small data. The crossover point is roughly when your data no longer fits in RAM or when processing time on a single machine becomes unacceptable.

How do common SQL operations map to pandas, and when should you use SQL instead of pandas?

Every core SQL clause — SELECT, WHERE, GROUP BY, HAVING, JOIN, ORDER BY, LIMIT — has a direct pandas equivalent, but SQL executes inside a database engine with optimized query planning and disk-backed storage, while pandas requires all data to fit in RAM. Use SQL for large persistent datasets and pandas for in-memory transformation, feature engineering, and integration with the Python ML ecosystem.

When should you use apply, map, or applymap versus vectorized pandas operations, and what are the performance implications?

Vectorized pandas and NumPy operations operate on entire arrays in compiled C/Fortran code and should always be your first choice. apply runs a Python function row- or column-wise in a Python loop, map transforms a single Series element-by-element, and applymap (DataFrame.map in pandas 2.1+) applies a function to every scalar — all three are orders of magnitude slower than vectorized equivalents.

When O(n²) Kills Your DataFrame — DSA

You have a working pipeline. It passes the tests, produces the right output, and runs fine on ten thousand rows. You ship it, it meets two million rows in production, and now it takes forty minutes instead of one.

Nothing is broken. The complexity is. This lesson is about the gap between correct and scalable — the O(n²) traps that turn up constantly in data work, and the O(n + m) patterns that erase them.

The trap, and why it scales so badly

Here is a join that looks perfectly reasonable — for every order, find the matching customer:

for order in orders:              # n iterations
    for customer in customers:    # m scans, each time
        if order["customer_id"] == customer["id"]:
            order["name"] = customer["name"]
            break

This is O(n × m). With 100,000 orders and 50,000 customers, that is up to five billion comparisons. The same shape hides in a boolean filter inside a loop (df[df.tag == t] re-scans every row each pass), in an apply that does a per-row list scan, and in pd.concat inside a loop — where each concat copies the whole growing frame, so k chunks cost 1 + 2 + … + k, a quadratic total.

The grid of comparisons fills in as n²; the index is built once and probed in one step per row.

The fix: build an index once, query in O(1)

The lesson from hash tables carries straight over. Instead of scanning, index once and look up directly:

def nested_join(left, right):
    comps = 0
    for l in left:
        for r in right:
            comps += 1
            if l == r:
                break
    return comps

def hash_join(left, right):
    index = set(right)            # build once — m steps
    comps = len(right)
    for l in left:
        comps += 1                # one O(1) lookup per row
        _ = l in index
    return comps

for n in [200, 500, 1000, 2000]:
    data = list(range(n))
    print(f"n={n:>4}: nested {nested_join(data, data):>10,} comparisons   hash {hash_join(data, data):>6,}")

n= 200: nested     20,100 comparisons   hash    400
n= 500: nested    125,250 comparisons   hash  1,000
n=1000: nested    500,500 comparisons   hash  2,000
n=2000: nested  2,001,000 comparisons   hash  4,000

Each time n doubles, the nested count roughly quadruples while the hash count merely doubles. At n = 2000 the gap is already 500×; at a million rows it is the difference between hours and seconds. In pandas, the fixes have names you already use: df.merge builds the hash index for you (O(n + m)); df["code"].map(some_dict) is the dict lookup; set(...) membership replaces x in big_list; and collecting chunks in a list to pd.concat once at the end replaces the quadratic concat-in-a-loop.

Practice

Quick check

0/3

Q1For each of 500,000 transactions you run df[df['user_id'] == row['user_id']] inside a Python loop. Complexity?

Q2Which pandas pattern is O(n + m), not O(n × m)?

Q3You grow a result with result = pd.concat([result, chunk]) inside a 10,000-chunk loop. The problem?

When O(n²) Kills Your DataFrame

What you'll learn

Before you start

The trap, and why it scales so badly

The fix: build an index once, query in O(1)

Practice

Quick check

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further