Why is pandas slow, and what are the main strategies to speed it up?

pandas is slow primarily because Python loops bypass NumPy's vectorized C kernels, object-dtype columns prevent SIMD optimizations, and keeping entire datasets in memory limits scalability. The fixes are vectorization, categorical encoding, eval/query for large frames, chunking for out-of-core data, and switching to Polars or DuckDB for compute-heavy pipelines.

Given a new data problem, how do you decide whether to use a list, dict, or set?

Choose a list when order matters and you need indexed access or duplicates. Choose a dict when you need to map keys to values and look up by key in O(1). Choose a set when you need uniqueness, fast membership testing, or set-algebra operations. Getting this choice wrong usually means either incorrect results (keeping duplicates when you needed uniqueness) or avoidable O(n) lookups.

What set operations does Python support, and where are they practically useful in data work?

Python sets support union, intersection, difference, and symmetric difference as both operators and methods, all running in O(min(m,n)) to O(m+n) time. They are useful for deduplication, membership testing in large collections, and computing overlaps between datasets — operations that would be expensive with lists.

What is the difference between DDP and FSDP for distributed training?

Distributed Data Parallel (DDP) replicates the full model on every GPU and synchronizes gradients each step, which is simple but requires the whole model to fit on one GPU. Fully Sharded Data Parallel (FSDP) shards parameters, gradients, and optimizer states across GPUs and gathers them on demand, drastically cutting per-GPU memory so you can train much larger models at the cost of extra communication.

Why DSA for Data Science

What you'll learn

Why the structure you store data in — not the hardware or the language — usually decides how fast your code runs

How checking membership in a list grows with the data, while a set stays flat

Why joins, deduplication, feature encoding, and nearest-neighbour search are all the same 'find it in a collection' problem

How to ask the one question — does the work double or quadruple? — before you write a loop

Suppose you write a small function to remove duplicate IDs from a dataset. You test it on a few hundred rows, the output is exactly right, and you move on.

Then you run it on a million rows. You go and make a coffee. You come back, and it is still running. The code was never wrong — every test passed. What was wrong was the way it stored the data while it worked.

That single idea is what this whole series is about. Data structures and algorithms — DSA — is not whiteboard trivia for interviews. It is the difference between a pipeline that finishes in three seconds and the same correct pipeline that finishes in three hours.

Two ways to remove duplicates

Let us make the difference visible on something small. Suppose we want to keep only the first appearance of each ID in this list:

[4, 7, 4, 2, 7]

One natural way is to keep a second list of IDs we have already seen, and for each new ID, check whether it is in that list before adding it.

def dedup_list(items):
    seen = []
    result = []
    for x in items:
        if x not in seen:      # this scans the whole "seen" list
            seen.append(x)
            result.append(x)
    return result

Let us walk it. The phrase x not in seen is the part to watch, because Python checks it by reading seen one element at a time:

4  →  "is 4 in []?"        0 comparisons   →  add 4
7  →  "is 7 in [4]?"       1 comparison    →  add 7
4  →  "is 4 in [4,7]?"     1 comparison    →  already seen
2  →  "is 2 in [4,7]?"     2 comparisons   →  add 2
7  →  "is 7 in [4,7,2]?"   2 comparisons   →  already seen

That is six comparisons for five items — nothing alarming. But notice why it climbs: every time seen grows, the next check has more to scan. The work each step does is tied to how much we have already stored.

Now the second way. Instead of a list, we keep the seen IDs in a set, which stores each value by a computed address (its hash) so it can answer “is x here?” by jumping straight to one spot — no scanning.

def dedup_set(items):
    seen = set()
    result = []
    for x in items:
        if x not in seen:      # a direct hash lookup, no scan
            seen.add(x)
            result.append(x)
    return result

Both functions return the exact same answer, [4, 7, 2]. On five items the set version does five quick lookups instead of six comparisons — barely a difference. The difference only wakes up when the data grows.

Where the gap actually bites

Picture the same two functions on 20,000 IDs. The list version’s check keeps getting slower as seen fills up, so the total work climbs toward the square of the input — on the order of a hundred million comparisons. The set version does one flat lookup per ID — about 20,000 of them, full stop.

The only change between the two functions is [] becoming set() — but one scans and one jumps.

A hundred million versus twenty thousand. Scale it once more to a million rows in production and the list version simply does not finish, while the set version is done before the next step of the pipeline notices. The whole difference between shipping and stalling came down to a single character: [] against set().

You do not need to memorise the notation yet — that is the next lesson. For now, hold one question in your head: when the input doubles, does the work merely double, or does it explode?

The same problem, wearing many costumes

Once you see this shape, you start spotting it everywhere in data work. It is always the same question — find something in a collection — and the structure you choose decides the cost.

Deduplication is the example above. When pandas runs df.drop_duplicates(), it leans on hashing for exactly this reason.
Joins. Merging two tables on a key builds a hash table on one side so each match is a direct lookup. A nested-loop join over two million-row tables would be a trillion-comparison affair; the hash join is linear.
Feature encoding. Mapping each category to a number is a lookup. Stored in a list and scanned, it is quadratic; stored in a dict, it is flat.
Nearest-neighbour search. Finding the closest vectors in a vector database — the heart of retrieval for RAG systems — uses index structures (HNSW, IVF) to avoid comparing against every vector. At a billion embeddings, that choice is the difference between a working product and none.

The pattern repeats: find something in a collection, many times over. Multiply the cost of one lookup by how often you do it, and you have the speed of the whole pipeline.

What this series is, and is not

You will not be asked to reverse a linked list from memory. The goal is a sharper instinct for the structures you already reach for — lists, dicts, sets, sorted arrays, queues — and for when to swap one for another. Each lesson takes one structure or technique and shows it changing something real in a data-science setting.

By the end, “a quadratic step on a real dataframe is the difference between shipping and not” should feel less like a warning and more like something you already knew.

Practice

Quick check

0/2

Q1You have 500,000 product IDs and, for each of 500,000 orders, you must check whether its product exists. Which structure makes each membership check a flat, one-step lookup?

Q2A loop runs over n rows and, for each row, scans a list of up to n already-seen values. As n grows, how does the total work grow?

Why DSA for Data Science

What you'll learn

Before you start

Two ways to remove duplicates

Where the gap actually bites

The same problem, wearing many costumes

What this series is, and is not

Practice

Quick check

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further