Why DSA for Data Science
The same correct code can take 3 hours or 3 seconds — the difference is the data structure you chose, not the language.
What you'll learn
- Why data structure choice — not hardware or language — is the dominant factor in code speed
- How list membership (O(n)) vs set membership (O(1)) produces a measurable gap you can see and time yourself
- Why joins, deduplication, feature engineering, and nearest-neighbour retrieval are DSA choices in disguise
- How to think in complexity before you write a loop, so your data code scales to real datasets
Before you start
Imagine you write a perfectly correct deduplication function. It passes every test. Then you run it on a million-row dataset and go make coffee — and it still hasn’t finished when you return. The code was right. The data structure was wrong.
That’s the entire premise of this series. DSA (Data Structures and Algorithms) is not whiteboard trivia for software engineering interviews. It is the vocabulary for asking: “will this code finish in 3 seconds or 3 hours?”
The gap you can see with your own eyes
The best way to make this real is to measure it. The example below deduplicates a list of IDs by checking membership — once using a list, once using a set. Both produce identical output. The runtime is not identical.
Run it. You will see the set version is tens to hundreds of times faster on this modest input. Now scale that to a million rows in production — the list version times out; the set version finishes before the next line of a pipeline even notices.
The only difference between the two functions is a single character: [] vs set().
Why the gap exists: O(n) vs O(1)
When you write x not in seen and seen is a list, Python has to check every
element one by one. With 10,000 items already in seen, that’s up to 10,000 comparisons
per lookup. In the full loop over 20,000 items, that’s up to 200,000,000 comparisons —
roughly O(n²) total work.
A set stores values by their hash — a fixed-cost arithmetic function that maps x
to a bucket index directly, without scanning. Looking up x is a direct hash
computation and one or two comparisons regardless of how large the set grows.
That is O(1) per lookup, O(n) total.
The notation O(...) just captures the relationship between input size and work done.
You do not need to memorise it — you need to feel it. When your input doubles, does the
work double (O(n), manageable) or quadruple (O(n²), dangerous)?
This is everywhere in data science
DSA choices are hidden inside every data workflow. They just wear different names.
Deduplication. Exactly the example above. df.drop_duplicates() in pandas uses
a hash-based approach internally. If you ever implement your own in pure Python, you
now know why a set is the right tool.
Joins. When pandas merges two dataframes, it builds a hash table on the join key so it can look up matches in O(1) instead of scanning one frame for every row of the other. A nested-loop join on two million-row tables would be O(n²) — roughly four trillion comparisons. The hash join is O(n). That is not a micro-optimisation; it is the difference between a query that runs and one that doesn’t.
Feature engineering. Encoding a categorical column means mapping each value to an integer. If you store the mapping in a list and scan it for each row, you are back to O(n²). A dict makes it O(n).
Nearest-neighbour retrieval. Similarity search in vector databases (the backbone of RAG systems) is an approximate nearest-neighbour problem. The naive approach scans every vector for every query — O(n) per query. Specialised index structures (HNSW, IVF) reduce that to O(log n). At a billion embeddings, that gap is existential.
The pattern repeats: the problem is always “find something in a collection.” The DSA choice determines whether “find” costs O(1), O(log n), or O(n). Multiply that by how many times you call it, and you have the performance profile of your pipeline.
This series is not about whiteboard problems
You will not be asked to invert a binary tree. The goal is sharper intuition about the structures you already reach for every day — lists, dicts, sets, sorted arrays, queues — and when to swap one for another. Each lesson introduces one structure or technique through a concrete data-science scenario where it changes something real.
By the end, “O(n²) on a real dataframe is the difference between shipping and not” will feel less like a warning and more like obvious hindsight.
Quick check
Quick check
Practice this in an interview
All questionspandas is slow primarily because Python loops bypass NumPy's vectorized C kernels, object-dtype columns prevent SIMD optimizations, and keeping entire datasets in memory limits scalability. The fixes are vectorization, categorical encoding, eval/query for large frames, chunking for out-of-core data, and switching to Polars or DuckDB for compute-heavy pipelines.
Choose a list when order matters and you need indexed access or duplicates. Choose a dict when you need to map keys to values and look up by key in O(1). Choose a set when you need uniqueness, fast membership testing, or set-algebra operations. Getting this choice wrong usually means either incorrect results (keeping duplicates when you needed uniqueness) or avoidable O(n) lookups.
Python sets support union, intersection, difference, and symmetric difference as both operators and methods, all running in O(min(m,n)) to O(m+n) time. They are useful for deduplication, membership testing in large collections, and computing overlaps between datasets — operations that would be expensive with lists.
pandas operates in-memory on a single machine, making it fast and simple for datasets under a few gigabytes. Spark distributes computation across a cluster, handles terabyte-scale data, and integrates with cloud storage — but adds significant overhead for small data. The crossover point is roughly when your data no longer fits in RAM or when processing time on a single machine becomes unacceptable.