What set operations does Python support, and where are they practically useful in data work?

Python sets support union, intersection, difference, and symmetric difference as both operators and methods, all running in O(min(m,n)) to O(m+n) time. They are useful for deduplication, membership testing in large collections, and computing overlaps between datasets — operations that would be expensive with lists.

How do you reverse a list and remove duplicates in Python, and what are the performance implications of each approach?

Reversing a list is O(n) whether you use slice notation or list.reverse(). Deduplication is O(n) with a set conversion but O(n²) if you check membership against a list. Understanding when order must be preserved changes which tool to reach for.

How do list, dict, and set comprehensions work in Python, and when should you avoid them?

Comprehensions are syntactic sugar for building a new collection by iterating over an iterable and optionally filtering elements. They are faster than equivalent for-loops because the iteration runs at the C level inside the interpreter. Avoid them when the expression is too complex to read at a glance — a plain loop with descriptive variable names is preferable.

What is the difference between CPU-bound and I/O-bound work, and how does the choice affect concurrency strategy in Python?

CPU-bound work keeps the processor busy the whole time — matrix multiplication, compression, parsing. I/O-bound work spends most of its time waiting for a slow external resource — network, disk, database. The distinction directly determines which concurrency primitive to reach for: multiprocessing for CPU-bound (bypasses the GIL), threading or asyncio for I/O-bound (GIL released during waits).

Python Built-ins & Their Cost — DSA

What you'll learn

The real cost of the list, dict, set, and deque operations you reach for every day

Why a membership test or an insert at the front of a list quietly turns a program quadratic

When to swap a list for a set, a dict, or a deque — and exactly what you gain

How to see the gap yourself by counting the work, not timing it

Every time you write x in my_list, Python reads the list from the front until it either finds x or runs out. Every time you write my_list.insert(0, x), Python shifts the whole list over to make room. On a short list you never notice. Put either one inside a loop over a large dataset and the program slides from fast to unusable, with no error to point at.

The fix is almost always a single data-structure swap. The trick is knowing which swap, and that comes from knowing the cost of the operations you use without thinking.

The costs worth memorising

Here are the operations that come up constantly. “Avg” means the average, amortised case — hashing has a rare worst case of O(n) when every key collides, but you do not meet it in practice.

Structure	Operation	Cost
`list`	`lst[i]` — read by position	O(1)
`list`	`lst.append(x)` — add at the end	O(1) amortised
`list`	`lst.pop()` — remove the last	O(1)
`list`	`lst.pop(0)` — remove the first	O(n)
`list`	`lst.insert(0, x)` — add at the front	O(n)
`list`	`x in lst` — membership test	O(n)
`dict`	`d[k]`, `d[k] = v`, `del d[k]`	O(1) avg
`dict`	`k in d` — membership test	O(1) avg
`set`	`x in s` — membership test	O(1) avg
`set`	`s.add(x)`, `s.discard(x)`	O(1) avg
`deque`	`dq.append(x)`, `dq.pop()`	O(1)
`deque`	`dq.appendleft(x)`, `dq.popleft()`	O(1)

One thing leaps out of that table. The list is cheap at its tail and expensive everywhere else. That single asymmetry is behind most Python performance bugs, so it is worth seeing why it is true.

Why a list is slow at the front

A Python list keeps its elements in one continuous block of memory, side by side, in order. That is what makes reading by position instant — Python knows exactly where slot i sits. But it also means there is no spare room at the front.

So when you ask to insert at position 0, Python must first move every existing element one slot to the right to open up a gap, and only then drop the new value in.

A list with a million items means a million shifts — every single time you insert at the front. pop(0) is the mirror image: remove the first, then everything shifts left.

append and pop() at the tail escape all of this, because nothing has to move — the action is at the open end. The list does occasionally outgrow its block and copy itself into a bigger one, but it grows in large jumps, so spreading that rare cost across all the cheap appends leaves the average at O(1). That averaged-over-many-operations figure is what amortised O(1) means.

Count the work, don’t time it

The cleanest way to feel the gap is not to time it — timing changes with your machine — but to count the steps, which is exact. Let us count the comparisons a list membership test really does, and compare it with a set, on 50,000 items where the value we want sits right at the end:

N = 50_000
data_list = list(range(N))
data_set  = set(range(N))
needle = N - 1                  # worst case: the very last element

list_comparisons = 0
for _ in range(500):            # repeat the lookup 500 times
    for x in data_list:
        list_comparisons += 1
        if x == needle:
            break

set_lookups = 500              # a set jumps straight there — one step each

print("list comparisons:", list_comparisons)
print("set lookups      :", set_lookups)

This prints:

list comparisons: 25000000
set lookups      : 500

Twenty-five million against five hundred — for the same answer. The list had to walk all 50,000 elements on every one of the 500 lookups; the set walked to the answer in one step each time. Now the same idea for inserting at the front, counting the element-shifts a list is forced to make:

ITERS = 5_000

# list.insert(0, x) shifts every element already present, every time.
shifts = sum(range(ITERS))     # 0 + 1 + 2 + ... + 4999

print("list element-shifts:", shifts)
print("deque operations   :", ITERS)

list element-shifts: 12497500
deque operations   : 5000

Building a 5,000-item collection from the front costs a list over twelve million element-moves; a deque, which has open ends on both sides, does it in 5,000 flat steps.

What the table tells you to reach for

These costs are not trivia to recite — they are the reason behind the standard advice you have probably already heard.

Reach for a set for membership and deduplication. Turning a list into a set costs O(n) once; after that, every “have I seen this ID?” is a flat lookup. If your question is “is this already here?”, a set is the answer.

Reach for a dict for lookups and joins. Building a {id: record} dict from a list is O(n), and every lookup by key is then flat. Joining two datasets by key means building that dict on one side first — which is exactly how hash joins work inside SQL engines.

Reach for a deque for queues and sliding windows. A deque is built for O(1) work at both ends. Any time you take from the left and add on the right, it keeps every step flat, where a list doing the same job would pay O(n) per step and O(n²) overall.

Practice

Quick check

0/3

Q1You have a list of 200,000 transaction IDs and, for each of 200,000 incoming events, you check `event_id in transaction_list`. As the data grows, how does the total work grow?

Q2Which operation on a Python list is the cheap, O(1)-amortised one?

Q3You need a queue that constantly adds at one end and removes from the other. Which keeps both operations flat (O(1))?

Python Built-ins & Their Cost

What you'll learn

Before you start

The costs worth memorising

Why a list is slow at the front

Count the work, don’t time it

What the table tells you to reach for

Practice

Quick check

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further