Hash Tables & Dicts

What you'll learn

How a hash function maps a key to an array index so lookup is O(1) on average

Why collisions are unavoidable, and how chaining resolves them

What load factor is, and why an occasional resize keeps insertion amortised O(1)

Why keys must be immutable, and when the worst case O(n) appears

A Python dict never searches for your value. It never loops. It computes the address of the value and goes straight there.

That single move — turning a key into an index — is the hash table, and it sits underneath almost every O(1) lookup you have ever taken for granted. Let us see how a key becomes an address.

Computing an address, not searching for one

Imagine a row of, say, eight empty slots, and you want to store the value 98 under the key "alice". Instead of hunting for a free slot, you feed "alice" to a hash function, which returns some large number; you take that number modulo 8 to land on a slot — say slot 3 — and drop 98 there. Later, to look "alice" up, you hash the key again, get the same slot 3, and read it. Two steps, whether the table holds ten entries or ten million.

index = hash("alice") % 8     # the same key always lands on the same slot
table[index] = 98

That is the heart of it: not a search, not a loop, but a computed address — O(1) on average. (The “on average” hides a wrinkle we will get to.)

”alice” and “carol” happen to hash to the same slot — a collision — so slot 3 keeps a short chain of both.

Collisions are unavoidable

A hash function turns any key into an integer; Python’s built-in hash() does it for any immutable object. But two different keys can land on the same slot — a collision — and you cannot escape them. With more keys than slots, the pigeonhole principle forces sharing. Even with room to spare, collisions arrive sooner than intuition says: by the same maths as the birthday paradox, a table of 365 slots sees its first collision after roughly 23 keys, not 183.

The common cure is chaining: each slot holds a little list of all the entries that landed there. To insert, you append to that slot’s list; to look up, you hash to the slot and scan its short list for an exact key match. As long as the lists stay short, lookup stays close to O(1).

Load factor, and the occasional resize

How short the chains stay depends on the load factor — the number of items divided by the number of slots. Low load factor means sparse slots and rare collisions; as it climbs, chains lengthen and lookups slow. So when the load factor crosses a threshold (Python’s dict uses about two-thirds), the table resizes: it allocates a bigger array and rehashes every existing key into it, because the slot count changed and so every index must be recomputed.

Rehashing is O(n) — but it happens only when the table doubles, which is roughly log₂ n times over n insertions. Spread that rare cost across all the cheap inserts between resizes and the average insert is O(1). It is the exact same amortised argument as a growing list.

Here is a tiny chaining hash map, built from scratch, so the mechanics are in plain view:

class HashMap:
    def __init__(self, capacity=4):
        self.capacity = capacity
        self.buckets = [[] for _ in range(capacity)]
        self.size = 0

    def put(self, key, value):
        bucket = self.buckets[hash(key) % self.capacity]
        for i, (k, _) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)      # update existing key
                return
        bucket.append((key, value))           # new key
        self.size += 1
        if self.size / self.capacity > 0.75:  # too full — grow and rehash
            self._resize()

    def get(self, key, default=None):
        for k, v in self.buckets[hash(key) % self.capacity]:
            if k == key:
                return v
        return default

    def _resize(self):
        old = self.buckets
        self.capacity *= 2
        self.buckets = [[] for _ in range(self.capacity)]
        self.size = 0
        for bucket in old:
            for k, v in bucket:
                self.put(k, v)                # rehash into the bigger table

m = HashMap(capacity=4)
for i, word in enumerate(["apple", "banana", "cherry", "date", "elderberry", "fig"]):
    m.put(word, i * 10)
    print(f"put {word:11} → size={m.size}, capacity={m.capacity}")

print(m.get("cherry"))
print(m.get("missing", "NOT FOUND"))

put apple       → size=1, capacity=4
put banana      → size=2, capacity=4
put cherry      → size=3, capacity=4
put date        → size=4, capacity=8
put elderberry  → size=5, capacity=8
put fig         → size=6, capacity=8

20
NOT FOUND

Watch the fourth insert: adding date pushes the load factor past the threshold, so the table grows from 4 slots to 8 and rehashes everything. After that, there is room again and inserts go quietly back to cheap.

The worst case — and why data work leans on this anyway

The O(1) promise assumes the hash spreads keys evenly. If every key lands in one slot — a broken hash function, or a deliberate flood of colliding keys — the table degenerates into a single list and lookup becomes O(n). (This is why Python randomises its hash seed per run: it defends web servers from attackers crafting mass collisions.) With built-in types you essentially never see it; with a custom __hash__ that returns a constant, you will.

Practice

Quick check

0/3

Q1You insert 1,000,000 items into a Python dict one at a time. Roughly how many of those insertions trigger a full rehash of the existing contents?

Q2A custom class defines __hash__ to always return 42. You store 10,000 of these objects in a dict. What is the lookup cost?

Q3Why can't a Python list be used as a dictionary key?

Questions about this lesson

How does a hash table achieve O(1) lookup?

It applies a hash function to a key to compute an array index, so it jumps straight to where a value is stored instead of scanning. Average lookups, inserts, and deletes are constant time when the hash spreads keys evenly.

What is a hash collision and how is it handled?

A collision is when two keys hash to the same slot. Tables resolve it by chaining (a list at each slot) or open addressing (probing for the next free slot). Too many collisions degrade performance toward O(n).

When is a hash table the wrong choice?

When you need sorted order or range queries (use a tree), when keys aren't hashable, or when worst-case guarantees matter — its O(1) is an average, and a bad hash or adversarial input can push it to O(n).

What you'll learn

Before you start

Computing an address, not searching for one

Collisions are unavoidable

Load factor, and the occasional resize

The worst case — and why data work leans on this anyway

Practice

Quick check

Sign in to track your progress

Questions about this lesson

Practice this in an interview

Related lessons

Explore further