datarekha

Hash Tables & Dicts

The most important data structure for data work — how a hash function turns a key into an array index, why lookup is O(1) on average, and what happens when the math gets messy.

9 min read Intermediate Data Structures & Algorithms Lesson 14 of 32

What you'll learn

  • How a hash function maps a key to an array index so lookup is O(1) average
  • Why collisions happen and how chaining and open addressing resolve them
  • What load factor is and why a resize + rehash gives amortized O(1) insertion
  • Why keys must be hashable and immutable, and when worst-case O(n) appears

Before you start

A Python dict does not search. It does not loop. It computes the address of your value and goes there directly. That single idea — turning a key into an index — is the hash table, and it is behind almost every O(1) lookup you have ever taken for granted.

The intuition

Imagine you have a giant array of 1 000 empty slots. You want to store the value 98 under the key "alice". Instead of scanning the array for a free spot, you run a hash function on "alice", get a number — say, 437 — and put 98 at index 437. Later, to look up "alice", you hash the key again, get 437 again, and read slot 437. Two steps, regardless of whether the table has ten entries or ten million.

# Conceptually:
index = hash("alice") % len(table)
table[index] = 98

That is O(1) on average — not a search, not a loop, but a computed address. (Worst-case is O(n), as you will see in the Collisions section.)

Hashing

A hash function maps an arbitrary key to an integer. Python’s built-in hash() does this for any hashable object:

>>> hash("alice")
-3713081631934410656
>>> hash("alice") % 8   # bucket index for a table of size 8
0
>>> hash(42)
42
>>> hash((1, 2, 3))
2528502973977326415

A good hash function distributes keys uniformly across buckets, is fast to compute, and is deterministic — the same key always produces the same index. Python randomizes the hash seed per process (hash randomization, enabled since Python 3.3) so values differ between runs, but within a run the same key always hashes the same.

Keys must be hashable and immutable. Lists and dicts cannot be keys because they can change after insertion; if the key mutated the stored index would be wrong. Strings, ints, floats, and tuples (of hashable types) are all hashable. If you define a custom class, __hash__ and __eq__ must be consistent: if a == b then hash(a) == hash(b).

Collisions

Two different keys can hash to the same index — this is a collision, and it is unavoidable. By the pigeonhole principle, if you have more keys than slots, some must share. Even with fewer keys, the birthday paradox (the statistical result that in a room of 23 people, two share a birthday with 50% probability) means collisions appear earlier than you expect: in a table of 365 slots, you get a collision after roughly 23 insertions, not 183.

There are two classic strategies:

Chaining

Each bucket holds a linked list (or small array) of all entries that hash to that index. On insert, append to the list. On lookup, hash to the bucket, then scan the list for an exact key match.

bucket[3] → ["alice": 98] → ["carol": 12]   # both hash to 3

Average-case lookup: O(1 + load factor). This is because the expected number of entries in any one bucket equals the load factor — so a scan of that bucket costs proportional to it. If the load factor is bounded (say, below 1.0), the list lengths stay short.

Open addressing (linear probing)

All entries live in the array itself. On collision, probe forward (index + 1, index + 2, …) until an empty slot appears. On lookup, probe the same path until you find the key or hit an empty slot.

slot[3] = "alice"   # hashed to 3
slot[4] = "carol"   # also hashed to 3; found slot 4 empty

Open addressing is cache-friendlier than chaining (entries are contiguous) but degrades faster as the table fills because probe paths lengthen.

Python’s dict uses open addressing with a more sophisticated probing scheme (pseudo-random probing) that reduces clustering. CPython 3.6+ also preserves insertion order as a separate compact array alongside the hash table.

Load factor and resize

The load factor is simply:

load_factor = number_of_items / number_of_buckets

When load factor is low, buckets are sparse and collisions are rare. As it rises, collisions multiply and average lookup time degrades. Python’s dict triggers a resize when load factor exceeds roughly 2/3 (about 0.67). (The demo implementation below uses 0.75 — a slightly higher threshold common in many textbook implementations; the principle is the same.)

A resize doubles the number of buckets, then rehashes every existing key into the new table — because the modulus (n) changed, every index must be recomputed. Rehashing is O(n), but it happens so rarely (roughly every time the size doubles) that the amortized cost per insertion stays O(1). This is the same argument as dynamic array (list) append.

# Amortized analysis sketch:
# n insertions into a hash table that starts at size 1 and doubles:
# resize costs: 1 + 2 + 4 + 8 + ... + n ≈ 2n
# total cost over n insertions: O(n)
# per-insertion amortized cost: O(1)

Worst case: O(n)

The average-case O(1) assumes the hash function distributes keys well. If all keys hash to the same bucket — due to a pathological hash function or a deliberate collision attack — lookup degrades to O(n): you scan a list of length n. This is why Python randomizes hash seeds by default (protecting web servers from hash-flooding DoS attacks). For built-in types you rarely see the worst case. For custom __hash__ implementations that always return a constant, you will.

The tiny chaining hash map

Implement put and get in pure Python — no built-in dict allowed for the core logic — to see the mechanics clearly:

Why this matters for data work

Quick check

Quick check

0/3
Q1You insert 1 000 000 items into a Python dict one at a time. How many of those insertions trigger a full rehash of the existing contents?
Q2A colleague defines a custom class with __hash__ that always returns 42. They store 10 000 objects in a dict. What is the lookup complexity?
Q3Why can't a Python list be used as a dictionary key?
FAQCommon questions

Questions about this lesson

How does a hash table achieve O(1) lookup?

It applies a hash function to a key to compute an array index, so it jumps straight to where a value is stored instead of scanning. Average lookups, inserts, and deletes are constant time when the hash spreads keys evenly.

What is a hash collision and how is it handled?

A collision is when two keys hash to the same slot. Tables resolve it by chaining (a list at each slot) or open addressing (probing for the next free slot). Too many collisions degrade performance toward O(n).

When is a hash table the wrong choice?

When you need sorted order or range queries (use a tree), when keys aren't hashable, or when worst-case guarantees matter — its O(1) is an average, and a bad hash or adversarial input can push it to O(n).

Practice this in an interview

All questions

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content