Hash tables: O(1) until they don't
A hash table promises constant-time lookup in the average case — but that asterisk hides a story about collisions, load factors, and adversarial keys that every practitioner should understand.
A Python dict with ten million entries still returns a value in roughly the same time as a dict with ten entries. Not approximately the same — provably, structurally the same, on average, by design. That is a remarkable promise, and it is not free. Behind it sits one of the most elegant data structures in computer science: the hash table. Understanding it fully — not just “it is fast” but why it is fast, and exactly when it stops being fast — is the difference between using a tool and understanding a tool.
The key insight: turn a key into an index
An array gives you O(1) access by position. If I want element 7, I go to memory address base + 7 * element_size. No traversal, no comparison — just arithmetic. The problem is that real keys are not integers from 0 to n. They are strings, timestamps, UUIDs, tuples.
A hash function bridges this gap. It takes any key and maps it to an integer. You then take that integer modulo the array size, and that becomes your slot index. The whole computation is O(1): you do a fixed number of arithmetic steps regardless of the size of the table or the number of entries.
That is the entire trick. Everything else in a hash table is engineering around what goes wrong.
What goes wrong: collisions
Two different keys can hash to the same index. This is not a bug — it is a mathematical inevitability. A hash table with 1,000 slots and any reasonable hash function will start producing collisions before you have inserted 40 keys. This is the birthday paradox (the counterintuitive probability result that says in a group of 23 people, two share a birthday with probability greater than 50%) applied to arrays.
There are two classical ways to handle collisions.
Chaining stores a linked list (or small array) at each slot. When two keys hash to the same slot, they share it as a list. Lookup walks the list until it finds the right key. In the best case — a good hash function, few collisions — that list has length 1, and lookup is O(1). In the worst case — all keys hash to the same slot — lookup is O(n) because you walk every element.
Open addressing keeps the array flat. When a slot is taken, it probes nearby slots according to a rule — linear probing (try slot + 1, + 2, + 3…), quadratic probing, or double hashing (a second hash function determines the step size). This has better cache locality than chaining because all data lives in one contiguous block of memory. CPython’s dict implementation uses a variant of open addressing with pseudo-random probing, which is why it performs well in practice.
Both approaches degrade as the table fills up. The more crowded the table, the longer the probe sequence or the longer the chain — and the further you drift from O(1).
The load factor: the number that controls everything
The load factor is the ratio of stored entries to available slots: n / m, where n is entries and m is total slots. When this ratio is 0.1, collisions are rare and lookup is nearly always one probe. When it climbs to 0.9, nearly every slot is occupied, probe sequences grow long, and you are doing a lot of extra work.
Every well-engineered hash table picks a maximum load factor and resizes before crossing it. CPython’s dict resizes when load factor exceeds two-thirds. Java’s HashMap defaults to 0.75. Resizing doubles the array (or grows by some fixed ratio), rehashes every existing entry into the new table, and resets the load factor to roughly half its trigger threshold.
Resize is expensive in isolation — it is O(n), touching every entry. But amortized over all insertions, each entry gets rehashed only O(log n) times across its lifetime in the table, which is why the amortized cost of insertion remains O(1). This is the same accounting trick that makes dynamic arrays (Python lists, Java ArrayLists) O(1) amortized despite occasional O(n) copies.
The implication for practitioners: if you pre-size a hash table when you know the entry count in advance, you pay for exactly one allocation and zero resizes. Python offers no direct API for this, but Java’s HashMap(initialCapacity) and Go’s make(map[K]V, hint) accept it. In a tight loop inserting millions of entries, this matters.
Why the hash function is the whole game
A collision is not inherently catastrophic. What is catastrophic is a hash function that consistently sends many keys to the same slot. A perfect hash function distributes keys uniformly across all slots — every slot has the same expected load, every probe sequence has the same expected length, and O(1) holds comfortably.
A bad hash function breaks this. The classic failure is using the key’s first byte or a simple sum of characters. These tend to cluster: English words front-loaded with “the,” “str,” or “get” hash to the same small set of buckets. The table works correctly but performance degrades toward O(n).
The more insidious failure is adversarial inputs. In 2003, security researchers showed that a web server using a naive hash function for query-parameter parsing could be forced into O(n) lookups by an attacker who crafted URL parameters that all hash to the same bucket — a HashDoS attack. With enough collisions, a single HTTP request could pin a CPU for seconds. This led to hash randomization: most modern runtimes (Python since 3.3, Perl since 5.18, Ruby since 1.9) add a random per-process seed to string hashing. Two runs of the same Python program will hash the string “hello” to different integers, so an attacker who probes your hash function offline cannot predict the colliding keys to send you in production.
Python’s PYTHONHASHSEED environment variable controls this. Setting it to a fixed integer (useful in reproducible tests) re-enables the vulnerability — fine for tests, catastrophic in a server that processes external input.
The distribution shape your hash function hides
Here is a useful mental model. Imagine you have a hash table with 128 slots and insert 64 keys (load factor 0.5). With a perfect hash function, each slot holds about 0.5 entries on average. A lookup probes exactly one slot most of the time.
Now imagine a biased hash function where 90% of keys land in slot 0. Slot 0 has 57 entries; the other 127 slots share 7 entries. Every lookup for any of those 57 keys is a linear scan of 57 nodes. The average lookup is suddenly not O(1) — it is O(n/2) for most keys.
The load factor does not capture this imbalance. It is a global ratio. A per-slot load histogram would reveal it, but no standard library exposes one. This is why hash function quality matters independently of load factor management: you need both uniform distribution and a managed ratio.
What Python does, specifically
CPython’s dict is an open-addressed hash table with a minimum size of 8 slots, a two-thirds load factor threshold, and a perturbation-based probing scheme that uses the full hash value (not just the low-order bits) to determine probe order. This prevents the clustering that naive linear probing causes on power-of-two-sized tables.
String hashing since Python 3.4 uses SipHash (a cryptographically strong pseudorandom function) seeded with a per-process random value. SipHash is fast enough that its cost is dominated by the key comparison, not the hash computation. It makes HashDoS infeasible for arbitrary string keys.
Integer hashing is an identity function for small integers: hash(n) == n for small n. This sounds risky — does it create clustering? In practice no, because integers used as dict keys tend to be sparse. But if you build a table keyed on consecutive integers starting at 0, the probing pattern degenerates on a power-of-two table; CPython handles this with its perturbation scheme. The probing sequence visits all slots before cycling, so you never get stuck even in the worst sequential-key scenario.
Sets are implemented identically to dicts, minus the value storage. The membership test x in some_set and x in some_dict have the same underlying cost.
The asterisk on O(1)
“O(1)” in the context of hash tables always means average case under a good hash function and managed load factor. The full picture:
| Scenario | Lookup complexity |
|---|---|
| Average case, good hash | O(1) |
| Load factor near 1, good hash | O(1) amortized but high constant |
| All keys collide (bad hash or adversarial) | O(n) |
| After resize (single operation) | O(n) |
| After resize (amortized over all insertions) | O(1) |
The worst-case theoretical complexity of a hash table lookup is O(n). This is not a theoretical footnote — it is a real attack surface that took the web ecosystem a decade to properly harden against.
The reason we call hash tables O(1) in practice is not that the worst case is impossible but that a well-engineered implementation makes the worst case require deliberate adversarial effort or genuinely pathological data. Python, Java, Ruby, Go, and Rust all make that engineering decision on your behalf. Understanding that the guarantee is probabilistic, not absolute, is what separates a practitioner from someone who just knows the textbook answer.
When to reach past a hash table
A hash table gives you fast lookup by exact key. It gives you nothing else. Iteration order, in CPython since 3.7, is insertion order — but that is an implementation commitment, not a consequence of the underlying structure. Range queries (“all keys between Monday and Friday”), prefix searches (“all strings starting with ‘get_’”), and ordered scans are not natural operations on a hash table. A balanced binary search tree (like a sorted dict or a database B-tree index) supports O(log n) range queries at the cost of O(log n) point lookup. A trie supports prefix queries at O(k) where k is key length.
The choice of data structure is always about which operations you need fast. For point lookup with arbitrary keys and no ordering requirements, a hash table is almost certainly optimal. For anything that requires ordering, it is the wrong tool regardless of how fast its lookup is.
The number worth remembering
A hash table lookup on 100 million entries takes roughly the same wall-clock time as on 100 entries — measured in nanoseconds, not milliseconds. That is what O(1) buys you. A properly maintained Python dict with one million string keys does a lookup in about 50 nanoseconds on modern hardware: two hash computations, one or two slot probes, one string comparison.
That number is the payoff for everything described above — the hash function, the load factor management, the resize strategy, the hash randomization. Every part of the machinery is in service of holding that constant flat against a growing n. When any part fails — a bad hash, a full table, adversarial keys — the number climbs. Knowing why each part exists is what lets you notice when it is failing and know which knob to turn.
The O(1) is real. So is the asterisk.