Hash Tables Deep Dive
Hash tables are one of the most important data structures in software engineering. They power Python dicts, JavaScript objects, Redis, database indexes, and deduplication pipelines. Understanding how they work internally lets you reason about their performance characteristics and failure modes.
What Is a Hash Table?
A hash table maps keys to values using a hash function to compute an array index from the key. The key property is O(1) average case for insert, lookup, and delete — independent of the number of stored entries.
Internally, a hash table is an array of slots (also called buckets). To store a key-value pair:
- Compute
index = hash(key) % array_size - Store the value at
array[index]
To look up a key: compute the same index and return array[index]. One array access — O(1).
Hash Functions
A good hash function has four properties:
- Deterministic: Same key always produces the same hash.
- Uniform distribution: Distributes keys evenly across all slots to minimize collisions.
- Fast to compute: Should be O(1) or O(k) where k is the key length.
- Avalanche effect: Small change in input produces a very different hash.
# djb2 — a classic simple hash function for strings
def djb2(key: str) -> int:
h = 5381
for char in key:
h = ((h << 5) + h) + ord(char) # h * 33 + char
h &= 0xFFFFFFFF # keep 32-bit
return h
# Usage: slot = djb2("hello") % table_size
print(djb2("hello")) # 210700827
print(djb2("hello!")) # 2211506502 — very different from "hello"
hash("abc") returns a different value each process run. For security-sensitive applications, use a cryptographic hash (SHA-256) instead.
Collision Resolution
A collision occurs when two different keys hash to the same array slot. Because there are more possible keys than slots, collisions are inevitable. The two main strategies are chaining and open addressing.
Chaining
Each slot contains a linked list (or dynamic array) of key-value pairs. All keys that hash to the same slot are stored in that slot's list. Lookup traverses the list comparing keys.
Open Addressing
All keys are stored directly in the array — no linked lists. When a collision occurs, the algorithm probes the array for the next empty slot.
- Linear probing: Try
index+1,index+2, ... until an empty slot is found. - Quadratic probing: Try
index+1²,index+2²,index+3²... Reduces clustering. - Double hashing: Use a second hash function for the step size. Best distribution but more computation.
dict uses open addressing; Java's HashMap uses chaining.
Load Factor & Resizing
The load factor α = n/k, where n is the number of stored entries and k is the number of slots. As α approaches 1, collision probability increases and performance degrades.
- Open addressing: Resize (double) when α > 0.75. Above this, clustering causes dramatic slowdowns.
- Chaining: Can tolerate α > 1 (multiple items per slot), but performance degrades linearly.
Resizing allocates a new array of double the size and rehashes all entries (their slot positions change with the new array size). Rehashing is O(n), but because it happens exponentially rarely, the amortized cost per insertion is O(1).
Time Complexity
| Operation | Average Case | Worst Case |
|---|---|---|
| Insert | O(1) amortized | O(n) — all keys collide |
| Lookup | O(1) | O(n) — all keys in same chain |
| Delete | O(1) | O(n) |
| Resize/rehash | O(n) — rare | O(n) |
The worst case O(n) occurs when all keys hash to the same slot — a degenerate hash function or a hash collision attack. Modern hash tables use hash randomization (Python) or tree-ify long chains (Java 8+) to defend against this.
Language Implementations
- Python
dict— Open addressing with pseudo-random probing. Guarantees insertion-order (since Python 3.7). Per-process hash seed (PYTHONHASHSEED) for security. - JavaScript
Map— Ordered key-value map. Keys can be any type.Objectonly allows string/symbol keys and has prototype chain overhead. - Java
HashMap— Chaining with tree-ification: chains longer than 8 entries are converted to red-black trees (O(log n) worst case). Not thread-safe; useConcurrentHashMapfor concurrent access.
Real-World Uses
- Database indexes — Hash indexes provide O(1) equality lookups. B-tree indexes support range queries; hash indexes don't.
- Caches — Redis is essentially a distributed hash table. Memcached too. O(1) get/set is what makes them fast.
- Deduplication — Streaming pipelines check a hash set to detect duplicate events. O(1) per event.
- Symbol tables — Compilers store variable names and their types/locations in a hash table for O(1) lookup during parsing.
- Sets — Python
set, JavaHashSet— hash tables without values. O(1) membership test:x in my_set.