DSA

Hash Tables Deep Dive

● Intermediate ⏱ 30 min read DSA

Hash tables are one of the most important data structures in software engineering. They power Python dicts, JavaScript objects, Redis, database indexes, and deduplication pipelines. Understanding how they work internally lets you reason about their performance characteristics and failure modes.

What Is a Hash Table?

A hash table maps keys to values using a hash function to compute an array index from the key. The key property is O(1) average case for insert, lookup, and delete — independent of the number of stored entries.

Internally, a hash table is an array of slots (also called buckets). To store a key-value pair:

Compute index = hash(key) % array_size
Store the value at array[index]

To look up a key: compute the same index and return array[index]. One array access — O(1).

Hash Functions

A good hash function has four properties:

Deterministic: Same key always produces the same hash.
Uniform distribution: Distributes keys evenly across all slots to minimize collisions.
Fast to compute: Should be O(1) or O(k) where k is the key length.
Avalanche effect: Small change in input produces a very different hash.

python

# djb2 — a classic simple hash function for strings
def djb2(key: str) -> int:
    h = 5381
    for char in key:
        h = ((h << 5) + h) + ord(char)  # h * 33 + char
        h &= 0xFFFFFFFF  # keep 32-bit
    return h

# Usage: slot = djb2("hello") % table_size
print(djb2("hello"))   # 210700827
print(djb2("hello!"))  # 2211506502 — very different from "hello"

💡

Python's Hash Randomization Python 3.3+ randomizes string hashes with a per-process seed (PYTHONHASHSEED) to prevent hash collision attacks. This means hash("abc") returns a different value each process run. For security-sensitive applications, use a cryptographic hash (SHA-256) instead.

Collision Resolution

A collision occurs when two different keys hash to the same array slot. Because there are more possible keys than slots, collisions are inevitable. The two main strategies are chaining and open addressing.

Chaining

Each slot contains a linked list (or dynamic array) of key-value pairs. All keys that hash to the same slot are stored in that slot's list. Lookup traverses the list comparing keys.

Open Addressing

All keys are stored directly in the array — no linked lists. When a collision occurs, the algorithm probes the array for the next empty slot.

Linear probing: Try index+1, index+2, ... until an empty slot is found.
Quadratic probing: Try index+1², index+2², index+3² ... Reduces clustering.
Double hashing: Use a second hash function for the step size. Best distribution but more computation.

✅

Chaining vs Open Addressing Chaining is simpler and handles high load factors better. Open addressing has better cache performance (all data is in one array) and uses less memory per entry at low load factors. Python's dict uses open addressing; Java's HashMap uses chaining.

Load Factor & Resizing

The load factor α = n/k, where n is the number of stored entries and k is the number of slots. As α approaches 1, collision probability increases and performance degrades.

Open addressing: Resize (double) when α > 0.75. Above this, clustering causes dramatic slowdowns.
Chaining: Can tolerate α > 1 (multiple items per slot), but performance degrades linearly.

Resizing allocates a new array of double the size and rehashes all entries (their slot positions change with the new array size). Rehashing is O(n), but because it happens exponentially rarely, the amortized cost per insertion is O(1).

Time Complexity

Operation	Average Case	Worst Case
Insert	O(1) amortized	O(n) — all keys collide
Lookup	O(1)	O(n) — all keys in same chain
Delete	O(1)	O(n)
Resize/rehash	O(n) — rare	O(n)

The worst case O(n) occurs when all keys hash to the same slot — a degenerate hash function or a hash collision attack. Modern hash tables use hash randomization (Python) or tree-ify long chains (Java 8+) to defend against this.

Language Implementations

Python dict — Open addressing with pseudo-random probing. Guarantees insertion-order (since Python 3.7). Per-process hash seed (PYTHONHASHSEED) for security.
JavaScript Map — Ordered key-value map. Keys can be any type. Object only allows string/symbol keys and has prototype chain overhead.
Java HashMap — Chaining with tree-ification: chains longer than 8 entries are converted to red-black trees (O(log n) worst case). Not thread-safe; use ConcurrentHashMap for concurrent access.

Real-World Uses

Database indexes — Hash indexes provide O(1) equality lookups. B-tree indexes support range queries; hash indexes don't.
Caches — Redis is essentially a distributed hash table. Memcached too. O(1) get/set is what makes them fast.
Deduplication — Streaming pipelines check a hash set to detect duplicate events. O(1) per event.
Symbol tables — Compilers store variable names and their types/locations in a hash table for O(1) lookup during parsing.
Sets — Python set, Java HashSet — hash tables without values. O(1) membership test: x in my_set.

← Previous

Big-O Complexity Guide

All Guides →