Skip to content

Hashing

The previous page introduced the concept of keyed map and set structures. This page goes into much more detail about hash tables

Hashing is the process of taking arbitrarily typed or sized data and creating fixed-width representation values called hashes, hash codes, hash values, or digests. This text prefers hash values and digests. Hashing is typically a one-way process, meaning that you cannot easily reconstruct the original data from the hash value. Sufficiently complex hashing algorithms have many common, real-world applications in data integrity, security, and cryptography. Though not strictly necessary, hashing is most useful when the algorithm produces the same hash value for all equivalent inputs. For example, if a hashing algorithm always produces the same value for equivalent inputs you can compare large data sets efficiently or ensure that a password is valid without storing it directly.

Hash Tables

Hash tables get their name from a computational technique called hashing. In computing, hashing represents a way to transform arbitrary data into a fixed-width value called a hash value, hash digest, or simply a digest. Hash tables, also called associative arrays, are themselves generally represented as indexed list-backed structures that rely on a three-part process that a) computes a hash for the key, b) converts that hash value into a number that exists within the legal range of index values [0, n - 1] where n is the capacity of the table, and c) performs a step to ensure that two keys dont hash to the same list index. As a result, hash tables can be seen (reductively) as fancy algorithmic logic on top of an array-based list structure.

To provide a high-level example using the person objects in the previous section, you might start by hashing a key such as a person’s name to some 64-bit number. In this example, “Bobson” might hash to 5267201001224790492. From there, you could run a simple compression as hash % list_size. If your hash table contains 10 slots, the compression calculation is 5267201001224790492 % 10 = 2, so you’d insert Bobson at index 2.

+---------------------------------------------------+
| Index | Value |
|-------+-------------------------------------------|
| 0 | |
| 1 | |
| 2 | {name: Bobson, age: 36, food: hamburgers} |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
+---------------------------------------------------+

Collisions happen when two keys get mapped to the same index, and can be addressed with several popular collision handling schemes. For example, using the same basic hash function, “Dichael” hashes to 792525162937935370 and “Brain” hashes to 11695810160735559810. These are unique values, but running the simple compression calculation hash % 10 for each hash value produces a collision at index [0].

The next three sections take a deeper dive into hashing, compression, and collision handling techniques used in hash map design. Ultimately, the best hash maps balance computational efficiency with practical reliability.

Hashing

Hashing is a widely used computational technique for identifying and comparing data. It is commonly employed to transform object data into minimal, fixed-size representations for fast comparison, to generate cryptographic fingerprints, or to distribute work. In hash tables, hash functions are typically simple and fast, while being robust enough to minimize collisions and distribute values uniformly.

One simple way to illustrate hashing is with a cyclic bit shift approach. Consider the ASCII/UTF-8-encoded string Peter. The string translates to a byte sequence of 01010000 01100101 01110100 01100101 01110010 in memory. Cyclic bit shifting resembles a queue operation where you take a bit from one end and move it to the other end sequentially. In a left shift, you’d remove the most significant (left most) bit, shift all the remaining bits left one position, and place the removed bit at the least significant (right most) position. For mathematical reasons that are too complex to cover here it is common to produce simple hashes by shifting bits five places in 32-bit hash codes, that is, performing a single shift five times. The result is what appears to be a random value. Consider the following visualization function for cyclically shifting 8-bit values by 5 bits:

pub fn bit_shift_visualization(value: &str) {
for mut v in value.bytes() {
print!("{:08b} ({}) -> ", v, v);
v = (v << 5) | (v >> 3);
println!("{:08b} ({v})", v);
}
}

If you input Peter, the function performs a left bit shift on each character in the sequence:

01010000 (80) -> 00001010 (10)
01100101 (101) -> 10101100 (172)
01110100 (116) -> 10001110 (142)
01100101 (101) -> 10101100 (172)
01110010 (114) -> 01001110 (78)

To calculate a hash code with this approach you might first initialize a single 32-bit hash code value to zero, and for each 8-bit character in the string, add its bits to the base hash code value and perform a cyclic bit shift. The result is a unique hash digest as a 32-bit integer value. The following example uses num::wrapping_add() for safety because the shift may produce a value that is close to u32::MAX which could potentially cause an overflow when adding the next character in the sequence.

#[allow(clippy::manual_rotate)]
pub fn simple_hash_function(key: &str) -> u32 {
let mut hash: u32 = 0;
for v in key.bytes() {
print!("{v:08b} -> ");
hash = hash.wrapping_add(v as u32);
hash = (hash << 5) | (hash >> 27);
println!("{hash:032b}");
}
hash
}

The example function produces the following output for the phrase, “Simple hash digest”:

01010011 -> 00000000000000000000101001100000
01101001 -> 00000000000000010101100100100000
01101101 -> 00000000001010110011000110100000
01110000 -> 00000101011001100100001000000000
01101100 -> 10101100110010000100110110000000
01100101 -> 10011001000010011011110010110101
00100000 -> 00100001001101111001101010110011
01101000 -> 00100110111100110110001101100100
01100001 -> 11011110011011000111100010100100
01110011 -> 11001101100011110010001011111011
01101000 -> 10110001111001000110110001111001
00100000 -> 00111100100011011001001100110110
01100100 -> 10010001101100100111001101000111
01101001 -> 00110110010011100111011000010010
01100111 -> 11001001110011101100111100100110
01100101 -> 00111001110110011111000101111001
01110011 -> 00111011001111100011110110000111
01110100 -> 01100111110001111011111101100111

The final (output) value 01100111110001111011111101100111 is equivalent to the integer value 1741143911.

assert_eq!(simple_hash_function("Simple hash digest"), 1741143911);

Cyclic bit-shifting is a simple and convenient way to illustrate the concept of hashing, but does not adequately handle modern hash-flooding denial-of-service (HashDoS) attacks. Rust’s default hasher uses a keyed hash algorithm derived from SipHash, which incorporates a per-process random seed and ARX mixing rounds to mitigate HashDoS attacks. While not intended as a general-purpose cryptographic hash, SipHash provides a strong balance between performance and collision resistance for in-memory hash table use.

The following function uses Rust’s built-in DefaultHasher to compute a 64-bit keyed hash suitable for internal data structure usage, but not necessarily for persistent or cryptographic purposes.

pub fn hasher<T: Hash + Debug + ?Sized>(key: &T) -> u64 {
let mut hasher = DefaultHasher::new();
key.hash(&mut hasher); // from Hash::hash()
let digest = hasher.finish(); // from Hasher::finish()
digest
}

Compression

The compression operation’s primary purpose within a hash table’s hashing function is to take a fixed-width hash digest and ensure that it fits into the legal set of underlying storage indexes. For example, the last section defined functions that produce 32- and 64-bit hash digests. Naturally you wouldn’t initialize an array of 18.45 quintillion or even 4.29 billion indexes to accommodate the unsigned 64- and 32-bit digest ranges respectively. Instead you use a compression operation to reduce the number to some legal number in the range [0, n - 1] where n is the capacity of the hash table’s backing structure. Good compression results in the probability of collisions around 1/n where n is the underlying hash table’s capacity.

  • The Division Method: This simple method maps i mod n where i is the integer hash value and n is the capacity of the underlying storage array. You can reduce the number of collisions by setting n to some prime number, but patterns still emerge from this methodology that produce collisions.

  • The MAD Method: The multiply, add, and divide (MAD) method is c(h(k)) = ((a * h(k) + b) mod p) mod n where n is the capacity of the underlying storage array, p is a prime number >n, and a (“scale”) and b (“shift”) are random integers from the range [1, p - 1] and [0, p - 1] respectively. This method drastically reduces patterns and compression collisions

The division method is as simple as using the modulo operator on two usize values hash % len. Implementing the MAD compression algorithm is a little more involved because you need to compute a prime number and seed the algorithm with random numbers.

use rand::Rng;
/// Efficient primality test using the 6k ± 1 rule for numbers >3 by trial.
fn is_prime(n: usize) -> bool {
// Only positive numbers >1 need apply
if n < 2 {
return false;
}
// The lil guys get some love
if n == 2 || n == 3 {
return true;
}
// Skips evens and numbers divisible by 3
if n % 2 == 0 || n % 3 == 0 {
return false;
}
// 6k ± 1 rule check
let mut i = 5;
while i * i <= n {
if n % i == 0 || n % (i + 2) == 0 {
return false;
}
i += 6;
}
true
}
/// Finds the next prime in O(n/2) time by skipping evens
pub fn next_prime(n: usize) -> usize {
if n < 2 {
return 2;
}
if n == 2 {
return 3;
}
let mut candidate = if n % 2 == 0 { n + 1 } else { n + 2 }; // Ensure candidate is odd
while !is_prime(candidate) {
candidate += 2; // Skip even numbers
}
candidate
}
/// Illustrates MAD compression as `[(ai + b) mod p] mod N`
/// Relies on `is_prime` and `next_prime` functions
/// `c(h(k)) = ((a * h(k) + b) mod p) mod N`
pub fn mad_compression(key: usize, len: usize) -> usize {
// Finds a prime much larger than len to ensure even spread
let p = next_prime(len.pow(3));
let mut rng = rand::rng(); // Thread-local RNG
let a = rng.random_range(1..=p - 1);
let b = rng.random_range(0..=p - 1);
// Use overflow-safe functions
(key.wrapping_mul(a))
.wrapping_add(b)
.wrapping_rem(p)
.wrapping_rem(len)
}

Collision Handling Schemes

Collisions are when a hash (or combined hash and compression) function h() produces the same hash code for two different keys k1 and k2 as h(k1) == h(k2). This section describes several different ways to handle collisions specifically for map implementations that use hash tables where the backing structure is some indexable list A with a capacity of N and size n. The size and capacity of the map provide an important metric for the behavior of the map. The ratio λ = n/N is called the load factor of the map. For operational (temporal) efficiency reasons discussed later, load factors should ideally be < 1, with some algorithmic sweet spots managing load factors no greater than .5. That is, ideally you’ve built a base structure with a larger capacity than the number of insertions, but because hash tables contain logic for handling collisions it is common to use the map’s load factor to determine when to grow the backing list.

Separate Chaining

The simplest way to handle collisions is for each index, called a bucket in the backing list A, to contain a secondary unordered list of entries that is no longer than the greatest number of collisions in the map. In a properly distributed hash table, each bucket contains on average n/N elements, but individual bucket sizes may vary, with the worst case being O(n) if all elements hash to the same bucket. The use of secondary lists for each index is called separate chaining. Insert, get, and search operations in each bucket are at worst O(1 + λ) or simply O(λ). It is possible for λ to reach higher than 1, but temporal efficiency for operations decreases significantly at higher load factors. That means that structures that use separate chaining logic are incentivized to keep a high spacial overhead and hashing schemes that reduces the structure’s load factor to gain an advantage over the O(n) or O(1) operations on the backing structures themselves. Experiments with real-world data indicate that you should maintain λ < .9, and Java’s implementation keeps λ < .75.

Open Addressing

This section covers several approaches that use a single data structure to implement hash tables, collectively known as open addressing schemes. Instead of each index in array A[] containing a separate list like in chaining approaches, open addressing flattens the strcture and simply follows some logic to find the next open index to stick the entry. Finding avilable indexes is called probing. Probing is implemented with some wraparound logic such that if you get to the end of the backing structure’s indexes the probe wraps around to the beginning of the structure. You need to apply the same probing logic to inserts as you do with get, search, and remove operations to ensure that you’re addressing the correct set of collisions. Depending on the probing logic this can create clustering of data which degrades the asymptotic behavior of operations similar to chaining implementations with long secondary lists. Open addressing schemes require a load factor of at most 1 because there are no secondary storage structures. However, to reap the beneficial temporal characteristics of hash tables it is recommended to maintain λ < .5 in implementations.

  • Linear probing simply tries the next index as A[(h(k) + 1) mod N]. The modulo operator provides circular probing logic. That is to say that if you have a structure A[] where N == 13 and you have a collision with h(k) == 12, the probe wraps the structure’s indexes circularly such that the next check is A[0]. If that index is not empty the algorithm simply moves to the next until it finds an open index. Linear probing has a tendancy to create clusters of entries which can slow search and put operations considerably as the load factor increases. This is evident as clusters start to overlap and all of a sudden you must search through a much larger number of entires to find the desired entry. Linear probing also complicates removals since the search/find operations rely on contiguous ranges of indexes to define a group of collisions. Instead of emptying out an index you need to use some sort of place holder, called a defunct or sentinel entry to maintain contiguous groupings of collisions for search/get operations. Linear probing can result in spacial limitations for maps with many inserts and deletes as each index can only hold at most one entry during its lifetime.

  • Quadratic probing is similar to linear probing, but each jump is A[(h(k) + f(i)) mod N] where i >= 0 and f(i) = i^2. This helps alleviate the clustering of linear probing, but still produces something called secondary clustering. If (and only if) N is prime and the load factor is less than 50% quadratic probing is guaranteed to find an empty slot.

  • Double hashing expands the linear probing algorithmic logic A[(h(k) + f(i)) mod N] where i >= 0 but now f(i) = h’(k) where h’() is some secondary hashing function that cannot evaluate to zero. One common choice is h’(k) = q - (k mod q) for some prime number q < N. Similar to other hash table implementations N should be prime.

  • Pseudorandom addressing Is another way to avoid clustering where the iterations are controlled by some pseudorandom number generator. This provides repeatable but apparently arbitrary sequences of probing steps that depend on the bits of the original hash code.

Growing The structure

When your load factor becomes too high you may want to grow the backing struture of your hash table. Simply copying the structure to a new memory location will not alleviate the operational characteristics on clustered data from the previous storage location. You do not necessarily need to re-hash each key, but you do need to run each hash through another round of compression (and probing for open addressing schemes) to ensure even distribution in the new backing structure. Just like with dynamic arrays it is common to double the base structure when you reach the maximum allowable load factor.

fn grow(&mut self) {
// Create a new base vector with_capacity() and resize_with() to ensure
// all indexes exist, otherwise you could push to an index that doesn't
// exist causing a panic
// NOTE: Vec::resize_with() may result in "hidden allocation" despite description
// that indicates that the function resizes "in place", initializes
// new_base with all None values
let new_capacity = hash_lib::next_prime(self.data.len() * 2);
let mut new_base: Vec<Option<Entry<K, V>>> = Vec::with_capacity(new_capacity);
new_base.resize_with(new_capacity, || None);
let mut new_ctrl: Vec<u8> = Vec::with_capacity(new_capacity);
new_ctrl.resize_with(new_capacity, || 0x00);
println!("Growing from {} to {}", self.data.len(), new_capacity);
// Reset the instance values for MAD compression
// Finds a prime > len, starting much larger to ensure even spread
self.prime = hash_lib::next_prime(new_capacity * 2);
let mut rng = rand::rng(); // Thread-local RNG
self.scale = rng.random_range(1..=self.prime - 1);
self.shift = rng.random_range(0..=self.prime - 1);
// Move entries from self.data into new_base
// For each Some entry in old table, calculate new location
// via hash/compression and insert the entry into new_base[location]
for e in &mut self.data {
if let Some(v) = e.take() {
// MAD compression algorithm
let mut location: usize = ((hash_lib::hash(&v.key))
.wrapping_mul(self.scale as usize))
.wrapping_add(self.shift)
% (self.prime)
% (new_capacity);
// Handle potential collisions for new vec insert
let mut i: usize = 0;
while new_base[location].is_some() {
location = (location + i.pow(2)) % new_capacity;
i += 1;
}
new_base[location] = Some(v);
new_ctrl[location] = 0x01;
}
}
//Update the struct instance to the new collections
self.data = new_base;
self.ctrl = new_ctrl;
}

Removing Elements

Removing elements in a separate chaining approach is easy as each bucket is a simple list, and when using linked lists can be removed without much effort. Removing elements in an open addressing approach is a little more complicated. The probing logic of open addressing schemes relies on all elements within the initial collision handling so you cannot simply remove values. You also want to avoid O(n) shift operations for entry removals. Instead you need to mark the index as “defunct” which then acts as a kind of sentinel for subsement find, insert, and remove operations.

This text’s example implementation uses an array mask to provide metadata about each entry. The idea is much more elegangly handled in places like the C++ or Rust standard library, but the idea is basically the same. Create a ctrl list of byte values that contain predetermined markers for empty, occupied, and defult slots. This example implementation uses 0x00, 0x01, and 0x80 respectively.

pub struct ProbingHashTable<K, V> {
data: Vec<Option<Entry<K, V>>>,
ctrl: Vec<u8>,
prime: usize,
scale: usize,
shift: usize,
size: usize,
entries: usize
}
impl<K, V> ProbingHashTable<K, V>
where
K: Clone + Debug + Hash + PartialEq,
V: Clone + PartialEq
{
...
pub fn remove(&mut self, key: K) -> Option<Entry<K, V>> {
let hash = Self::hash(&key);
let bucket = self.compress(hash);
let location = self.find_index(bucket, &key);
let mut entry = None;
if location >= 0 {
//TODO Figure out how to remove clone()
// Cannot use take() because that leaves the index
// as None which breaks probing logic
entry = self.data[location as usize].clone();
//entry = self.data[location as usize].take();
self.ctrl[location as usize] = 0x80; // Mark with tombstone
}
self.entries -= 1;
entry
}
...
}

Examples

One practical use case for maps is statistical analysis of a collection. For example, if you have a text where each word is a collection of characters (outside of the typical punctuation) you can use a map to determine frequency of occurrence. This is more efficiently done with maps rather than a list of key-value pairs because

Comparison

So which structure should you choose to implement your map? Well, like most things around here; it depends.

Hashing implementations do not allow for sorted collections. Separate chaining maintains an edge in temporal efficiency at high load factors, but open addressing is more spacially efficient and ideal for hash table implementations with low load factors. Open addressing also has an edge where operations have the same asymptotics due to cache locality and the lack of pointer jumping to secondary lists with randomized heap locations.

StructureSearchInsert/Update/DeleteNotes
Hash tableO(1) avg, O(n) worstO(1) avg, O(n) worstNot spatially eficient, unsorted/not good for processing sorted\entries or range queries, may require periodic O(n) resizing
Sorted (array) listO(long n)O(n)Simple, cache-friendly, computationally simpler than hashing, may require periodic O(n) resizing operations
Skip listO(long n)O(log n)Handles sorted collections, good for concurrent operations, does not utilize cache locality due to pointer jumps
BST (AVL, red-black)O(long n)O(log n)Guaranteed O(long n) operations, not great for concurrent operations due to rotations