Associative Structures
Maps, sometimes called dictionaries, are abstract data structures that store arbitrarily typed key-value pairs called entries. The advantage of a map over a simple list of entries is the efficient lookup functionality that maps provide. In a map, you use an entry’s key to look up its associated data.
Maps can be implemented as either unsorted or sorted variants by key. Maps use keying schemes to define things like equivalence and ordering of keys, and can be adapted for an incredibly wide range of use cases. This text covers several of the most popular map implementations, using hash tables for unsorted maps and search trees for sorted maps. Each approach carries specific, sometimes mutually-exclusive characteristics that provide differing asymptotic bounds.
Generally, the goal of unsorted maps with hashing is to provide fast O(1) lookups for arbitrarily-typed keys. In order to do this, hash tables necessarily scatter the entries for good distribution within a backing structure that is necessarily larger than the number of entries. This keying scheme results in what appears to be randomly distributed entries, and makes performing ranged operations impossible. The result is that hash maps are relatively spatially inefficient, but provides optimal lookup bounds. Sorted maps backed by performant search trees require slightly longer O(log(n)) lookups, but provide spatial efficiency, sorted ordering, and the ability to identify ranges of values or for elements close to a specified mark.
To illustrate the nature of key-value associations and access, consider the following pseudocode set of person objects that include name, age, and favorite food. This concept is used as example data throughout the rest of page.
{name: Brain, age: 37, food: tapioca}{name: Dicahel, age: 22, food: ice cream}{name: Bobson, age: 36, hamburgers}
If you want to know how old Bobson is, you could put all the objects into a list, and scan the names until you find the person you’re looking for. From there you can easily accesse any of the other fields for that person object, but getting there is expensive, necessarily requiring O(n) time.
To reduce lookup times you could use classical arrays and the idea of direct addressing via indexes in a lookup table. In a direct indexing scheme the keys represent an index in the array. In this way each of the array’s integer index values can act directly as a key to the value stored at that index. The only problem is that now you have to remember the position of the object you want to access, which carries no logical association with the data you’re accessing.
+---------------------------------------------------+| Index | Person object ||-------+-------------------------------------------|| 1 | {name: Brain, age: 37, food: tapioca} || 2 | {name: Dicahel, age: 22, food: ice cream} || 3 | {name: Bobson, age: 36, food: hamburgers} |+---------------------------------------------------+
Maps/dictionaries allow you to directly associate an abitrarily-typed key with an arbitrarily-complex value for O(1) access to the value without relying on a secondary assocation. Continuing the person object example, you could use the person’s name as the key and set the value to an object containing all the personal details without having to store any extra positional information about where the data is located. This key-value association is the superpower of maps/dictonaries.
+---------------------------------------+| Key | Value ||---------+-----------------------------|| Brain | {age: 37, food: tapioca} || Dichael | {age: 22, food: ice cream} || Bobson | {age: 36, food: hamburgers} |+---------------------------------------+
Sorted variants allow you to order keys in some defined way depending on the implementation. In this example, the map is sorted lexigraphically by key.
+---------------------------------------+| Key | Value ||---------+-----------------------------|| Bobson | {age: 36, food: hamburgers} || Brain | {age: 37, food: tapioca} || Dichael | {age: 22, food: ice cream} |+---------------------------------------+
The Map ADT
At its core the map ADT looks very similar to any other list. It includes functions to insert, remove, and iterate over members of the data structure in various ways.
size()
Returns the number of entries in the mapis_empty()
Returns a Boolean that indicates whether the map is emptyget(k)
Returns the valuev
associated with keyk
put(k, v)
Adds entry(k, v)
, overwriting any valuev
associated with an existing keyk
, returns old valueremove(k)
Removes the entry(k, v)
associated with keyk
key_set()
Returns an iterable collection of all keys in the mapvalues()
Returns an iterable collection of all values in the map, including repeats for multiple key-value associationsentry_set()
Returns an iterable collection of all(k, v)
entries in the mapcontains(k)
Returns a Boolean if the map contains keyk
; Used to disambiguate the presence of a key with a null/None value
Hash Tables
One of the most popular real-world map designs uses a two-part algorithm for insertion consisting of hashing and hash compression to transform an arbitrarily typed key into a deterministic locator within the hash table’s backing structure. Hash tables, also called associative arrays, are inherently unsorted outside of potentially tracking insertion order. The reasons will become more obvious through further exploration of the hash-based approach. Due to fast O(1) lookups, hash tables are still probably the most popular way to implement maps.
In a hash table it is common to start with some set of positions, commonly realized as indexes of an array-like structure. The insert logic requires that you start with a hashable key to produce a fixed-width hash digest with a hashing function. The hashing function is ideally simple and easy to compute, but must also be robust enough to produce sufficiently unique hash digests. With the hashed key you then run a compression function takes the hash and converts it to some number [0, N - 1] where N is the capacity of the table.
Following the original example, you’d start with the hashable value “Bobson”. A simple hash function might hash “Bobson” as the 64-bit number 5267201001224790492
. From there, you could run a simple compression that computes the index value as hash % list_size
, and insert the information at the computed index. Hopefully this clearly illustrates why the backing structure must be necessarily larger than the number of entries, and why the entries within the map appear at seemingly random indexes.
+---------------------------------------------------+| Index | Value ||-------+-------------------------------------------|| 0 | || 1 | || 2 | {name: Bobson, age: 36, food: hamburgers} || 3 | || 4 | || 5 | || 6 | || 7 | || 8 | || 9 | |+---------------------------------------------------+
Keen observers may have already noticed that two different hash digests may result in the same index for simple compression algorithms. Collisions happen when two keys get mapped to the same index, and can be addressed with several popular collision handling schemes. For example, using the same basic hash function, “Dichael” hashes to 792525162937935370
and “Brain” hashes to 11695810160735559810
. These are unique values, but running a simple compression algorithm as hash % list_size
produces a collision at index [0]. Ultimately, the best hash maps balance computational efficiency with practical reliability.
The next three sections take a deeper dive into hashing, compression, and collision handling of a hash map implementation.
Map Hashing Function
The hash function can return a value that does not have to fit within the capacity of the map and can even be negative. The hash function should avoid collisions as much as possible given your design criteria. There are several “obvious” ways to produce hash codes, like taking a literal value of bit representations and computing polynomial products of a key’s bits, but the most robust is an cyclic bit shift (bitwise rotation) that produces seemingly random (but unique) values.
To illustrate the cyclic bit shift approach consider the ASCII/UTF-8-encoded string Peter
. The string translates to a byte sequence of 01010000 01100101 01110100 01100101 01110010
in memory. To shift a byte’s bits left you remove the most significant (left most) bit and place it on to the least significant (right most) bit position. For mathematical reasons that are too complex to cover here it is common to shift bits by 5 places in 32-bit hash codes. If you shift left 5 bits you repeat the process of shifting most significant bit to least significant bit positions five times sequentially. The result is what appears to be a random value. Consider the following visualization function for cyclically shifting 8-bit values by 5 bits:
pub fn bit_shift_visualization(value: &str) { for mut v in value.bytes() { print!("{:08b} ({}) -> ", v, v); v = (v << 5) | (v >> 3); println!("{:08b} ({v})", v); }}
If you input Peter
you get the following shifts:
01010000 (80) -> 00001010 (10)01100101 (101) -> 10101100 (172)01110100 (116) -> 10001110 (142)01100101 (101) -> 10101100 (172)01110010 (114) -> 01001110 (78)
To calculate a hash code with this approach you might first initialize a single 32-bit hash code value to zero and loop through each character in the string. Each loop iteration might add the 8-bit value of the character to the hash code, then perform a cyclic bit shift on the hash code. The result is a unique hash code as a 32-bit integer value. Depending on the length of the input this approach could easily overflow the 32-bit hash code, so you might add arithmetic integer overflow wrapping logic to keep the computer happy. This example illustrates a hash function that accounts for such overflow potential:
pub fn hash_function(key: &str) -> u32 { let mut h: u32 = 0; for v in key.bytes() { print!("{:08b} -> ", v); h = (h << 5) | (h >> 27); h = h.wrapping_add(v as u32); println!("{:032b}", h); } h}
The print statements illustrate that the input Peter
does not require the 32-bit hash code itself to wrap, but larger strings might cause overflow.
01010000 -> 0000000000000000000000000101000001100101 -> 0000000000000000000010100110010101110100 -> 0000000000000001010011010001010001100101 -> 0000000000101001101000101110010101110010 -> 00000101001101000101110100010010
The end result is that the function hashes Peter
to the integer value of 101001101000101110100010010
which is 87317778
:
let v = hash_code("Peter"); assert_eq!(v, 87317778);
For a more generic implementation you can also use Rust’s built-in hashing functionality. This example illustrates taking a reference to a type T
and using Rust’s default hasher to return a 64-bit digest. The T
to be hashed must implement the Hash
and Debug
traits, and may optionally be Sized
.
pub fn hasher<T: Hash + Debug + ?Sized>(key: &T) -> u64 { let mut hasher = DefaultHasher::new(); key.hash(&mut hasher); // from Hash::hash() let digest = hasher.finish(); // from Hasher::finish() digest}
Compression
The compression function’s primary purpose is to take a hash digest and ensure that it fits into the legal set of underlying storage indexes. For example, the last section defined a toy hash function that produced 32-bit hash digests. Naturally you wouldn’t initialize an array of 4.29 billion indexes to accommodate the unsigned 32-bit digest range. Instead you use a compression function to reduce the number to some [0, N - 1] range of values. This section covers two popular approaches to compression functions. Good compression results in the probability of collisions around 1/N.
-
The Division Method: This simple method maps
i mod N
wherei
is the integer hash value andN
is the capacity of the underlying storage array. You can reduce the number of collisions by settingN
to some prime number, but patterns still emerge from this simple methodology that produce collisions. -
The MAD Method: The multiply, add, and divide (MAD) method is c(h(k)) = ((a * h(k) + b) mod p) mod N where
N
is the capacity of the underlying storage array,p
is a prime number >N, and a (“scale”) and b (“shift”) are random integers from the range [1, p - 1] and [0, p - 1] respectively. This method drastically reduces patterns and compression collisions
The division method is super simple to implement. This example mixes u64
and usize
for workflow convenience and specificity. The Vec::len()
function returns usize
, but the actual hash code is u64
.
pub fn compression_0(key: u64, len: usize) -> u64 { key % len as u64}
Implementing the MAD compression algorithm is a little more involved because you need to compute a prime number and seed the algorithm with random numbers. Implementation details are available in this project’s GitHub repo.
/** Implements MAD compression as `[(ai + b) mod p] mod N`Relies on `is_prime` and `next_prime` functions not detailed here */pub fn compression(key: u64, len: usize) -> u64 { // Finds the next prime >len let p = next_prime(len.pow(3)) as u64;
let mut rng = rand::rng(); // Thread-local RNG let a = rng.random_range(1..=p - 1); let b = rng.random_range(0..=p - 1);
// Raw-dogging the algorithm may cause overflow //(((key * a) + b) % p) % len
// Apply wrapping * and + to prevent overflow let wrapped_value = (key.wrapping_mul(a)) .wrapping_add(b) .wrapping_rem(p) .wrapping_rem(len as u64);
wrapped_value}
Collision Handling Schemes
Collisions are when a hash (or combined hash and compression) function h() produces the same hash code for two different keys k1 and k2 as h(k1) == h(k2). This section describes several different ways to handle collisions specifically for map implementations that use hash tables where the backing structure is some indexable list A with a capacity of N and size n. The size and capacity of the map provide an important metric for the behavior of the map. The ratio λ = n/N is called the load factor of the map. For operational (temporal) efficiency reasons discussed later, load factors should ideally be < 1, with some algorithmic sweet spots managing load factors no greater than .5. That is, ideally you’ve built a base structure with a larger capacity than the number of insertions, but because hash tables contain logic for handling collisions it is common to use the map’s load factor to determine when to grow the backing list.
Separate Chaining
The simplest way to handle collisions is for each index, called a bucket in the backing list A, to contain a secondary unordered list of entries that is no longer than the greatest number of collisions in the map. In a properly distributed hash table, each bucket contains on average n/N elements, but individual bucket sizes may vary, with the worst case being O(n) if all elements hash to the same bucket. The use of secondary lists for each index is called separate chaining. Insert, get, and search operations in each bucket are at worst O(1 + λ) or simply O(λ). It is possible for λ to reach higher than 1, but temporal efficiency for operations decreases significantly at higher load factors. That means that structures that use separate chaining logic are incentivized to keep a high spacial overhead and hashing schemes that reduces the structure’s load factor to gain an advantage over the O(n) or O(1) operations on the backing structures themselves. Experiments with real-world data indicate that you should maintain λ < .9, and Java’s implementation keeps λ < .75.
Open Addressing
This section covers several approaches that use a single data structure to implement hash tables, collectively known as open addressing schemes. Instead of each index in array A[] containing a separate list like in chaining approaches, open addressing flattens the strcture and simply follows some logic to find the next open index to stick the entry. Finding avilable indexes is called probing. Probing is implemented with some wraparound logic such that if you get to the end of the backing structure’s indexes the probe wraps around to the beginning of the structure. You need to apply the same probing logic to inserts as you do with get, search, and remove operations to ensure that you’re addressing the correct set of collisions. Depending on the probing logic this can create clustering of data which degrades the asymptotic behavior of operations similar to chaining implementations with long secondary lists. Open addressing schemes require a load factor of at most 1 because there are no secondary storage structures. However, to reap the beneficial temporal characteristics of hash tables it is recommended to maintain λ < .5 in implementations.
-
Linear probing simply tries the next index as A[(h(k) + 1) mod N]. The modulo operator provides circular probing logic. That is to say that if you have a structure A[] where N == 13 and you have a collision with h(k) == 12, the probe wraps the structure’s indexes circularly such that the next check is A[0]. If that index is not empty the algorithm simply moves to the next until it finds an open index. Linear probing has a tendancy to create clusters of entries which can slow search and put operations considerably as the load factor increases. This is evident as clusters start to overlap and all of a sudden you must search through a much larger number of entires to find the desired entry. Linear probing also complicates removals since the search/find operations rely on contiguous ranges of indexes to define a group of collisions. Instead of emptying out an index you need to use some sort of place holder, called a defunct or sentinel entry to maintain contiguous groupings of collisions for search/get operations. Linear probing can result in spacial limitations for maps with many inserts and deletes as each index can only hold at most one entry during its lifetime.
-
Quadratic probing is similar to linear probing, but each jump is A[(h(k) + f(i)) mod N] where i >= 0 and f(i) = i^2. This helps alleviate the clustering of linear probing, but still produces something called secondary clustering. If (and only if) N is prime and the load factor is less than 50% quadratic probing is guaranteed to find an empty slot.
-
Double hashing expands the linear probing algorithmic logic A[(h(k) + f(i)) mod N] where i >= 0 but now f(i) = h’(k) where h’() is some secondary hashing function that cannot evaluate to zero. One common choice is h’(k) = q - (k mod q) for some prime number q < N. Similar to other hash table implementations N should be prime.
-
Pseudorandom addressing Is another way to avoid clustering where the iterations are controlled by some pseudorandom number generator. This provides repeatable but apparently arbitrary sequences of probing steps that depend on the bits of the original hash code.
Growing The structure
When your load factor becomes too high you may want to grow the backing struture of your hash table. Simply copying the structure to a new memory location will not alleviate the operational characteristics on clustered data from the previous storage location. You do not necessarily need to re-hash each key, but you do need to run each hash through another round of compression (and probing for open addressing schemes) to ensure even distribution in the new backing structure. Just like with dynamic arrays it is common to double the base structure when you reach the maximum allowable load factor.
fn grow(&mut self) { // Create a new base vector with_capacity() and resize_with() to ensure // all indexes exist, otherwise you could push to an index that doesn't // exist causing a panic // NOTE: Vec::resize_with() may result in "hidden allocation" despite description // that indicates that the function resizes "in place", initializes // new_base with all None values let new_capacity = hash_lib::next_prime(self.data.len() * 2); let mut new_base: Vec<Option<Entry<K, V>>> = Vec::with_capacity(new_capacity); new_base.resize_with(new_capacity, || None); let mut new_ctrl: Vec<u8> = Vec::with_capacity(new_capacity); new_ctrl.resize_with(new_capacity, || 0x00);
println!("Growing from {} to {}", self.data.len(), new_capacity);
// Reset the instance values for MAD compression // Finds a prime > len, starting much larger to ensure even spread self.prime = hash_lib::next_prime(new_capacity * 2); let mut rng = rand::rng(); // Thread-local RNG self.scale = rng.random_range(1..=self.prime - 1); self.shift = rng.random_range(0..=self.prime - 1);
// Move entries from self.data into new_base // For each Some entry in old table, calculate new location // via hash/compression and insert the entry into new_base[location] for e in &mut self.data { if let Some(v) = e.take() {
// MAD compression algorithm let mut location: usize = ((hash_lib::hash(&v.key)) .wrapping_mul(self.scale as usize)) .wrapping_add(self.shift) % (self.prime) % (new_capacity);
// Handle potential collisions for new vec insert let mut i: usize = 0; while new_base[location].is_some() { location = (location + i.pow(2)) % new_capacity; i += 1; } new_base[location] = Some(v); new_ctrl[location] = 0x01; } }
//Update the struct instance to the new collections self.data = new_base; self.ctrl = new_ctrl; }
Removing Elements
Removing elements in a separate chaining approach is easy as each bucket is a simple list, and when using linked lists can be removed without much effort. Removing elements in an open addressing approach is a little more complicated. The probing logic of open addressing schemes relies on all elements within the initial collision handling so you cannot simply remove values. You also want to avoid O(n) shift operations for entry removals. Instead you need to mark the index as “defunct” which then acts as a kind of sentinel for subsement find, insert, and remove operations.
This text’s example implementation uses an array mask to provide metadata about each entry. The idea is much more elegangly handled in places like the C++ or Rust standard library, but the idea is basically the same. Create a ctrl
list of byte values that contain predetermined markers for empty, occupied, and defult slots. This example implementation uses 0x00
, 0x01
, and 0x80
respectively.
pub struct ProbingHashTable<K, V> { data: Vec<Option<Entry<K, V>>>, ctrl: Vec<u8>, prime: usize, scale: usize, shift: usize, size: usize, entries: usize}impl<K, V> ProbingHashTable<K, V>where K: Clone + Debug + Hash + PartialEq, V: Clone + PartialEq{
...
pub fn remove(&mut self, key: K) -> Option<Entry<K, V>> { let hash = Self::hash(&key); let bucket = self.compress(hash); let location = self.find_index(bucket, &key);
let mut entry = None; if location >= 0 { //TODO Figure out how to remove clone() // Cannot use take() because that leaves the index // as None which breaks probing logic entry = self.data[location as usize].clone(); //entry = self.data[location as usize].take(); self.ctrl[location as usize] = 0x80; // Mark with tombstone } self.entries -= 1; entry }
...
}
Sorted Map ADT
The sorted map variant extends the map with the following operations:
first_entry()
Returns the entry with the smallest relative key or null/None if the key does not exist in the maplast_entry()
Returns the entry with the largest relative key or null/None if the key does not exist in the mapceiling(k)
Returns the next smallest key that is>= k
, or null/None if the key does not exist in the mapfloor(k)
Returns the next largest key that is<= k
, or null/None if the key does not exist in the mapupper(k)
Returns the next largest key that is >k
, or null/None if the key does not exist in the maplower(k)
Returns the next smallest key that is <k
, or null/None if the key does not exist in the mapsub_map(k1, k2)
Returns an iterable range of all entries with keys>= k1
and< k2
Sorted maps are good for, you guessed it, search operations. One example is using timestamps as keys and performing operations on elements before, after, or within a range of timestamps. One convenient example of using sorted maps is with lexicographically sorted tuple key-value pairs such as timestamped flight records. Sorted maps are also useful in maintaining maxima sets. Take for example a set of entries where the keys are “cost” and values are “performance”. You can write a function like best(cost)
that returns the best performance for the value.
When implementing a sorted map you can use a vanilla list structure and just call it a map because who cares. Resize, add, and remove operations require O(n) time. The first/last operations require O(1), and get/ceiling/floor/higher/lower operations take O(log n) when using a binary search algorithm. The submap operation requires O(s + log n) where s represents the number of items included. On the other hand, you can implement sorted maps with skip lists that result in O(log n) queries and O((1 + r) log n) adds/updates where r is the number of removed entries.
Skip Lists
You can make a sorted table with a regular old index-based list, but it still may require O(n) time for update operations due to shifts, and linked lists dont allow for fast searches. To get around this you can use either a binary search tree (BST) or a skip list. Both BSTs and skip lists provide O(long n) expected (on average) time bounds for search, insert, update, and delete operations. The choice between a skip list and a BST comes down to two trade-offs. BSTs have better cache locality but are difficult to implement for concurrent operations. If you need a safe, concurrent map with
Skip lists are a collection of lists S0 to Sh. List S0 contains all entries in the skip list with keys sorted by some logical methodology as well as two sentinel entries representing keys lower and higher than all possible keys. Visually you can imagine the sentinel entries as -∞ and +∞. The magic of a skip list relies on the fact that each subsequent list contains some random number of elements of the base list S0, also between two sentinel entries -∞ and +∞. Each subsequent list contains roughly half the number of entries as the feeder list, with the specific entries (indexes) chosen at random. If list S0 for map M contains 10 entries (plus two sentinel entries), list S1 might contain 5 entries, S2 might contain 3 entries, S3 might contain 1 entry, and S4 would contain just the sentinel entries.
To visualize skip list operations you can start by layering each list S0 through Sh as sorted lists from left to right with S0 on the bottom under S1 through Sh at the top where h represents the height of the skip list. The result is a kind of grid where each horizontal row is a list Si that represents a level and each column of randomly chosen entries forms a tower.
The heart of the skip list is the skip_search(k)
operation which finds the first key that is less than or equal to key k. The skip search operation starts at the “top left” position of the imaginary grid of layered lists S0 up to Sh, which is the sentinel node of Sh represented by key -∞. The operation first checks if key k is <= -∞
. If the operation returns null/None, the operation terminates. Otherwise, the operation drops down a level with in the same tower to Sh-1. Next the operation advances to the right-most position p occupied by an entry at position p where key(p) <= k
, which is called scanning forward. The operation repreats these steps until it finds the largest key at position p that is <= k
.
Algorithm:
skip_search(k): p = head of highest list while lower_list(p) != null // Drop to lower list p = lower_list(p) while k >= key(next(p)) do // Scan forward p = next(p) return p
Create a skip list
Algorithm:
insert():
Examples
One practical use case for maps is statistical analysis of a collection. For example, if you have a text where each word is a collection of characters (outside of the typical punctuation) you can use a map to determine frequency of occurrence. This is more efficiently done with maps rather than a list of key-value pairs because
Sorted maps
Examples
Tuple associations
() : ()
mutlimap
13101972 : (SUMU, SCEL, 40, 45, 29, 16) 27031977 : (KHAM, GCLP, 234, 248, 248, 0) 27031977 : (KLAX, GCLP, 380, 396, 335, 61) 12081985 : (RJTT, RJOO, 509, 524, 520, 4) 21121988 : (EGLL, KJFK, 243, 259, 259, 0)
Comparison
So which structure should you choose to implement your map? Well, like most things around here; it depends.
Hashing implementations do not allow for sorted collections. Separate chaining maintains an edge in temporal efficiency at high load factors, but open addressing is more spacially efficient and ideal for hash table implementations with low load factors. Open addressing also has an edge where operations have the same asymptotics due to cache locality and the lack of pointer jumping to secondary lists with randomized heap locations.
Structure | Search | Insert/Update/Delete | Notes |
---|---|---|---|
Hash table | O(1) avg, O(n) worst | O(1) avg, O(n) worst | Not spatially eficient, unsorted/not good for processing sorted\entries or range queries, may require periodic O(n) resizing |
Sorted (array) list | O(long n) | O(n) | Simple, cache-friendly, computationally simpler than hashing, may require periodic O(n) resizing operations |
Skip list | O(long n) | O(log n) | Handles sorted collections, good for concurrent operations, does not utilize cache locality due to pointer jumps |
BST (AVL, red-black) | O(long n) | O(log n) | Guaranteed O(long n) operations, not great for concurrent operations due to rotations |
Sets
A set is an unordered collection of unique elements (no duplicates) that supports efficient membership checks (and frequently some way to iterate through the members). Sets typically consist of single members, e.g. not key:value pairs.
Generally speaking, you can implement set, mutliset, and multimap structures with the same basic underlying mechanisms for implementing maps, though sets generally operate on single members instead of entires. Indeed the map, multimap, set, and multiset are all implemented here with the same underlying hash-based logic with open addressing. Sorted variants can be implemented with more appropriate backing structures. For example, the sorted map and sorted set can be implemented with either a self-balancing tree, or commonly with a skip list for situations that require concurrent operations.
The ADTs have slightly different operations, based on their usage.
The Set ADT
The set type contains the following fundamental operations:
add(e)
Adds the elemente
to the setremove(e)
Removes the entrye
from the setcontains(e)
Returns a Boolean if the map contains elemente
Additionally, you may consider using the fundamental methods to implement the following opertations for common use cases:
iterator()
One of the most common uses of a set is to iterate through its elementsunion(T) -> Iter
A union operation that returns an iterator over all unique borrowed of bothself
andT
extend(T)
A union operation that mutatesS
such thatS
contains all unique values ofS
+T
intersection(T) -> Iter
An Intersection operation that returns an iterator over all borrowed values in bothself
andT
retain_all(T)
(intersect(T)
) An intersection operation that mutatesS
such thatS
only keeps elements that are also inT
difference(T) -> Self
A subtraction operation that returns a new set with only elements that are not inT
subtract(T)
A subtraction operation that mutatesS
such thatS
only contains elements that are not inT
retain(|predicate|)
A filter operation that mutatesS
to only contain values that adhere to the given predicate (lambda)
The Sorted Set ADT
first()
Returns the smallest element of sorted setS
.last()
Returns the largest element of setS
.ceiling(e)
Returns the smallest element>=
to elemente
.floor(e)
Returns the largest element<=
to elemente
.lower(e)
Returns the largest element strictly>
than elemente
.higher(e)
Returns the smallest element strictly<
than elemente
.sub_set(e1, e2)
Returns an iterator containing values>=
to elemente1
and strictly < than elemente2
.poll_first()
/pop_first()
Returns (and removes) the smallest element in setS
.poll_last()
/pop_last()
Returns (and removes) the largest element in setS
.
In Rust you may want to replace ceiling
, floor
, lower
, higher
, and sub_set
with a single range(R: RangeBounds)
operation. Using RangeBounds
is likely more performant than implementing a predicate with a closure as it allows the collection to use its backing structure to directly identify the lower bound, then iterate until it reaches the upper bound. The bound checks are monomorphized and often zero-cost after inlining.
For example, consider the following range
operation ipmlemented with a closure:
impl<T: Ord> SortedSet<T> { fn range_where<F>(&self, pred: F) -> impl Iterator<Item = &T> where F: Fn(&T) -> bool, { self.data.iter().filter(move |x| pred(x)) }}
...
for x in s.range_where(|&x| x >= 3 && x < 8) { ...}
Using the closure necessarily checks every element in the collection, resulting in O(n) running time, even if you impelemnt an early exit at the upper bound. Using RangeBounds
allows you to use the underlying structure’s fast location to identify elements, typically in O(log(n)) time.
Multimaps & mutlisets
A multimap is like a map, except that a single key can be associated with multiple different values. Multisets, also called bags, are sets that allow duplicates. The biggest takeaway here is how you define sameness/otherness. For example, are you counting the number of references to a single object, or are you counting multiple separate instances of bitwise equivalence? In multisets it is probably most common to list occurences and thus implement these structures with Entry<K, N>
types where K
is the element andN
is the number of occurences. Multimaps are similar, except N
is typically a list of values associated with key K
. This might look more like Entry<K, V>
.
// Base multimap structpub struct Entires<K, V> { key: K, vec: Vec<V>,}
Multiset ADT
In addition to all the regular map operations, multisets contain the following specialized operations:
count(k)
Returns the number of times keyk
appears in the set.remove(k, n)
Removesn
number of occurences of keyk
in the set.size()
Returns the total number of elements in the set, including duplicates.
Multimap ADT
In addition to all the regular map operations, multimaps contain the following specialized operations:
remove_all(k)
Returns an iterator over all key:value pairs for keyk
.key_set(k)
Returns an iterator over non-duplicative keysk
in the map.values(k)
Returns an iterator over all the values for keyk
.