Sorting, Searching, & Selection

The previous page introduced the mathematical foundations of comparative ordering and equivalence through the concept of binary relations. It also provided some examples for implementing ordering on custom types. This section introduces practical applications for comparative ordering including elementary sort, search, and selection algorithms as applied to list-based sequences of comparable values.

Sorting

Sorting can be defined as the logical arrangement of elements in a collection into a sequence that adheres to a given comparison relation. In this way, sorted arrangements exist independently of the collection’s underlying memory layout (structural ordering). Rather, the sorted arrangement is realized by a logical traversal order. In basic cases, sorting algorithms achieve sorted traversal order by structural reordering. This is common in sorting algorithms on (contiguous) list structures via memory swaps. For example, Rust’s sort operation on slices physically rearranges elements in the list. This allows the slice’s traversal methods (such as iter()) to yeild sorted arrangements without a custom implementation.

It is also common for data structures to maintain a sorted traversal order without structural reordering. In these cases, the traversal order is dictated by an additional logical layer of links, indices, or abstract positions within memory that are used to produce the sorted arrangement. Data structures that maintain a sorted traversal order are called sorted data structures. The term does not dictate how the sorted traversal order is maintained, rather, the important part is that the structure maintains a sorted invariant regardless of arbitrary insertion and removal operations. In all cases, the collection is considered sorted if the sequence produced by iteration satisfies the comparison relation.

There is a natural progression from elementary list-based algorithms, used as a traditional pedagogical starting point, to more advanced and practical data structures used in real-world applications. Sorted structures dont just provide a sorted traversal invariant; as this section illustrates, sorted data provides a platform to efficiently support searching for elements in the collection, identifying entries based on relative ordering such as identifying the first, last, and Kth values, as well as queries based on ranges of values.

The Hierarchical Structures chapter covers sorted structures in much greater depth and includes material on heaps, priority queues, skip lists, and search trees. The Hierarchical Structures section also covers combining structures to produce even more useful data structures, such as adaptable priority queues and sorted maps/sets.

Comparison-based sorting requires that the comparison relation define either a total preorder (<=p) or a total order (<=), such that every pair of elements is comparable. In this way, the ordering is a property of a relation defined on a set of elements, whereas sorting is the process of arranging elements according to that relation. The previous page covers the mathematics of comparative ordering in more detail for readers who are not familiar with the subject.

Readers who have spent some time programming have likely have heard of many different sorting algorithms, but seen only a few in practice. It is common for a programming language’s list type to have a single sort() method which works exceptionally well for the vast majority of use cases. This is because most languages implement advanced hybrid sorting algorithms to take advantage of different properties of each and minimize downsides. For example, the insertion sort algorithm is one of the simplest sorting algorithms, and remains one of the fastest algorithms for small sets, but in practice ends up performing sluggishly for large sets. Other algorithms represent the opposite; performing exceptionally well on large sets, but requiring unnecessary overhead for small sets. Additionally, some algorithms only really perform well when the set is randomized, and start to falter when attempting to sort a semi-sorted collection.

In comparison-based sorting, all algorithms require Ω(n * log(n)) comparisons in the worst case. As a result, this page only covers some elementary sorting algorithms, and leaves more advanced, hybrid sorting algorithms up to the reader to explore in further depth. These mathematical bounds hold true for both sorting a list and maintaining a sorted structure. For example, inserting a comparable element into a self-balancing search tree requires O(log(n)) time, so constructing such a tree from n elements yields a de-facto O(n * log(n)) sorting method. Later pages in this module cover more advanced algorithmic material including non-comparison-based sorting such as radix sort which runs in roughly linear time (O(d * n)) where d is the number of digits in each element.

Insertion & Selection Sort

The two algorithms listed here are similar enough to lump together. Both sorting algorithms separate an unsorted base collection into sorted and unsorted logical segments during operation. With each pass, the sorted segment increases by one, and the unsorted segment decreases by one. For the first pass, the 0th element can be considered trivially sorted. Both algorithms can be implemented in place, or with a secondary structure of n elements.

As a potentially reductive generalization, think of insertion sort as finding the right spot for the next element, and selection sort as finding the right element for the next spot.

Insertion Sort

The insertion sort algorithm proceeds linearly through the unsorted collection, figuring out where in the sorted segment to insert the unsorted element. The insertion then results in O(n) shifts to the right. This requires n number of O(n) scans of the sorted segment, and n number of O(n) shift operations, resulting in a total O(n^2) running time and O(n^2) number of writes. The name comes from the way the algorithm inserts elements into a sorted segment.

pub fn insertion_sort<T: Ord>(list: &mut [T]) {
    for i in 1..list.len() {
        let mut j = i;
        while j > 0 && list[j] < list[j - 1] {
            list.swap(j, j - 1);
            j -= 1;
        }
    }
}

Selection Sort

Selection sort proceeds linearly through the base collection, figuring out the next non-increasing/decreasing element within the unsorted section to append to the sorted segment. This requires n number of O(n) scans of the unsorted segment, and n number of O(1) swap operations resulting in O(n^2) running time and O(n) number of writes (swaps).

Selection sort represents a simple and effective sorting algorithm, but only for small sets because of its quadratic temporal tendancy.

pub fn selection_sort<T: Ord>(list: &mut [T]) {
    let len = list.len();

    for i in 0..len {
        let mut min_idx = i;

        for j in (i + 1)..len {
            if list[j] < list[min_idx] {
                min_idx = j;
            }
        }

        if min_idx != i {
            list.swap(i, min_idx);
        }
    }
}

Merge Sort

Coming soon!

Quick Sort

pub fn quick_sort<T: Ord>(list: &mut [T]) {
    if list.len() <= 1 {
        return;
    }

    let pivot_index = partition(list);

    let (left, right) = list.split_at_mut(pivot_index);
    quick_sort(left);
    quick_sort(&mut right[1..]); // skip pivot
}

fn partition<T: Ord>(list: &mut [T]) -> usize {
    let len = list.len();
    let pivot_index = len - 1;

    let mut store = 0;

    for i in 0..pivot_index {
        if list[i] < list[pivot_index] {
            list.swap(i, store);
            store += 1;
        }
    }

    list.swap(store, pivot_index);
    store
}

Sorting Analysis

It should be noted that the underlying data structure can alter the operational profile of these algorithms. For example, implementing insertion sort on a linked list rather than an array changes the insertion cost from shifting elements (O(n) writes) to pointer updates (O(1) writes), though the search cost remains the same.

By their descriptions, selection sort may appear more efficient as it performs significantly fewer writes (O(n) vs O(n^2)). This is true and makes it useful when writing to memory is expensive. However, selection sort is generally unstable, meaning that it may reorder unique elements of equal value within the base collection, whereas insertion sort is generally stable. Furthermore, insertion sort is adaptive; it becomes significantly faster (O(n)) on nearly-sorted data, whereas selection sort is always (O(n^2)).

Searching

The concept of searching can be defined here as determining whether a given element or key exists in a collection. The next chapter on associative structures goes into more detail about key-value pairs and provides a way to achieve O(1) searches with map structures. Searches in unsorted arrangements necessarily require O(n) time, and can be improved to O(log(n)) time for sorted sequences.

In this module, searching refers to matching an element by value or predicate within a collection, and selection provides a way to find elements via rank-based comparisons, such as the kth-smallest (or largest) element in a collection. These represent fundamental data structure operations and differ significantly from web-based search engines like Google or DuckDuckGo. Large-scale web search systems operate in a different problem domain known as information retrieval, which is substantially more complex and extends beyond the scope of basic searching and selection as presented in this DSA module.

Binary Search

The most popular and practical search algorithm is called binary search. The idea of binary search within a sequence structure is simple, but requires that the sequence optimally provide random access such as array indexing, and also requires that the sequence be sorted. Within a sorted list, you start at the half-way point and compare the value against the search key. If the value is the same, you end the algorithm. If the value is smaller or larger than the value half-way through the sequence, you’ve already eliminated roughly half of all necessary comparisons. The search algorithm continues halving the remaining candidate values until it either finds the key or determins that the key is not in the sequence.

Selection

Selection, often referred to as order statistics, concerns rank-based queries over a collection. These ranks are typically expressed as the kth element, such as the kth smallest or largest value in the collection. Selection can be performed on both unsorted and sorted collections, but have different complexity characteristics. In unsorted collections, comparison-based algorithms such as randomized quickselect can identify the kth element in expected O(n) time where n is the number of elements in the collection. In contrast, sorted collections can support more efficient order statistic queries. If the underlying structure provides O(1) random access, such as with an indexed array, the kth element can be retrieved in O(1) time because the index value encodes the element’s rank out of the box. Sorted structures also make range queries more efficient and direct. For example, retrieving the elements ranked 3rd through 6th can be done by simple index-based access in a sorted list, or by sequential traversal in structures that maintain a sorted invariant. Because sorted arrangements provide such a large advantage to selection, it is common to use sorted structures where frequent selection is required. The next section introduces this module’s first sorted structure, the skip list, which is a probabilistic data structure that maintains sorted order and supports efficient search queries in O(log(n)) time and range queries in O(log(n) + k) time where k is the size of the range.

Quickselect

Used to find the kth smallest/largest element in an unsorted array in O(n) time
Used for efficient partial order statistics
By comparison, kth values in sorted sets are O(1)

The Skip List ADT

The sort, search, and selection algorithms presented on this page are typically introduced over simple list structures. While this provides a clean pedagogical foundation, it can feel abstract and relatively dry. To provide a more practical context, this module introduces the skip list.

First described by William Pugh in 1990, skip lists are relatively modern compared to many classical data structures. Skip lists can be used anywhere a sorted list invariant is required. They are interesting for two primary reasons; their conceptual simplicity and their probabilistic structuring. Skip lists are not exactly simple, per se, but become so when compared with the structures they compete with, namely self-balancing search trees (discussed later in the Hierarchical Structures chapter). Skip lists rely on random number generation to form probabilistic data structuring. This means that they cannot put a hard upper performance bound. Skip lists boast O(log(n)) expected insert and removal operations, though the chances of reverting to significantly worse time bounds is statistically improbable with an adequate random number generator.

In practice, skip lists are less commonly used than self-balancing trees in standard libraries, largely due to the latter’s deterministic worst-case guarantees and long-standing adoption. However, skip lists are particularly well-suited to concurrent and lock-free implementations, where their structure allows for simpler algorithms than comparable tree-based approaches. As a result, they are used in specialized systems such as concurrent maps and database indexing structures, making them a valuable case study for data structures in Rust.

Traditionally described through pointer indirection, these structures make them challenging to implement safely in Rust due to ownership and aliasing pitfalls. As a result, this material takes a safe Vec-backed approach, substituting positional handles for Vec indexes.

Because all comparison-based sorting algorithms require Ω(n * log(n)) comparisons in the worst case, sorting a collection “from scratch” is inherently expensive. As a result, the choice between periodically sorting a list and maintaining a sorted data structure depends on factors such as input size, the degree to which the elements are already sorted, memory usage, access patterns, and whether the design requires maintaining a sorted invariant.

The skip list ADT may support the following (non-exhaustive) operations:

length() Returns the number of elements in the list
is_empty() Returns whether the list is empty
contains(e) Returns true if element e exists in the list
get(e) Returns a reference to element e, if it exists
insert(e) Inserts element e into the list
remove(e) Removes element e from the list, if it exists
min() Returns the smallest element in the list
max() Returns the largest element in the list
predecessor(e) Returns the largest element less than element e in the list
successor(e) Returns the smallest element greater than element e
find_kth(k) Returns the k-th smallest or largest element (if an indexable variant is supported)
rank(e) Returns the number of elements less than e
iter() Returns an iterator over elements in sorted order
range(lo, hi) Returns an iterator over elements within a range

TODO:

Probabalistic vs deterministic sorted structures
Design patterns??? (towers, layers, etc.)
Easy kth element method and predecessor/successor search
Range query method with index semantics