Trees
The first module on sequences introduces recursive type representations with Node
structures that contain values and links to other Node
structures. In a linked list, however, you can only ever move forwards or backwards linearly. As you can imagine, you might encounter a need to have more than one path from a given node, such as a fork in the road. Indeed you may require any number of links that allow you to traverse a deeply interconnected web of nodes. Such interconnected structures are known in computer science as graphs. The simplest definition of graphs are structures that contain nodes (vertices) connected by links (edges). Even simple linked lists technically fit the definition of a graph structure! As it turns out, graphs are incredibly important in all sorts of data structuring and algorithmic problem solving in computer science. Some have even claimed that the most interesting problems in computer science are graph problems. While that sentiment is deeply subjective, there is no arguing that graph theory in computer science represents an incredibly large surface area with all sorts of practical application. This module provides a taste of graph theory by covering a relatively simple sub-set of graphs called trees.
Trees are non-linear, hierarchical data structures. It may be helpful to invoke the imagery of a family tree structure from which computational trees borrow terminology. A computational tree can be thought of as a series of linked nodes, aka vertices. The first node of the tree is called the root. Root nodes link to zero or more child nodes. All child nodes have exactly one parent node. Each “generation” is referred to as a level. The root is level 0, the children of the root node are level 1, and so on. You can navigate trees using node pairs called edges, commonly expressed as pointers or references, and often referred to in implementations as links. Combining one or more edges forms path sequences. All parent nodes in a direct path sequence from any child node all the way up to the root are considered that child’s ancestor nodes. A node’s parent can be further defined as its direct ancestor, and a node’s “grandparent” or “great-grandparent” node are defined as its indirect ancestor nodes. All directly recursive child nodes below any given node is referred to as that node’s sub-tree. The entire sub-tree of a node constitute that node’s descendants. Multiple child nodes under the same parent node are called siblings. Nodes with children are called internal nodes, but nodes with no children are known as external nodes. External nodes are also referred to as leaves. A tree’s depth is the number of ancestors for a given node, put another way the depth of a node is the number of nodes from a given position a to the tree’s root (not counting the original position). A tree’s height is the maximum of all possible depth values in the tree (or sub-tree). Put another way the height of a tree is the longest path from a tree’s root to its furthest leaf. A tree is considered ordered if there is some meaningful linear order to the nodes in one level of the tree.
One super common examples of trees in computing is how many major operating systems implement file systems with tree or tree-like structures. This is so common that most major modern operating systems ship with a tree
(or tree
-like) program for visualizing directory structures in the terminal. Indeed it can be helpful to visualize all sorts of information with trees.
For example, the headings in an HTML or Markdown document can be parsed into hierarchical tree structures for visualization. This website is written in Markdown which uses heading levels marked by #
characters. This allows you to describe the contents of the website in a tree-like structure by heading value. Consider the following representation of this site’s introduction page. You can clearly see that the site has five top-level sections that each have zero or more sub-topics.
../tech-docs/src/content/docs/cs/dsa/dsa-intro.md 📄 dsa-intro.md ├── Foundational Topics │ └── Recursion & Iteration ├── Algorithmic Complexity │ ├── Empirical testing │ ├── Asymptotic analysis │ ├── Common growth functions │ └── Asymptotic notation ├── Proofs │ ├── Counterexample │ ├── Contra attacks │ └── Induction ├── Abstract Data Types └── Basic Algorithms
Module Contents
The goal of this module on hierarchies is to cover the following structures:
- Intro: Binary vs General/Multiway Tree ADTs
- Priority Trees: Heaps, Cartesian Trees, Treaps, & Pairing Heaps
- Binary Search Trees: AVL & Red-Black Trees
- Frequency Trees: Splay Trees
- Multiway Search Trees: (a,b) Trees, (2,4) Trees, & B-Trees/B+ Trees
- Text Processing: Tries/Radix Trees/Patricia Trees, Suffix Trees, & Suffix Arrays
- Range & Order Statistics: Segment Trees, Fenwick Trees, & Order Statistic Trees
- Spatial Trees: k-d Trees, Quadtrees, & Octrees
The Tree ADT
Like most data structures, real-world implementations and use cases often include specialized design criteria and may involve combining more fundamental structures and algorithms to build out functionality. The list of functions presented here mostly provide information about the tree instantiation. This list contains several classifications of functions for the ADT. These classifications can be helpful in grounding your design. For example, ancestor functions deal specifically with a given position’s ancestor nodes. Query functions return information about the instantiation. Functions like depth()
and height()
are considered derived functions because they rely on other functionality because they require traversal algorithms defined for a particular tree sub-type. This section does not include functions for creating, adding, or deleting nodes from the tree yet. These instantiation and mutator functions relate closely to specific implementations and tree sub-types. The tree ADT contains an assortment of different function classifications. There are conflicting opinions about the nature of these classifications, but in general the abstract type might contain some combination of the following functions:
get(p)
Returns the data at positionp
root()
Returns the position of the tree’s root, usually an instance valueparent(p)
Returns the position of the parent of nodep
children(p)
Returns an iterable collection containing immediate child nodes of nodep
size()
- Returns the total number of nodes in the treeis_empty()
- Returnstrue
if there are no nodes in the treeis_root(p)
Returns true if the position is the tree’s rootnum_children(p)
Returns the number of immediate children at positionp
(0-2 for a binary tree node)is_internal(p)
Returnstrue
if the node at positionp
has childrenis_external(p)
/is_leaf(p)
Returnstrue
if the node at positionp
has no childrendepth(p)
A derived function that returns the number of nodes from the input positionp
to the rootheight(p)
A derived function that returns the height of the tallest branch under positionp
Additionally, trees should be iterable and able to return all positions within the tree. You have quite a bit of flexibility here. In fact, there is so much flexibility that this text has an entire section dedicated to tree traversal.
To handle all of this flexibility and hierarchical typing Java designs commonly use the factory method pattern which defines abstract methods in super-classes or interfaces that allow implementing classes to provide specific details for specialized instantiation. The idea is that you can design a hierarchical type with a series of increasingly specialized behaviors through interfaces or abstract classes. By contrast Rust does not support inheritance, so design hierarchies tend to be much flatter and make use of traits (similar to interfaces) to define shared behavior. Traits must be implemented for each instantiated struct or enum instead of inherited as interfaces or abstract types.
Common sub-types include n-ary and binary trees. The n-ary tree is a tree in which a node can contain n number of children, whereas a binary tree node can only contain two children as left and right child nodes. Implementations require specific ways to create/instantiate trees with some kind of new()
function as well as add children. N-ary trees might include a generic add_child(n, p)
function where n
is the node to add and p
is the position of the node’s parent. Binary trees often include add_left(n, p)
and add_right(n, p)
functions instead.
The Binary Tree ADT
Binary trees are a sub-type of trees where each node has at most two children; a left and a right child, and in traversal left children precede right children. The result is that each node either has a left sub-tree or a right sub-tree. Binary trees are proper (or full) if each node has either zero or two children. As a result binary trees have at most 2^d
nodes at a given level where d
is the level’s depth. Binary trees with single left child nodes are improper, but more commonly referred to as incomplete. Proper binary trees are known as decision trees because they model a series of yes or no questions as you progress from the root to the leaves.
Binary tree ADTs commonly support the following ancestor and mutator functions:
left(p)
Returns the position of the left child ofp
right(p)
Returns the position of the right child ofp
sibling(p)
Returns the position of the sibling ofp
add_left(p)
Adds a left child to a node at positionp
add_right(p)
Adds a right child to a node at positionp
set(e, p)
Sets the payload values for nodee
at positionp
attach(p1, p2)
Attaches sub-treep2
to positionp1
remove(p)
Removes the node at positionp
moving its children (sub-tree) top
’s parent
Binary trees are some of the simplest trees to conceptualize and implement, but just like with linked lists, real-world vanilla use cases turn out to be rare. More than one academic resource suggests implementing an (infix) arithmetic expression evaluator using binary trees. On closer consideration you’ll learn that its actually easier and more efficient to convert infix arithmetic expressions with binary operators to Reverse Polish Notation (also known as postfix notation) using a Shunting Yard algorithm before evaluating the expression directly with lists. As a result this section leaves binary tree implementation examples to the next section on Search Trees. Indeed binary trees are much more common among the various search tree variants as well as a popular choice in heap implementation.
Tree Traversal
Before talking about ways to concretely instantiate, construct, or mutate trees it is important to discuss tree navigation. You may have noticed that the text has not provided any methods to instantiate or mutate the tree so far. This is because tree traversal is heavily dependent on the tree’s structure and what you need it to do. This section discusses two basic forms of traversal; depth-first and breadth-first traversal. In general the idea of this classification is that algorithmic approaches can progress from a particular position to the leaves before completing a level’s siblings or handle each sibling first.
You have some options when walking a tree depending on what you want to accomplish. For example, you might prioritize visiting each leaf in a particular order, or you might process nodes level by level to leverage information encoded in the tree’s structure. This choice gives rise to the first major categorization of tree traversal: depth-first or breadth-first traversal. Even within these fundamental traversal classifications there are still more discrete and specialized approaches. The following sections attempt to describe several common approaches, provide a simple use case, and finally compare approaches to help guide your decision making process.
Depth-First Traversal
The defining characteristic of a depth-first traversal is to pick a branch and go all the way to its leaves before selecting the next branch. It is probably more common to associate depth-first traversals with graphs than with trees, but the defining characterics are basically the same.
Consider a binary tree structure with one root and two leaves.
root
/
left right
To visit each leaf in the tree you could visit each node with any one of n! number of ordered processes:
[root, left, right] [left, root, right] [left, right, root]
[root, right, left] [right, root, left] [right, left, root]
Thats too many options regardless of the tree’s size. Right away you may notice that the first group and the second group are effectively the same regarding ordered traversal. The major difference is whether you proceed from left-to-right or right-to-left. Algorithmically there is no real difference between them, so this text focuses on the first group because reading left-to-right is more common in English-speaking communities. Each of the remaining three ordered processes represents a specific kind of depth-first traversal.
Preorder
A preorder traversal is a kind of depth-first traversal that visits a position p
(as root or the sub-tree’s root) first, followed by its children. Suitable for tasks like creating copies of the tree or tasks that need to act on the root before its subtrees. This diagram illustrates the order in which a preorder traversal visits each node.
1 / \ 2 5 / \ / \ 3 4 6 7
Algorithm:
preorder(p): "visit" p for each child c in children(p) do preorder(c)
Postorder
The postorder traversal is a kind of depth-first traversal that visits the children of the given root first, followed by the root p
itself. Ideal for tasks like deleting a tree or solving problems where you need to solve sub-problems before dealing with the parent problem. This diagram illustrates the order in which a postorder traversal visits each node.
7 / \ 3 6 / \ / \ 1 2 4 5
Algorithm:
postorder(p): for each child c in children(p) do postorder(c) "visit" p
The disk usage algorithm in this project’s associated GitHub repo represents a postorder traversal of a directory tree.
Inorder
This kind of depth-first traversal only applies to binary trees. An inorder traversal visits the left sub-tree of position p
(as the sub-tree’s root), then p
, and then the right sub-tree of p
. The output of an inorder traversal orders all elements in a tree from left-to-right (or right-to-left if you want, Im not your dad). This diagram illustrates the order in which a postorder traversal visits each node.
4 / \ 2 6 / \ / \ 1 3 5 7
The inorder traversal is well suited for processing binary expression trees. For example, if you use exclusively binary operators, such as with basic arithmetic operators (+, -, *, /), to construct a (binary) expression tree, inorder traversal is a natural choice for constructing flattened infix expressions. In such a tree the internal nodes are operators and the leaves are operands. Consider the binary tree:
* / \ + - / \ / \ 3 1 4 2
Using inorder traversal this results in the infix expression 3 + 1 * 4 - 2
. The creation and evaluation of such an expression (tree) is better suited to a stack, however. Naturally you’ll also require logic to handle proper arithmetic operator precedence when converting from or evaluating an infix representation, as the tree inherently enforces precedence through its structure.
Algorithm:
inorder(p): if p has a left child lc then inorder(lc) "visit" p if p has a right child rc then inorder(rc)
Depth-First Traversal Summary
If you revisit the ordered processes from the example at the top of this section you can now apply a label to each process in the traversal.
[root, left, right] // Preorder [left, root, right] // Inorder (binary trees only) [left, right, root] // Postorder
Breadth-First Traversal
The defining characteristic of the breadth-first approach is to visit all nodes at the same level of a tree before traversing their sub-trees. This approach typically relies on a queue implementation to store positions at a given level. This diagram illustrates the order in which the traversal visits each node.
1 / \ 2 3 / \ / \ 4 5 6 7
Algorithm:
breadth_first() Initialize queue Q to contain root() while Q not empty do p = Q.dequeue() perform the "visit" action for position p for each child c in children(p) do Q.enqueue(c)
Structural Implementations
It is common to implement trees with a combination of linked and indexed structures. For the basic tree structure a linked approach provides a natural choice due to its flexibility and efficiency in dynamic scenarios like insertion and deletion. Array-based structures can be used for base tree structure, but they often suffer from spatial inefficiency (due to pre-allocation or re-sizing) and prohibitive temporal costs (like shifting elements). Specific use cases, such as proper binary trees and heaps (a specialized binary tree), leverage array-based implementations effectively, providing spatial efficiency and maintaining O(log n) temporal complexity for key operations.
On top of a link-based structure it’s common to use indexed structures for managing child nodes in n-ary trees. Indexed structures make tracking variable numbers of child nodes easier in traversal operations.