Trees

The first module on sequences introduces recursive type representations with Node structures that contain values and links to other Node structures. In a linked list, however, you can only ever move forwards or backwards linearly. As you can imagine, you might encounter a need to have more than one path from any given data point, such as a fork in the road. Indeed you may require any number of links that allow you to traverse a deeply interconnected web of nodes. Such interconnected structures are known in computer science as graphs. The simplest definition of graphs are structures that contain nodes (vertices) connected by links (edges). Even simple linked lists technically fit the definition of a graph structure! Graphs are incredibly important in all sorts of data structuring and algorithmic problem solving in computer science. Some have even claimed that the most interesting problems in computer science are graph problems. While that sentiment is deeply subjective, there is no arguing that graph theory in computer science represents an incredibly large surface area with all sorts of practical application. This module provides a taste of graph theory by covering a relatively simple sub-set of graphs called trees.

In graph theory trees represent a specific sub-type of unconnected, unweighted, acyclic graph. However, those attributes aren’t much use without much more context, and are discussed much later in a dedicated graphs section. For now, think of trees as simply non-linear, hierarchical data structures. That is, trees are not strictly sequential like lists. To illustrate the hierarchical nature of tree structures it may be helpful to invoke the imagery of a family tree structure from which computational trees borrow terminology. A computational tree can be thought of as a series of linked nodes (aka vertices in graph theory).

Each of the nodes in a tree can be represented by the same basic (often recursive) data structure referred to generally as a tree node. This text discusses several different specific tree node desins which affect the structure and operations of the tree. In all tree sub-types, the first node of the tree is called the root. Root nodes link to zero or more child nodes. All child nodes have exactly one parent node. Each “generation” can be referred to as a level. Canonically, the root is level 0, the children of the root node can be represented by some integer L. Nodes with children are called internal nodes, but nodes with no children are known as external nodes. External nodes are also referred to as leafs or leaf nodes.

You can navigate trees using node pairs called edges, commonly expressed as pointers or references, and often referred to in implementations simply as links. Combining one or more edges forms path sequences. All parent nodes in a direct path sequence from any child node all the way up to the root are considered that child’s ancestor nodes. A node’s parent can be further defined as its direct ancestor, and a node’s “grandparent” or “great-grandparent” node are defined as its indirect ancestor nodes. All directly recursive child nodes below any given node is referred to as that node’s sub-tree. The entire sub-tree of a node constitute that node’s descendants. Multiple child nodes under the same parent node are called siblings.

A tree’s depth is the number of ancestors for a given node, put another way the depth of a node is the number of nodes from a given position a to the tree’s root (not counting the original position). A tree’s height is the maximum of all possible depth values in the tree (or sub-tree). Put another way the height of a tree is the longest path from a tree’s root to its furthest leaf. A tree is considered ordered if there is some meaningful linear order to the nodes in one level of the tree.

Trees In Use

Hierarchical structures appear in both representational and structural forms. That is to say that not all problems that involve “tree-like” relationships actually require tree data structures.

Problems that use representational trees focus on modeling nested or hierarchical relationships. These kinds of problems are best suited for tree-like algorithms coupled with sequential data structures like stacks and queues to perform tree-like traversal operations rather than requiring self-balancing tree structures. These problems include abstract syntax trees (ASTs), document object models (DOMs), and parse trees.

By contrast, structural trees such as AVL trees, red–black trees, and B-trees are designed for efficient O(log n) insertion, lookup, and deletion within sorted collections. These trees serve mostly performance goals rather than pure hierarchical data representation.

This overlap between data modeling and data structuring is one reason data structures and algorithms are taught together; understanding both helps you choose the simplest and most efficient form that satisfies a problem’s true requirements.

One super common examples of trees in computing is how many major operating systems implement file systems with tree or tree-like structures. This is so common that most major modern operating systems ship with a tree (or tree-like) program for visualizing directory structures in the terminal. The subtle kicker here is that many file systems rely on tree structures as well as simpler algorithmic represetations to illustrate the files themselves. The hierarchy content group will use representational tree algorithms to write both a stack- and tree-based heading parser that prints a hierarchical table of contents of a Markdown file.

  ../my_site/docs/dsa/dsa-intro.md
  📄 dsa-intro.md
     ├── Foundational Topics
     │   └── Recursion & Iteration
     ├── Algorithmic Complexity
     │   ├── Empirical testing
     │   ├── Asymptotic analysis
     │   ├── Common growth functions
     │   └── Asymptotic notation
     ├── Proofs
     │   ├── Counterexample
     │   ├── Contra attacks
     │   └── Induction
     ├── Abstract Data Types
     └── Basic Algorithms

Structural Trees

The Tree ADT

Like most data structures, real-world implementations and use cases often include specialized design criteria and may involve combining more fundamental structures and algorithms to build out functionality. The list of functions presented here mostly provide information about the tree instantiation. This list contains several classifications of functions for the ADT. These classifications can be helpful in grounding your design. For example, ancestor functions deal specifically with a given position’s ancestor nodes. Query functions return information about the instantiation. Functions like depth() and height() are considered derived functions because they rely on other functionality because they require traversal algorithms defined for a particular tree sub-type. This section does not include functions for creating, adding, or deleting nodes from the tree yet. These instantiation and mutator functions relate closely to specific implementations and tree sub-types. The tree ADT contains an assortment of different function classifications. There are conflicting opinions about the nature of these classifications, but in general the abstract type might contain some combination of the following functions:

get(p) Returns the data at position p
root() Returns the position of the tree’s root, usually an instance value
parent(p) Returns the position of the parent of node p
children(p) Returns an iterable collection containing immediate child nodes of node p
size() - Returns the total number of nodes in the tree
is_empty() - Returns true if there are no nodes in the tree
is_root(p) Returns true if the position is the tree’s root
num_children(p) Returns the number of immediate children at position p (0-2 for a binary tree node)
is_internal(p) Returns true if the node at position p has children
is_external(p) / is_leaf(p) Returns true if the node at position p has no children
depth(p) A derived function that returns the number of nodes from the input position p to the root
height(p) A derived function that returns the height of the tallest branch under position p

Additionally, trees should be iterable and able to return all positions within the tree. You have quite a bit of flexibility here. In fact, there is so much flexibility that this text has an entire section dedicated to tree traversal.

To handle all of this flexibility and hierarchical typing Java designs commonly use the factory method pattern which defines abstract methods in super-classes or interfaces that allow implementing classes to provide specific details for specialized instantiation. The idea is that you can design a hierarchical type with a series of increasingly specialized behaviors through interfaces or abstract classes. By contrast Rust does not support inheritance, so design hierarchies tend to be much flatter and make use of traits (similar to interfaces) to define shared behavior. Traits must be implemented for each instantiated struct or enum instead of inherited as interfaces or abstract types.

Common sub-types include n-ary and binary trees. The n-ary tree is a tree in which a node can contain n number of children, whereas a binary tree node can only contain two children as left and right child nodes. Implementations require specific ways to create/instantiate trees with some kind of new() function as well as add children. N-ary trees might include a generic add_child(n, p) function where n is the node to add and p is the position of the node’s parent. Binary trees often include add_left(n, p) and add_right(n, p) functions instead.

The Binary Tree ADT

Binary trees are a sub-type of trees where each node has at most two children; a left and a right child, and in traversal left children precede right children. The result is that each node either has a left sub-tree or a right sub-tree. Binary trees are proper (or full) if each node has either zero or two children. As a result binary trees have at most 2^d nodes at a given level where d is the level’s depth. Binary trees with single left child nodes are improper, but more commonly referred to as incomplete. Proper binary trees are known as decision trees because they model a series of yes or no questions as you progress from the root to the leaves.

Binary tree ADTs commonly support the following ancestor and mutator functions:

left(p) Returns the position of the left child of p
right(p) Returns the position of the right child of p
sibling(p) Returns the position of the sibling of p
add_left(p) Adds a left child to a node at position p
add_right(p) Adds a right child to a node at position p
set(e, p) Sets the payload values for node e at position p
attach(p1, p2) Attaches sub-tree p2 to position p1
remove(p) Removes the node at position p moving its children (sub-tree) to p’s parent

Binary trees are some of the simplest trees to conceptualize and implement, but just like with linked lists, real-world vanilla use cases turn out to be rare. More than one academic resource suggests implementing an (infix) arithmetic expression evaluator using binary trees. On closer consideration you’ll learn that its actually easier and more efficient to convert infix arithmetic expressions with binary operators to Reverse Polish Notation (also known as postfix notation) using a Shunting Yard algorithm before evaluating the expression directly with lists. As a result this section leaves binary tree implementation examples to the next section on Search Trees. Indeed binary trees are much more common among the various search tree variants as well as a popular choice in heap implementation.

Tree Traversal

Before talking about ways to concretely instantiate, construct, or mutate trees it is important to discuss tree navigation. You may have noticed that the text has not provided any methods to instantiate or mutate the tree so far. This is because tree traversal is heavily dependent on the tree’s structure and what you need it to do. This section discusses two basic forms of traversal; depth-first and breadth-first traversal. In general the idea of this classification is that algorithmic approaches can progress from a particular position to the leaves before completing a level’s siblings or handle each sibling first.

You have some options when walking a tree depending on what you want to accomplish. For example, you might prioritize visiting each leaf in a particular order, or you might process nodes level by level to leverage information encoded in the tree’s structure. This choice gives rise to the first major categorization of tree traversal: depth-first or breadth-first traversal. Even within these fundamental traversal classifications there are still more discrete and specialized approaches. The following sections attempt to describe several common approaches, provide a simple use case, and finally compare approaches to help guide your decision making process.

Depth-First Traversal

The defining characteristic of a depth-first traversal is to pick a branch and go all the way to its leaves before selecting the next branch. It is probably more common to associate depth-first traversals with graphs than with trees, but the defining characterics are basically the same.

Consider a binary tree structure with one root and two leaves.

        root
       /    \
    left    right

To visit each leaf in the tree you could visit each node with any one of n! number of ordered processes:

    [root, left, right]
    [left, root, right]
    [left, right, root]

    [root, right, left]
    [right, root, left]
    [right, left, root]

Thats too many options regardless of the tree’s size. Right away you may notice that the first group and the second group are effectively the same regarding ordered traversal. The major difference is whether you proceed from left-to-right or right-to-left. Algorithmically there is no real difference between them, so this text focuses on the first group because reading left-to-right is more common in English-speaking communities. Each of the remaining three ordered processes represents a specific kind of depth-first traversal.

Preorder

A preorder traversal is a kind of depth-first traversal that visits a position p (as root or the sub-tree’s root) first, followed by its children. Suitable for tasks like creating copies of the tree or tasks that need to act on the root before its subtrees. This diagram illustrates the order in which a preorder traversal visits each node.

            1
         /     \
       2         5
     /   \     /   \
    3     4   6     7

Algorithm:

    preorder(p):
    "visit" p
    for each child c in children(p) do
        preorder(c)

Postorder

The postorder traversal is a kind of depth-first traversal that visits the children of the given root first, followed by the root p itself. Ideal for tasks like deleting a tree or solving problems where you need to solve sub-problems before dealing with the parent problem. This diagram illustrates the order in which a postorder traversal visits each node.

            7
         /     \
       3         6
     /   \     /   \
    1     2   4     5

Algorithm:

    postorder(p):
    for each child c in children(p) do
        postorder(c)
    "visit" p

The disk usage algorithm in this project’s associated GitHub repo represents a postorder traversal of a directory tree.

Inorder

This kind of depth-first traversal only applies to binary trees. An inorder traversal visits the left sub-tree of position p (as the sub-tree’s root), then p, and then the right sub-tree of p. The output of an inorder traversal orders all elements in a tree from left-to-right (or right-to-left if you want, Im not your dad). This diagram illustrates the order in which a postorder traversal visits each node.

            4
         /     \
       2         6
     /   \     /   \
    1     3   5     7

The inorder traversal is well suited for processing binary expression trees. For example, if you use exclusively binary operators, such as with basic arithmetic operators (+, -, *, /), to construct a (binary) expression tree, inorder traversal is a natural choice for constructing flattened infix expressions. In such a tree the internal nodes are operators and the leaves are operands. Consider the binary tree:

            *
         /     \
       +         -
     /   \     /   \
    3     1   4     2

Using inorder traversal this results in the infix expression 3 + 1 * 4 - 2. The creation and evaluation of such an expression (tree) is better suited to a stack, however. Naturally you’ll also require logic to handle proper arithmetic operator precedence when converting from or evaluating an infix representation, as the tree inherently enforces precedence through its structure.

Algorithm:

    inorder(p):
    if p has a left child lc then
       inorder(lc)
    "visit" p
    if p has a right child rc then
       inorder(rc)

Depth-First Traversal Summary

If you revisit the ordered processes from the example at the top of this section you can now apply a label to each process in the traversal.

[root, left, right] // Preorder [left, root, right] // Inorder (binary trees only) [left, right, root] // Postorder

Breadth-First Traversal

The defining characteristic of the breadth-first approach is to visit all nodes at the same level of a tree before traversing their sub-trees. This approach typically relies on a queue implementation to store positions at a given level. This diagram illustrates the order in which the traversal visits each node.

            1
         /     \
       2         3
     /   \     /   \
    4     5   6     7

Algorithm:

    breadth_first()
    Initialize queue Q to contain root()
    while Q not empty do
        p = Q.dequeue()
        perform the "visit" action for position p
        for each child c in children(p) do
            Q.enqueue(c)

Structural Implementations

It is common to implement trees with a combination of linked and indexed structures. For the basic tree structure a linked approach provides a natural choice due to its flexibility and efficiency in dynamic scenarios like insertion and deletion. Array-based structures can be used for base tree structure, but they often suffer from spatial inefficiency (due to pre-allocation or re-sizing) and prohibitive temporal costs (like shifting elements). Specific use cases, such as proper binary trees and heaps (a specialized binary tree), leverage array-based implementations effectively, providing spatial efficiency and maintaining O(log n) temporal complexity for key operations.

On top of a link-based structure it’s common to use indexed structures for managing child nodes in n-ary trees. Indexed structures make tracking variable numbers of child nodes easier in traversal operations.