DATA ANALYSIS  

Data Structures: Trees - Theory

📥
You can download the original Jupyter Notebook for this lesson: trees_theory.ipynb

1. What is a Tree?

A tree is a hierarchical, non-linear data structure. Unlike arrays or linked lists, which store data in a sequential (linear) manner, trees represent relationships between elements using a parent-child hierarchy.

DOM Tree Example

Dynamic Data Structure

A tree is a dynamic data structure, meaning it can grow or shrink in size during execution by allocating or deallocating memory as needed. This is in contrast to static data structures, such as a fixed-size array in C or Java, where the memory allocation is predetermined and cannot be changed without significant overhead or creating a new structure.

Key Terminology

  • Root: The top-most node of the tree (the ancestor of all other nodes).
  • Node: A basic unit containing data and references to child nodes.
  • Edge: The connection between two nodes.
  • Leaf: A node with no children.
  • Parent/Child: A node directly above/below another node.
  • Height of a Tree: The length of the longest path from the root to a leaf (number of edges).
  • Depth of a Node: The number of edges from the root to that specific node.
  • Degree: The number of children a node has.
  • Subtree: A portion of a tree that can be viewed as a complete tree itself.

Representations

Trees can be represented in various ways. Below are common methods: graph, table, nested sets.

Tree Representations

2. Operations and Complexity

We use trees because the complexity of their most important operations is often significantly better than other data structures (like O(n) for searching in a simple list).

Most Important Operations

  • Insertion: Adding a new element while maintaining the tree structure.
  • Deletion: Removing a node and reconfiguring the tree.
  • Search: Finding a specific value within the tree.
  • Traversal (Retrieval): Visiting all nodes in a specific order:
  • Inorder: Left -> Root -> Right (common for BSTs to get sorted data).
  • Preorder: Root -> Left -> Right (common for copying trees).
  • Postorder: Left -> Right -> Root (common for deleting trees or evaluating expressions).

3. When do Developers use Trees in Practice?

Trees are everywhere in software engineering, even if we do not always build them from scratch:

Hierarchical Data

Representing file systems, organization charts, or category taxonomies.

Web Development

The DOM (Document Object Model) in HTML is a tree. Python libraries like BeautifulSoup or lxml represent web pages as trees for scraping.

Machine Learning

Decision Trees and Random Forests (e.g., in scikit-learn) are fundamental for classification and regression.

Databases

Indexes are often stored as B-Trees or LSM Trees to allow for fast searching (O(log n)). (Developers often do not build these large trees manually if the data is already in a database, as the database engine handles the tree structures internally.)

Compilers/Parsers

Programming languages use Abstract Syntax Trees (AST) to analyze and execute code.

Autocomplete and Predictive Text (Tries)

Suggestions in Google or text messages are generated using a Trie (prefix tree). Each node is a letter; following the path as you type allows the app to instantly find words sharing that prefix.

Comment Sections

Threaded conversations (like Reddit) use trees to manage "replies to replies."

Nested Category Menus

E-commerce sites organize products into trees (Electronics -> Computers -> Laptops).

Portfolio Breakdown

Wealth management apps split your total portfolio (Root) into Asset Classes, then Sectors, and finally individual Tickers.

Professional Python Packages for Tree Management

While implementing your own Node class is educational, professional developers often use robust libraries:

anytree

Flexible library for generic trees with visualization support.

treelib

Efficient library for tree management and console display.

bigtree

Integrates trees with Python dictionaries and Pandas DataFrames.

scikit-learn

Specifically for Machine Learning decision trees.

lxml / BeautifulSoup

For working with XML/HTML trees.

BTrees

A library providing persistent, disk-based B-Tree structures (often used with ZODB).

bplusj

Or custom implementations for B+ trees in high-performance data storage.

5. Classification of Trees

Trees are categorized based on their structural properties and intended use. Note that some trees can belong to multiple categories (e.g., an AVL tree is both a Binary Tree and a Balanced Search Tree).

Main Categories

1. General Trees: Nodes can have any number of children.

2. Binary Trees: Each node has at most two children.

3. Search Trees: Specifically organized to allow for fast searching (e.g., BST).

4. Balanced Trees: Automatically rebalance themselves to maintain O(log n) height. They use rotations to ensure no branch becomes too long.

5. Multi-way Trees: Nodes can have more than two children (e.g., B-Trees).

6. Heaps: Specialized trees used for priority queues where the root is always the max or min element.

6. Detailed Tree Types and Complexity

Binary Search Tree (BST)

  • Description: Left child is smaller than parent; right child is larger.
  • Use Case: Simple data sorting and retrieval.
OperationAverage ComplexityWorst Case
SearchO(log n)O(n)
InsertionO(log n)O(n)
DeletionO(log n)O(n)

AVL Tree

  • Description: A strictly balanced BST. The height difference between subtrees is at most 1.
  • Why O(log n)?: Since the tree is always balanced, the height is logarithmic. If you have $N$ elements, the number of levels is approximately $\log_2(N+1)$. For example, with 1023 elements, you only need about 10 levels.
  • Library: pyavl or anytree (with custom logic).
OperationComplexity
SearchO(log n)
InsertionO(log n)
DeletionO(log n)

Red-Black Tree

  • Description: A balanced tree that uses a color-coding system (red/black) to ensure the tree remains roughly balanced.
  • Use Case: Used in Java TreeMap, C++ STL Map, and the Linux kernel scheduler.
  • Library: bintrees package.

B-Trees and B+ Trees

  • Description: Multi-way search trees designed for storage systems. B+ trees store all data in leaf nodes for faster sequential access.
  • Database Usage: Used in PostgreSQL and MySQL for indexes. While B-Trees handle ranges and equality, databases may use Hash Indexes specifically for string columns when only exact equality searches are needed.
  • Library: BTrees or python-bplus-tree.
OperationComplexity
SearchO(log n)
InsertionO(log n)
DeletionO(log n)

Optimal and Priority Search Trees

  • Optimal Search Trees: Built based on the frequency of access to each key to minimize total search time.
  • Priority Search Trees: Combine properties of heaps and search trees to allow for range queries on one dimension and priority on another.

7. Decision Trees in Machine Learning

Decision Trees are predictive models that map observations about an item to conclusions about the item target value.

  • Characteristics: They use a flowchart-like structure where each internal node represents a "test" on an attribute.
  • Usage: Classification (is this email spam?) and Regression (what is the predicted house price?).
  • Python Implementation: The most common way to use them is via scikit-learn using DecisionTreeClassifier or DecisionTreeRegressor.

8. SortedContainers: An Alternative to Trees

In Python, the sortedcontainers library is often preferred over manually implementing tree structures.

  • What it is: A pure-Python implementation of sorted list, sorted dict, and sorted set types.
  • Complexity: Most operations are $O(\log n)$ or better (using a mix of lists and sub-lists).
  • Advantages: Extremely fast in Python because it leverages Python list optimizations. It is often faster than balanced trees implemented in pure Python.

9. Trees vs Graphs

While trees are a type of graph, there are critical differences:

  • Cycle: A tree is an acyclic graph (no loops). Graphs can have cycles.
  • Connectivity: A tree must have exactly one path between any two nodes. Graphs can have multiple paths.
  • Roots: Trees always have a designated root; graphs are generally unrooted.
  • Hierarchy: Trees are hierarchical; graphs can be any shape (e.g., social networks).

Contact details:

+48 790-430-860

analysislessons@gmail.com