Decision Trees From Root to Leaves — Decision Trees for Machine Learning Explained

Published

May 25, 2022

Reading time

3 min read

What kind of beast was Aristotle? The philosopher's follower Porphyry, who lived in Syria during the third century, came up with a logical way to answer the question. He organized Aristotle’s proposed “categories of being” from general to specific and assigned Aristotle himself to each category in turn: Aristotle’s substance occupied space rather than being conceptual or spiritual; his body was animate not inanimate; his mind was rational not irrational. Thus his classification was human. Medieval teachers of logic drew the sequence as a vertical flowchart: An early decision tree.

The digital difference: Fast forward to 1963, when University of Michigan sociologist John Sonquist and economist James Morgan, dividing survey respondents into groups, first implemented decision trees in a computer. Such work became commonplace with the advent of software that automates training the algorithm, now available in a variety of machine learning libraries including scikit-learn. The code took a quartet of statisticians at Stanford and University of California Berkeley 10 years to develop. Today, coding a decision tree from scratch is a homework assignment in Machine Learning 101.

Roots in the sky: A decision tree can perform classification or regression. It grows downward, from root to canopy, in a hierarchy of decisions that sort input examples into two (or more) groups. Consider the task of Johann Blumenbach, the German physician and anthropologist who first distinguished monkeys from apes (setting aside humans) circa 1776, before which they had been categorized together. The classification depends on various criteria such as presence or absence of a tail, narrow or broad chest, upright versus crouched posture, and lesser or greater intelligence. A decision tree trained to label such animals would consider each criterion one by one, ultimately separating the two groups.

The tree starts with a root node that can be viewed as containing all examples in a dataset of creatures — chimpanzees, gorillas, and orangutans as well as capuchins, baboons, and marmosets. The root presents a choice between examples that exhibit a particular feature or not, leading to two child nodes that contain examples with and without that feature. Each child poses yet another choice that leads to two more children, and so on. The process ends with any number of leaf nodes, each of which, mosty or wholly, contains examples of one class.
To grow, the tree must find the root decision. To choose, it considers all features and their values — posterior appendage, barrel chest, and so on — and chooses the one that maximizes the purity of the split. (Optimal purity is defined as 100 percent of examples of one class going to a particular child node and none going to the other node.) Splits are rarely 100 percent pure after just one decision and may never get there, so the process continues, producing level after level of child nodes, until purity doesn’t rise much by considering further features. At this point, the tree is fully trained.
At inference, a fresh example traverses the tree, which evaluates a different decision at each level from top to bottom. The example takes the label of the data contained by the leaf node it lands in.

Top 10 hit: Given Blumenbach’s conclusion (later overturned by Charles Darwin) that humans are distinguished from apes by a broad pelvis, hands, and close-set teeth, what if we wanted to extend the decision tree to classify not just apes and monkeys but humans as well? Australian computer scientist John Ross Quinlan made this possible in 1986 with ID3, which extended decision trees to support nonbinary outcomes. In 2008, a further refinement called C4.5 capped a list of Top 10 Algorithms in Data Mining curated by the IEEE International Conference on Data Mining. In a world of rampant innovation, that’s staying power.

Raking leaves: Decision trees do have some drawbacks. They can easily overfit the data by growing so many levels that leaf nodes include as few as one example. Worse, they’re prone to the butterfly effect: Change one example, and the tree that grows could look dramatically different.

Into the woods: Turning this trait into an advantage, American statistician Leo Breiman and New Zealander statistician Adele Cutler in 2001 developed the random forest, an ensemble of decision trees, each of which processes a different, overlapping selection of examples and then votes on a final decision. Random forest and its cousin XGBoost are less prone to overfitting, which helps make them among the most popular machine learning algorithms. It’s like having Aristotle, Porphyry, Blumenbach, Darwin, Jane Goodall, Dian Fossey, and 1,000 other zoologists in the room together, all making sure your classifications are the best they can be.

Subscribe to The Batch