# Understanding Decision Trees Behind the Scene

**Background**

Decision tree classifier is a supervised learning algorithm that generates a decision tree as a predictive model which maps new observations to their expected labels. If the labels are discrete (ordinal or nominal) values, the decision tree is called a classification tree (the focus of this article), and if the labels are continuous values, the decision tree is called a regression tree. The structure of a decision tree is depicted as follows:

Decision trees grow incrementally downward. The entirety of the training data is split up into smaller and smaller subsets. A subset is known as a node. The line connecting two nodes is called an edge (or a branch) which specifies a feature condition that splits data into subsequent nodes. The root node starts at the topmost position, and it has no incoming edge and zero or more outgoing edges. An internal node has just one incoming edge and two or more outgoing edges. Non-terminal nodes (such as root and internal nodes) contain feature conditions that separate data by different feature values. Leaf (or terminal) nodes have just one incoming edge and no outgoing edges. Each leaf node is assigned with a class label that all its instances are members of.

The benefits of using decision tree classifiers include:

- Implicit feature selection – The first few nodes at the top of the tree are essentially the most important features, thus feature selection is carried out implicitly as part of the decision tree construction.
- Less data preparation – Data transformation such as normalization and re-scaling, unlike in some other machine learning algorithms, are generally not required as they will not affect the outcomes of the tree construction.
- Modeling nonlinear relationships – There is no linear assumption of the data in decision tree which makes it a suitable learning algorithm for non-linear data.
- Ease in interpretation – Decision trees can be graphically depicted and it is straightforward to interpret and explain since it is analogous to how people actually make decisions.

**Building a Decision Tree**

One of the core algorithms for growing decision trees is called ID3, by J. R. Quinlan. The ID3 uses a top-down, greedy search over the space of possible edges with no backtracking. A greedy algorithm is an algorithm that adheres to the problem solving heuristic of making the locally optimal choice at each stage in hope of finding a global optimum. Achieving a global optimum is not guaranteed in this algorithm, nonetheless a greedy heuristic should yield local optimums that approximate a globally optimal solution in a reasonable time. No backtracking means that the greedy search makes the best decision in each step and never looks back to reconsider earlier choices, even though there may be a more optimized path as a whole.

**Greedy Algorithm**

Greedy algorithm selects the “best choice” at each individual step hoping that the overall result would have reached or at least be close to the overall best decision sequence.

*Scenario #1:*

For instance, we set an example objective of using the least number of coins to add up to a certain amount. The available coin types are X: {$1, $3, $4} and the amount to be added up to is y: $10.

Step 1: Since $10 is larger than every single element of X, the greedy algorithm will choose the highest possible values from X: $4 as its first step.

Step 2: The remaining amount is $10-$4 = $6 which is larger than all the elements in X. Thus, $4 is chosen again.

Step 3: The remaining $2 will be matched by $1 coin which is the largest possible option.

Step 4: The remaining $1 will be matched by $1 coin. Iteration ends.

Thus, it takes 4 coins ($4+$4+$1+$1) to add up to $10. While greedy algorithm finds one of the best solutions in scenario #1, it does not always find the best solution necessarily. The following example will demonstrate why.

*Scenario #2:*

Once again, we set the objective of using the least number of coins to add up to a certain amount. The available coin types are X: {$1, $3, $4} but the amount to be added up to is y: $6.

Step 1: The largest value in X ($4) will be chosen.

Step 2: The remaining $2 will be matched by $1 coin.

Step 3: The remaining $1 will be matched by $1 coin. Iteration ends.

Thus, it takes 3 coins ($4+$1+$1) to add up to $6 with a greedy approach. In this scenario, while using up three coins is not the worst solution, it is not the most optimal solution (which is only using two $3 coins) either. The key benefit of the greedy algorithm is that it is much computationally less intensive than having to exhaustively searching the best solution from the entire dataset. The drawback of this benefit is that it may not find the most optimal solution.

**Growing a Decision Tree**

The core idea of ID3 algorithm is to construct the decision tree by employing a top-down, greedy search through the given dataset. In order to identify an optimal way to classify a learning task, the ID3 will find a feature condition that best splits the data into subsequent tree nodes in each iteration. How good a split is depends on how “pure” (or homogeneous) the resulting nodes are with respect to the new distribution of classes. The goal is to split the sample into nodes that each contains as many instances of the same class as possible. The quantification of purity (or impurity) is typically done by measures such as entropy and Gini index.

**Entropy and Gini Index**

Entropy measures the level of impurity of data in a group of instances. Within a given node s with c number of classes, the fraction of instances belonging to class i is denoted as p(i|S). The entropy of this node will be:

Entropy ranges from 0 to 1. The lower the entropy, the lower the impurity (and higher the purity), indicating that the instances in the node mostly belonging to the same class. The higher the entropy, the higher the impurity (or lower the purity), indicating that the instances in the node have a good mix of more than one classes.

Another quantification of the impurity of data is called the Gini index, given the following equation:

The interpretation of the Gini index value is similar to entropy, as they represent essential the same concept but in a different scale, which is illustrated below:

**Information Gain**

In order to determine which feature condition to use to split data, an objective quantification called information gain is used. Information gain is the difference in entropy between the node before (parent) and the nodes after (children) splitting:

where I_{(parent)} is the impurity measure of a given parent node, N is the total number of instances at the parent node, k is the number of child nodes, and N(v_{j}) is the number of instances in child node v_{j}. I_{(vj)} is the impurity measure of a child node v_{j}. Essentially, the information gain is equal to the impurity at the parent node minus the weighted average of impurity of the child nodes after splitting. The feature condition that gives the largest reduction in impurity (a.k.a. the largest information gain) is seen as the best feature condition to use to split the data at that node since it is the most capable to group data instances of similar classes into smaller subsets compared to other feature conditions.

**ID3 Algorithm**

The ID3 algorithm performs the calculation of Δ_{info} on each feature value and picks the feature that gives the highest information (greedy approach) gain for each node generation. Its underlying objectives are to create as small a decision tree as possible (with smallest number of nodes and depth of the final tree) such that instances can be identified into the correct classes after only a few decision tree splitting.

The if-then rules for executing the ID3 algorithm are as follows:

- Step 1: Let ID3(Leaning sets, Feature sets, Feature values) be ID3(S, F, V). Create root node, then add learning set S into root node as its subset.
- Step 2: If the entropy of a subset S from a feature value V is 0, then all the instances belong to the same class. A leaf node is returned.
- Step 3: If none of the subsets S based on different feature values V has an entropy of 0, then compute Δ
_{info }on each feature left (that has not been used in splitting before). Find feature F that results with the maximum Δ_{info}. Create child nodes based on this feature and add to the root node. - Step 4: For each subsequent child node, apply ID3(S, F, V) recursively until reaching a node that
- has 0 entropy,
- all feature values are the same, or
- results a leaf node.

**Overfitting and Stopping Rules**

If a decision tree is allowed to fully grow without any additional stopping rules, it may reach 100% accuracy in the training example but suffer in generalization capability. This is an example of overfitting due to the innate ability that decision tree can keep splitting data even if a node has a single instance only. In this case, the decision tree has too closely memorize the noise from the training data, rendering it highly inaccurate to predict new, unseen instances. There are generally two approaches including the pre-pruning and post-pruning of the decision tree to avoid this overfitting problem.

**Pre-pruning**

Pre-pruning stops the building of the decision tree before it reaches its fully-grown state. Thus, instead of using the typical stopping rules as in step 4 in the ID3 iterations above, more restrictive stopping conditions could be applied:

- Stop if number of instances is fewer than a user-specified threshold
- Stop if depth of a tree has reached a user-specified threshold
- Stop if class distribution of instances are independent of the available features
- Stop if expanding the current node does not improve impurity measures

**Post-pruning**

Post-pruning, in contrast, removes parts of the decision tree after it is allowed to have fully grown. Some believe it is more preferable than pre-pruning since it is often difficult and non-systematic to choose stopping threshold a-priori. Post-pruning “trims” the nodes of the tree from bottom-up. A portion called sub-tree is trimmed out and replaced by a leaf node. The class label of this new leaf node is determined from the majority class of instances derived from the sub-tree. An example of post-pruning is holding out a random data subset (validation set) from the original training data. This allows the evaluation of the effect before and after the pruning. The validation set allows the estimation of errors between the entire subtree and the replaced leaf node. The node pruning aims at finding the highest reduction in error, and it repeats until error is no longer reduced.

**Source material**

## Leave a Reply

You must be logged in to post a comment.