The Math Behind Decision Trees

4 min readApr 29, 2021

Part1: A Statistical guide to Machine Learning

Decision tree is non-parametric , supervised learning algorithm used for Classification and Regression tasks. Decision tree classifier is based on the nested if-else classifier, it works through recursive partitioning of the training set in order to obtain subsets that are as pure as possible to a given target class. Each node of the tree is associated to a particular set of records 𝑇 that is splitted by a specific test on a feature.

There are Various algorithm that are used to generate decision tree from data, some are as following: (Classification and regression tree) CART, ID 4.5, CHAID, ID 3.

x1: Petal- length, x2: Sepal-length, ‘o’: Rose, ‘x’: Sunflower

How to make a decision tree to classify a new data point as “x” or “o”?

Each node in the tree specifies a test on an attribute, each branch descending from that node corresponds to one of the possible values for that attribute. Each leaf represents class labels associated with the instance.

The initial set was the entire data set, then after further splitting into subsets according to the rules. Then, for each subset we performed additional splitting until we were able to correctly classify every data point. But how to we split dataset into subsets…? This depends on target variable…

Let’s construct a Decision Tree by using “information gain” as a criterion:-

Steps in ID3 algorithm:

For constructing a decision tree from this data, we have to convert continuous data into categorical data.

Either you can categorize data using Sk-learn, tool but to simplify let’s use basic math comparison based method.

Say, let’s split column c into 2 category, with ≥ 4.0 and <4 and D column with ≥ 1.0 and <1.0. Note Column E is the target class.

It begins with the original set S as the root node.
On each iteration of the algorithm, it iterates through unused attribute of the set S and calculates Entropy(H) and Information gain(IG) of this attribute.

Let’s calculate entropy(H) & IG, Entropy of target var E:-

Information Gain of attribute C and D:

3. It then selects the attribute which has the smallest Entropy or Largest Information gain.

IG(Target, C) = 0.283
IG(Target, D) = 0.3789

since, IG of D is Larger, D will be used to split the dataset

4. The set S is then split by the selected attribute to produce a subset of the data.

5. The algorithm continues to recur on each subset, considering only attributes never selected before.

Thus, since only attribute not selected is C, algorithm will recur on C

Some concepts !!

Stopping Criteria

Number of cases in the node is less than some pre-specified limit.
Purity of the node is more than some pre-specified limit. …
Depth of the node is more than some pre-specified limit.
Predictor values for all records are identical — in which no rule could be generated to split them.

Overfitting:

Decision trees that are trained on any training data run the risk of overfitting the training data.Eventually each leaf will represent a very specific set of attribute combinations that are seen in the training data, and the tree will consequently not be able to classify attribute value combinations that are not seen in the training data. In order to prevent this from happening, we must prune the decision tree using techniques like Post-Pruning and Pre-Pruning.The simplest technique is to prune out portions of the tree that result in the least information gain.

Count the total number of leaves in the tree.
While the number of leaves in the tree exceeds the desired number:

Find the twig with the least Information Gain
Remove all child nodes of the twig.
Relabel twig as a leaf.
Update the leaf count.

Thanks for reading. If you have any feedback, please feel to reach out by commenting on this post.

Check out our website! https://eliteaihub.com/

we keep on adding blogs, tools and videos to help to understand the math behind ML and AI !!

References:

https://people.csail.mit.edu/menze/papers/menze_11_oblique.pdf

Math behind Decision Tree Algorithm

Decision tree algorithm is one of the most popular machine learning algorithm. It is a supervised machine learning…

medium.com

The Math Behind Decision Trees

Let’s construct a Decision Tree by using “information gain” as a criterion:-

Math behind Decision Tree Algorithm

Decision tree algorithm is one of the most popular machine learning algorithm. It is a supervised machine learning…

Written by EliteAI

No responses yet