# How Decision tree Algorithm works

Decision tree is an algorithm to solve classification and regression problems that is in the form of tree shaped diagram where each branch is possible course of action, decision, outcome or reaction. Problems that can be solved using decision tree are classification problems and regression problems. Decision tree when used to solve classification problem are called classification decision tree and when decision tree is used to solve regression problem it is called regression decision tree. Decision tree algorithm are used in application like weather prediction, medical diagnosis, and robot control.

The advantage of decision tree is that problem is simple to understand, visualize and interpret. Little effort is required to prepare the dataset, can handle numerical and categorical data and non-linear data does not effect its performance. The disadvantage of decision is overfitting. Overfitting occurs when the algorithm captures noise in dataset. Another disadvantage of decision tree is high variance which occurs when there is small variation in the data which leads to unstable model. Similarly, another disadvantage of decision tree is low bias which occurs in highly complex decision tree. The low bias problem makes it difficult to work with new data.

Entropy and Information Gain in Decision Tree

To use decision some terms needs to be learnt. This includes foremost, the term entropy. Entropy is the measure of uncertainty, unpredictability, randomness in a dataset. An example of high entropy dataset is group of animals with large features. Consider 4 different animals each of which has different features(color, height etc). Such data set has high entropy because we cannot simple decide which animal is which by looking at the animal group. Another term that is to be understood is the information gain. Information gain is measure of decrease in entropy. In case of decision tree, after a decision is made and dataset is slit, the entropy from the higher node to lower node decreases and there is information gain. If E1 is the entropy at higher node and E2 is the entropy in the lower node, then the difference E1-E2 is called information gain. The decision tree works by calculating the entropy and information gain until the end of the decision tree is reached which is the leaf nodes.

How Decision tree works

The first node is the called the root node. The leaf nodes which are the end points of the decision tree are the decision or the classification. Consider we have the dataset as a group of animals with color and height as their feature. Consider that there are 2 elephants(height=9,color=gray), 3 giraffe(height=10,color=yellow), 2 lions(height=5,color=yellow) and 1 monkey(height=3,color=brown). This is the initial dataset or the training dataset. After the decision tree is trained and set, we input unknown animal and classify it to one of the decisions.

The first task is to find the first condition to split the data from the root node. The condition should be such that highest information gain is achieved.

The entropy is given by,

H = -∑ pi log(pi)

The entropy at the root node is,

H = -∑ pi log(pi)

where pi is the probability of occurrence of event i

In our case, the entropy is calculated as below.

Total animals = 2 + 3 + 2 + 1 = 8

Next, we can calculate the probability of each animal in the dataset:

Probability of elephants(p1) = 2/8 = 0.25

Probability of giraffes(p2) = 3/8 = 0.375

Probability of lions(p3) = 2/8 = 0.25

Probability of monkey(p4) = 1/8 = 0.125

To calculate the entropy, we can use the formula:

Entropy = H = - (p1 * log2(p1) + p2 * log2(p2) + ... + pn * log2(pn))

where p1, p2, ..., pn are the probabilities of each class.

Plugging in the values, we get:

H = - (0.25 * log2(0.25) + 0.375 * log2(0.375) + 0.25 * log2(0.25) + 0.125 * log2(0.125))

= - (0.25 * -2 + 0.375 * -1.42 + 0.25 * -2 + 0.125 * -3) = - (-0.5 - 0.533 - 0.5 - 0.375) = 1.908

Therefore, the entropy at the root node of the given dataset is approximately 1.908.

Using the color feature we can make the first split. This  will give two nodes with giraffe and lion on one node A and the elephant and monkey on the other node B.

Node A: 3 giraffe(height=10,color=yellow), 2 lions(height=5,color=yellow)

The entropy at node A is calculated as follows.

Total animals = 3 + 2 = 5

Next, we can calculate the probability of each animal in the subset:

Probability of giraffes = 3/5 = 0.6 Probability of lions = 2/5 = 0.4

To calculate the entropy, we can use the formula:

Entropy = Ha = -∑ pi log(pi) - (p1 * log2(p1) + p2 * log2(p2))

where p1 and p2 are the probabilities of each class.

Plugging in the values, we get:

Ha = - (0.6 * log2(0.6) + 0.4 * log2(0.4)) = - (0.6 * -0.736 + 0.4 * -1.322) = - (-0.442 + -0.529) = 0.971

Therefore, the entropy of the subset of 3 giraffes and 2 lions is approximately 0.971.

We see that the entropy has decreased from root entropy of 1.9 to node A entropy of 0.971.

The information gain is, IGa=H-Ha=1.908-0.971 or, IGa=0.937

Node B: 2 elephants(height=9,color=gray), 1 monkey(height=3,color=brown)

Similarly the entropy of node B and information gain are;

Hb = 0.918

and IGb = 0.99

Next we can classify with the height as the basis. So the condition is whether the height is higher or equal to 9. This will split both the node A and B. to get A1 and A2 nodes from node A and nodes B1 and B2 from node B.

Node A1: 3 giraffe(height=10,color=yellow)

Entropy and information gain are:

Ha1 = 0

IGa1 = Ha-Ha1 = 0.937

Node A2:  2 lions(height=5,color=yellow)

Ha2:0

IGa2=Ha-Ha2= 0.937

Node B1: 2 elephants(height=9,color=gray)

Hb1:0

IGb1= Hb-Hb1= 0.918

Node B2:1 monkey(height=3,color=brown)

Hb2:0

IGb2= Hb-Hb2= 0.918