## Decision Tree

### Machine learning and Classical Models

• Classical Models :We perform Statistical hypothesis and check assumptions. If the Statistical hypothesis and assumptions fails we reject.
• Machine Learning: we directly run the ML algorithms on the model, thus allowing the data to speak out instead of directing it in a certain direction with our initial hypothesis/assumptions. For example decision tree/random forest.

### What is Supervised and Un-supervised

• What is Supervised ?
• What is Un-supervised Learning ?

### Types of Decision Tree

• 1) Categorical Variable Decision Tree : That which has a categorical target variable.
• 2) Continuous Variable Decision Tree: That which has a continuous target variable.
• ### Attribute Selection Measures:

• 1) Information Gain – Entropy – Gain.
• 2) Gini Index
• 3) Chi-square
• 4) Variance Reduction
• ### Information Gain

• Information Gain = I(p,n)
= -p/(p+n) log2(p/p+n) – n/(p+n) * log2(n/p+n)
• Entropy E(A) = ∑(( pi + ni ) / (p+n) ) * (I(p,n))
• Gain (A) = I(p,n) – E(A)
• ### Dataset :

• Number of Yes = p = 9
• Number of No = n = 5
• I(p,n) = -p/(p+n) log2(p/p+n)) – n/(p+n) * log2(n/p+n))
• = -9/14 log2(9/14) – 5/(14) * log2(5/14)
• = -9/14 log2(0.642) – 5/(14) * log2(0.357)
• = -9/14 log2(0.642) – 5/(14) * log2(0.357)
• = -9/14 * (-0.639) -5/14 * (-1.485)
• I(9,5) = 0.941

• ### Entropy

• Information Value for Outlook
• I(2,3) = - 2/5 log2(2/5) – 3/5 * log2(3/5) = 0.970
• Entropy = 5/14 * 0.970 + 4/14 *0 + 5/14 * 0.970 = 0.692
• Gain ( Outlook) = 0.940 – 0.692 = 0.248
• ### Similarly Calculate for others

• Gain( Outlook ) = 0.248
• Gain ( Temperature) = 0.029
• Gain ( Humidity ) = 0.151
• Gain (Wind ) = 0.048
• ### Branching from root

• With Respect to Sunny find the Entropy and Gain Information for temperature humidity and rain.

• Similarly for overcast and rain.
• ### Gini Index

• It works with categorical target variable “Success” or “Failure”.
• It performs only Binary splits.
• Higher the value of Gini higher the homogeneity.
• CART (Classification and Regression Tree) uses Gini method to create binary splits.
• ### Steps to calculate Gini Index…

• 1) Calculate Gini for sub-nodes:
probability for success and failure = (p^2+q^2).
• 2) Calculate Gini for split using weighted Gini score of each node of that split
• Case Study – Students of 30 in mix class of 2nd and 3rd year, 10 are females and 20 are males. Target is how many will continue higher studies.

### Split on Gender:

• Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68
• Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55
• Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59
• ### Similar for Split on Class:

• Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51
• Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51
• Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51
• Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence, the node split will take place on Gender.

### Chi-square

• An Algorithm to find out statistical significance between sub nodes and parent nodes.
• It’s sum of squares of standardized differences between observed and expected frequencies of target variable.
• It works with categorical target variable “Success” or “Failure”.
• It can perform two or more splits.
• Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.
• Chi-Square of each node is calculated using formula,
• Chi-square = ((Actual – Expected)^2 / Expected)^1/2
• It generates tree called CHAID (Chi-square Automatic Interaction Detector)
• ### Steps to calculate Chi-Square

• Calculate expected value for Continue
• Calculate Expected Value for Not Continue
• Calculate Chi-Sq =  ((Actual – Expected)^2 / Expected)^1/2

### Variance Reduction Technique:

• Reduction in variance is for continuous target variable. Its is used for regression problems. It uses standard variance formula to choose the best split. The split with lower variance is selected:

• ### Steps to calculate variance:

• Calculate variance for each node.
• Calculate variance for each split as weighted average of each node variance.
• Example:

• Let’s assign numerical value 1 for continue and 0 for not continuing with higher studies. Following are the steps to identify the right split:
• ### Calculation for Variance and reduction.

• For Continue or not, Mean= (15*1 + 15*0)/30 = 0.5 since 15 one and 15 zero
• Variance= (15*(1-0.5)^2+15*(0-0.5)^2) / 30 = 0.25
• Mean of Female node = (2*1+8*0)/10=0.2 and Variance = (2*(1-0.2)^2+8*(0-0.2)^2) / 10 = 0.16
• Mean of Male Node = (13*1+7*0)/20=0.65 and Variance = (13*(1-0.65)^2+7*(0-0.65)^2) / 20 = 0.23
• Variance for Split Gender = Weighted Variance of Sub-nodes = (10/30)*0.16 + (20/30) *0.23 = 0.21
• Mean of Class IX node =  (6*1+8*0)/14=0.43 and Variance = (6*(1-0.43)^2+8*(0-0.43)^2) / 14= 0.24
• Mean of Class X node =  (9*1+7*0)/16=0.56 and Variance = (9*(1-0.56)^2+7*(0-0.56)^2) / 16 = 0.25
• Variance for Split Gender = (14/30)*0.24 + (16/30) *0.25 = 0.25
• ### Key parameters of tree modelling and how to avoid overfitting:

• 1) Setting constraints on tree size
• 2) Tree Pruning
• ### Setting Constraints on tree size

• Maximum depth:
• Maximum number of terminal nodes:
• Maximum feature to consider for split:
• Minimum samples for node split: So that it doesn't study the relationship.
• Minimum Samples for terminal node:
• ### Tree Pruning

• Many of the branches will reflect anomalies in the training data due to noise and outliers.
• Make the decision tree to a large depth.
• Start at the bottom and start removing leaves which are giving us negative returns when compared from the top.

• Simple to Understand
• Useful in Data exploration
• Less data cleaning required
• Data type is not a constraint
• Non Parametric Method

• Overfitting.
• Not fit for continuous variable : Decision trees looses information when it categorizes variables in different categories.
• ### What are ensemble methods in tree based modeling?

• Ensemble methods involve group of predictive models to achieve a better accuracy and model stability.
• Balance of bias and variance.
• ### Bagging

• It is technique to reduce the variance of our predictions by combining the result of multiple classifiers model on different sub-samples of the same data set.
• Steps
1. Create Multiple Datasets
2. Build Multiple Classifiers
3. Combine Classifiers

### Random Forest

• We grow multiple trees as opposed to a single tree in CART model.
• Each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest) and in case of regression, it takes the average of outputs by different trees.
• ### Advantages of Random Forest

• Both Classification and Regression problems are solved by Random Forest.
• It has power to handle large data set with higher dimensionality.
• It can identify significant variables as importance of variable.
• It has methods for balancing errors in data sets where classes are imbalanced.
• It is also used for estimating missing data.
• Out of bag samples.

• May overfit particularly with noisy datasets.
• Have a very little control on what the model does.
• ### Boosting

• Boosting refers to a algorithms which converts weak learner to strong learners.
• Each tree complements the previous one.

Keep track of the errors of the previous trees.

• n.trees  – It refers to number of iterations i.e. tree which will be taken to grow the trees
• interaction.depth– It determines the complexity of the tree i.e. total number of splits it has to perform on a tree (starting from a single node)
• shrinkage– It refers to the learning rate. This is similar to learning_rate in python (shown above).
• n.minobsinnode– It refers to minimum number of training samples required in a node to perform splitting
• ### Difference between Bagging and Boosting

• In Boosting, each model is built on top of the previous ones. Whereas in bagging each model is built independently.
• The final boosting ensemble uses weighted majority vote while bagging uses a simple majority vote.
• Bagging is a method of reducing variance while boosting can reduce the variance and bias of the base classifier
• ### Quiz:

• What is supervised learning?
• What is un-supervised learning?
• Types of decision tree?
• What are the types of attribute selection measures?