Decision Tree models are simple but powerful analytical models used in variety of analytic solutions: segmentation, regression, and classification. Fig. 1 illustrates nomenclature of a simple decision tree.
Decision trees have quite a few advantages over other modeling algorithms:
We have covered introduction of decision trees in earlier post, and this post will go into detail on mechanics behind construction of decision tree.
A typical decision (Fig. 1) tree starts with a root note containing whole of data/observations or population of interest (viz. training population). At each depth level, tree splits into 2 or more branches. Since a 3-branch split can always be represented by two 2-branch splits (Fig. 2), we can treat all decision trees splitting into only 2 branches at each level without loss of generality. Such trees are called binary decision trees.
For each splitting, there are two decisions to be made: which variable is to be split, and  at what value. This is decided by computing impurity of tree, and selecting a split which will decrease overall impurity. If target variable of interest – which is criteria for segmentation, classification, or regression – is categorical class (viz. Fraud or not-fraud) then three common methods exists for computing “impurity” of each leaf node of the tree:
1. Misclassification rate → one minus sum of proportion of all classes, except majority class
2. Information Value or Entropy → sum of product of proportion of each class with natural log of that proportion, over all classes
3. Gini Index → one minus sum of square of proportion of all classes
where pj is proportion of observations belonging category (class) j at each node. Impurity of the tree is sum of impurity of all leaf nodes, weighted by proportion of overall observations falling into that leaf node. Fig. 3 shows how impurity varies as proportion of class 1 and class 2 varies in two-class problem. As the share of both class approach equal – mid point where share of first class is 50% and hence share of second is also 100%-50%=50% – impurity reaches highest value, as one would expect. When node has only observations belonging to single class, then impurity is lowest at 0 since there is no further segmentation required. While different methods of impurity computation give different numeric values, their trend is similar, and which is what matters in deciding when/what to split.
Lastly, consider an example how we may go about selecting point of splitting based on impurity computation. Fig. 4 shows example where two possible splits may exist – which one is better? If we use Gini Index for computing impurity then at level=0, there is only one leaf, and its Gini Index is 0.5, which is also impurity of whole tree.
At level 2, left split (first option) has two leaves with impurity values of 0.46 and 0.32 respectively, which handle 69% and 31% of total observations respectively. This gives total impurity of tree as 0.32+0.10=0.42. Right split (second option) gives total impurity as 0.20+0.00=0.20. Since impurity is lower with right split tree compared to left tree, right split is preferred. You can try with Misclassification rate and Entropy metrics and see if your conclusion changes.
This is very simple example. In practice, algorithm will iterate over all variables at all possible split points and do this exercise, and then select the variable and the split point which reduces impurity by maximum amount. By the way, how do you think we can calculate impurity if our target variable is continuous?
Innext post, we will cover questions about when to stop splitting and growing your trees, how to use tree to make predictions, and how to handle disadvantages of decision trees.* Decision boundaries are boundaries in high dimension space which separate observations into different groups. For segmentation, for example, groups can be High/Medium/Low value shoppers. For classification problems, for example, groups can be customer who default and who don’t. And so on. ** An example of variable interaction: A linear regression has single weight for single variable, say for 1 square foot increase in floor area, price of flat goes up by 3000/-. While this is true for flats in single building, this is not so true for flats in different localities. This case cannot be handled by linear models (viz. linear or logistic regression)
Global Association of Risk Professionals, Inc. (GARP®) does not endorse, promote, review or warrant the accuracy of the products or services offered by EduPristine for FRM® related information, nor does it endorse any pass rates claimed by the provider. Further, GARP® is not responsible for any fees or costs paid by the user to EduPristine nor is GARP® responsible for any fees or costs of any person or entity providing any services to EduPristine Study Program. FRM®, GARP® and Global Association of Risk Professionals®, are trademarks owned by the Global Association of Risk Professionals, Inc
CFA® Institute does not endorse, promote, or warrant the accuracy or quality of the products or services offered by EduPristine. CFA® Institute, CFA® Program, CFA® Institute Investment Foundations™ and Chartered Financial Analyst® are trademarks owned by CFA® Institute.
Utmost care has been taken to ensure that there is no copyright violation or infringement in any of our content. Still, in case you feel that there is any copyright violation of any kind please send a mail to firstname.lastname@example.org and we will rectify it.
2017 © Edupristine. ALL Rights Reserved.