course bg
EduPristine>Blog>Decision Trees – Why Are They Useful and How To Create Them

Decision Trees – Why Are They Useful and How To Create Them

June 1, 2015

Decision Tree models are simple but powerful analytical models used in variety of analytic solutions: segmentation, regression, and classification. Fig. 1 illustrates nomenclature of a simple decision tree.

Figure 1

nomenclature of simple decision tree

Decision trees have quite a few advantages over other modeling algorithms:

  1. Really easy to understand and visualize – You don’t need to be analytic modeler or data scientist to understand what decision trees are trying to say. Even your grandmother can also follow tree structure as they provide clearly labeled path from original query to final resolution. This helps when explaining and getting buy-in from clients who are usually non-analytics professional from marketing, sales, or executive functions.
  2. Easy to implement – Since they can be easily represented by series of IF-THEN-ELSE statements which are basic building block of any tool or scripting language, they are really quick and easy to implement. Thus client’s operation and IT department doesn’t need complex mathematical models or specialized data processing tools to get the analytic solution running live.
  3. Handle almost any type of data – Continuous, cardinal or ordinal data all can be easily handled by tree, and pre-processing, scaling, or normalization is required.
  4. Require almost no data preparation – Decision trees don’t require complex missing value imputation, blank value handling, variable binning, and dummy value creation. They don’t need input variables to uncorrelated either, nor do they expect prior selection of features before model building.
  5. Can represent complex decision boundaries* – Because of their handling of multiple variables at various depth levels, they represent models far more complex than what simple logistic regression can do. Since each branch of tree is essentially independent of other branch, a decision tree can be thought of collection of separate models each working within sub-set of training observations. This provides them ability to handle interaction** between variables without having to explicitly design and create interaction variables.
  6. Robust with respect to outliers – Since decision trees essentially segment training population into multiple sub-segments, they are not influenced by few outlier cases.
  7. Efficient model scoring – Because of simple implementation and uniquely defined single path from root node to leaf node, model scoring is very efficient with decision trees. And while model development is usually offline, scoring usually happens in real-time and time-saving there really add value.

We have covered introduction of decision trees in earlier post, and this post will go into detail on mechanics behind construction of decision tree.

A typical decision (Fig. 1) tree starts with a root note containing whole of data/observations or population of interest (viz. training population). At each depth level, tree splits into 2 or more branches. Since a 3-branch split can always be represented by two 2-branch splits (Fig. 2), we can treat all decision trees splitting into only 2 branches at each level without loss of generality. Such trees are called binary decision trees.

Figure 2

binary decision trees

For each splitting, there are two decisions to be made: which variable is to be split, and [2] at what value. This is decided by computing impurity of tree, and selecting a split which will decrease overall impurity. If target variable of interest – which is criteria for segmentation, classification, or regression – is categorical class (viz. Fraud or not-fraud) then three common methods exists for computing “impurity” of each leaf node of the tree:

1. Misclassification rate ’ one minus sum of proportion of all classes, except majority class

2. Information Value or Entropy ’ sum of product of proportion of each class with natural log of that proportion, over all classes

3. Gini Index ’ one minus sum of square of proportion of all classes

where pj is proportion of observations belonging category (class) j at each node. Impurity of the tree is sum of impurity of all leaf nodes, weighted by proportion of overall observations falling into that leaf node. Fig. 3 shows how impurity varies as proportion of class 1 and class 2 varies in two-class problem. As the share of both class approach equal – mid point where share of first class is 50% and hence share of second is also 100%-50%=50% – impurity reaches highest value, as one would expect. When node has only observations belonging to single class, then impurity is lowest at 0 since there is no further segmentation required. While different methods of impurity computation give different numeric values, their trend is similar, and which is what matters in deciding when/what to split.

Figure 3

variation in impurity in decision trees

Lastly, consider an example how we may go about selecting point of splitting based on impurity computation. Fig. 4 shows example where two possible splits may exist – which one is better? If we use Gini Index for computing impurity then at level=0, there is only one leaf, and its Gini Index is 0.5, which is also impurity of whole tree.

At level 2, left split (first option) has two leaves with impurity values of 0.46 and 0.32 respectively, which handle 69% and 31% of total observations respectively. This gives total impurity of tree as 0.32+0.10=0.42. Right split (second option) gives total impurity as 0.20+0.00=0.20. Since impurity is lower with right split tree compared to left tree, right split is preferred. You can try with Misclassification rate and Entropy metrics and see if your conclusion changes.

Figure 4

variation in impurity in decision trees

This is very simple example. In practice, algorithm will iterate over all variables at all possible split points and do this exercise, and then select the variable and the split point which reduces impurity by maximum amount. By the way, how do you think we can calculate impurity if our target variable is continuous?

Innext post, we will cover questions about when to stop splitting and growing your trees, how to use tree to make predictions, and how to handle disadvantages of decision trees.

* Decision boundaries are boundaries in high dimension space which separate observations into different groups. For segmentation, for example, groups can be High/Medium/Low value shoppers. For classification problems, for example, groups can be customer who default and who don’t. And so on.

** An example of variable interaction: A linear regression has single weight for single variable, say for 1 square foot increase in floor area, price of flat goes up by 3000/-. While this is true for flats in single building, this is not so true for flats in different localities. This case cannot be handled by linear models (viz. linear or logistic regression)

About Author

author pic EduPristine

Trusted by Fortune 500 Companies and 10,000 Students from 40+ countries across the globe, it is one of the leading International Training providers for Finance Certifications like FRM®, CFA®, PRM®, Business Analytics, HR Analytics, Financial Modeling, and Operational Risk Modeling. EduPristine has conducted more than 500,000 man-hours of quality training in finance.


Interested in this topic?

Our counsellors will get in touch with you with more information about this topic.

`````````````````````````````````````````````````` Post ID = 76524