In previous posts, we discussed method of construction and scoring of Decision Tree models. We also noted that despite their many advantages, they are also prone to overfitting and sampling errors. In this post in the series, we will talk about one of the state-of-the-art methods of overcoming some of the limitations of Decision Trees.
But first, a small probability puzzle.
Let’s say that you are Police Inspector interested in solving a murder mystery. You have a crude lie-detector, which, as is case with most lie detectors, is not fool-proof. In fact, it’s barely better than random with accuracy of 65%. Fig. 1 shows how likely you are to make right decision if you rely on this tool. You want better tool, but you cannot afford one.
Thankfully, a newfangled business analyst friend of yours suggested that you repeatedly test the suspect, and then make decision based on majority of results. So if lie-detector indicates that suspect is lying two or more out of three times, then you can have more confidence in results. This sounds do-able, and you happily do the tests and almost feel you’ve solved the case â€“ but â€“ your analyst friend calls and tells you that he made a mistake! (Can you figure out what?)
He says that his conclusion is only valid if each test of lie-detector is independent of other â€“ which may not be the case since it is the same equipment. You feel dejected, but realize, that you can easily borrow other lie-detectors from nearby police stations. While each is as good (or bad) as your, but at least those will be independent assessment. You ask your friend why combining many bad tools will make a good tool? How good? (What do you think?)
Your friend quickly runs some numbers, and presents you with Fig. 2 showing accuracy rate if you use 3 tests on 3 lie-detectors.
While the gain is marginal, from 65% to 72%, you quickly realize the possibilities!
That â€“ coming back to Decision Trees â€“ in a nutshell, is forest of Decision Trees. These types of models are called ensemble models or bagging models because they rely on many-many not-so-good models to combine and make one very-good model. A random forest model is combination of hundreds of Decision Trees â€“ each imperfect in its own way, probably overfitted, probably prone to random sampling â€“ and yet collectively improving overall accuracy significantly. While a forest can be of Decision Trees, you can see that concept can apply to any other type of model too. You may have ensemble of linear regression models or neural network models or what not.
Since we’d prefer each model is different and independent from other models in ensemble, we can achieve this by following two methods:
1. Different training data for each model â€“ In practice, whole development data is first split into train and test set, and then from within train set, multiple small random samples are taken out. One of the common ways is to sample N observations from train set of N observations with replacement, effectively sampling only about 2/3rd of unique observations and some observations more than once. This is called boosting, since it boosts weight of some observations which are sampled more than once. If N is sufficiently large, then smaller sub-sample can be selected, without replacement, hence boosting is not crucial to ensemble models.
2. Different feature set for each model â€“ Apart from sampling observations, another way to introduce variety in models is to limit set of feature variables for each model. Small random sample of all variables is only made available per model.
Since we rely on majority vote of ensemble of trees to provide robust modeling output, we don’t have to worry about overfitting of each model. In fact, overfitting is desired property of models in ensemble, since belief is that combination will take care of generalization. And this has been proven mathematically as well (by Leo Breiman), which is beyond scope of this blog post.
While more trees in the forest, the better model will be, we need to allow for practical run-time and scoring-time constraints as well. Following decisions have to be made when building an ensemble of Decision Trees:
1. Number of trees to build â€“ depending on accuracy desired, hardware and runtime constraints, 100-1000 is typical number
2. When to stop growing each individual tree â€“ while overfitting is not a concern, a criteria is still required â€“ usually minimum number of observations in leaf is good criteria
3. Number of observations to sample for each tree â€“ default is same as number in train data
4. With or without replacement sampling or observations for each tree â€“ usual practice is with replacement
5. Number of features to sample for each tree â€“ rule of thumb is about square-root of total number of features in train data, but there is no theoretical best
Ensemble models are really good at handling many disadvantages of Decision Trees, while still retaining many advantages. However, nothing in life is free! Ensemble models lose out on one key benefit of Decision Trees â€“ interpretability and ease of implementation. Since outcome is vote among hundreds of trees, clear IF-THEN-ELSE intuition is lost. Also, it’s not easy to even determine which of the variables have high predictive power.
In next and last post in this series we will talk about different kind of ensemble models. We will also briefly touch upon using decision trees for regression problems which we’ve skipped so far.
Related links you will like: