June 17, 2015
Decision Tree models are simple and powerful analytical models used in variety of analytic solutions: segmentation, regression, and classification. In previous three posts in this series, we talked about advantages of Decision Tree models , method of construction and scoring of Decision Tree models , some of the limitations and ways to overcome them. One of the advanced methods to manage problem of sampling and overfitting errors in Decision Trees is ensemble models, which we discussed in previous post. In this last post in the series, we will discuss a different variant of ensemble models. Lastly, we will also talk about creating and scoring trees for regression problem.
Gradient Boosting models are another variant of ensemble models, different from Random Forest we discussed previously. In Random Forest (RF) models, goal is to build many-many overfitted models each on subset of training data, and combine their individual prediction to make final prediction. In Gradient Boosting models, goal is to build series of many-many underfitted models, each bettering errors of previous model, and cumulative prediction is used to make final prediction.
Fig. 1 illustrates the method schematically. Let’s say, you have complex decision boundary which is combination of two linear boundaries, in two-dimensional data space. That is your original training data, shown by red-dots. Blue lines are true model which we don’t know but are trying to discover. First, we sample the training data and build a simple model, say a linear regression model. Second figure shows sampled observations and corresponding linear model. Next we score full training data on this model, and compute error of prediction. Since this is simple model on sampled data, we expect high error between true target value and predicted target value. Next, we revise our training data to have residual as target for next time around. And then we construct another simple model on again sampled training data. And so on iteratively.
Prediction in Gradient Boosting models are made not be combined voting of all the trees, but cumulative predictions of all the trees. Each tree predicts a part of target, just like it was modeled to do so, during model development.
Few nuance before we close this topic:
1. Why shallow tree? Because each tree is building on residual errors of previous, underfitting (shallow) is desirable; else there will not much error to begin with.
2. How to define shallow? Very few leaf nodes, about 3-8 is generally used, depending on the context.
3. Why sampling of observations? Because we want variety in trees, letting each tree work with different dataset. Otherwise, successive shallow trees will not capture anything new than what previous shallow tree did, since both would have been working with same data. Hence, we don’t want them working on exactly same data.
4. How to sample? Usually 40%-50% of original number of records without replacement is considered thumb rule.
As always, last caveat: Above description of Gradient Boosting is directionally right, but not technically right. Residual (aka Gradient) computation at each round is not simple difference, but complex formulation using partial derivatives. Scoring is not simple sum of predictions of all trees but weighted function of learning rate parameter. For those interested, this paper (pdf) by Jerome H. Friedman will help.
Regression trees (where target variable is continuous) are similar to classification trees (where target variable is categorical) and collectively they are called Classification and Regression Trees (CART). Accordingly impurity definitions of Gini Index, Misclassification Errors, and Entropy of node aren’t applicable. Impurity in regression trees is computed as sum-of-squared-errors (SSE) for each observation against mean of average target for corresponding node.
Consider a toy example of dataset with five observations, with continuous target values of 1, 2, 4, 7 and 9. Fig. 2 and Fig. 3 show two possible splits. Sum-of-squared-errors is computed in both cases, and as lower SSE is desirable, or higher drop in SSE is desirable, Fig. 2 is better split. In practice, model will iterate over all predictor variables and all possible split points to identify the tree-split leading to lowest SSE. Note that there is no data type restriction on predictor variables in CART, and they can be both categorical and continuous.
Mathematically, SSE computation can represented as following equation, for each node i:
Now to scoring a Regression Tree: As evident from SSE computation, predicted value at each leaf node is mean target value, and hence prediction for unseen observation is simply mean target value of leaf node it falls into. In more complex cases, each leaf node may accommodate a separate model by itself, and predicted value may be outcome of that model.
In this four part series, we reviewed Machine Learning models called Decision Trees, and their extension Ensemble Models. Decision Trees are really simple and intuitive models capable of modeling complex predictor, however their limitation arises from their propensity to overfit. If sampling and overfitting are appropriately handled, they can be really useful tool for prediction and segmentation model. In fact, if you review solutions at Machine Learning competition site kaggle.com, you will almost always find winning solution involving complex ensemble of models!