Decision Tree models are powerful analytical models which are really easy to understand, visualize, implement, score; while at the same time requiring little data pre-processing. They are also adept at handling variable interaction and model convoluted decision boundary by piece-wise approximations. One would wonder why decision trees aren’t as common as, say, logistics regression. This relates to their method of development. In previous post we talked about how to grow the decision tree by selecting, at each level of depth, which variable to split, and at what split level. (By the way, go through the previous post, before continuing, if you have not already done so, so that you may follow the discussion here.) Computation of impurity of tree ensures that it is always advisable to split the node until all leaf nodes at pure node (of only one class if target variable is categorical) or single observation node (if target variable is continuous). As you can note, this looks like overfitting, which is one of cardinal sins in analytics and machine learning. Apart from overfitting, Decision Trees also suffer from following disadvantages:

1.  Tree structure prone to sampling – While Decision Trees are generally robust to outliers, due to their tendency to overfit, they are prone to sampling errors. If sampled training data is somewhat different than evaluation or scoring data, then Decision Trees tend not to produce great results.

2.  Tree splitting is locally greedy – At each level, tree looks for binary split such that impurity of tree is reduced by maximum amount. This is a greedy algorithm and achieves local optima. It may be possible, for example, to achieve less than maximum drop in impurity at current level, so as to achieve lowest possible impurity of final tree, but tree splitting algorithm cannot see far beyond the current level. This means that Decision Tree built is typically locally optimal and not globally optimal or best.

3.  Optimal decision tree is NP-complete problem – Because of number of feature variables, potential number of split points, and large depth of tree, total number of trees from same input dataset is unimaginably humongous. Thus, not only tree splitting is not global, computation of globally optimal tree is also practically impossible.

In this post will go about how to overcome some of these disadvantages in development of Decision Trees.

To avoid overfitting, Decision Trees are almost always stopped before they reach depth such that each leaf node only contains observations of one class or only one observation point. There are various approaches which can decide when to stop growing the tree.

1.  When the leaf node is pure node – If a leaf node happens to be pure node at any stage, then no further downstream tree is grown from that node. Tree can continue to be grown from other leaf nodes.

2.  When the leaf node has very few observations left – This ensures that we terminate the tree when reliability of further splitting the node becomes suspect due to small sample size. Central Limit Theorem tells us that when observations are mutually independent, then about 30 observations constitute large sample. This can become rough guide, though usually, this user input parameter should be higher than 30, say 50 or 100 or more, because we typically work with multi-dimensional observations and observations could be correlated.

3.  When decrease in impurity of tree is very small – This user input parameter leads to termination of tree when impurity drops by very small amount, say, 0.001 or lesser.

4.  When sufficient number of leaves are created – One method of culminating growth of tree is to achieve desired number of leaves – an user input parameter – and then stop. Personally, I find this to be not so good criteria simply because growth of tree is unbalanced and some branch would have nodes of very few observations while others of very large, when stopping condition is met. Also, while it is possible to decide what is small sample size or what is small change in impurity, it’s not usually possible to know what is reasonable number of leaves for given data and business context.

5.  When cross-validation impurity starts to increase – This is one of complex method, but likely to be more robust as it doesn’t required any assumption on user input. Training data is split into train and cross-validation data, in say 70%-30% proportion. Tree is grown on train data by computing impurity of tree and splitting the tree wherever decrease in impurity is observed. Similar tree is replicated on cross-validation data. Since we are growing tree on train data, its impurity will always decrease, by very definition of process. However, at some point, impurity of cross-validation tree will increase for same split. This is point where we can stop growing the tree since divergence in error (impurity) signals start of overfitting.

In next post, we will cover how to handle some of other disadvantages of Decision Tree. Before that, just a short note how to score a new observation given that a Decision Tree is already available.

There are two kinds of predictions possible for classification problem (where target is categorical class):

1.  Point Prediction – Where prediction is class of new observation

New observation belongs to majority class of training observations at the leaf node at which new observation falls into. This minimizes misclassification error of prediction.

2.  Probabilistic Prediction – Where prediction is probability of new observation belonging to each class*

Probability of new observation belonging to a class is equal to proportion (percent) of training observations of that class at the leaf node at which new observation falls into

Let’s say a terminal node into which our scoring observation falls into has 200 training observations of Class A, 250 of Class B, and 50 of Class C. Then, because Class B is majority (has maximum observations) in this node, point prediction of new observation will be Class B i.e. our model will predict Class B for that new observation. On the other hand, model will probabilistically predict that new observation belongs to Class A with 200/(200+250+50)=0.40 probability, belongs to Class B with 0.50 probability, and to Class C with 0.10. Depending on business application, one or other kind of prediction may be more suitable.

*For two-class problem (binary classification), this is commonly used “score” which is also output of logistic regression model