Multiple linear regression does not work well as we would like it to, in all situations. We will run into problems if the underlying model is not linear or if we have heteroscedasticity, clustering or outliers.

In this blog, we will discuss techniques for addressing one of the two most common problems that can skew the results of multiple linear regression even when the model is linear and homoscedastic with no clustering or outliers. These problems are:

1. Multicollinearity

When some of the predictor variables are too similar or too closely correlated with one another, it makes it very difficult to sort out their separate effects. In this blog, we will examine how to identify and address multicollinearity

2. Model specification issues

Because the various x variables interact with one another, regression results can change significantly based on which variables are included in the model. We must be careful in choosing which predictor variables to include in order to get the most useful results for analysis and prediction. In my next blog, I shall examine techniques for assessing which variables should or should not be included in the model.

When Multicollinearity Occurs?

Multicollinearity occurs when two or more of the predictor (x) variables are correlated with each other. When one or more predictor’s move together, it is difficult for multiple regression to distinguish between their individual effects. This may affect your estimated coefficients in several ways.

1. High Standard Errors

The standard errors for the coefficients will be inflated, which may result in higher p-values for the hypothesis test of significance on the individual coefficients.

In extreme examples, the model may be highly significant overall, but none of the individual coefficients will pass the test of significance.

2. Incorrect Signs

One or more of the coefficient estimates may have a sign that is inconsistent with intuition. For example, a model may indicate that the less satisfied customers are with service levels, the more satisfied they are overall. This does not make logical sense and is a clue that satisfaction with customer service levels is likely correlated with another variable that is skewing its estimated impact.

3. Instability

When predictor variables are correlated, the estimated coefficients can change wildly as variables are added to or dropped from the model. This is because of the fact that multiple linear regression estimates the impact of a certain predictor variable while controlling for (or holding constant) the other predictor variables in the model. In a model in which the various predictor variables are related to one another, this will result in a very different interpretation than if you were looking at the impact of any variable alone.

This is shown by the omitted variable bias theorem. If a true model includes two related predictor variables (x1 and x2) and one of those variables (x2) is left out, the coefficient estimate of the remaining variable (x1) will be biased. This bias occurs because the model compensates for the missing x2 variable by either overestimating or underestimating the effect of x1. The direction of the bias will depend on the sign of the correlation between x1 and x2 and the sign of β2 (the regression coefficient of x2).

While multi collinearity affects the stability and accuracy of coefficient estimates, it does not affect the results when the model is used for prediction purposes alone.

Let’s look at an example. Suppose we have the three observations shown here. Note that
x2 = 2×1, so the two variables are perfectly correlated. Suppose that the true regression model is y’ = 2×1 + x2. However, there are many other models that could potentially fit this same data set, such as:

•y’ = 4×1 (with no effect of x2)

•y’ = 2×2 (with no effect of x1)

•y’ = 6×1 – x2 (where x1 has a positive effect and x2 has a negative effect)

With all of these models, the overall model will give you the correct prediction of y, but the interpretation of the individual coefficients is completely different.

What happens if we try to extrapolate beyond our data set by predicting y if both x1 and x2 equal one? The correct answer according to the true regression model is that y equals three, but the other models give estimates of four, two and five.

This extreme example shows that even with perfectly correlated variables, a model can still be used for prediction purposes as long as you do not extrapolate beyond the established range of data. However, clients are often interested not only in prediction, but also in interpretation and examination of the effects of individual variables.

When deciding the variables we also need to see business significance of the variable over and above the multicollinearity.

This blog gives an understanding of multicollinearity and it is a best practice to present clients with a model that is not overly complicated by correlated predictor variables.