In my previous blog “How to deal with Multicollinearity”, I theoretically discussed about definition of multicollinearity and what are issues with the same in statistical model. Multicollinearity is problem because it can increase the variance of the regression coefficients, making them unstable and difficult to interpret. You cannot tell significance of one independent variable on the dependent variable as there is collineraity with the other independent variable. Hence, we should remove one of the independent variable. This will help in better attribution of determining the variation in dependent variable because of each independent variable.
I received comments on the previous blog that I should to add some real time steps and examples, hence in this blog I will talk about steps, methods and examples
When running a regression, it is always prudent to check the data for the possible impact of multicollinearity before drawing conclusions from it. There are several methods for doing this, which we will examine in the context of the following example.
A fitness goods manufacturer has created a new product and has done a market test of it in four select markets. For each store in which it was introduced, its sales were monitored over a six-month period. Several potential predictor variables for sales were identified, tested and measured including price, advertising, location in the store and total store volume.
Step 1: Review scatterplot and correlation matrices.
In the last blog, I mentioned that a scatterplot matrix can show the types of relationships between the x variables. If one of the individual scatterplots in the matrix shows a linear relationship between variables, this is an indication that those variables are exhibiting multicollinearity.
While a scatterplot matrix is a good visual approach, a more precise approach is to run a correlation matrix. In this correlation matrix produced from the fitness goods manufacturer company data, we can see that there is a fairly strong correlation (-0.74) between advertising and store volume. This is a strong sign of multicollinearity.
Step 2: Look for incorrect coefficient signs.
Incorrect signs can indicate multicollinearity. The regression output for the four variables for the fitness goods manufacturer company data is below. In this case, the signs are what we would expect them to be; sales go up as price goes down, but as advertising and overall store volume go up. This test does not indicate multicollinearity in this case.
Step 3: Look for instability of the coefficients.
To test for instability of the coefficients, we can run the regression on different combinations of the variables and see how much the estimates change. The output on the left is with all four variables; the one on the right omits volume.
As you can see, the coefficient for advertising changes significantly between these two models. The impact of advertising when volume is not included is 20.5, but the impact of advertising while controlling for volume is 131.3. This indicates multicollinearity between volume and advertising that is resulting in a downward bias in the advertising coefficient in the second model.
Step 4: Review the Variance Inflation Factor.
A measure that is commonly available in software to help diagnose multicollinearity is the variance inflation factor (VIF).
Variance inflation factors (VIF) measures how much the variance of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related.
Use the following guidelines to interpret the VIF:
Status of predictors
VIF = 1
1 < VIF < 5
VIF > 5 to 10
VIF measures how much of the variation in one variable is explained by the other variable. This is done by running a regression using one of the correlated x variables as the dependent variable against the other variables as predictor variables. In our example, if we ran a regression of price, ad and location on volume, we would get a result with an R-squared of 0.584. This shows that 58% of the variation in volume can be explained by the other variables.
The VIF is calculated as one divided by the tolerance, which is defined as one minus R-squared. In this case, the VIF for volume would be 1/(1-0.584), which equals 2.4. A VIF of one for a variable indicates no multicollinearity for that variable. As these values become larger, they indicate increased multicollinearity.
Rather than calculating this manually, you can ask for a report of VIFs from statistical software. In SAS, when we run Proc regression we add ‘/vif tol’ in the code
/***------ syntax start------- ***/
proc reg data= test_database;
/***------ syntax end------- ***/
For this example, the output shows multicollinearity with volume and ads, but not with price and location.
In my next blog I shall talk about different situations where multicolinearity occurs and how to address the multicolinearity, which is the optimal variables to remove to decrease multicolinearity.