In previous blogs, we have talked about basic information on multicollinearity and how to detect multicollinearity. In this blog, we have four examples of multicollinearity and I shall tell you how you can address it. These are real life practical examples.
Firstly, if there is multicollinearity in a data set, we need to understand why. Having a solid understanding of the data and the logical relationships between the variables is the first step in understanding the effect of multicollinearity on our results and thereby determining how it should be addressed.
You may find that your model contains a predictor variable that has a direct causal relationship with another predictor variable.
For example, you may be looking at contributions to town charity organizations using a model that includes the population of the town and the total gross income of the town. You identify that these variables are highly correlated because the population of the town is a direct contributor to the total gross income of the town.
In a case like this, you should restructure your model to avoid regressing on two variables that are causally related. You could do this by either omitting one of these variables or by combining them into a single ratio variable such as per capita income.
You may find that your model contains two predictor variables that are manifestations of a common, underlying latent variable or construct. This is often referred to as the halo effect.
For example, you may be looking at customer loyalty to a shop using a model that includes several different measures of satisfaction. You identify that two of these measures of satisfaction (satisfaction with quality of product and satisfaction with the network) are highly correlated and determine that it is because customer don’t tend to describe out satisfaction in that way. Rather, both measures of satisfaction are really a reflection of the same measure of overall satisfaction.
In this case, you could simply use overall satisfaction as a predictor variable instead of the separate measures of satisfaction.
You may find that the multicollinearity is a function of the design of the experiment.
For example, in the cloth manufacturer case, we saw that advertising and volume were correlated predictor variables, resulting in major swings in the impact of advertising when volume was and was not included in the model. In further examination, you may discover that the cloth manufacturer may have inadvertently introduced multicollinearity between volume and advertising as part of the experimental design by assigning a high ad budget to cities with smaller stores and a low ad budget to cities with larger stores.
If you were able to re-do the market test, you could address this issue by restructuring the experiment to ensure a good mix of high ad/low volume, high ad/high volume, low ad/high volume and low ad/low volume stores. This would allow you to eliminate the multicollinearity in the data set.
It is often not feasible though, to re-do an experiment. This is why it is important to very carefully analyze the design of a controlled experiment before beginning so that you can avoid accidentally causing such problems. If you have found multicollinearity as a result of the experimental design and you cannot re-do the experiment, you can address the multicollinearity by including controls. In the case of the cloth manufacturer, it will be important to include volume in the model as a control in order to get a better true estimate for the impact of advertising. Other solutions to addressing multicollinearity in cases like this include shrinkage estimations such as principal components regression or partial least-squares analysis.
Sometimes, you will find that multicollinearity is inevitable.
In the real world, you are often not working with completely controlled experiments where you can ensure that there is no relationship between your predictor variables through your experimental design. Predictor variables may be closely related to one another, but may not have a direct causal relationship with each other or with a latent variable. In this case, you cannot just remove or replace one of the variables.
The Radison Medical case is an example of this situation: both sales and reps, though correlated, are important predictor variables and should be included in the model. It would not be appropriate to analyze either one without controlling for the other. For example, adding hundreds of new ads without also increasing the number of reps will not have the same effect on sales as increasing both would. In cases such as this, you need to recognize the multicollinearity, accept it as part of your model and ensure your analysis and recommendations consider the relationship between the variables.
Hope this blog has given you good examples to deal with multicollinearity. If you have any queries or doubts then feel free to mention them in the comments box below and I shall get back to you at the earliest.