June 24, 2013
Doing Regression Analysis has always been a tough task as there are many complicated terms involved in it like Heteroskedasticity, Multicollinearity, Adjusted R2 and many more.
The lack of practical knowledge on Regression Analysis or reluctance on taking training makes a job really tough.
The task of doing Variable selection and creating customized bands is tough because firs you need understand and then come to a conclusion. I want to help you guys so that you don’t end up landing in the same situation like me.
In Regression there involves Selecting Variable which will prove relation with the dependent variable. With the Insurance data assigned to me previously I had the age, age group, Vehicle age, Vehicle age group, years of driving experience and so on.
In Regression Analysis, selecting right variables is an important point as the final decision is taken depending on this relation. The first task I had with these Independent variables was to check for losses (dependent variable) due to these variables. Multicollinearity means the correlation between independent variables, the lower this correlation the better it is for us.
Also Banding of variables is also an important criterion we have to look in Regression. This involves selecting the Age or Age band or Vehicle age or Vehicle age band. Once we are done with checking for Multicollinearity and banding of variables, a statistical significance of variables is tested.
With the data in hand, I first started taking out the correlation between age and years of driving experience. Getting Correlation with so much data in hand wasn’t an easy task, but it got simplified with the help of MS Excel. I used the function CORREL and selected the column of age and Years of driving experience and got the result within a span of few seconds.
Now the result I got showed a high correlation between the two variables, thus failing the multicollineary test. As I knew I can use either of the variables but the question was which variable I should choose. I searched on the internet and looked in some books and came out with an answer that variable with a better fit should be used.
How to select a variable that best fits the model can be done on the basis of running a regression between dependent and independent variables, that is capped losses and age after that capped losses and years of driving experience. This is by fitting two models independently. I used excel and came to know the result. Let me show you how.
Step1: Go to Data -> Data Analysis from the ribbon menu.
Step2: Select the Regression function.
Step 3: Put the respective inputs, in this case capped losses and age.
Step 4: In the new sheet we have the result of the Regression
Similarly with the same steps, capped losses and Years of driving experience Regression were run.
The final decision was based on R2 which shows that how much an independent variable explains the dependent variable. Since age had a R2, it was selected over years of driving experience.
Now once I checked for Multicollinearity and which variables to be used, I had to check whether to use the age band or age. To do this again we have to run a Regression using age and age band. The problem arises as running the regression will be in Excel and Excel doesn’t give results for non-numerical values.
For this the age band has to be converted to average age so that it becomes numerical variable for a categorical variable.
Since for each policy there should be Average Age, we use V-look up function
Now with the Age and Average age and which of the variable is to be selected out of it, I ran a Regression in the similar way as I did for Years of driving experience and Average age.
Now that I had the result and a comparison after running the regression, I took the average age or to be clearer the age band over age for the simple reason being the R-square. Thus in selecting the variables I would use age band over age for the age factor.
In the same way I performed the regression for Vehicle age and age band. I got the similar result of Average Vehicle age or Age band over Vehicle age.
As I discussed before, excel doesnâ€™t take any other values apart from Numerical it was necessary to convert Binary variables present in the data to Numerical one.
The Binary variables such as Gender, fuel type etc needs to be converted. This can be done by giving values to the variables like in the case of Married status.
Finally after selecting the variables based on R2, taking out the band and then creating dummy variables we have the data on which we can run the Multivariate Regression.
Snapshot of the final data on which we will run the Multivariate Regression
In this way after hours of search and learning on different Excel functions I was able to create the data needed to run the Regression Analysis.
For indepth study of Analytics , have a look at our Business Analytics Program