What is Heteroskedasticity?
Heteroskedasticity occurs when the variance of the disturbance is not constant, which is often a problem encountered in cross sectional data. It does not affect the parameter estimates: the coefficients should be unbiased, but it does, however, bias the variance of the estimated parameters.

How do you check for Heteroskedasticity?
The assumption of Linear Regression is that the variance should be constant between the dependent and the independent variable. I tried checking for the variance using frequency distribution between age band and capped losses, by creating a loss band using the rounddown function in excel
Heteroskedasticity

If you also wish to learn about using R to understand heteroskedasticity and to fix it subsequently, then you can refer to another blog that is on
Now creating a frequency distribution can be done using the Pivot table. We will follow the same rule and create it for age band and capped losses band.
So by taking the data in a new sheet, my initial need was to check for losses in various age groups.

When fetching data from the pivot table, I ended up getting the variables reflected in different boxes.

Now, as we want to check for loss range in various age groups, we re-arrange the data so that we can plot a frequency distribution.

So with the help of report filter we can select individual age group bands:

Now with the help of report filter we can check the number of policies lying in the loss band.

With this, plotting the frequency distribution becomes an easy task and can be plotted for any age group band using the filter option.

The same way, we can plot for a particular number of vehicles by fixing it in the report filter option in the pivot table.
So why did we calculate the frequency distribution for different groups of the same variable? The reason is simple: to visually check whether the distribution for all the groups is same or different by which we can make a check on heteroskedasticity. I will come to this again, but let me cover an important concept to get a clear check on heteroskedasticity.
Now trying to extract the actual numbers to check for heteroskedasticity, we require the variance of the variable for each age group.
Calculating variance is an easy task using excel (can also be done using R). But what is to be done when we have to calculate variance using different age groups? The process is simple. We use the subtotal function in excel.


So by giving a filter option to the data by using the short cut key Ctrl+Shift+L we can select the variables for which we are looking out for variance.

Concentrating on the variable of age group we can check on the variance of the band of the same variable and conclude on heteroskedasticity.

Similarly the same way we can create the frequency distribution and variance by using the filter option. It may look like a lengthy process, but it is not really when you work with excel.
How to Fix Heteroskedasticity
Since I was dealing with multivariate data where I had many independent variables, fixing heteroskedasticity for an individual variable wouldn’t solve the problem. You instead need to immunize all the data against Heteroskedasticity.

The next step is to find the standard deviation of capped losses for different segments. We have to make use of the formula Variance (X) = Mean (X^2) – {Mean(X)^2}.
We create a column of count and then a pivot table which gives the unique values of segment variables through which we can continue using the formula.

Since we have to fit the formula, we require the values in the form of Mean(Square of capped losses), Mean(capped losses), Square of (Mean (Capped Losses)), Variance and standard deviation of Mean(Square of capped losses).
I have given a screen shot of how the values were derived.

The standard deviation of capped losses can be taken out by using VLOOKUP function:

We finally calculate the Standardized Capped Losses as Capped Losses/ Segment Std Dev. This becomes the new response variable.

To check for Heteroskedasticity we calculate the variance of Standardized Capped Losses.
Comments