Share with your network!

What is Heteroskedasticity?

 

Heteroskedasticity occurs when the variance of the disturbance is not constant, which is often a problem encountered in cross sectional data. It does not affect the parameter estimates: the coefficients should be unbiased, but it does, however, bias the variance of the estimated parameters.

Heteroskedascity by plotting the residual against the predicted y

How do you check for Heteroskedasticity?

 

The assumption of Linear Regression is that the variance should be constant between the dependent and the independent variable. I tried checking for the variance using frequency distribution between age band and capped losses, by creating a loss band using the rounddown function in excel

 

Heteroskedasticity

Heteroskedasticity example

If you also wish to learn about using R to understand heteroskedasticity and to fix it subsequently, then you can refer to another blog that is on

Now creating a frequency distribution can be done using the Pivot table. We will follow the same rule and create it for age band and capped losses band.

 

So by taking the data in a new sheet, my initial need was to check for losses in various age groups.

Frequency distribution using pivot table

When fetching data from the pivot table, I ended up getting the variables reflected in different boxes.

Filters in pivot table

Now, as we want to check for loss range in various age groups, we re-arrange the data so that we can plot a frequency distribution.

How to use filters in pivot table

So with the help of report filter we can select individual age group bands:

Using filters in pivot table

Now with the help of report filter we can check the number of policies lying in the loss band.

Result after using filters in pivot table

With this, plotting the frequency distribution becomes an easy task and can be plotted for any age group band using the filter option.

How to plot frequency distribution

The same way, we can plot for a particular number of vehicles by fixing it in the report filter option in the pivot table.

 

So why did we calculate the frequency distribution for different groups of the same variable? The reason is simple: to visually check whether the distribution for all the groups is same or different by which we can make a check on heteroskedasticity. I will come to this again, but let me cover an important concept to get a clear check on heteroskedasticity.

 

Now trying to extract the actual numbers to check for heteroskedasticity, we require the variance of the variable for each age group.

 

Calculating variance is an easy task using excel (can also be done using R). But what is to be done when we have to calculate variance using different age groups? The process is simple. We use the subtotal function in excel.

how to use subtotal function on excel
Calculate variance using subtotal function on excel

So by giving a filter option to the data by using the short cut key Ctrl+Shift+L we can select the variables for which we are looking out for variance.

Using filters in excel

Concentrating on the variable of age group we can check on the variance of the band of the same variable and conclude on heteroskedasticity.

Mean, Variance and Frequency distribution

Similarly the same way we can create the frequency distribution and variance by using the filter option. It may look like a lengthy process, but it is not really when you work with excel.

 

How to Fix Heteroskedasticity

 

Since I was dealing with multivariate data where I had many independent variables, fixing heteroskedasticity for an individual variable wouldn’t solve the problem. You instead need to immunize all the data against Heteroskedasticity.

How to fix heteroskedasticity

The next step is to find the standard deviation of capped losses for different segments. We have to make use of the formula Variance (X) = Mean (X^2) – {Mean(X)^2}.

 

We create a column of count and then a pivot table which gives the unique values of segment variables through which we can continue using the formula.

Finding Values

Since we have to fit the formula, we require the values in the form of Mean(Square of capped losses), Mean(capped losses), Square of (Mean (Capped Losses)), Variance and standard deviation of Mean(Square of capped losses).

 

I have given a screen shot of how the values were derived.

Standard deviation: Finding values

The standard deviation of capped losses can be taken out by using VLOOKUP function:

Standard deviation using Vlookup

We finally calculate the Standardized Capped Losses as Capped Losses/ Segment Std Dev. This becomes the new response variable.

Final results for heterskedasticity

To check for Heteroskedasticity we calculate the variance of Standardized Capped Losses.


Learn: Multivariate Linear Regression

Understanding Heteroskedasticity