I remember those early days of my career when I was a part of Risk Analytics team at a bank. My Boss had given me an assignment where I had to work on the data from the Banks General Insurance unit. I went clueless seeing that huge data.
The data was related to Auto Insurance and it was not about identifying and defining variable types, rather it was all about creating a data dictionary and doing a Bivariate Regression Analysis by a way of multivariate linear regression. The basic objective was to identify and quantify losses in the insurance portfolio.
Although I did have theoretical understanding of regression, yet it took me 2 sleepless nights applying it practically.
Task 1 of Multivariate Linear Regression: Get the Data Dictionary
The moment I received the data from my Boss along with the scary deadlines, my initial job was creating a data dictionary.
A data dictionary contains the descriptions of data variables in a data model. The IBM Dictionary of Computing defines Data Dictionary as a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format.
I noted down each variable from the data and gave description to each of them.
Step 1 of Data Dictionary:
Find unique values for each individual variable from the data. In the above table, since policy number itself is a unique identifier I was left with the task of next eight variables. There was indeed a lot of data.But as we know Excel can handle however large the data is, whether we talk about 3000 rows to 15000 odd rows or increasing columns.
I found that the unique values was done by going through the Ribbon menu and then selecting data and using the Advanced filter option
Step 2 of Data Dictionary:
Select the data range you want to extract unique values from and then select the unique values option.
Step 3 of Data Dictionary:
Select the cells where the unique values are to pasted and sort them. After doing this for each you finally create the data dictionary having all the unique values for each variable.
Task 2 of Multivariate Linear Regression: Doing a Loss Analysis
Step 1 of Loss Analysis:
Since the task was identifying and quantifying the factors for losses (the dependent variable) a Distribution Analysis was done. Initially I started by creating a scatter plot for the loss data to check for the outliers and created a frequency distribution.
Step 2 of Loss Analysis:
Re-create a scatter plot and frequency distribution: Scatter plot clearly indicates a lot of outliers in the original data which creates problem in forming the relationship between dependent and Independent variable, I analyzed that I should give a cap to the loss data and then re-create a scatter plot and frequency distribution.
Task 3 of Multivariate Linear Regression: Doing Bivariate Profiling
Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the concept of relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences.
For the Insurance portfolio the losses had to be analyzed using different factors and look for the relationship so that a conclusion can be formed for each factor.
Step 1 of Bivariate Profiling:
I started by doing Loss profiling using the age factor. With the help of pivot table in excel I was able to perform this task efficiently. The first step was selecting the data and using the shortcut key Alt D then P use Next function and then select a new worksheet after that finishes.
At every age factor there is an associated sum of losses, capped losses to avoid outliers and the number of policies.
Step 2 of Bivariate Profiling:
Select the fields of age, losses, count and capped losses. In the row labels select the age since we are concerned with the associated losses at each age.
Step 3 of Bivariate Profiling:
From this we calculate average losses, average capped losses and percentage of observations. The percentage of observations is calculated as:
Step 4 of Bivariate Profiling:
Once we have calculated the above numbers we can plot the data to analyze the relationship between losses, age and percentage of observations.
This way I carried forward the Bivariate Analysis under Multivariate Linear Regression for other variables too and made my boss happy for the second time in the week. Thus in this way I successfully learned to represent data and create an easily readable chart.
Feel free to ask any doubt in the comments section below