I still remember that early phase of career when i was a part of risk analytics team at a bank, and my boss gave me a task which I had no clue about. He gave me a list of customers from the bank database and told me to extract the information about the customers with credit card. Further he asked me to present whatever inferences I could draw from the data to him! The bank wanted to identify the risky customers.
Now as you can well imagine, I was completely stumped. All I had was a huge amount of data and no idea as to how to deal with it.
It took me a lot of effort and quite a many sleepless nights to finish that task, and now I want to help you avoid the pain when undertaking a similar task at work.
Identifying the Variables
I sat down and broke my task into parts. My first task was to identify the information relating to my task that I needed from the huge database. Data is nothing but a combination of "variables" which actually contain the values.
So my task was to identify the variables that gave information about customers with credit card. That part was easy. By simply looking at the data I was easily able to identify the variables which I needed such as Name of Customer, Customer ID, Number of credit cards etc.
Identifying the Values Stored
The next step was to identify what values could be stored in each variable and also to choose the type of variable required to store those values. Now this was a complicated task!
Could you think of how you would approach to this problem? No?
I began by trying to list out all the unique values stored under each data.
Huge amount of data on Credit Card customers
Name and Customer ID
The Name of customer and Customer Id are the main identifiers. They can take any value. However, we can have two customers with the same name. Thus we need a Customer Id to identify each one distinctly. These two variables will not have any set of unique values
The Other Variables
As you can see from the data above, I was dealing with 3000 rows and 8 columns! It would have been impossible for me to go through each cell of data and identify the unique values for the remaining variables!
Here Excel came to my rescue. I used a cool tool called Advanced Filters to identify all the unique values in each column.
Let me show you.
Step1: Go to Data -> Advanced Filters from the ribbon menu
Step 2: Select the data range you want to extract unique values from and Select â€œUnique Valuesâ€ option
Step 3: Select the cell where you want to paste these unique values
Paste unique values and sort them
Thus, in this manner you can extract all the unique values. It will be helpful to sort these values as well.
I did this for all the variables and filled in my table for the type of value stored.
Value stored in each variable
Identifying the Variable Type
Now that I had found the values stored in each variable, it was easier to define what kind of variable was needed to store those values. But before we identify the type of variable, we will need a little information about the different types of variables.
I have summarized the same for you below.
Types of variables
Also learn to Handle Different Kinds of Data Variables
Now that I knew about the data variables types, and also the value stored in each variable, identifying the variable type was as easy as putting 2 and 2 together.
Thus in this way I successfully learned to represent data is a nice and convenient manner!
While I was ecstatic about having managed so much, I still needed to summarize the data and present my findings to my boss!
After doing some research on the internet I learned that there are many different ways to summarize data, but the most common and useful methods are
- Frequency distribution
- Grouped frequency distribution
- Cumulative frequency distribution
Each of these methods is used for different types of variables. For eg.to summarize a discrete variable like number of credit cards, we can use a frequency distribution.
Frequency distribution graph
For the variable, Annual Salary, we cannot use such a technique to summarize the data. Annual salary is a continuous variable. There are too many unique values to create a frequency distribution. However, we can divide the salary into ranges and present the distribution among those ranges. This is called Grouped Frequency distribution
Grouped Frequency Distribution
In this manner, I created a comprehensive presentation and kept my boss happy!
Also learn to Handle Different Kinds of Data Variables from our blog on Multivariate Linear Regression