In this article we are going to learn to create scatter plot matrix for the chosen dataset. Scatter plot matrix is a great way to roughly determine if you have a linear correlation between multiple variables. Though scatter plot matrix visualization is not available readily in Tableau as one click visualization under Show me but it can be created quite easily.
To create scatter plot we all know that we need two measures, so we must choose a dataset for this exercise that has at least 3 measures else we will not be able to create a matrix of scatter plots. For this exercise we will use an Auto MPG Data Set from University of California, Irvine website which has lot of publicly available dataset for machine learning purposes.
The data for our exercise is available here (free of unknown values) and can be converted into CSV or Excel file manually as the headers are missing in the dataset. The headers for the data can be source from here.
Let us have a look at the dimensions and measures that needs to be understood in order to create scatter plot matrix from this dataset.
Though Origin, Cylinders appear is numeric in nature, after close examination at the actual data records it can be concluded that they are actually categorical in nature. Cylinders take values from 3 to 8 whereas origin takes values from 1 to 3. Actually origin is the place of manufacturing for car under consideration and is either produced in Europe, Asia or North America but it has been converted into numeric form may be for regression purposes.
Hence we will make sure to convert Origin and Cylinders into dimension after loading them into Tableau. Let us begin.
Data Exploration & Visualization
Step 1 – Connect to the data.
I have my data stored in Excel file named auto-mpg as shown below. There should be 398 records in the dataset. Tableau Data Interpreter indicates that data doesn’t look good but there doesn’t seem to be any issues with the data so you can choose to ignore the warning posed by Tableau’s data interpreter.
Step 2 – Go to Sheet 1 and analyse/review the loaded data.
As shown below, following dimensions and measures must be detected by Tableau upon loading sheet 1.
Step 3 – Convert Origin and Cylinders to Dimension
As shown below right click on Cylinders and convert it into Dimension. Similarly convert Origin into Dimension as well.
You should see Dimension and Measures pane as shown below once Cylinders and Origin are converted into Dimension.
Step 4 – Create Matrix of Measures
Start double clicking on measures one after the other. After you have double clicked on first two measures you should see a single scatter plot as shown below.
On double clicking on third measure you should see following scatter plot matrix.
Likewise once you have double clicked on all 5 measures you should see the below scatter plot matrix. Though the basic skeleton for our scatter plot matrix is created but we have to perform a few more steps to turn into a really useful visualization.
Step 5 – Change aggregation of measures from SUM to AVG
The reason behind changing the aggregation of measures from SUM to AVG is because there are multiple records for the same car as model year can be different hence summing the measures will not make sense.
As shown below right click on measure in row/column shelf and choose Avg under Measures option.
Once you have changed the aggregation method for all measures from SUM to AVG, the column and row shelf should look like as below. Notice that we still don’t have the data plotted into individual scatter plots in the matrix. Do you know why?
Step 6 – Put Car name into Detail card
Remember, for creating scatter plot you must choose the granularity of the data by putting a dimension onto a detail shelf. For our context since we are analyzing the characteristics of different cars i.e. cylinders, acceleration, mileage per gallon etc. we will put car name onto detail card for creating various scatter plots to analyze correlation between various attributes present in our dataset.
Notice that we now have moved very close to our final target. We can start seeing the correlation between any two pair of measures in the matrix. If you observe the scatter plots are symmetrical across a diagonal running from top-left to bottom-right and the scatter plots on the diagonal itself do not make sense as plotting a measure against itself will produce a perfect linear correlation. We can either pay attention to right angle triangle above diagonal or below diagonal. Since we have 5 measures there are 10 scatter plots [N * (N-1)/2 here N=5] which contribute to meaningful analysis.
We will make few more tweaks to the visualization before beginning with the analysis.
Step 7 – Change the size of the marks
One can decrease the size of the marks to make data points look more obvious as shown below.
Step 8 – Put Cylinders on Colour card
One can choose to put Cylinders on colour card to further augment the analysis by segmenting the cars based on cylinders as show below.
Step 9 – Add Filters
One can add filters to slice and dice the data by various means. Configure Cylinders, Model Year and Origin as filter and show them as quick filters. Feel free to play around with different values of the filter. Observe the visualization getting updated for chosen filter values which may throw some interesting results.
Step 10 – Analysis
As usual it is time for some interesting analysis as we have successfully created the scatter plot matrix for our data. After all what is the point of creating a visualization if we it doesn’t help us understand the data or reveal some interesting insights.
What is the relationship between mpg and weight?
As the weight of the car increases the mileage per gallon decreases as shown below.
What is the correlation between horsepower and mileage?
As it can be seen below more the horsepower of the car, less the mileage.
Likewise other 8 pairs of measures can be analyzed for correlation analysis with a single scatter plot matrix created in this exercise.Happy analysis and visualization.
In summary, Scatter plot matrices are good for determining rough linear correlations of metadata that contain continuous variables. Scatter plot matrices are not so good for looking at discrete variables.
That is it for this time; stay tuned for more learning with Tableau.
One can visit the official Tableau website to find more details about Tableau and its product offering and features.