The beeswarm plot is a one-dimensional scatter plot like "stripchart", but with closely-packed, non-overlapping points.
Essentially beeswarm plot is used to visualize distributions similar to stripchart, histogram or box and whisker plot. The difference between beeswarm plot and other traditional chart type that visualize distributions is that beeswarm plots the data on the single axis and then offsets in the other direction to show volume or counts.
This is first blog in a series of data analysis and visualization using R. This series of blog post assumes that reader has installed R on his/her machine and comfortable programming in R.
If you don’t have R installed yet, please proceed ahead and install it first. It is free, runs on the major platforms, and is typically a straightforward installation.
The data has been sourced from howstat.com and formatted appropriately for consumption by R. This is the first important and often time-consuming step before data analysis, visualization and exploration can happen. We have batting data for One Day International (ODI) matches played between years 1971 to 2011 with close to 60,000 data points. The below table gives you a quick overview of important dimensions and measures present in the dataset.
|Player name||Score Rate (runs per 100 balls faced)|
|Opponent country (Versus)|
Data Exploration & Visualization
Step 1 – Open R Console (GUI) or R Studio.
Step 2 – Install beeswarm package.
Before we begin we will have to install beeswarm package. Follow below steps to install beeswarm.
install.packages(“beeswarm”) and then select appropriate CRAN mirror.
Step 3 – Load the package
Once the beeswarm package is installed successfully let us load the beeswarm package using library command as shown above.
Step 4 – Load the data
Load the ODI batting data prepared in CSV format using below command.
Step 5 – Explore the data
Explore the datasets using head command. Also one can see how many rows and columns are there in the dataset using nrow and ncol commands as shown below.
Step 6 – Extract data for Indian players
One can extract the data for Indian players by putting filter as India on Country column as shown below. One can verify that the resultant data frame as less number of records as compared to data frame having complete data set.
Step 7 – Aggregate runs scored by all Indian Batsmen
As can be seen below we have created a new data frame and used an aggregate function to find the sum of runs scored by each player. We have data for 185 unique India batsmen.
Step 8 – Find Top 3 batsmen by runs that have played for India
Now let us sort the new data frame by Runs in descending order to find the top 3 India batsman by Runs.
Without even sorting the data we know that Sachin, Sourav and Rahul will be those three players but still let us go through the programmatic way to confirm our beliefs.
Notice the negative sign before the Runs column inside order function which indicates we need to sort in decreasing order.
Step 9 – Subset the data for the top 3 batsmen
Now that we have narrowed down on our data points let us plot a basic version of beeswarm plot.
Before that we will further subset the data to have data points only for Sachin, Sourav and Rahul using below command.There could be various ways in which sub setting can be done.
sachin_sourav_rahul=india_odi_batting_data[which(india_odi_batting_data$Player=="Sachin R Tendulkar" | india_odi_batting_data$Player=="Sourav C Ganguly" | india_odi_batting_data$Player=="Rahul Dravid"),]
Step 10 – Plot beeswarm plot for Top 3 batsmen
Here is the default beeswarm plot command and its output.
As you can see it appears like a pyramid where lots of individual scores by these 3 gentlemen are in the range of 0 to 40 and there is peak starting to form once we reach 125 runs mark though there are a few marks between 140 and 150.
Now let us plot beeswarm plot for each of the players.
As can be seen from above diagram, all the three batsmen seem to have lot of score in the range of 0 to 30.
Sachin has a string of scores between 80 and 100. Obviously Sachin has scored more centuries than Sourav and Rahul hence more data points beyond 100 runs mark.
Now let us make this visualization aesthetically more beautiful slowly.
There are lot of options that beeswarm method has so one can play around as per one’s visual preferences. Here I will take you through a few of them.
One can execute ?beeswarm on the console to get the help for the command.
Let us start with giving 3 different colours to data points using below command (col=1:3)
Now let us try to change the symbols which is used to plot the points using pch argument as shown below. Do try different options for pch.
One can change the size of the marks by altering the parameter cex.
Now let us try to have non-overlapped data points using corral argument set to gutter. One has to explore the possible values of these arguments to best suit their requirements.
Now let us give x label, y label and name to the chart.
beeswarm(sachin_sourav_rahul$Runs ~ sachin_sourav_rahul$Player, data=sachin_sourav_rahul, method="swarm", col=1:3, pch=19, cex=.75, corral="gutter", xlab="India Batsman", ylab="Runs", main="India Top 3 ODI Batsmen Beeswarm Plot")
That is it for this time; stay tuned for more learning with R.