July 21, 2015
k-Means clustering (aka segmentation) is one of the most common Machine Learning methods out there, dwarfed perhaps only by Linear Regression in its popularity. While basic k-Means algorithm is very simple to understand and implement, therein lay many a nuances missing which out can be dangerous. A good analyst doesn’t just know his/her techniques perfunctorily but knows ins-and-outs to understand implication and interpret the results in variety of scenarios. In three-part series starting with this post, we will explore finer aspects related to k-Means clustering.
This post assumes prior knowledge of k-Means algorithm. If you aren’t familiar then go through wiki-article or any standard text-book to understand and then come back here for deep-dive.
One primary and significant drawback of k-Means is that it requires prior knowledge or assumption about number of clusters. How can we know beforehand what’s right number of clusters for given data and business problem? This post will present variety of useful approaches. In this discussion it will be useful to consider a working example.
Fig. 1 presents a scatter plot of 1000 observations dummy-data. We can see, visually, that there are 3 clusters here. In real life, of course, observations have more than two dimensions and cannot be visualized easily. Further, cluster separation isn’t likely to be so obvious.
Figure 1: Example Cluster Data (with True Clustering)
Implicit objective function in k-Means measures sum of distances of observations from their cluster centroids, called Within-Cluster-Sum-of-Squares (WCSS). This is computed as
where Yi is centroid for observation Xi. By definition, this is geared towards maximizing number of clusters, and in limiting case each data point becomes its own cluster centroid. This is, naturally, neither practical nor desirable. Fig. 2 plots WCSS for k=1.20 and we can see that it continuously drops, indicating more clusters the better!
Figure 2: WCSS decreasing with k
Given that k-Means has no in-built preference for right number of clusters, following are some of the common ways k can be selected:
where a is average distance to all other observations within same cluster as that of observation i while b is minimum of average distance to all other observations from all other clusters. Silhouette coefficient of clustering result is average of si for all observations i. This metric is between +1 representing best clustering and -1 representing worst clustering. While WCSS is comparable for same data for different k, its number is not comparable across different clustering solutions on different data, and hence doesn’t have absolute threshold. On the other hand, Silhouette Coefficient has fixed range and hence can be used overall metric comparing quality of clustering irrespective of data or number of clusters.
Figure 3: Silhouette Coeff changing with k
Plotting SC against K we see highest coefficient of 0.63 with 3 clusters and second highest of 0.60 with 2 clusters. For higher number of clusters, SC sharply drops and stays low. This is because further fragmenting a given cluster makes both a and b closer to each other.
In next post, we will cover data pre-processing required for correct implementation of k-Means.
For those interested and familiar with R, above results can be replicated using following functions:
result < - kmeans(dataframe, centers=k) – this runs k-Means clustering on dataframe of observations
result$tot.withinss – this returns WCSS
sc < - silhouette(result$cluster, daisy(dataframe)) – this compute silhouette coefficient for k-Means result, and is part of MASS package
sum(summary(sc)$clus.avg.widths*summary(sc)$clus.sizes)/sum(summary(sc)$clus.sizes) – this computes overall SC