k-Means clustering (aka segmentation) is one of the most common Machine Learning methods out there, dwarfed perhaps only by Linear Regression in its popularity. While basic k-Means algorithm is very simple to understand and implement, therein lay many a nuances missing which out can be dangerous. A good analyst doesn't just know his/her techniques perfunctorily but knows ins-and-outs to understand implication and interpret the results in variety of scenarios. In three-part series starting with this post, we will explore finer aspects related to k-Means clustering.
This post assumes prior knowledge of k-Means algorithm. If you aren’t familiar then go through wiki-article or any standard text-book to understand and then come back here for deep-dive.
One primary and significant drawback of k-Means is that it requires prior knowledge or assumption about number of clusters. How can we know beforehand what’s right number of clusters for given data and business problem? This post will present variety of useful approaches. In this discussion it will be useful to consider a working example.
Fig. 1 presents a scatter plot of 1000 observations dummy-data. We can see, visually, that there are 3 clusters here. In real life, of course, observations have more than two dimensions and cannot be visualized easily. Further, cluster separation isn’t likely to be so obvious.
Figure 1: Example Cluster Data (with True Clustering)
Implicit objective function in k-Means measures sum of distances of observations from their cluster centroids, called Within-Cluster-Sum-of-Squares (WCSS). This is computed as
where Yi is centroid for observation Xi. By definition, this is geared towards maximizing number of clusters, and in limiting case each data point becomes its own cluster centroid. This is, naturally, neither practical nor desirable. Fig. 2 plots WCSS for k=1.20 and we can see that it continuously drops, indicating more clusters the better!
Figure 2: WCSS decreasing with k
Given that k-Means has no in-built preference for right number of clusters, following are some of the common ways k can be selected:
where a is average distance to all other observations within same cluster as that of observation i while b is minimum of average distance to all other observations from all other clusters. Silhouette coefficient of clustering result is average of si for all observations i. This metric is between +1 representing best clustering and -1 representing worst clustering. While WCSS is comparable for same data for different k, its number is not comparable across different clustering solutions on different data, and hence doesn’t have absolute threshold. On the other hand, Silhouette Coefficient has fixed range and hence can be used overall metric comparing quality of clustering irrespective of data or number of clusters.
Figure 3: Silhouette Coeff changing with k
Plotting SC against K we see highest coefficient of 0.63 with 3 clusters and second highest of 0.60 with 2 clusters. For higher number of clusters, SC sharply drops and stays low. This is because further fragmenting a given cluster makes both a and b closer to each other.
In next post, we will cover data pre-processing required for correct implementation of k-Means.
For those interested and familiar with R, above results can be replicated using following functions:
result < - kmeans(dataframe, centers=k) – this runs k-Means clustering on dataframe of observations
result$tot.withinss – this returns WCSS
sc < - silhouette(result$cluster, daisy(dataframe)) – this compute silhouette coefficient for k-Means result, and is part of MASS package
sum(summary(sc)$clus.avg.widths*summary(sc)$clus.sizes)/sum(summary(sc)$clus.sizes) – this computes overall SC
Global Association of Risk Professionals, Inc. (GARP®) does not endorse, promote, review or warrant the accuracy of the products or services offered by EduPristine for FRM® related information, nor does it endorse any pass rates claimed by the provider. Further, GARP® is not responsible for any fees or costs paid by the user to EduPristine nor is GARP® responsible for any fees or costs of any person or entity providing any services to EduPristine Study Program. FRM®, GARP® and Global Association of Risk Professionals®, are trademarks owned by the Global Association of Risk Professionals, Inc
CFA Institute does not endorse, promote, or warrant the accuracy or quality of the products or services offered by EduPristine. CFA Institute, CFA®, Claritas® and Chartered Financial Analyst® are trademarks owned by CFA Institute.
Utmost care has been taken to ensure that there is no copyright violation or infringement in any of our content. Still, in case you feel that there is any copyright violation of any kind please send a mail to email@example.com and we will rectify it.
2015 © Edupristine. ALL Rights Reserved.