August 12, 2015
k-Means clustering is one of the most common segmentation method. In previous two posts we talked about different ways number of clusters i.e. k can be identified and how to pre-process data before we run k-Means algorithm. Reader is requested to go through them before continuing the discussion here. In this blog post, we will delve deeper into ‘Means’ part of k-Means.
Means in k-Means refers to arithmetic mean (average), plural because there are as many means as there are number of clusters, and is referred as such because centroid of a cluster is computed as mean of all the observations belonging to that cluster. While iterations modify allocation of points in and out of clusters, cluster centroid is repeatedly adjusted to be arithmetic center of all observations. For multi-dimensional observations, ith dimension of centroid is mean of ith dimension of all the observations belonging to that cluster. This has interesting consequences.
Unlike Hierarchical Clustering or Density-Based Spatial Clustering of Application with Noise (DBSCAN) – two other common clustering methods – k-Means produces cluster centers which aren’t necessarily part of input data. In fact, centroids are hypothetical points which don’t really correspond to any of the observations in that cluster. Is this important?
For numerical multi-dimensional data, hypothetical centroid can still be assumed to represent a point in multi-dimensional hyper-space. But for non-numerical or part-numerical data, where distance notion is other than Euclidian distance or Manhattan distance , hypothetical centroid may even represent absurdity. Let’s think beyond the Means!
Depending on business context, data and problem at hand, following measures of central tendency can work better than average.
K-Modes uses Mode of each dimension as centroid – Mode refers to the value which is most frequent in the dataset, breaking the ties randomly – and is useful when categorical data is involved. Recall that k-Means necessarily only handles numerical data and categorical data must be transformed into binary variables before running k-Means clustering. However, mean of binary transformed categorical variable isn’t particularly intuitive or even representative of true data in many cases. Consider set of multiple-choice questions and task of finding out how many different ways (clusters) they have been answered. For such discrete data k-Modes is much better choice. Mode of a dimension will necessarily be a valid choice.
While mean weighs each observation equally, mode weighs highest frequency observation in all. So if there were only two values of a particular dimension in 51% and 49% proportion, mean will give value about half-way while mode will simply give value with 51% proportion. This forces clustering to make strong choices rather than dilly-dallying between all the options. Result is that clusters from k-Modes are more compact than those from k-Means. Read this amazing story of how k-Modes clustering is applied in real life to find a most compatible life-partner on a dating website.
k-Median, as the name implies, uses median as measure of centroid computation. Median is defined as value that’s exactly halfway between all sorted values. Median need not necessarily belong to the dataset, however. For instance, median of 1, 3, 4, and 5 is 3.5. Using median to compute centroid is equivalent to objective function using Manhattan distance rather than Euclidian distance. Square error function with Euclidian distance makes clusters spherical – all data points somewhat closer to centroid, but linear error function with Manhattan distance makes cluster more compact – with most points much closer to centroid and few farther. This is difficult to describe in words but if you compute average distance of an observation from centroid for each dimension, then k-Means will typically result in each dimensional difference smaller but still non-zero, while k-Medians will result in more of dimensional differences zero and some high and non-zero. For those who are familiar, this is similar to L1-norm and L2-norm in regularization of logistic regression equation for robust coefficient estimation. Fig. 1 illustrates 25 data points and their both types of clustering with k=3. Below plots are results of single iteration. Value of k is selected randomly.
Figure 1: k-Means vs k-Medians
k-Medoids has restriction that that centroid must be one of the observations, and not some hypothetical central point. Accordingly, this method has different convergence approach than other variants we discussed here. Firstly, solution is initialized randomly by selecting k points from observations, just like k-Means. Thereafter, instead of centroid computation, model greedily swaps one of the selected observations (current centroid) with non-selected observations such as to reduce objective function (say, Within-Cluster Sum-of-Squares). Algorithm converges when no further swapping will reduce objective and finally selected k points becomes final cluster centroids.
For those interested and familiar with R, above results can be replicated using following functions:
df <- achieve[,c(2,4)]
res2 <- kcca(df, 3, family=kccaFamily("kmedians"))
df2 <- df
df2$C <- res2@cluster
ggplot(df2, aes(x=df2$arith4, y=df2$arith6, fill=df2$C)) + geom_point(aes(color=df2$C)) + labs(title='k-Median', x='', y='')
In three part series on k-Means clustering algorithm we covered nuances and details of algorithm, common pitfalls to be avoided for right results, and different variations which can be applied in different data context. Various simulated data and explanations were provided, including R-tip, for reader to replicate the results. Hope you had fun!