This is second post in three-part series on deep-dive into k-Means clustering. While k-Means is simple and popular clustering solution, analyst must not be deceived by the simplicity and lose sight of nuances of implementation. In previous blog post, we discussed various approaches to selecting number of clusters for k-Means clustering. This post will discuss aspects of data pre-processing before running the k-Means algorithm.
This post assumes prior knowledge of k-Means algorithm. If you aren’t familiar then go through wiki-article or any standard text-book to understand and then come back here for deep-dive.
It’s useful to follow an example to demonstrate some of the points. Fig. 1 presents a scatter plot of 1000 observations dummy-data. We can see, visually, that there are 3 clusters here.
Figure 1: Example Data (with True Clusters)
Prepping the Data
- Missing value Handling – k-Means clustering just cannot deal with missing values. Any observation even with one missing dimension must be specially handled. If there are only few observations with missing values then these observations can be excluded from clustering. However, this must have equivalent rule during scoring about how to deal with missing values. Since in practice one cannot just refuse to exclude missing observations from segmentation, often better practice is to impute missing observations. There are various methods available for missing value imputation which can be a blog post in itself, but care must be taken to ensure that missing imputation doesn’t distort distance calculation implicit in k-Means algorithm. For example, replacing missing age with -1 or missing income with 999999 can be misleading!
- Categorical variables – k-Means uses distance computation at core of its algorithm, and hence cannot handle categorical variable directly. There are two ways this can be handled:
- If categorical variables are ordinal variables*, then they can be replaced with arithmetic sequence of appropriate difference. Say, small/large can be replaced by 5/10. Because of unit normalization, difference doesn’t really matter.
- If categorical variables are cardinal variables**, then they must be translated into as many binary numerical variables as number of categorical classes. Say, one cardinal variable with three classes (viz. car/bus/bike) can be converted into three binary variables.
- Data normalization – Distance computation in k-Means weights each dimension equally and hence care must be taken to ensure that unit of dimension shouldn’t distort relative near-ness of observations. Fig. 2 shows how relative distances can alter when units are changed for three 2-d observations. Common method is to unit-normalize each dimension individually. Even for simple dummy data from Fig. 1, clustering results may differ with and without unit normalization, as show in Fig. 3 below where one observation is classified differently.
Figure 2: Problem of Units
Figure 3: Clustering without (top) and with (bottom) Unit Normalization
- Curse of dimensionality – As explained in previous blog post, any more than few tens of dimensions mean that distance interpretation isn’t obvious and must be guarded against. Appropriate dimensionality reduction techniques and distance measure must be used.
- Random initialization – k-Means clustering is prone to initial seeding i.e. random initialization of centroids which is required to kick-off iterative clustering process. Bad initialization may end up getting bad clusters. Fig. 4 shows one example where for if initial seed happens to be one each in red, blue and green regions then we get true clustering (left image), but we may get wrong clustering (middle, right images) at other initial location.
For our example dataset, if all at least two of initial centroids chosen happened to be in bottom cluster then resulting solution will be as in Fig. 5 which is very far from true solution!
There is no built-in mechanism to correct for initial wrong starting point. Hence care must be taken to ensure that hundreds of k-Means are runs with different initial seeds and segmentation chosen with best for given . One may use k-Means++ for selecting good starting points but there is no substitute for multiple starting points though it’s costly since multiple iterations need to be run.
Figure 4: Effect of Random Seeding
Image courtesy – ‘Machine Learning’ course by Andrew Ng on Coursera, chapter ‘Clustering’.
Figure 5: Bad 3 Clusters due to Initial Bad Starting Points
In next post, we will cover other algorithms similar to k-Means which can be better in different circumstances.
*They have inherent order, for example, low/medium/high (low < medium < high) or school/college/university (school < college < university).
**There don’t have inherent order, for example, red/blue/green or owner/renter.