This is second post in three-part series on deep-dive into k-Means clustering. While k-Means is simple and popular clustering solution, analyst must not be deceived by the simplicity and lose sight of nuances of implementation. In previous blog post, we discussed various approaches to selecting number of clusters for k-Means clustering. This post will discuss aspects of data pre-processing before running the k-Means algorithm.
This post assumes prior knowledge of k-Means algorithm. If you aren’t familiar then go through wiki-article or any standard text-book to understand and then come back here for deep-dive.
It’s useful to follow an example to demonstrate some of the points. Fig. 1 presents a scatter plot of 1000 observations dummy-data. We can see, visually, that there are 3 clusters here.
For our example dataset, if all at least two of initial centroids chosen happened to be in bottom cluster then resulting solution will be as in Fig. 5 which is very far from true solution!
There is no built-in mechanism to correct for initial wrong starting point. Hence care must be taken to ensure that hundreds of k-Means are runs with different initial seeds and segmentation chosen with best for given . One may use k-Means++ for selecting good starting points but there is no substitute for multiple starting points though it’s costly since multiple iterations need to be run.
Image courtesy – ‘Machine Learning’ course by Andrew Ng on Coursera, chapter ‘Clustering’.
In next post, we will cover other algorithms similar to k-Means which can be better in different circumstances.*They have inherent order, for example, low/medium/high (low < medium < high) or school/college/university (school < college < university).
Global Association of Risk Professionals, Inc. (GARP®) does not endorse, promote, review or warrant the accuracy of the products or services offered by EduPristine for FRM® related information, nor does it endorse any pass rates claimed by the provider. Further, GARP® is not responsible for any fees or costs paid by the user to EduPristine nor is GARP® responsible for any fees or costs of any person or entity providing any services to EduPristine Study Program. FRM®, GARP® and Global Association of Risk Professionals®, are trademarks owned by the Global Association of Risk Professionals, Inc
CFA Institute does not endorse, promote, or warrant the accuracy or quality of the products or services offered by EduPristine. CFA Institute, CFA®, Claritas® and Chartered Financial Analyst® are trademarks owned by CFA Institute.
Utmost care has been taken to ensure that there is no copyright violation or infringement in any of our content. Still, in case you feel that there is any copyright violation of any kind please send a mail to email@example.com and we will rectify it.
2015 © Edupristine. ALL Rights Reserved.