Clustering is a canonical example of un-supervised machine learning methods. Un-supervised, as in, true clusters (segments) don’t exist or aren’t known in advance. Hence method tries to separate observations in different groups without any way to verify if model has done good job or not. There are various ways we can try to measure performance of un-supervised clustering: Within-Cluster-Sum-of-Squares is one, Silhouette Coefficient is another. We talked about both in our previous blog post on selecting right number of clusters for k-Means clustering algorithm.
If we had labels i.e. true cluster number for all training observations, we would have preferred a supervised method. If there were only 2 classes, then binary classification models like Logistic Regression,Scorecard, or Support Vector Machines would’ve suited. If there were more than 2 classes then multi-class models like one-versus-all Logistic Regression, Discrete Choice models, or Neural Network Models would have been better design.
Real life isn’t black and white though. We may sometimes have labels for only handful of observations while we still need to cluster all the observations. This is example of semi-supervised clustering. While we can ignore labels and treat all data as un-supervised, it’s not wise to not make use of the information.We may also build a supervised model only on labelled data and ignore unlabelled data but there may be problems of very few modeling observations, and bias in modeling data (perhaps, because data which got labelled isn’t same as data which didn’t get label).
In this post we will talk about how to undertake semi-supervised clustering.
Why semi-supervised clustering?
In real world, there is more unlabelled data then labelled data, but there is still some labelled data. We want to use all available information for most robust and best performing model. Semi-supervised clustering helps us there.
Generating labels may not be easy or cheap, and hence due to limited resources, we may have labels for only few observations. For example, investigating fraud is expensive so we may know about confirmed fraud or confirmed non-fraud only for limited insurance claims. But not knowing doesn’t mean that those claims cannot be fraud.
Some examples of semi-supervised clustering can be news category classification, as you seen on Google News. There may be some information about a news item being related to ‘politics’ or ‘sports’ but nobody can sift through hundreds of thousands of items every day to create fully labelled data. Similarly, image recognition uses the similar method as you may experience now on Google Photos.
How of semi-supervised clustering?
First thing to note is that labelled information for clusters can come in two forms:
- Known labels for some observations- This is how it sounds. For instance, out of 10,000 observations, you may have labels for 1,000 observations which fall into 3 classes. We know that there are at least 3 clusters, but there could be more.
- Pair-wise constraints for some pair of observations- This implies that we don’t know true clusters of observations, but we know for sure that some pairs must fall into same segment and some must not fall into same segment. These are called must-link and cannot-link constraints, respectively. In practice, this is more common way supervised information is available for clustering data. For instance, I may not be able to differentiate eagle from vulture in photo categorization, but I know that they are different from each other and different from parrot.
Astute reader will note that information on known labels can easily be converted into series of must-link and cannot-link pair-wise constraints without must efforts. Of course, doing otherwise is non-trivial. Hence, for purposes to analytic approach, we can assume that supervised information is available as set of must-link and cannot-link pairs of observations. Now, how do we use this information?
Depending on clustering approach, you can have one of three ways:
- Modify clustering objective function to include reward for must-link pairs and penalty for cannot-link pairs of observations
- Modify pair-wise distance metric to have small (zero) distance for must-link pairs and large (infinite!) distance for cannot-link pairs
- Modify internal search and optimization process to include supervised information
k-Means semi-supervised Clustering
We’ve been talking about k-Means clustering, pre-processing of its data and measuring means for last 3 blog posts. It is also one of the most common and recognized clustering algorithm. How can we improve k-Means algorithm to make use of partial label information?
Mechanics of k-Means do not make search process explicit, but we can use first two approaches. Recall that k-Means uses Within-Cluster-Sum-of-Squares as implicit objective function:
where Yi is centroid for observation Xi. This can be altered as:
where first term is reward for each i,j pair of observations which Must-Link (ML) and are in same cluster, and second term is penalty for each i,j pair of observations which Cannot-Link (CL) and are in same cluster.
So dear reader, which of three approaches will suit semi-supervised clustering of agglomerative Hierarchical Clustering method?