Machine Learning methods can be classified into two broad categories: supervised and un-supervised. Supervised learning learns from labelled set of observations, where observations are known to belong to certain classes (for classification problems) or have certain values (regression problem). Un-supervised learning learns from unlabelled set of observations, where nothing else is known apart from observations themselves.
For supervised learning methods, we essentially say that “look at this ‘true’ data and tell me know to know ‘truth’ of unseen data”. For un-supervised learning methods, this is equivalent to “look at this data, and tell me something interesting I don’t know.” While clear dichotomy is useful, in this post we will talk about interesting variants where defining target can become itself a very interesting task!
This is middle ground between supervised and un-supervised data, where ‘true’ labels exist only for some of the observations but not all. Ignoring information at hand is injustice to quality of analytic models, but using this information can make problem unusual. In the world where data generation is easy – think of internet – and labeling is expensive, many problems fall in semi-supervised domain.
One approach could be to not include label information for modeling but only for model validation and performance comparison. For instance, we can segment data into k clusters using un-supervised clustering and then verify competence of our model by comparing predicted cluster to actual cluster. This may help us decide better among multiple clustering solutions.
Another approach could be to use this information for modeling. We discussed about semi-supervised clustering in previous blog post.
Discrete versus Continuous
While many situations obviously fall into classification or regression categories – where ‘true’ value is discrete class or a continuous value – there are instances where target transformation makes sense.
Consider customer level revenue as function of customer’s demographic and past transaction behaviour. This is common enough problem in retail, bank, insurance and telecom industries. Revenue, often of last N months, is obviously continuous number. A (linear, perhaps) regression may fit here, you say! But there are more complex and sophisticated methods available for classification problems, like Neural Networks, which can potentially do better job but won’t apply to continuous target. Depending on business context, you may not need to predict actual revenue but only if revenue will be high, medium or low. If so, you can review distribution of revenue for all customers and define boundaries for high-medium and medium-low, and voila! You have converted a continuous target problem into three-class discrete classification problem.
If your continuous target is a percentage – say fraction of students passing the exam by schools – and you want to predict percent as well for new/unseen school, then you have another option apart from regression model. You may parcel the data! Parceling converts continuous fractions into binary classes by replicating the observations. For an observation with target value of 40%, you replicate the observation 100 times and for 40 of them assign class of 1 and for remaining 60 assign class of 60. Given observation is same and feature set is same, model will try to differentiate 1 from 0 and conclude that this kind of observation is 40% likely to be 1 and 60% likely to be 0. You can do same for all observations. (Yes, this will increase size of training data manifold.) Most common classification models anyway produce probability of class=1 as outcome, and that’s result you want anyway!
Could there be case for converting discrete classes into continuous value? Mostly no, but there are examples. When you are trying to predict someone’s age in years – which is essentially a discrete integer from 0-100 (or so) – you can treat age as continuous target. Similarly, if you have granular enough income categories, say, 0-50k, 50k-100k, 100k-150k, etc. then you may benefit from treating income as continuous variable rather than solving 20-class problem.
Case of No Class
All data is either labelled or not labelled. When data is labelled (assume, binary), it either belongs to one class or another. But in practice, data may not belong to any class.
For instance, among bunch of insurance claims, you may know for sure certain claims to be fraud, and others not to be fraud, but you may not know about many which were never investigated. Consider, among credit card applicants who are accepted or and who were rejected as part of application scoring model in banking. But there are lot of customers who didn’t apply at all, and you don’t know whether they would have been accepted or not. Often, we cannot deal with data with no label, and we must exclude them from our development population. However, we must keep in mind if this induces bias in modeling. In insurance example above, perhaps claims which were investigated were suspicious to begin with (even those found not-fraud), or in banking example, customers who didn’t apply weren’t solicited by sales force which excluded a demographic category altogether.
If you do target transformation as described in previous section, you may create a no-class data yourself. Suppose you define revenue of over 5000/- per year a high and less than 5000/- per year a low to convert continuous target into binary target. This will, however, be a bad design, because there is arbitrary cut-off at 5000. A customer with revenue of 4999/- isn’t much different from that with revenue of 5001/- yet you put them worlds apart. Your definition will imply that revenue of 5001/- is more similar to revenue of 10000/- than of 4999/-. You see the problem? You can do so, technically, but resulting model will not be good and robust since you are asking it to learn differences from similar customers.
What’s usually a good practice is to include a buffer/no-class zone. So you may define revenue of 6000/- or more as high and 4000/- or less as low, and ignore the observations with revenue between 4000/- and 6000/-. While you lose some data, model will do much better job since what you define to be different is really different.
While lots of focus is deservedly on data preparation, feature generation, and machine learning method, defining right target can also be useful to quality of overall analytic outcome.