Readers are advised to go through k-Nearest Neighbours (k-NN) as algorithm for the better understanding of this blog
k-NN really refers to finding nearest k observations (i.e. neighbours) for each observation (data point) in training set and doing something with those k observations. There are lots of details even in this seemly simple design.
To define near requires defining sense of distance. This is not so easy for categorical dimensions of data, and hence those problems where data instance has anything other than numerical dimensions are not suitable for k-NN. For numeric dimensions, Euclidian is most common distance form, but others like Hamming distance are also possible. In general, L-norm distance can be defined as
where L=1 is known as Manhattan distance, and L=2 is known as Euclidian distance.
Thus finding k-Nearest Neighbours requires computing distance of each observation with each other observation and selecting top k shortest distance.
Since algorithm requires user input in terms of k, selecting that is critical to success of algorithm. Usually, measuring performance of algorithm on cross-validation data can give some indication of suitable k. However, even that requires some k values to test for. In principle, k can be as small as one and as large as number of all training observations, in few thousands to few millions. If running algorithm is time-consuming, testing for so many k’s can be daunting itself. In practice, 3-10 is reasonable number to test for, since benefit of k-NN comes from being locally optimal, going for too high a k defeats the purpose.
Pop Quiz – Everything else being equal, would you prefer odd number or even number for k?
Are all neighbours equal?
k-NN works on principle that other observations closer to observation of interest are most likely to be similar to observation of interest. Hence, if we have k neighbours selected, it may not make sense to treat them equally. Doing so is equivalent of putting hard sharp cut-off at kth neighbour. It is intuitive to think that closer the neighbour is, more likely it will be similar to our observation point. Hence in weighted k-NN contribution of each neighbour is weighted by function inverse of distance of that neighbour. Commonly used functions are 1/d^2 or e^(-d), is distance between observations.
If you do weigh observations by inverse distance function, you may be tempted to think: why limit to k at all! And you are right. One may consider all training observations, and weigh contributions by inverse distance function, in essence giving almost negligible weight to farther points, but not formally drawing sharp boundary, and still retaining ability to capture values from other points.
Too many computations!
At this point, one realizes that even for simple case of k-neighbours, one has to compute full matrix of distance from each point to each other point. That is a lot of calculations. There are two solutions for this problem:
Reduce number of points to compute distance to – Key to this is to realize that all k neighbours will contribute to decisions together. We can easily drop those points from training data whose neighbours have opposing contributions since contribution of that observation will almost always be vetoed by other observations. In fact, each observation has an area of influence, called Voronoi cell , and we can drop those points whose removal does not change Voronoi regions of each target class (for classification problems).
Reduce distance computation efforts – Key to this is to realize that while we are interested in finding k closest points, we don’t actually need actual distance measures. We can break down distance computation in two parts: cheaper, faster calculation for all points, and then precise calculation for shortlisted points. We can select 5k-10k points, for example, from imprecise calculation and then select k from among those. Another method to reduce efforts for distance computation involves clustering observations, and then doing precise calculation only for points in same cluster as the observation of interest.
Curse of Dimensionality
Since finding right set of neighbours is crucial for k-NN algorithms, reader may recall the curse of dimensionality inherent in computing distance in high dimensions. This is a problem which requires careful design and dimensionality reduction.
What to do with those k Observations?
Having selected k observations now is time to use them. There are many applications of this method. They can be used for classification by majority voting (or distance weighted voting), or used for regression as demonstrated in instance based learning . Outlier detection algorithms and collaborative filtering algorithms too work on k-NN design. Since k-NN combines advantage of lazy learning and ability to capture complex decision boundaries, it plays role in many machine learning solutions – Your imagination is only limitation.
(Answer to Pop Quiz – Odd is preferable, to avoid the tie in voting, depending on your use case.)