January 9, 2016
Anyone cursorily familiar with analytic techniques would have noticed plenty of algorithms relying on distances among data points for their application. Each observation, or data instance, is usually represented as multi-dimensional vector, and input to algorithm requires distances between each pair of such observations.
Distance computation method depends on type of data – numerical, categorical, or mixed. Some of the algorithms apply to only one class of observations, while others work on multiple. In this post, we will discuss distance measures which work on numerical data. There are perhaps more ways distance can be measured in multi-dimensional hyperspace than those can be covered in single blog post, and one can always invent newer ways, but we look into some of the common distance metrics and their relative merits.
For purpose of rest of the blog post, we imply
to refer to two observations or data vectors.
Before we review different distance metrics, we need to prepare the data:
Transformation to numeric vector
For mixed observation, which contains both numerical and categorical dimensions, first step is to actually transform categorical dimension into numerical dimension(s). A categorical dimension with three potential values can be turned into two or three numerical dimensions with binary values. Since this categorical variable necessarily takes one of three values, one of three numerical dimensions will be perfectly correlated with other two. This may or may not be okay depending on your application.
If observation is purely categorical, such as text string (varying length sentences) or genome sequence (fixed length sequences), then some special distance metric can directly be applied without transforming data into numeric format. We will discuss these algorithms in next post.
Depending on your use case, you may want to normalize each dimension on same scale, so that distance along any one dimension doesn’t unduly influence overall distance between observations. The same thing was discussed in k-Means algorithm . There are two kinds of normalization possible:
Range normalization (rescaling) normalizes data to be in 0-1 range, by subtracting minimum value from each dimension and then dividing by the range of values in that dimension.
First problem with range normalization is that an unseen value may be normalized beyond 0-1 range. Though, this is generally not a concern for most distance metrics, but if algorithm cannot handle negative values then this can be problem. Second problem is that this is highly dependent on outliers. If one observation has very extreme (high or low) value for a dimension, normalized value for that dimension for other observations will be huddled together and lose their discriminative powers.
Standard normalization (z-scaling) normalizes dimension to have 0 mean and 1 standard deviation, by subtracting mean from that dimension of each observation and then dividing by standard deviation of value of that dimension across all observations.
This generally keeps data in -5 to +5 range, roughly, and avoids influence of extreme value.
We have simulated z-scaling of two observations. Simulated, because we really need many more than two observations to compute mean and standard deviation of each dimension, and we have assumed both of these numbers for each dimension here.
Euclidean distance – aka “as the crow flies” distance – is shortest distance in multi-dimensional hyperspace between two points. You are familiar with this in 2D plane or 3D space (this is a line), but similar concept extends to higher dimensions. Euclidean distance between vectors in n-dimensional space is computed as
For transformed data vector examples, this is
This is most common metric and often very suitable for most applications. A variant of this is squared-Euclidean distance, which is just sum of squared differences.
Manhattan distance – named because of East-West-North-South grid like structure of streets of Manhattan in New York – is distance between two points when traversing parallel to the axes.
This is computed as
This may be useful in some application where distance is used in real, physical sense rather than machine learning sense of “dissimilarity.” For instance, if you need to calculate distance taken by fire-truck to reach a point then using this is more practical.
Canberra distance is weighted variant of Manhattan distance, and is computed as
L-norm distance is extension of above two – or you can say that above two are specific cases of L-norm distance – and is defined as
where L is a positive integer. I’ve not come across any cases where I needed to use this, but this is still good to know possibility. For instance 3-norm distance will be
Do note that L should generally be even integer since we don’t want positive or negative distance contributions cancelling out.
Minkowski distance is generalization of L-norm distance, where L could take any value from 0 to including fractional values. Minkowski distance of order p is defined as
Cosine distance is measure of angle between two vectors, each representing two observations, and formed by joining data point to origin. Cosine distance ranges from 0 (exactly same) to 1 (no connection), and is computed as
While this is more common distance measure when working with categorical data, this can also be defined for numerical vector. For our numeric vectors, this will be
You knew this was coming, don’t you? If analytics was just bunch of mathematical formulas, we won’t need smart folks like you to do it.
First thing to note is that distances computed by different metrics are different. You may be tempted to think that Cosine distance of 1.3 is smallest and hence indicates vectors are closest but this is not right way to interpret. Distances across different methods cannot be compared, and only distances between different pairs of observations under same method can be compared. Distances have relative meaning and no absolute meaning by themselves.
This leads to next question of how to select right distance metric. Unfortunately, there is no true answer. Depending on type of data, context, business problem, application, and model training method, different metric give different results. You will have to use judgment, make assumptions, or test model performance to decide on right metric.
Second caveat is my often repeated one about curse of dimensionality . In higher dimensions, distances don’t behave the way we intuitively think they do, and analyst must be extremely cautious when using any metric.
Third caveat is about relationship between distances among three observations. Some metrics support triangle inequality and while others don’t. Triangle inequality implies that it is always shortest to go from point i to point j directly, rather than via any intermediate point k. Mathematically,
Depending on your application, this may or may not be required property of distance metric.
Oh, one more thing, “distance” is opposite of “similarity.” Higher the distance, lower the similarity, and vice-versa. Clustering algorithms work on distances, and recommendation algorithms work on similarity, but essentially they are talking about same thing.
So, how can you transform distance number into similarity number?