January 21, 2016
In this post, we will talk about measuring distance for categorical observations. Categorical dimensions can always be translated into numeric dimensions, and numeric distance metrics continue to be meaningful. However, for purely categorical observations there are some special metrics which can be used.
There are two cases for purely categorical data: where number of dimensions is not constant across observations, and where they are. Example of former is text documents where number of words is number of dimensions in each document. Finding distances among documents is one the most common tasks in text analytics and Natural Language Processing. Example of later is from bio-informatics where gene-sequence is constant length categorical sequence of genotypes.
For purpose of demonstrations in this post, we shall use following three sentences as three observations among which we want to compute distances. Metrics discussed shall apply equally well to cases where number of dimensions are constant.
Let xiw refer to count of word w in sentence i. Using “bag of word” approach (That is, we ignore order of words in the sentence) we can transform these sentences into following vectors:
|“I think, therefore I am”||2||1||1||1||0||0||0||0||0|
|“Can you think?”||0||1||0||0||1||1||0||0||0|
|“I don’t think, therefore I don’t know who I am”||3||1||1||1||0||0||2||1||1|
Cosine similarity is measure of number of common words in two observations, scaled by length of sentences. Cosine distance is computed as
Cosine distance between sentence 1 and sentence 2 is computed as…
Number of common words: 1 (“think”)
Length of sentence 1: 4 (“I” repeated twice)
Length of sentence 2: 3
Thus sentence 1 and 3 are closest, but sentence 1 is closer to sentence 2 than sentence 3.
Tanimato coefficient extends idea of Cosine distance and changes the normalization figure in denominator. Tanimato distance is computed as
Tanimato distance between sentence 1 and 2 is
Jaccard distance is one minus Jaccard similarity, which is number of common words in two sentences divided by total number of unique words. This is particularly popular distance metric for text comparison, because it is fairly fast to compute as we are essentially counting number of common and unique words without complex mathematical computation.
Sorensen–Dice coefficient is variation of Jaccard’s and computed as
Hamming distance is most commonly used for equal length documents, and is equal to number of places changes are required to convert one document into another. This is computationally expensive metric as it also takes into account order of words. Sentence 1 can be converted into sentence 2 in following series of operations –
|–||I think therefore I am||0|
|Drop first “I”||think therefore I am||1|
|Drop “therefore”||think I am||2|
|Drop second “I”||think am||3|
|Insert “can”||Can think||5|
|Insert “you”||Can you think||6|
Thus, hamming distance d12 = 6, and d13 = 4 (insert “don’t”, “don’t”, “know”, “who”) and d23 = 11 (drop “can”, “you”; insert “I”, “don’t”, “therefore”, “I”, “don’t”, “know”, “who”, “I”, “am”).
String Edit distance is generalization of Hamming distance and is O(n2)computational operation. While Hamming distance permits only ‘insert’ and ‘drop’ operation, String Edit distance permits third operation of ‘replace’ which is equivalent to one ‘insert’ and ‘drop’ combined. Further, Hamming distance computation weighs each operation (insert or drop) equally, String Edit distance can weight insert, replace, and delete separately, in counting number of operations. However, commonly applied variant works with equal weight for all three actions. String Edit distance between Sentence 1 and sentence 2 is 5 as shown in table below.
|–||I think therefore I am||0|
|Replace first “I” with “you”||you think therefore I am||1|
|Drop “therefore”||you think I am||2|
|Drop second “I”||you think am||3|
|Drop “am”||you think||4|
|Insert “can”||Can you think||5|