The similarity measure is the measure of how much alike two data objects are. Similarity measure in a data mining context is a distance with dimensions representing features of the objects. If this distance is small, it will be the high degree of similarity where large distance will be the low degree of similarity.

The similarity is subjective and is highly dependent on the domain and application. For example, two fruits are similar because of color or size or taste. Care should be taken when calculating distance across dimensions/features that are unrelated. The relative values of each element must be normalized, or one feature could end up dominating the distance calculation. Similarity are measured in the range 0 to 1 [0,1].

Euclidean distance

Euclidean distance is the most common use of distance. In most cases when people said about distance, they will refer to Euclidean distance. Euclidean distance is also known as simply distance. When data is dense or continuous, this is the best proximity measure.

Applications: ** Where data is continuous or numerical **. Also knows as L2 Norm famously. In an n dimensional space between two vectors x and y the formula is simply the square root of the sum of the square distance:

$$ d \left( \left[ x _ { 1 } , x _ { 2 } , \ldots , x _ { n } \right] , \left[ y _ { 1 } , y _ { 2 } , \ldots , y _ { n } \right] \right) = \sqrt { \sum _ { i = 1 } ^ { n } \left( x _ { i } - y _ { i } \right) ^ { 2 } } $$

1
2
3
4
5
6
from math import *
 def euclidean_distance(x,y):
 
    return sqrt(sum(pow(a-b,2) for a, b in zip(x, y)))
 
print euclidean_distance([0,3,4,5],[7,6,3,-1])

适用范围:通用型,在连续稠密的向量计算中相比更好。使用这个 metrics 的时候,最好是将数据规范化,一种原因在于 这个是乘方的运算,规范化之后误差是不至于太大。

Manhattan distance

This Manhattan distance metric is also known as Manhattan length, rectilinear distance, L1 distance or L1 norm, city block distance, Minkowski’s L1 distance, taxi-cab metric, or city block distance.

Applications: ** can be used if the dimensions are continuous or numeric.** This distance measure is very similar to Euclidean but it is a sum of the absolute difference of every dimension rather than the sum of squares.

$$ d = \sum _ { i = 1 } ^ { n } \left| x _ { i } - y _ { i } \right| $$

1
2
3
4
5
6
from math import *
def manhattan_distance(x,y):
 
    return sum(abs(a-b) for a,b in zip(x,y))
 
print manhattan_distance([10,20,10],[10,20,20])

适用范围:可以处理异常值、具有特征选择的功能,可以有多个解(而 Euclidean distance 只有一个最优解)

Minkowski distance

Minkowski distance is the generalized distance metric. 当p =1 和2 时,恰好是 Euclidean distance 和Manhattan distance.

$$ \left( \sum _ { i = 1 } ^ { n } \left| x _ { i } - y _ { i } \right| ^ { p } \right) ^ { 1 / p } $$

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
 
from math import *
from decimal import Decimal
 
def nth_root(value, n_root):
 
    root_value = 1/float(n_root)
    return round (Decimal(value) ** Decimal(root_value),3)
 
def minkowski_distance(x,y,p_value):
 
    return nth_root(sum(pow(abs(a-b),p_value) for a,b in zip(x, y)),p_value)
 
print minkowski_distance([0,3,4,5],[7,6,3,-1],3)

Cosine similarity

Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. One of the reasons for the popularity of cosine similarity is that it is very efficient to evaluate, especially for sparse vectors.

Applications: ** Used in identifying document similarity, product recommendations, Information retrieval and works efficiently with high-dimensional sparse data.** It is an angle between two data points in the vector space. Given a vector A and B, the cosine distance is the dot product of x and y divided by Euclidean distance.

$$ \text { similarity } = \cos ( \theta ) = \frac { \mathbf { A } \cdot \mathbf { B } } { | \mathbf { A } | | \mathbf { B } | } = \frac { \sum _ { i = 1 } ^ { n } A _ { i } B _ { i } } { \sqrt { \sum _ { i = 1 } ^ { n } A _ { i } ^ { 2 } } \sqrt { \sum _ { i = 1 } ^ { n } B _ { i } ^ { 2 } } } $$

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from math import *
 
def square_rooted(x):
 
    return round(sqrt(sum([a*a for a in x])),3)  # round(num, ndigits) 在python2 中是四舍五入
 
def cosine_similarity(x,y):
 
   numerator = sum(a*b for a,b in zip(x,y))
   denominator = square_rooted(x)*square_rooted(y)
   return round(numerator/float(denominator),3)
 
print cosine_similarity([3, 45, 7, 2], [2, 54, 13, 15])

Cosine similarity vs Euclidean distance

首先从公式上说, cosine similarity 考虑的是角度 (angle)而不是 magnitude,可以排除文章长度的干扰。

Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document).

如果涉及到(使用 tf or tf-idf)生成了向量表示,使用 cosine similarity 比 euclidean 更好。 如果使用 word2vec 生成的向量,那么 euclidean 是不是更好的选择。Cosine is mostly used on very sparse, discrete domains such as text. Here, most dimensions are 0 and do not matter at all.

如果你想要 magnitude,那么使用 ED。Euclidean is commonly used on dense, continuous variables. There every dimension matters。

这里有一个重要的观点, Cosine is essentially the same as Euclidean on normalized data. 在很高的维度,两者都是不行的,这就是 Curse of Dimensionality.

适用范围: sparse, discrete 这个特点决定了在 nlp 中使用是比较广泛的; 而Euclidean distance 在图像中(稠密连续值)中使用比较广泛。

Jaccard similarity

Jaccard Distance measures how close two sets are. It is simply a ratio of the intersection of the sets to the Union. **Can be used when the datatypes are categorical **. Example: Products purchased/viewed by customers. Typically used in Product recommendation, Clustering customers based on purchase/engagement patterns. Please note Jaccard distance is a dissimilarity metric and Jaccard coefficient, J(A,B) is a similarity metric.

$$ d _ { J } ( A , B ) = 1 - J ( A , B ) = \frac { | A \cup B | - | A \cap B | } { | A \cup B | } $$

1
2
3
4
5
6
7
8
9
from math import *
 
def jaccard_similarity(x,y):
  
    intersection_cardinality = len(set.intersection(*[set(x), set(y)])) # 参数前一个 *表示传入的是一个元祖 tuple
    union_cardinality = len(set.union(*[set(x), set(y)]))
    return intersection_cardinality/float(union_cardinality)
 
print jaccard_similarity([0,1,2,5,6],[0,2,3,5,7,9])

适用范围:set() 中计算。

Edit Distance

Edit distance is used when the comparing strings. Ideal use cases would be auto spell check, meta data correction etc. The distance between two strings are smaller if the number of corrections ( insertions or deletions ) needed to perfectly match are smaller.

常见的dp 解法。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class Solution(object):
    def minDistance(self, word1, word2):
        """
        :type word1: str
        :type word2: str
        :rtype: int
        """

        m = len(word1)
        n = len(word2)
        table = [[0] * (n + 1) for _ in range(m + 1)]

        for i in range(m + 1):
            table[i][0] = i
        for j in range(n + 1):
            table[0][j] = j

        for i in range(1, m + 1):
            for j in range(1, n + 1):
                if word1[i - 1] == word2[j - 1]:
                    table[i][j] = table[i - 1][j - 1]
                else:
                    table[i][j] = 1 + min(table[i - 1][j], table[i][j - 1], table[i - 1][j - 1])
        return table[-1][-1]

适用范围: string 中的距离。

Kullback–Leibler divergence

Kullback–Leibler divergence (KL 散度) (also called relative entropy 相对熵), 是衡量两个分布之间的差异性指标。

可以写成这样:

$$ D_{\mathrm{KL}}(P | Q)=-\sum_{x \in \mathcal{X}} P(x) \log \left(\frac{Q(x)}{P(x)}\right) $$

也可以写成这样:

$$ D_{\mathrm{KL}}(P | Q)=\sum_{x \in \mathcal{X}} P(x) \log \left(\frac{P(x)}{Q(x)}\right) $$

KL 散度是交叉熵和熵之差。推导如下: $$ \begin{split} D_{K L}(p | q) &= H(p) -H(p, q) \\
& =-\int p(x) \log q(x)-\left(-\int p(x) \log p(x)\right) \\
&=-\int p(x) \log \frac{q(x)}{p(x)} d x \end{split} $$ (上面的式子是错误的,应该是交叉熵和熵的差)

特点:

  • 不具有对称性
  • 当两个分布不相交时候,距离趋向无穷,无法反应距离关系

复习总结

  1. Euclidean distance 最常用的,适合在连续稠密的向量中进行计算。
  2. 绝对值distance, 适合处理异常值,具有特征选择的功能(是否还记得L1)
  3. cosine similarity,适合在文章相似度、商品推荐、信息召回等稀疏高纬数据中。 cosine vs. 欧氏距离,前者考虑的是角度(angle) 而不是magnitude,因此可以排除文章长度的影响;前者在nlp中使用比较广泛,后者在图像(稠密连续值)中使用比较广泛。
  4. Jaccard similarity,交集比上并集
  5. edit distance,字符串的编辑距离
  6. KL 散度,衡量两个分布之间差异的指标