Cluster algorithm

主要介绍K-means，EM算法和KNN 算法。

K-means 算法

K-means 是一种聚类算法，属于无监督学习算法，先说一下什么是聚类：聚类分析是在数据中发现数据对象之间的关系，将数据进行分组，组内的相似性越大，组间的差别越大，则聚类效果越好。

K-Means算法思想：对给定的样本集，事先确定聚类簇数K （超参数），让簇内的样本尽可能紧密分布在一起，使簇间的距离尽可能大。该算法试图使集群数据分为n组独立数据样本，使n组集群间的方差相等，数学描述为最小化惯性或集群内的平方和。K-Means作为无监督的聚类算法，实现较简单，聚类效果好，因此被广泛使用。

算法步骤

创建k个点作为k个簇的起始质心（经常随机选择）。
分别计算剩下的元素到k个簇中心的相异度（距离），将这些元素分别划归到相异度最低的簇。
根据聚类结果，重新计算k个簇各自的中心，计算方法是取簇中所有元素各自维度的算术平均值。
将D中全部元素按照新的中心重新聚类。
重复第4步，直到聚类结果不再变化。
最后，输出聚类结果。

K-Means算法优缺点

优点

原理易懂、易于实现；
当簇间的区别较明显时，聚类效果较好；
Trains quickly

缺点

当样本集规模大时，收敛速度会变慢；
对噪声或者离群点比较敏感
k的取值十分关键，对不同数据集，k选择没有参考性，需要大量实验
初始质心的选择，不同的质心选择可能得到完全不同的结果，因为k-means 可以保证的是局部最优解，而不是全局最优解。针对该缺陷，可以使用 kmeans++ 算法解决。k-means++ 算法选择seeds的基本思路是：初始化聚类中心之间的相互距离要尽可能的远。

时间复杂度 $O(mnkt)$，其中$t$ 为迭代刺手， $k$ 为簇的数目， $m$ 为数据量, $n$ 为数据的维度。对于大数据来说，$t$，$k$和$n$ 都是可以看做常数，那么时间复杂度近似就是 $O(n)$，所以该算法是相当高效的。

空间复杂度 $O(m+k)n$ 其中 $k$ 为簇的数目， $m$为数据量， $n$ 为数据的维度。同理对于空间复杂度也可以这样分析。最后的结论可以简化为：时空复杂度是$O(n)$

Hard clustering: Even though some colors near the boundary are pretty similar, but they are grouped into different clusters. No red/green/blue can cross the line.
Means/latent variables: the labels are added manually based on the means (the cross). K-means will end up with bunches of means and assignments, but how to interpret the means are up to people.

You may notice the issues that k-means have:

Initialization matters: the same k may end up with different outcomes, because initialization matters. It’s also important to break the tie.
Unstable under this uniform-ish circumstance. There is no clear structure to divide them into groups.

Choosing K

The algorithm explained above finds clusters for the number k that we chose. So, how do we decide on that number?

尝试法：计算每个点到最近的簇的距离的总和，如果增加 k 导致的总和下降不明显，那么就接近临界点了。

To find the best k we need to measure the quality of the clusters. The most traditional and straightforward method is to start with a random k, create centroids, and run the algorithm as we explained above. A sum is given based on the distances between each point and its closest centroid. As an increase in clusters correlates with smaller groupings and distances, this sum will always decrease when k increases; as an extreme example, if we choose a k value that is equal to the number of data points that we have, the sum will be zero.

The goal with this process is to find the point at which increasing k will cause a very small decrease in the error sum, while decreasing k will sharply increase the error sum. This sweet spot is called the “elbow point.” In the image below, it is clear that the “elbow” point is at k-3.

使用场景：一般在数据分析的前期使用，选择适当的 $K$，分析不同聚类数据下的特点。

k-means 是以欧式距离作为相似度测量，要求某一初始化聚类中心分类效果最好。算法采用误差平方和作为准侧函数，优化的是该函数。

欧式距离和曼哈顿距离（L1距离）都可以表示最近邻居之间的距离，如何进行选择？前者适合在归一化无量纲的情况下进行使用，两者不具有相互替代性。

代码实现

自己实现了一个 k-means 计算过程；然后使用 sklearn 中实现好的框架。没有比这个写得更好的了。代码链接

代码摘要

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


## Initialisation

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.DataFrame({
    'x': [12, 20, 28, 18, 29, 33, 24, 45, 45, 52, 51, 52, 55, 53, 55, 61, 64, 69, 72],
    'y': [39, 36, 30, 52, 54, 46, 55, 59, 63, 70, 66, 63, 58, 23, 14, 8, 19, 7, 24]
})

np.random.seed(200)
k = 3
# centroids[i] = [x, y]
# 学习开源代码，在于向别人学习，看人家是如何写代码的，怎么写得简单高效，漂亮
centroids = {
    i+1: [np.random.randint(0, 80), np.random.randint(0, 80)]
    for i in range(k)
}
    
fig = plt.figure(figsize=(5, 5))
plt.scatter(df['x'], df['y'], color='k')
colmap = {1: 'r', 2: 'g', 3: 'b'}
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0, 80)
plt.ylim(0, 80)
plt.show()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


def assignment(df, centroids):
    for i in centroids.keys():
        # sqrt((x1 - x2)^2 - (y1 - y2)^2)
        df['distance_from_{}'.format(i)] = (
            np.sqrt(
                (df['x'] - centroids[i][0]) ** 2
                + (df['y'] - centroids[i][1]) ** 2
            )
        )
    centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
    df['closest'] = df.loc[:, centroid_distance_cols].idxmin(axis=1)#得到的是id 标识
    df['closest'] = df['closest'].map(lambda x: int(x.lstrip('distance_from_'))) # 这个转换成数字
    df['color'] = df['closest'].map(lambda x: colmap[x]) # 注意lambda 函数 x 是输入参数，: 之后是处理的函数
    return df

df = assignment(df, centroids)
print(df.head())

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


## Update Stage
import copy
old_centroids = copy.deepcopy(centroids)
def update(k):
    for i in centroids.keys():
        centroids[i][0] = np.mean(df[df['closest'] == i]['x'])
        centroids[i][1] = np.mean(df[df['closest'] == i]['y'])
    return k

centroids = update(centroids)

# 这个可视化的箭头也是做得相当的棒
fig = plt.figure(figsize=(5, 5))
ax = plt.axes()
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0, 80)
plt.ylim(0, 80)
for i in old_centroids.keys():
    old_x = old_centroids[i][0]
    old_y = old_centroids[i][1]
    dx = (centroids[i][0] - old_centroids[i][0]) * 0.75
    dy = (centroids[i][1] - old_centroids[i][1]) * 0.75
    ax.arrow(old_x, old_y, dx, dy, head_width=2, head_length=3, fc=colmap[i], ec=colmap[i])
plt.show()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


# Continue until all assigned categories don't change any more
while True:
    closest_centroids = df['closest'].copy(deep=True)
    centroids = update(centroids)
    df = assignment(df, centroids)
    # 不同的语言之间是相同的，要么是值比较，那么是引用比较，显然这里是值比较
    if closest_centroids.equals(df['closest']):
        break

fig = plt.figure(figsize=(5, 5))
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i]) # 如意可以传入一个list，表示两个参数
plt.xlim(0, 80)
plt.ylim(0, 80)
plt.show()

EM 算法

先验概率和后验概率

事件发生前的预判概率。可以是基于历史数据的统计，可以由背景常识得出，也可以是人的主观观点给出。一般都是单独事件概率，如 $P(x) $, $P(y)$。事件发生后求的反向条件概率；或者说，基于先验概率求得的反向条件概率。条件概率：一个事件发生后另一个事件发生的概率。

先验概率是指根据以往经验和分析得到的概率,它往往作为"由因求果"问题中的"因"出现，比如全概率公式。

后验概率是指依据得到"结果"信息所计算出的最有可能是那种事件发生,如贝叶斯公式中的,是"执果寻因"问题中的"因"，比如贝叶斯公式。

极大似然估计

极大似然估计是用来求解概率分布中参数值。为什么要求解参数值？因为大多数概率分布都是由关键的参数值，得到分布就可以计算概率值，那么进行分类和回归就没有什么问题。常见的概率分布比如二项分布是由 $p $控制，正太分布是由 $\mu$ 和$\sigma$控制。极大似然估计理论认为，概率越大，发生的可能性就越大。

该算法背景：

有一个独立同分布的样本集 D，且样本从数据分布 $p(x | \theta)$ 中抽取
想要计算 $\theta$

解决思路：

假设 D 中所有的样本都是独立从 $ p(x | \theta)$ 中抽取，那么：

$$ P({X=x^{1}, X=x^{2}, \cdots, X=x^{n}})=\prod_{i=1}^{n} p(x^{i} | {\theta}) $$

记等号后面的式子为似然函数

因为乘积不方便处理，所以上面式子中左右两边求对数

$$ \ln l(\boldsymbol{\theta})=\ln \prod_{i=1}^{n} p\left(x^{i} | \boldsymbol{\theta}\right)=\sum_{i=1}^{n} \ln p\left(x^{i} | \boldsymbol{\theta}\right) $$

根据算法的思想，我们的目标就是最大化 $L(\theta)$，最大化的 $\theta$ 就是我们要求的。如何最大化 $L(\theta)$？一般来说，如果如果只有一个参数，比如二项分布，那么求导令导数为0 就可以求$\theta$。如果有多个参数，比如正太分布，那么就需要求解偏导，令偏导数为0.

极大似然估计的局限性

需要事先假定数据分布
假定的数据分布和真实的数据分布不一致的时候，容易出现较大的误差

介绍完了最大似然估计，那么后面是真正 EM （Expectation Maximization Algorithm）算法。

EM算法解决这个的思路是使用启发式的迭代方法，既然我们无法直接求出模型分布参数，那么我们可以先猜想隐含数据（EM算法的E步），接着基于观察数据和猜测的隐含数据一起来极大化对数似然，求解我们的模型参数（EM算法的M步)。由于我们之前的隐藏数据是猜测的，所以此时得到的模型参数一般还不是我们想要的结果。不过没关系，我们基于当前得到的模型参数，继续猜测隐含数据（EM算法的E步），然后继续极大化对数似然，求解我们的模型参数（EM算法的M步)。以此类推，不断的迭代下去，直到模型分布参数基本无变化，算法收敛，找到合适的模型参数。从上面的描述可以看出，EM算法是迭代求解最大值的算法，同时算法在每一次迭代时分为两步，E步和M步。一轮轮迭代更新隐含数据和模型分布参数，直到收敛，即得到我们需要的模型参数。一个最直观了解EM算法思路的是K-Means算法，见之前写的K-Means聚类算法原理。在K-Means聚类时，每个聚类簇的质心是隐含数据。我们会假设$𝐾$个初始化质心，即EM算法的E步；然后计算得到每个样本最近的质心，并把样本聚类到最近的这个质心，即EM算法的M步。重复这个E步和M步，直到质心不再变化为止，这样就完成了K-Means聚类。

EM算法能保证收敛吗？有充分的理论说明，这个是能够保证收敛的。
EM算法如果收敛，那么能保证收敛到全局最大值吗？
不能保证，这个取决于初始化。初始值不同，那么最后的结果不同。所以这个是给定初始值，经过循环迭代，最终逼近真实值。

如果要讲解 EM 算法，可以认为 K-means 是其一个特例，那么进行理解。

PROPERTIES

Soft clustering: can cluster data with no clear group structure; easy to interpret responsibility as probability or componence percentage.
Convergence: non-decreasing log likelihood indicates that with more iterations, EM is guaranteed to get a better result. I.e., it is guaranteed to converge to one of local optima.
More computation & risks: It requires moooore computation than k-means, moooore iterations to converge, but still possible to get stuck in local optima. When that happens, there is something you can do:
Cry
Random restart

Relationship btwn k-means & EM

There is a close similarity between k-means algorithm and EM algorithm for GMM. The first way to understand is from the two-stage update process. Both of the algorithms share an expectation stage and a maximization stage. The second way is we can derive the k-means as a particular limit EM for GMM. The key is to make the soft assignment to be a hard one. We can simply use a \arg \max to force the probabilities to be binary, or consider a GMM setting with covariance matrix to be $\epsilon$ I where $\lim\epsilon \rightarrow 0$.

KNN

K最近邻(k-Nearest Neighbor，KNN) 是有监督分类学习，根据K 个最近邻的类别信息，通过投票的方式决定刚进来的数据点的类别。和KNN 容易相混淆的是K-means算法，具体可以参考上面的描述。KNN 中的K 表示K个最近邻是有投票权的，根据K 个最近邻的投票然后决定新加入的点类别信息。

In short, the algorithms are trying to accomplish different goals. K-nearest neighbor is a subset of supervised learning classification (or regression) algorithms (it takes a bunch of labeled points and uses them to learn how to label other points). It is supervised because you are trying to classify a point based on the known classification of other points. In contrast, K-means is a subset of unsupervised learning clustering algorithms (it takes a bunch of unlabeled points and tries to group them into clusters). It is unsupervised because the points have no external classification.

The $ k $ in each case mean different things. In K-NN, the $ k $ represents the number of neighbors who have a vote in determining a new player’s position. The $ k $ in K-means, determine the number of clusters we want to end up.

In a K-NN algorithm, a test sample is given as the class of majority of its nearest neighbours. For example, if we have three classes and the goal is to find a class label for the unknown example $ x_j $ then, by using the Euclidean distance and a value of $ k=5 $ neighbors, the unknown sample is classified to the category of the most voted neighbors.

How it works? Step 1: Determine the value for K Step 2: Calculate the distances between the new input (test data) and all the training data. The most commonly used metrics for calculating distance are Euclidean, Manhattan and Minkowski Step 3: Sort the distance and determine k nearest neighbors based on minimum distance values Step 4: Analyze the category of those neighbors and assign the category for the test data based on majority vote Step 5: Return the predicted class

The situation with K-means is that, given some data you will cluster them in k-groups or clusters. The initial step of the algorithm is to randomly spawn $ k $ centroids (centers). At every iteration the center of each cluster is moved slightly to minimize the objective function. The algorithm will terminate if the iterations are maximized or if the centroids stop to move.

The objective function of K-means is $ J = \sum_{j=1}^{k}\sum_{i=1}^{n}\left |x_i^{j}-c_j \right |^{2} $

How it works? Step 1: Determine K value by Elbow method and specify the number of clusters K Step 2: Randomly assign each data point to a cluster Step 3: Determine the cluster centroid coordinates Step 4: Determine the distances of each data point to the centroids and re-assign each point to the closest cluster centroid based upon minimum distance Step 5: Calculate cluster centroids again Step 6: Repeat steps 4 and 5 until we reach global optima where no improvements are possible and no switching of data points from one cluster to other.

　上图a表达了初始的数据集，假设k=2。在图b中，我们随机选择了两个k类所对应的类别质心，即图中的红色质心和蓝色质心，然后分别求样本中所有点到这两个质心的距离，并标记每个样本的类别为和该样本距离最小的质心的类别，如图c所示，经过计算样本和红色质心和蓝色质心的距离，我们得到了所有样本点的第一轮迭代后的类别。此时我们对我们当前标记为红色和蓝色的点分别求其新的质心，如图4所示，新的红色质心和蓝色质心的位置已经发生了变动。图e和图f重复了我们在图c和图d的过程，即将所有点的类别标记为距离最近的质心的类别并求新的质心。最终我们得到的两个类别如图f。

聚类算法是最常见的无监督学习算法，而K-Means算法是最常见的聚类算法

EM算法是一个迭代式的算法，分为两个步骤：E步和M步；EM算法也可以视为将目标函数作坐标上升的过程，每个步骤都是固定一个参数并估计另一个参数

EM算法和K-Means算法的迭代过程比较类似，不同的是K-Means算法中每次对参数的更新是硬猜测，而EM中每次对参数的更新是软猜测；相同的是，两个算法都可能得到局部最优解，采用不同的初始参数迭代会有利于得到全局最优解

Nearest Neighbors

(摘录于官网网站)

无监督学习

sklearn.neighbors provides functionality for unsupervised and supervised neighbors-based learning methods. Unsupervised nearest neighbors is the foundation of many other learning methods, notably manifold learning and spectral clustering. Supervised neighbors-based learning comes in two flavors: classification for data with discrete labels, and regression for data with continuous labels.

Nearest Neighbors Classification

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

Nearest Neighbors Regression

Neighbors-based regression can be used in cases where the data labels are continuous rather than discrete variables. The label assigned to a query point is computed based on the mean of the labels of its nearest neighbors.

Nearest Neighbor Algorithms

Fast computation of nearest neighbors is an active area of research in machine learning. The most naive neighbor search implementation involves the brute-force computation of distances between all pairs of points in the dataset: for N samples in D dimensions, this approach scales as O[DN2]. Efficient brute-force neighbors searches can be very competitive for small data samples. However, as the number of samples N grows, the brute-force approach quickly becomes infeasible. In the classes within sklearn.neighbors, brute-force neighbors searches are specified using the keyword algorithm = 'brute', and are computed using the routines available in sklearn.metrics.pairwise.

有三种算法：

Brute Force

K-D Tree

Ball Tree

Neighborhood Components Analysis

Combined with a nearest neighbors classifier (KNeighborsClassifier), NCA is attractive for classification because it can naturally handle multi-class problems without any increase in the model size, and does not introduce additional parameters that require fine-tuning by the user.

Dimensionality reduction （在项目中使用的就是该技术）

NCA can be used to perform supervised dimensionality reduction. The input data are projected onto a linear subspace consisting of the directions which minimize the NCA objective.

（可以进一步比较一下 NCA、PCA 和t-sne 的区别和联系）

参考资料：

Nearest Neighbors

文章目录

K-means 算法

EM 算法

KNN

Nearest Neighbors