分类(class)与聚类(cluster)不同,分类是有监督学习模型,聚类属于无监督学习模型。聚类讲究使用一些算法把样本划分为n个群落。一般情况下,这种算法都需要计算欧氏距离。
概述 在输出未知的前提下,仅根据已知的输入寻找样本之间的内在联系,据此将输入样本划分为不同的族群。
量化相似度 欧式距离 $P(x1,y1)$ $Q(x2,y2)$ $|PQ|=\sqrt{(x1 - x2)^2 + (y1 - y2)^2}$
$P(x1,y1,z1)$ $Q(x2,y2,z2)$ $|PQ|=\sqrt{(x1-x2)^2+(y1-y2)^2+(z1-z2)^2}$
$P(x1,y1,z1,…)$ $Q(x2,y2,z2,…)$
张三(1.7,60) 李四(1.75,200) 王五(2.5,65) 赵六(1.72,61) 两个N维样本之间的欧氏距离越小,就越相似,反而反之。
用两个样本对应特征值之差的平方和之平方根,即欧氏距离,来表示这两个样本的相似性。
K均值聚类 第一步:随机选择k个样本作为k个聚类的中心,计算每个样本到各个聚类中心的欧氏距离,将该样本分配到与之距离最近的聚类中心所在的类别中。
第二步:根据第一步所得到的聚类划分,分别计算每个聚类的几何中心,将几何中心作为新的聚类中心,重复第一步,直到计算所得几何中心与聚类中心重合或接近重合为止。
注意:
聚类数k必须事先已知。借助某些评估指标,优选最好的聚类数。
聚类中心的初始选择会影响到最终聚类划分的结果。初始中心尽量选择距离较远的样本。
K均值算法相关API:
1 2 3 4 5 6 7 import sklearn.cluster as scmodel = sc.KMeans(n_clusters=4 ) model.fit(x) centers = model.cluster_centers_
案例:加载multiple3.txt,基于K均值算法完成样本的聚类。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 import numpy as npimport sklearn.cluster as scimport matplotlib.pyplot as mpx = [] with open ('../data/multiple3.txt' ) as f: for line in f.readlines(): data = [float (substr) for substr in line.split(',' )] x.append(data) x = np.array(x) model = sc.KMeans(n_clusters=4 ) model.fit(x) centers = model.cluster_centers_ l, r, h = x[:, 0 ].min () - 1 , x[:, 0 ].max () + 1 , 0.005 b, t, v = x[:, 1 ].min () - 1 , x[:, 1 ].max () + 1 , 0.005 grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v)) flat_x = np.c_[grid_x[0 ].ravel(), grid_x[1 ].ravel()] flat_y = model.predict(flat_x) grid_y = flat_y.reshape(grid_x[0 ].shape) mp.figure('K-Means' , facecolor='lightgray' ) mp.title('K-Means' , fontsize=20 ) mp.xlabel('x' , fontsize=14 ) mp.ylabel('y' , fontsize=14 ) mp.tick_params(labelsize=10 ) mp.pcolormesh(grid_x[0 ], grid_x[1 ], grid_y, cmap='gray' ) mp.scatter(x[:, 0 ], x[:, 1 ], c=model.labels_, cmap='brg' , s=80 ) mp.scatter(centers[:, 0 ], centers[:, 1 ], marker='+' , c='gold' , s=1000 , linewidth=1 ) mp.show()
图像预处理之颜色量化
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 import numpy as npimport scipy.misc as smimport sklearn.cluster as scimport matplotlib.pyplot as mpimage = sm.imread('../data/lily.jpg' , True ).astype(np.uint8) x = image.reshape(-1 , 1 ) model = sc.KMeans(n_clusters=4 ) model.fit(x) y = model.labels_ centers = model.cluster_centers_.squeeze() z = centers[y] image4 = z.reshape(image.shape) model = sc.KMeans(n_clusters=3 ) model.fit(x) y = model.labels_ centers = model.cluster_centers_.squeeze() z = centers[y] image3 = z.reshape(image.shape) model = sc.KMeans(n_clusters=2 ) model.fit(x) y = model.labels_ centers = model.cluster_centers_.squeeze() z = centers[y] image2 = z.reshape(image.shape) mp.figure('Image Quantization' , facecolor='lightgray' ) mp.subplot(221 ) mp.title('Original' , fontsize=16 ) mp.axis('off' ) mp.imshow(image, cmap='gray' ) mp.subplot(222 ) mp.title('4 Colors' , fontsize=16 ) mp.axis('off' ) mp.imshow(image4, cmap='gray' ) mp.subplot(223 ) mp.title('3 Colors' , fontsize=16 ) mp.axis('off' ) mp.imshow(image3, cmap='gray' ) mp.subplot(224 ) mp.title('2 Colors' , fontsize=16 ) mp.axis('off' ) mp.imshow(image2, cmap='gray' ) mp.tight_layout() mp.show()
均值漂移聚类 将每个聚类中的样本看作是服从某种概率模型的随机分布,利用已知样本的统计直方图,拟合某个特定的概率模型,以概率密度的峰值点作为相应聚类的中心。然后,根据每个样本与聚类中心的距离,则其近者而从之,完成聚类划分。
1)无需事先给定聚类数 2)样本本身从业务上服从某种概率规律
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 import numpy as npimport sklearn.cluster as scimport matplotlib.pyplot as mpx = [] with open ('../data/multiple3.txt' ) as f: for line in f.readlines(): data = [float (substr) for substr in line.split(',' )] x.append(data) x = np.array(x) bw = sc.estimate_bandwidth(x, n_samples=len (x), quantile=0.1 ) model = sc.MeanShift(bandwidth=bw, bin_seeding=True ) model.fit(x) centers = model.cluster_centers_ l, r, h = x[:, 0 ].min () - 1 , x[:, 0 ].max () + 1 , 0.005 b, t, v = x[:, 1 ].min () - 1 , x[:, 1 ].max () + 1 , 0.005 grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v)) flat_x = np.c_[grid_x[0 ].ravel(), grid_x[1 ].ravel()] flat_y = model.predict(flat_x) grid_y = flat_y.reshape(grid_x[0 ].shape) mp.figure('Mean Shift' , facecolor='lightgray' ) mp.title('Mean Shift' , fontsize=20 ) mp.xlabel('x' , fontsize=14 ) mp.ylabel('y' , fontsize=14 ) mp.tick_params(labelsize=10 ) mp.pcolormesh(grid_x[0 ], grid_x[1 ], grid_y, cmap='gray' ) mp.scatter(x[:, 0 ], x[:, 1 ], c=model.labels_, cmap='brg' , s=80 ) mp.scatter(centers[:, 0 ], centers[:, 1 ], marker='+' , c='gold' , s=1000 , linewidth=1 ) mp.show()
凝聚层次聚类 首先假定每个样本都是一个独立的聚类,统计总聚类数,如果大于所要求的聚类数,就从每个样本出发,连接离它欧氏距离最近的样本,在扩大聚类的规模的同时减少聚类数,重复以上过程,直到总聚类数满足要求为止。
1)没有所谓聚类中心,适用于中心特性不明显的样本 2)无需事先给定聚类中心 3)在选择被凝聚样本的过程中,还可以分别按照距离优先和连续性优先两种方式选连接的样本。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import numpy as npimport sklearn.cluster as scimport matplotlib.pyplot as mpx = [] with open ('../data/multiple3.txt' ) as f: for line in f.readlines(): data = [float (substr) for substr in line.split(',' )] x.append(data) x = np.array(x) model = sc.AgglomerativeClustering(n_clusters=4 ) model.fit(x) mp.figure('Agglomerative' , facecolor='lightgray' ) mp.title('Agglomerative' , fontsize=20 ) mp.xlabel('x' , fontsize=14 ) mp.ylabel('y' , fontsize=14 ) mp.tick_params(labelsize=10 ) mp.scatter(x[:, 0 ], x[:, 1 ], c=model.labels_, cmap='brg' , s=80 ) mp.show()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 import numpy as npimport sklearn.cluster as scimport sklearn.neighbors as snimport matplotlib.pyplot as mpn_samples=500 t = 2.5 * np.pi * (1 + 2 * np.random.rand( n_samples, 1 )) x = 0.05 * t * np.cos(t) y = 0.05 * t * np.sin(t) n = 0.05 * np.random.rand(n_samples, 2 ) x = np.hstack((x, y)) + n model = sc.AgglomerativeClustering( linkage='average' , n_clusters=3 ) y1 = model.fit_predict(x) nb = sn.kneighbors_graph(x, 10 , include_self=False ) model = sc.AgglomerativeClustering( linkage='average' , n_clusters=3 , connectivity=nb) y2 = model.fit_predict(x) mp.figure('Nonconnectivity' , facecolor='lightgray' ) mp.title('Nonconnectivity' , fontsize=20 ) mp.xlabel('x' , fontsize=14 ) mp.ylabel('y' , fontsize=14 ) mp.tick_params(labelsize=10 ) mp.grid(linestyle=':' ) mp.scatter(x[:, 0 ], x[:, 1 ], c=y1, cmap='brg' , s=80 , alpha=0.5 ) mp.figure('Connectivity' , facecolor='lightgray' ) mp.title('Connectivity' , fontsize=20 ) mp.xlabel('x' , fontsize=14 ) mp.ylabel('y' , fontsize=14 ) mp.tick_params(labelsize=10 ) mp.grid(linestyle=':' ) mp.scatter(x[:, 0 ], x[:, 1 ], c=y2, cmap='brg' , s=80 , alpha=0.5 ) mp.show()
聚类的评价指标 内密外疏 对于每个样本计算内部距离a和外部距离b,得到该样本的轮廓系数s=(b-a)/max(a, b),对所有样本的轮廓系数取平均值,即为整个样本空间的轮廓系数S=ave(s)。 内部距离a: 一个样本与同聚类其它样本的平均欧氏距离 外部距离b: 一个样本与离其聚类最近的另一个聚类中所有样本的平均欧氏距离。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 import numpy as npimport sklearn.cluster as scimport sklearn.metrics as smimport matplotlib.pyplot as mpx = [] with open ('../data/multiple3.txt' ) as f: for line in f.readlines(): data = [float (substr) for substr in line.split(',' )] x.append(data) x = np.array(x) model = sc.KMeans(n_clusters=4 ) model.fit(x) centers = model.cluster_centers_ s = sm.silhouette_score(x, model.labels_, sample_size=len (x), metric='euclidean' ) print (s)l, r, h = x[:, 0 ].min () - 1 , x[:, 0 ].max () + 1 , 0.005 b, t, v = x[:, 1 ].min () - 1 , x[:, 1 ].max () + 1 , 0.005 grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v)) flat_x = np.c_[grid_x[0 ].ravel(), grid_x[1 ].ravel()] flat_y = model.predict(flat_x) grid_y = flat_y.reshape(grid_x[0 ].shape) mp.figure('K-Means' , facecolor='lightgray' ) mp.title('K-Means' , fontsize=20 ) mp.xlabel('x' , fontsize=14 ) mp.ylabel('y' , fontsize=14 ) mp.tick_params(labelsize=10 ) mp.pcolormesh(grid_x[0 ], grid_x[1 ], grid_y, cmap='gray' ) mp.scatter(x[:, 0 ], x[:, 1 ], c=model.labels_, cmap='brg' , s=80 ) mp.scatter(centers[:, 0 ], centers[:, 1 ], marker='+' , c='gold' , s=1000 , linewidth=1 ) mp.show()
噪声密度聚类 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 import numpy as npimport sklearn.cluster as scimport sklearn.metrics as smimport matplotlib.pyplot as mpx = [] with open ('../data/perf.txt' , 'r' ) as f: for line in f.readlines(): data = [float (substr) for substr in line.split(',' )] x.append(data) x = np.array(x) epsilons, scores, models = \ np.linspace(0.3 , 1.2 , 10 ), [], [] for epsilon in epsilons: model = sc.DBSCAN(eps=epsilon, min_samples=5 ) model.fit(x) score = sm.silhouette_score( x, model.labels_, sample_size=len (x), metric='euclidean' ) scores.append(score) models.append(model) scores = np.array(scores) best_index = scores.argmax() best_epsilon = epsilons[best_index] print (best_epsilon)best_score = scores[best_index] print (best_score)best_model = models[best_index] pred_y = best_model.labels_ core_mask = np.zeros(len (x), dtype=bool ) core_mask[ best_model.core_sample_indices_] = True offset_mask = pred_y == -1 periphery_mask = ~(core_mask | offset_mask) mp.figure('DBSCAN' , facecolor='lightgray' ) mp.title('DBSCAN' , fontsize=20 ) mp.xlabel('x' , fontsize=14 ) mp.ylabel('y' , fontsize=14 ) mp.tick_params(labelsize=10 ) mp.grid(linestyle=':' ) labels = set (pred_y) cs = mp.get_cmap('brg' , len (labels))( range (len (labels))) mp.scatter(x[core_mask][:, 0 ], x[core_mask][:, 1 ], c=cs[pred_y[core_mask]], s=80 , label='Core' ) mp.scatter(x[periphery_mask][:, 0 ], x[periphery_mask][:, 1 ], edgecolor=cs[pred_y[periphery_mask]], facecolor='none' , s=80 , label='Periphery' ) mp.scatter(x[offset_mask][:, 0 ], x[offset_mask][:, 1 ], c=cs[pred_y[offset_mask]], marker='x' , s=80 , label='Offset' ) mp.legend() mp.show()
最近邻 代码:knnc.py、knnr.py 回归:线性、岭、多项式、决策树、SVM、KNN R2得分 分类:逻辑、朴素贝叶斯、决策树、SVM、KNN F1得分 聚类:K均值、均值漂移、凝聚层次、DBSCAN 轮廓系数