9.7 mean-shift聚类算法
MeanShift聚类算法不需要预先知道聚类的分组数,对聚类的形状也没有限制。MeanShift聚类函数的语法格式如下:
MeanShift(bandwidth=None, seeds=None, bin_seeding=False, min_bin_freq=1, cluster_all = True)
详细参数说明请查阅官网。
【医学案例9-2】现有乳腺电阻抗图谱数据106例(来源于UCI开源数据),其中有9个参数(指标),如表9-1所示,诊断结果有6大类,如表9-2所示,数据形式如表9-3所示。请以mean-shift聚类算法对数据进行聚类分析,显示聚类分析结果与诊断结果。
表9-1 乳腺电阻抗图谱参数
特征 |
备注 |
I0 |
Impedivity (ohm) at zero frequency |
PA500 |
phase angle at 500 KHz |
HFS |
high-frequency slope of phase angle |
DA |
impedance distance between spectral ends |
AREA |
area under spectrum |
A/DA |
area normalized by DA |
MAX IP |
maximum of the spectrum |
DR |
distance between I0 and real part of the maximum frequency point |
P |
length of the spectral curve |
表9-2 乳腺电阻抗图谱诊断类别
CLASS |
DESCRIPTION |
NUMBER |
Car |
Carcinoma |
21 |
Fad |
Fibro-adenoma |
15 |
Mas |
Mastopathy |
18 |
Gla |
Glandular |
16 |
Con |
Connective |
14 |
Adi |
Adipose |
22 |
表9-3 乳腺电阻抗图谱数据
Case |
Class |
I0 |
PA500 |
HFS |
DA |
Area |
A/DA |
Max IP |
DR |
P |
1 |
car |
524.794 |
0.187 |
0.032 |
228.800 |
6843.598 |
29.911 |
60.205 |
220.737 |
556.828 |
2 |
car |
330.000 |
0.227 |
0.265 |
121.154 |
3163.239 |
26.109 |
69.717 |
99.085 |
400.226 |
3 |
car |
551.879 |
0.232 |
0.064 |
264.805 |
11888.392 |
44.895 |
77.793 |
253.785 |
656.769 |
4 |
car |
380.000 |
0.241 |
0.286 |
137.640 |
5402.171 |
39.249 |
88.758 |
105.199 |
493.702 |
5 |
car |
362.831 |
0.201 |
0.244 |
124.913 |
3290.462 |
26.342 |
69.389 |
103.867 |
424.797 |
6 |
car |
389.873 |
0.150 |
0.098 |
118.626 |
2475.557 |
20.869 |
49.757 |
107.686 |
429.386 |
7 |
car |
290.455 |
0.144 |
0.053 |
74.635 |
1189.545 |
15.938 |
35.703 |
65.541 |
330.267 |
8 |
car |
275.677 |
0.154 |
0.188 |
91.528 |
1756.235 |
19.188 |
39.305 |
82.659 |
331.588 |
9 |
car |
470.000 |
0.213 |
0.225 |
184.590 |
8185.361 |
44.343 |
84.482 |
164.123 |
603.316 |
…… |
|
|
|
|
…… |
|
|
|
|
…… |
105 |
car |
485.669 |
0.230 |
0.134 |
253.894 |
8135.968 |
32.045 |
64.855 |
245.471 |
541.364 |
106 |
car |
390.000 |
0.358 |
0.204 |
245.686 |
10055.837 |
40.930 |
70.325 |
236.490 |
477.548 |
【分析】
(1)加载位于当前目录的乳腺组织的电阻抗图谱数据,删除诊断结果列和病例编号列;
(2)对数据进行mean-shift聚类分析:
bandwidth = estimate_bandwidth(x, quantile=0.1, n_samples=106)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms=ms.fit(x)
(3)显示聚类结果 ms.labels_和ms.cluster_centers_
【meanShift聚类的代码实例】
import numpy as np,pandas as pd
from sklearn.cluster import MeanShift, estimate_bandwidth
data=pd.read_csv('BreastTissue.csv')#读取乳腺电阻抗图谱数据集
x_columns0 = [x for x in data.columns if x not in ['Class', 'Case']]#剔除Class, Case字段
x = data[x_columns0]
bandwidth = estimate_bandwidth(x, quantile=0.14, n_samples=60)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms=ms.fit(x)
y_pre = ms.predict(x)
df=pd.DataFrame(x)
df['meanshift']=y_pre
df['realClass']=data['Class']
print(df)
n_clusters_ = len(np.unique(y_pre))
print("number of estimated clusters : %d" % n_clusters_)
cluster_centers = ms.cluster_centers_
print("所有的聚类中心")
print(cluster_centers)
【运行效果】
I0 PA500 HFS ... P meanshift realClass
0 524.794072 0.187448 0.032114 ... 556.828334 1 car
1 330.000000 0.226893 0.265290 ... 400.225776 0 car
2 551.879287 0.232478 0.063530 ... 656.769449 1 car
3 380.000000 0.240855 0.286234 ... 493.701813 1 car
4 362.831266 0.200713 0.244346 ... 424.796503 0 car
.. ... ... ... ... ... ... ...
101 2000.000000 0.106989 0.105418 ... 2088.648870 2 adi
102 2600.000000 0.200538 0.208043 ... 2664.583623 5 adi
103 1600.000000 0.071908 -0.066323 ... 1475.371534 1 adi
104 2300.000000 0.045029 0.136834 ... 2480.592151 1 adi
105 2600.000000 0.069988 0.048869 ... 2545.419744 2 adi
[106 rows x 11 columns]
number of estimated clusters : 6
所有的聚类中心
[[3.94141736e+02 1.11309837e-01 8.89510899e-02 8.70812062e+01
1.05372613e+03 1.14313989e+01 3.44067700e+01 7.64714506e+01
4.03084832e+02]
[9.40643121e+02 1.78656266e-01 1.37139246e-01 3.14979700e+02
9.06462231e+03 3.29234879e+01 9.68289413e+01 2.94815322e+02
9.43997135e+02]
[2.27500000e+03 9.56876762e-02 1.88146493e-01 5.78072660e+02
3.83861391e+04 6.83336379e+01 2.63915192e+02 4.64193399e+02
2.38582287e+03]
[2.33992007e+03 7.38274275e-02 3.12413937e-01 4.46271422e+02
2.65638401e+04 6.05583408e+01 3.12822113e+02 2.98809193e+02
2.57205604e+03]
[2.80000000e+03 8.30776720e-02 1.84306769e-01 5.83259257e+02
3.13886529e+04 5.38159532e+01 2.98582977e+02 5.01038494e+02
2.89658248e+03]
[2.60000000e+03 2.00538331e-01 2.08043247e-01 1.06344143e+03
1.74480476e+05 1.64071543e+02 4.18687286e+02 9.77552367e+02
2.66458362e+03]]
从聚类结果看,与诊断结果有较大的偏差,效果并不好。像这类带有标签诊断结果的乳腺组织的电阻抗图谱数据,采用分类算法较合适,此处仅用于演示运用过程。
【课后作业】
(1)请使用决策树的方法对带有标签诊断结果的乳腺组织的电阻抗图谱数据进行分类分析,编写程序,提交源代码和运行结果。
(2)请使用神经网络的方法对带有标签诊断结果的乳腺组织的电阻抗图谱数据进行分类分析,编写程序,提交源代码和运行结果。