智慧医学语言基础

授课人


 9.7 mean-shift聚类算法

 

9.7 mean-shift聚类算法

MeanShift聚类算法不需要预先知道聚类的分组数,对聚类的形状也没有限制。MeanShift聚类函数的语法格式如下

MeanShift(bandwidth=None, seeds=None, bin_seeding=False, min_bin_freq=1, cluster_all = True)

详细参数说明请查阅官网。

【医学案例9-2】现有乳腺电阻抗图谱数据106例(来源于UCI开源数据),其中有9个参数(指标),如表9-1所示,诊断结果有6大类,如表9-2所示,数据形式如表9-3所示。请以mean-shift聚类算法对数据进行聚类分析,显示聚类分析结果与诊断结果。

9-1 乳腺电阻抗图谱参数

特征

备注

I0

Impedivity (ohm) at zero frequency

PA500

phase angle at 500 KHz

HFS

high-frequency slope of phase angle

DA

impedance distance between spectral ends

AREA

area under spectrum

A/DA

area normalized by DA

MAX IP

maximum of the spectrum

DR

distance between I0 and real part of the maximum frequency point

P

length of the spectral curve

9-2 乳腺电阻抗图谱诊断类别

CLASS

DESCRIPTION

NUMBER

Car

Carcinoma

21

Fad

Fibro-adenoma

15

Mas

Mastopathy

18

Gla

Glandular

16

Con

Connective

14

Adi

Adipose

22

 

9-3 乳腺电阻抗图谱数据

Case

Class

I0

PA500

HFS

DA

Area

A/DA

Max IP

DR

P

1

car

524.794

0.187

0.032

228.800

6843.598

29.911

60.205

220.737

556.828

2

car

330.000

0.227

0.265

121.154

3163.239

26.109

69.717

99.085

400.226

3

car

551.879

0.232

0.064

264.805

11888.392

44.895

77.793

253.785

656.769

4

car

380.000

0.241

0.286

137.640

5402.171

39.249

88.758

105.199

493.702

5

car

362.831

0.201

0.244

124.913

3290.462

26.342

69.389

103.867

424.797

6

car

389.873

0.150

0.098

118.626

2475.557

20.869

49.757

107.686

429.386

7

car

290.455

0.144

0.053

74.635

1189.545

15.938

35.703

65.541

330.267

8

car

275.677

0.154

0.188

91.528

1756.235

19.188

39.305

82.659

331.588

9

car

470.000

0.213

0.225

184.590

8185.361

44.343

84.482

164.123

603.316

……

 

 

 

 

……

 

 

 

 

……

105

car

485.669

0.230

0.134

253.894

8135.968

32.045

64.855

245.471

541.364

106

car

390.000

0.358

0.204

245.686

10055.837

40.930

70.325

236.490

477.548

【分析】

1)加载位于当前目录的乳腺组织的电阻抗图谱数据,删除诊断结果列和病例编号列;

2)对数据进行mean-shift聚类分析:

bandwidth = estimate_bandwidth(x, quantile=0.1, n_samples=106)

ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)

ms=ms.fit(x)

3)显示聚类结果 ms.labels_ms.cluster_centers_

meanShift聚类代码实例】

import numpy as np,pandas as pd

from sklearn.cluster import MeanShift, estimate_bandwidth

data=pd.read_csv('BreastTissue.csv')#读取乳腺电阻抗图谱数据集

x_columns0 = [x for x in data.columns if x not in ['Class', 'Case']]#剔除Class, Case字段

x = data[x_columns0]

bandwidth = estimate_bandwidth(x, quantile=0.14, n_samples=60)

ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)

ms=ms.fit(x)

y_pre = ms.predict(x)

df=pd.DataFrame(x)

df['meanshift']=y_pre

df['realClass']=data['Class']

print(df)

n_clusters_ = len(np.unique(y_pre))

print("number of estimated clusters : %d" % n_clusters_)

cluster_centers = ms.cluster_centers_

print("所有的聚类中心")

print(cluster_centers)



【运行效果】

         I0    PA500     HFS   ...       P          meanshift     realClass

0     524.794072  0.187448  0.032114  ...   556.828334          1        car

1     330.000000  0.226893  0.265290  ...   400.225776          0        car

2     551.879287  0.232478  0.063530  ...   656.769449          1        car

3     380.000000  0.240855  0.286234  ...   493.701813          1        car

4     362.831266  0.200713  0.244346  ...   424.796503          0        car

..     ...      ...      ...    ...          ...        ...        ...

101    2000.000000  0.106989  0.105418  ...  2088.648870          2        adi

102    2600.000000  0.200538  0.208043  ...  2664.583623          5        adi

103    1600.000000  0.071908  -0.066323  ...  1475.371534          1        adi

104    2300.000000  0.045029  0.136834  ...  2480.592151          1        adi

105    2600.000000  0.069988  0.048869  ...  2545.419744          2        adi

 

[106 rows x 11 columns]

number of estimated clusters : 6

所有的聚类中心

[[3.94141736e+02 1.11309837e-01 8.89510899e-02 8.70812062e+01

  1.05372613e+03 1.14313989e+01 3.44067700e+01 7.64714506e+01

  4.03084832e+02]

 [9.40643121e+02 1.78656266e-01 1.37139246e-01 3.14979700e+02

  9.06462231e+03 3.29234879e+01 9.68289413e+01 2.94815322e+02

  9.43997135e+02]

 [2.27500000e+03 9.56876762e-02 1.88146493e-01 5.78072660e+02

  3.83861391e+04 6.83336379e+01 2.63915192e+02 4.64193399e+02

  2.38582287e+03]

 [2.33992007e+03 7.38274275e-02 3.12413937e-01 4.46271422e+02

  2.65638401e+04 6.05583408e+01 3.12822113e+02 2.98809193e+02

  2.57205604e+03]

 [2.80000000e+03 8.30776720e-02 1.84306769e-01 5.83259257e+02

  3.13886529e+04 5.38159532e+01 2.98582977e+02 5.01038494e+02

  2.89658248e+03]

 [2.60000000e+03 2.00538331e-01 2.08043247e-01 1.06344143e+03

  1.74480476e+05 1.64071543e+02 4.18687286e+02 9.77552367e+02

  2.66458362e+03]]



从聚类结果看,与诊断结果有较大的偏差,效果并不好。像这类带有标签诊断结果的乳腺组织的电阻抗图谱数据,采用分类算法较合适,此处仅用于演示运用过程。

【课后作业】

1)请使用决策树的方法对带有标签诊断结果的乳腺组织的电阻抗图谱数据进行分类分析,编写程序,提交源代码和运行结果。

2)请使用神经网络的方法对带有标签诊断结果的乳腺组织的电阻抗图谱数据进行分类分析,编写程序,提交源代码和运行结果。

 

 评论 01 / 1

相关资源