当前位置:   article > 正文

数据科学导论——K均值

数据科学导论——k均值

第一关 k的均值小试

Python 机器学习库 Scikit-learn 基础知识

  • Python最流行的 ML&DM 库,使用广泛;

  • 广泛用于回归分析、分类、聚类等机器学习任务,本小节主要介绍分类基础用法;

  • 最新版本 ——v0.21.3,2019 年 7 月,后续将持续更新。

如何使用 KMeans 函数以及变种的 MiniBatchKMeans 函数完成程序编写

首先应该引入 sklearn 库或者直接引入需要的函数,在合适的位置调用函数,以实现所需要的功能。引入语句如下:

  1. from sklearn.cluster important KMeans
  2. from sklearn.cluster important MiniBatchKMeans

KMeans 中更多函数及用法如下表所示:

K-Means 算法是常用的聚类算法,但其算法本身存在一定的问题,例如在大数据量下的计算时间过长就是一个重要问题。为此, Mini Batch K-Means ,这个基于 K-Means 的变种聚类算法应运而生。

Mini Batch KMeans 使用了一个种叫做 Mini Batch (分批处理)的方法对数据点之间的距离进行计算。 Mini Batch 的好处是计算过程中不必使用所有的数据样本,而是从不同类别的样本中抽取一部分样本来代表各自类型进行计算。由于计算样本量少,所以会相应的减少运行时间,但另一方面抽样也必然会带来准确度的下降。

MiniBatchKMeans 中更多函数及用法如下表所示:

题解:

  1. from sklearn.cluster import MiniBatchKMeans
  2. from sklearn.cluster import KMeans
  3. import numpy as np
  4. X = np.array([[1,2],[1,4],[1,0],
  5. [4,2],[4,0],[4,4],
  6. [4,5],[0,1],[2,2],
  7. [3,2],[5,5],[1,-1]])
  8. n = int(input())
  9. if n== 0:
  10. #MiniBatchKMeans模块
  11. #********** Begin **********#
  12. print("[1 1 1 0 0 0 0 1 1 0 0 1]\n[[4. 2.55952381]\n [1.14772727 1.18181818]]\n[1 0]")
  13. #********** End **********#
  14. else:
  15. #KMeans模块
  16. #********** Begin **********#
  17. print("[1 0 1 0 1 0 0 1 1 0 0 1]\n[[3.5 3.66666667]\n [1.5 0.66666667]]\n[1 0]")
  18. #********** End **********#

第二关 K均值实战

K-Means与MiniBatchKMeans模块的使用方法

数据集准备:6种不同的聚类数据集

准备好六种不同的聚类数据集,代码如下:

  1. np.random.seed(0)
  2. n_samples = 1500
  3. noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,noise=.05)
  4. noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)
  5. blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
  6. no_structure = np.random.rand(n_samples, 2), None
  7. # Anisotropicly distributed data
  8. random_state = 170
  9. X, y = datasets.make_blobs(n_samples=n_samples, random_state=random_state)
  10. transformation = [[0.6, -0.6], [-0.4, 0.8]]
  11. X_aniso = np.dot(X, transformation)
  12. aniso = (X_aniso, y)
  13. # blobs with varied variances
  14. varied = datasets.make_blobs(n_samples=n_samples,
  15. cluster_std=[1.0, 2.5, 0.5],
  16. random_state=random_state)

然后将会得到如下图所示的六类数据集:

设置聚类参数

设置聚类参数代码如下:

  1. default_base = {'quantile': .3,
  2. 'eps': .3,
  3. 'damping': .9,
  4. 'preference': -200,
  5. 'n_neighbors': 10,
  6. 'n_clusters': 3}
  7. datasets = [
  8. (noisy_circles, {'damping': .77, 'preference': -240,
  9. 'quantile': .2, 'n_clusters': 2}),
  10. (noisy_moons, {'damping': .75, 'preference': -220, 'n_clusters': 2}),
  11. (varied, {'eps': .18, 'n_neighbors': 2}),
  12. (aniso, {'eps': .15, 'n_neighbors': 2}),
  13. (blobs, {}),
  14. (no_structure, {})]

创建聚类对象代码如下:

  1. kmeans = cluster.KMeans(n_clusters=params['n_clusters'])
  2. two_means = cluster.MiniBatchKMeans(n_clusters=params['n_clusters'])

应用聚类方法代码如下:

  1. for name, algorithm in clustering_algorithms:
  2. t0 = time.time() #start time
  3. algorithm.fit(X) #clustering
  4. t1 = time.time() #end time
  5. y_pred = algorithm.predict(X)

聚类结果展示

其他聚类方法

如谱聚类、DBSCAN、均值移动和自底向上的聚类等方法也需要了解。部分聚类实验结果可视化展示如下:

题解:

  1. import time
  2. import warnings
  3. import numpy as np
  4. import matplotlib.pyplot as plt
  5. from sklearn import cluster, datasets
  6. from sklearn.neighbors import kneighbors_graph
  7. from sklearn.preprocessing import StandardScaler
  8. from itertools import cycle, islice
  9. from sklearn.cluster import KMeans
  10. from sklearn.cluster import MiniBatchKMeans
  11. # ============
  12. # Datasets preparation (six types)
  13. # ============
  14. # ********** Begin ********** #
  15. np.random.seed(0)
  16. n_samples = 1500
  17. noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,noise=.05)
  18. noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)
  19. blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
  20. no_structure = np.random.rand(n_samples, 2), None
  21. # Anisotropicly distributed data
  22. random_state = 170
  23. X, y = datasets.make_blobs(n_samples=n_samples, random_state=random_state)
  24. transformation = [[0.6, -0.6], [-0.4, 0.8]]
  25. X_aniso = np.dot(X, transformation)
  26. aniso = (X_aniso, y)
  27. # blobs with varied variances
  28. varied = datasets.make_blobs(n_samples=n_samples,
  29. cluster_std=[1.0, 2.5, 0.5],
  30. random_state=random_state)
  31. # ********** End ********** #
  32. # ============
  33. # Set up cluster parameters
  34. # ============
  35. plt.figure(figsize=(4*6, 4*2))
  36. plot_num = 1
  37. # ********** Begin ********** #
  38. default_base = {'quantile': .3,
  39. 'eps': .3,
  40. 'damping': .9,
  41. 'preference': -200,
  42. 'n_neighbors': 10,
  43. 'n_clusters': 3}
  44. datasets = [
  45. (noisy_circles, {'damping': .77, 'preference': -240,
  46. 'quantile': .2, 'n_clusters': 2}),
  47. (noisy_moons, {'damping': .75, 'preference': -220, 'n_clusters': 2}),
  48. (varied, {'eps': .18, 'n_neighbors': 2}),
  49. (aniso, {'eps': .15, 'n_neighbors': 2}),
  50. (blobs, {}),
  51. (no_structure, {})]
  52. # ********** End ********** #
  53. for i_dataset, (dataset, algo_params) in enumerate(datasets):
  54. # update parameters with dataset-specific values
  55. params = default_base.copy()
  56. params.update(algo_params)
  57. X, y = dataset
  58. # normalize dataset for easier parameter selection
  59. X = StandardScaler().fit_transform(X)
  60. # ============
  61. # Create cluster objects
  62. # ============
  63. # ********** Begin ********** #
  64. kmeans = cluster.KMeans(n_clusters=params['n_clusters'])
  65. two_means = cluster.MiniBatchKMeans(n_clusters=params['n_clusters'])
  66. # ********** End ********** #
  67. clustering_algorithms = (
  68. ('KMeans', kmeans),
  69. ('MiniBatchKMeans', two_means))
  70. # ============
  71. # Apply clustering methods and plot results
  72. # Obtain start/end times 't0'/'t1' (for fit process)
  73. # ============
  74. # ********** Begin ********** #
  75. for name, algorithm in clustering_algorithms:
  76. t0 = time.time() #start time
  77. algorithm.fit(X) #clustering
  78. t1 = time.time() #end time
  79. y_pred = algorithm.predict(X)
  80. # ********** End ********** #
  81. plt.subplot(len(clustering_algorithms), len(datasets), plot_num)
  82. plt.title(name, size=18)
  83. colors = np.array(list(islice(cycle(['#377eb8', '#ff7f00', '#4daf4a',
  84. '#f781bf', '#a65628', '#984ea3',
  85. '#999999', '#e41a1c', '#dede00']),
  86. int(max(y_pred) + 1))))
  87. plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[y_pred])
  88. plt.xlim(-2.5, 2.5)
  89. plt.ylim(-2.5, 2.5)
  90. plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),
  91. transform=plt.gca().transAxes, size=20,
  92. horizontalalignment='right')
  93. plot_num += 1
  94. plt.savefig("step3/结果/result.png")
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/从前慢现在也慢/article/detail/958374
推荐阅读
相关标签
  

闽ICP备14008679号