模块:K 平均值聚类Module: K-Means Clustering

本文介绍如何使用 Azure 机器学习设计器(预览版)中的“K-Means 群集”模块来创建未经训练的 K-Means 群集模型 。This article describes how to use the K-Means Clustering module in Azure Machine Learning designer (preview) to create an untrained K-means clustering model.

K-means 是最简单、最常见的非监督式学习算法之一 。K-means is one of the simplest and the best known unsupervised learning algorithms. 可以将算法用于各种机器学习任务,如:You can use the algorithm for a variety of machine learning tasks, such as:

  • 检测异常数据Detecting abnormal data.
  • 群集文本文档。Clustering text documents.
  • 在使用其他分类或回归方法之前,分析数据集。Analyzing datasets before you use other classification or regression methods.

若要创建群集化模型,请执行以下操作:To create a clustering model, you:

  • 将此模块添加到管道。Add this module to your pipeline.
  • 连接数据集。Connect a dataset.
  • 设置参数,例如所需群集数、创建群集时使用的距离指标等。Set parameters, such as the number of clusters you expect, the distance metric to use in creating the clusters, and so forth.

配置模块超参数后,将未经训练的模型连接到训练群集模型After you've configured the module hyperparameters, you connect the untrained model to the Train Clustering Model. 由于 K-means 算法是非监督式学习方法,因此有一个可选的标签列。Because the K-means algorithm is an unsupervised learning method, a label column is optional.

  • 如果数据包含标签,可以使用标签值来指导群集选择并优化模型。If your data includes a label, you can use the label values to guide selection of the clusters and optimize the model.

  • 如果数据没有标签,则算法会完全基于数据来创建表示各种可能的类别的群集。If your data has no label, the algorithm creates clusters representing possible categories, based solely on the data.

了解 K-Means 群集Understand K-means clustering

通常,群集化操作使用迭代技术将数据集中的事例归组到具有相似特征的群集中。In general, clustering uses iterative techniques to group cases in a dataset into clusters that possess similar characteristics. 这种归组过程有助于浏览数据、标识数据中的异常情况,并最终帮助进行预测。These groupings are useful for exploring data, identifying anomalies in the data, and eventually for making predictions. 群集模型还有助于识别数据集中的关系,这些关系可能无法通过浏览或简单观察数据以逻辑推理的方式推导出来。Clustering models can also help you identify relationships in a dataset that you might not logically derive by browsing or simple observation. 因此,通常会在机器学习任务的早期阶段使用群集化来探究数据和发现预期之外的相关性。For these reasons, clustering is often used in the early phases of machine learning tasks, to explore the data and discover unexpected correlations.

使用 K-means 方法配置群集化模型时,必须指定目标数值 K,该数值指示模型中所需的质心数目 。When you configure a clustering model by using the K-means method, you must specify a target number k that indicates the number of centroids you want in the model. 质心是代表每个群集的点。The centroid is a point that's representative of each cluster. K-means 算法通过最大程度地减少群集内平方和,将每个传入的数据点分配给一个群集。The K-means algorithm assigns each incoming data point to one of the clusters by minimizing the within-cluster sum of squares.

处理训练数据时,K-means 算法始于一组随机选择的初始质心。When it processes the training data, the K-means algorithm begins with an initial set of randomly chosen centroids. 质心充当群集的起点,应用 Lloyd 算法以迭代的方式优化自身位置。Centroids serve as starting points for the clusters, and they apply Lloyd's algorithm to iteratively refine their locations. 当 K-means 算法满足以下一个或多个条件时,会停止构建和优化群集:The K-means algorithm stops building and refining clusters when it meets one or more of these conditions:

  • 质心稳定,意味着各个点的群集分配不再改变,并且算法已收敛到一个解决方案。The centroids stabilize, meaning that the cluster assignments for individual points no longer change and the algorithm has converged on a solution.

  • 该算法已运行完指定的迭代数。The algorithm completed running the specified number of iterations.

完成训练阶段后,使用将数据分配给群集模块,将新事例分配到使用 K-means 算法找到的某个群集中。After you've completed the training phase, you use the Assign Data to Clusters module to assign new cases to one of the clusters that you found by using the K-means algorithm. 通过计算新事例与每个群集的质心之间的距离,执行群集分配。You perform cluster assignment by computing the distance between the new case and the centroid of each cluster. 将每个新事例分配到质心最近的群集中。Each new case is assigned to the cluster with the nearest centroid.

配置“K-Means 群集”模块Configure the K-Means Clustering module

  1. 将“K-Means 群集”模块添加到管道 。Add the K-Means Clustering module to your pipeline.

  2. 若要指定训练模型的方式,选择“创建训练模式”选项 。To specify how you want the model to be trained, select the Create trainer mode option.

    • 单个参数:如果知道要在群集模型中使用的确切参数,可以提供一组特定的值作为参数。Single Parameter: If you know the exact parameters you want to use in the clustering model, you can provide a specific set of values as arguments.
  3. 对于“质心的数量”,请键入希望算法初始时处理的群集数 。For Number of centroids, type the number of clusters you want the algorithm to begin with.

    此模型不能保证精确生成这一数量的群集。The model isn't guaranteed to produce exactly this number of clusters. 算法在初始时处理此数量的数据点,然后执行迭代操作以寻找最佳配置。The algorithm starts with this number of data points and iterates to find the optimal configuration. 可以参考 sklearn 的源代码You can refer to the source code of sklearn.

  4. 使用 Initialization 属性指定用于定义初始群集配置的算法。The properties Initialization is used to specify the algorithm that's used to define the initial cluster configuration.

    • 第一个 N:系统会从数据集中选择一定数量的初始数据点,将其用作初始平均值。First N: Some initial number of data points are chosen from the dataset and used as the initial means.

      此方法也称为 Forgy 方法。This method is also called the Forgy method.

    • 随机:该算法将某个数据点随机放置在某个群集中,然后计算初始平均值作为群集的随机分配点的质心。Random: The algorithm randomly places a data point in a cluster and then computes the initial mean to be the centroid of the cluster's randomly assigned points.

      此方法也称为“随即划分”方法。This method is also called the random partition method.

    • K-Means++ :这是默认的群集初始化方法。K-Means++: This is the default method for initializing clusters.

      “K-Means++”算法由 Arthur 和 Sergei Vassilvitskii 于 2007 年提出,用于优化标准“K-Means++”算法的群集化性能。The K-means++ algorithm was proposed in 2007 by David Arthur and Sergei Vassilvitskii to avoid poor clustering by the standard K-means algorithm. K-Means++ 采用了一种不同的方法来选择初始群集中心,从而对标准 K-Means 进行了优化。K-means++ improves upon standard K-means by using a different method for choosing the initial cluster centers.

  5. 对于“随机数种子”,可以选择键入一个值,将其用作群集初始化的种子。For Random number seed, optionally type a value to use as the seed for the cluster initialization. 该值可能会极大影响群集选择。This value can have a significant effect on cluster selection.

  6. 对于“指标”,选择用于测量群集矢量之间或新数据点与随机选择的质心之间的距离的函数。For Metric, choose the function to use for measuring the distance between cluster vectors, or between new data points and the randomly chosen centroid. Azure 机器学习支持以下群集距离指标:Azure Machine Learning supports the following cluster distance metrics:

    • 欧几里得:K-Means 群集化常使用欧几里得距离作为群集散点图的度量值。Euclidean: The Euclidean distance is commonly used as a measure of cluster scatter for K-means clustering. 常用此指标是因为它最大程度地减少了点与质心之间的平均距离。This metric is preferred because it minimizes the mean distance between points and the centroids.
  7. 对于“迭代”,请键入算法在完成质心选择之前应循环访问训练数据的次数。For Iterations, type the number of times the algorithm should iterate over the training data before it finalizes the selection of centroids.

    可以根据训练时间调整此参数以平衡精确度。You can adjust this parameter to balance accuracy against training time.

  8. 对于“分配标签模式”,请选择一个选项,用于指定应如何处理数据集中的某个标签列。For Assign label mode, choose an option that specifies how a label column, if it's present in the dataset, should be handled.

    由于 K-Means 群集化是一种非监督式机器学习方法,因此标签是可选的。Because K-means clustering is an unsupervised machine learning method, labels are optional. 但是,如果数据集已具有标签列,则可以使用这些值来指导群集的选择,或者可以指定忽略这些值。However, if your dataset already has a label column, you can use those values to guide the selection of the clusters, or you can specify that the values be ignored.

    • 忽略标签列:将忽略标签列中的值,构建模型时不会使用这些值。Ignore label column: The values in the label column are ignored and are not used in building the model.

    • 填充缺失值:将标签列的值作为特征使用,帮助构建群集。Fill missing values: The label column values are used as features to help build the clusters. 如果任何行缺少标签,则使用其他特征来输入值。If any rows are missing a label, the value is imputed by using other features.

    • 从最接近中心的点开始覆盖:使用最靠近当前质心的点的标签,将标签列值替换为预测的标签值。Overwrite from closest to center: The label column values are replaced with predicted label values, using the label of the point that is closest to the current centroid.

  9. 如果要在训练之前将特征规范化,请选择“规范特征”选项。Select the Normalize features option if you want to normalize features before training.

    如果应用了规范化,在训练之前数据点将由 MinMaxNormalizer 规范化为 [0,1]If you apply normalization, before training, the data points are normalized to [0,1] by MinMaxNormalizer.

  10. 定型模型。Train the model.

    • 如果将“创建训练程序模式”设置为“单个参数”,请使用训练群集化模型模块添加带标记的数据集并训练模型。If you set Create trainer mode to Single Parameter, add a tagged dataset and train the model by using the Train Clustering Model module.

结果Results

配置和训练完模型后,一个能生成分数的模型就创建完成了。After you've finished configuring and training the model, you have a model that you can use to generate scores. 然而,训练模型的方式有多种,查看和使用结果的方式也有多种:However, there are multiple ways to train the model, and multiple ways to view and use the results:

捕获工作区中模型的快照Capture a snapshot of the model in your workspace

如果使用了训练群集模型模块:If you used the Train Clustering Model module:

  1. 选择“训练群集模型”模块并打开右侧面板。Select the Train Clustering Model module and open the right panel.

  2. 选择“输出”选项卡。选择“注册数据集”图标以保存已训练模型的副本。Select Outputs tab. Select the Register dataset icon to save a copy of the trained model.

保存的模型表示保存模型时的训练数据。The saved model represents the training data at the time you saved the model. 如果以后更新了管道中使用的训练数据,已保存的模型不会更新。If you later update the training data used in the pipeline, it doesn't update the saved model.

查看群集化结果数据集See the clustering result dataset

如果使用了训练群集模型模块:If you used the Train Clustering Model module:

  1. 右键单击“训练群集模型”模块。Right-click the Train Clustering Model module.

  2. 选择“可视化”。Select Visualize.

有关如何生成最佳群集模型的提示Tips for generating the best clustering model

已知群集化过程中使用的种子设定过程可能会显著影响模型。It is known that the seeding process that's used during clustering can significantly affect the model. 种子设定指的是将一些初始点设为潜在的质心。Seeding means the initial placement of points into potential centroids.

例如,如果数据集包含多个离群值,并且选择了一个离群值来设定群集种子,则没有其他数据点能适合该群集,且该群集可能是单一实例。For example, if the dataset contains many outliers, and an outlier is chosen to seed the clusters, no other data points would fit well with that cluster, and the cluster could be a singleton. 即它可能只有一个点。That is, it might have only one point.

可以通过以下几种方式来避免此问题:You can avoid this problem in a couple of ways:

  • 更改质心的数量,并尝试多个种子值。Change the number of centroids and try multiple seed values.

  • 创建多个模型,使用不同指标或增加循环访问次数。Create multiple models, varying the metric or iterating more.

通常,使用群集化模型,任何给定的配置都可能会生成一组经过本地优化的群集。In general, with clustering models, it's possible that any given configuration will result in a locally optimized set of clusters. 换言之,由模型返回的一组群集只适合当前数据点,而不适用于其他数据。In other words, the set of clusters that's returned by the model suits only the current data points and isn't generalizable to other data. 如果采用不同的初始配置,则使用 K-Means 方法时可能会发现另一种高级的配置。If you use a different initial configuration, the K-means method might find a different, superior, configuration.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.