SMOTESMOTE

本文介绍如何使用 Azure 机器学习设计器中的 SMOTE 模块来增加用于机器学习的数据集中少见事例的数量。This article describes how to use the SMOTE module in Azure Machine Learning designer to increase the number of underrepresented cases in a dataset that's used for machine learning. 与简单地复制现有事例相比,SMOTE 更适合用于增加罕见事例数量。SMOTE is a better way of increasing the number of rare cases than simply duplicating existing cases.

将 SMOTE 模块连接到不平衡的数据集。You connect the SMOTE module to a dataset that's imbalanced . 数据集不平衡的原因可能有很多。There are many reasons why a dataset might be imbalanced. 例如目标类别在总体中很少见,或者难以收集数据。For example, the category you're targeting might be rare in the population, or the data might be difficult to collect. 通常,当要分析的类较少出现时,可以使用 SMOTE。Typically, you use SMOTE when the class that you want to analyze is underrepresented.

该模块将返回包含原始示例的数据集。The module returns a dataset that contains the original samples. 它还会根据你指定的百分比返回大量合成的少数示例。It also returns a number of synthetic minority samples, depending on the percentage that you specify.

有关 SMOTE 的详细信息More about SMOTE

合成少数过采样技术 (SMOTE) 是一种统计技术,用于以平衡的方式增加数据集中的事例数量。Synthetic Minority Oversampling Technique (SMOTE) is a statistical technique for increasing the number of cases in your dataset in a balanced way. 该模块的工作原理是从你作为输入提供的现有少数事例生成新的实例。The module works by generating new instances from existing minority cases that you supply as input. SMOTE 的实现不会更改多数事例的数量。This implementation of SMOTE does not change the number of majority cases.

新实例不仅是现有少数事例的副本。The new instances are not just copies of existing minority cases. 相反,该算法为每个目标类及其最近的邻域获取特征空间示例。Instead, the algorithm takes samples of the feature space for each target class and its nearest neighbors. 然后,该算法生成将目标事例的特征与其邻域特征相结合的新示例。The algorithm then generates new examples that combine features of the target case with features of its neighbors. 这种方法会增加可用于每个类的特征,使示例更一般化。This approach increases the features available to each class and makes the samples more general.

SMOTE 将整个数据集作为输入,但仅增加少数事例的百分比。SMOTE takes the entire dataset as an input, but it increases the percentage of only the minority cases. 例如,假设有一个不平衡的数据集,其中只有 1% 的事例具有目标值 A(少数类),而 99% 的事例具有值 B。若要将少数事例的百分比增加到先前的两倍,需在模块属性中输入 200 作为“SMOTE 百分比” 。For example, suppose you have an imbalanced dataset where just 1 percent of the cases have the target value A (the minority class), and 99 percent of the cases have the value B. To increase the percentage of minority cases to twice the previous percentage, you would enter 200 for SMOTE percentage in the module's properties.

示例Examples

建议尝试对小型数据集使用 SMOTE 以了解其工作原理。We recommend that you try using SMOTE with a small dataset to see how it works. 下面的示例使用 Azure 机器学习设计器中提供的献血数据集。The following example uses the Blood Donation dataset available in Azure Machine Learning designer.

如果将数据集添加到管道并选择对数据集的输出进行“可视化”,将可以看到,在数据集的 748 行(或者说 748 个事例)中,有 570 个事例 (76%) 属于类 0,而有 178 个事例 (24%) 属于类 1。If you add the dataset to a pipeline and select Visualize on the dataset's output, you can see that of the 748 rows or cases in the dataset, 570 cases (76 percent) are of Class 0, and 178 cases (24 percent) are of Class 1. 尽管此结果的不平衡程度较低,但类 1 代表已献血者,因此这些行包含要建模的特征空间。Although this result isn't terribly imbalanced, Class 1 represents the people who donated blood, so these rows contain the feature space that you want to model.

要增加事例数量,可以使用 100 的倍数来设置 SMOTE 百分比的值,如下所示:To increase the number of cases, you can set the value of SMOTE percentage , by using multiples of 100, as follows:

类 0Class 0 类 1Class 1 totaltotal
原始数据集Original dataset

(等效于 SMOTE 百分比 = 0 )(equivalent to SMOTE percentage = 0 )
570570

76%76%
178178

24%24%
748748
SMOTE 百分比 = 100 SMOTE percentage = 100 570570

62%62%
356356

38%38%
926926
SMOTE 百分比 = 200 SMOTE percentage = 200 570570

52%52%
534534

48%48%
1,1041,104
SMOTE 百分比 = 300 SMOTE percentage = 300 570570

44%44%
712712

56%56%
1,2821,282

警告

使用 SMOTE 增加事例数不保证生成更准确的模型。Increasing the number of cases by using SMOTE is not guaranteed to produce more accurate models. 尝试使用不同的百分比、不同的特征集以及不同数量的最近的邻域进行管道传输,以了解添加事例对模型的影响。Try pipelining with different percentages, different feature sets, and different numbers of nearest neighbors to see how adding cases influences your model.

如何配置 SMOTEHow to configure SMOTE

  1. 将 SMOTE 模块添加到管道。Add the SMOTE module to your pipeline. 可以在“操作”类别中的“数据转换模块”下找到该模块 。You can find the module under Data Transformation modules , in the Manipulation category.

  2. 连接要增强的数据集。Connect the dataset that you want to boost. 如果要通过仅使用特定列或排除某些列来指定用于构建新事例的特征空间,请使用在数据集中选择列模块。If you want to specify the feature space for building the new cases, either by using only specific columns or by excluding some, use the Select Columns in Dataset module. 然后,可以在使用 SMOTE 之前隔离要使用的列。You can then isolate the columns that you want to use before using SMOTE.

    否则,通过 SMOTE 创建新事例将基于作为输入提供的所有列。Otherwise, creation of new cases through SMOTE is based on all the columns that you provide as inputs. 特征列中至少有一列是数字。At least one column of the feature columns is numeric.

  3. 请确保选中包含标签或目标类的列。Ensure that the column that contains the label, or target class, is selected. SMOTE 仅接受二进制标签。SMOTE accepts only binary labels.

  4. SMOTE 模块自动标识标签列中的少数类,然后获取少数类的所有示例。The SMOTE module automatically identifies the minority class in the label column, and then gets all examples for the minority class. 所有列都不能有 NaN 值。All columns can't have NaN values.

  5. 在“SMOTE 百分比”选项中,输入一个整数,该整数指示输出数据集中少数事例的目标百分比。In the SMOTE percentage option, enter a whole number that indicates the target percentage of minority cases in the output dataset. 例如:For example:

    • 输入 0。You enter 0 . SMOTE 模块返回的数据集与你提供的输入完全相同。The SMOTE module returns exactly the same dataset that you provided as input. 没有增加新的少数事例。It adds no new minority cases. 在此数据集中,类比例未更改。In this dataset, the class proportion has not changed.

    • 输入 100。You enter 100 . SMOTE 模块将生成新的少数事例。The SMOTE module generates new minority cases. 它会添加与原始数据集中相同数量的少数事例。It adds the same number of minority cases that were in the original dataset. 由于 SMOTE 不会增加多数事例的数量,因此每个类的事例所占比例也会发生变化。Because SMOTE does not increase the number of majority cases, the proportion of cases of each class has changed.

    • 输入 200。You enter 200 . 与原始数据集相比,该模块会将少数事例的百分比翻倍。The module doubles the percentage of minority cases compared to the original dataset. 这不会使少数事例的数量变为原来的两倍。This does not result in having twice as many minority cases as before. 而会增加数据集的大小,并使多数事例的数量保持不变。Rather, the size of the dataset is increased in such a way that the number of majority cases stays the same. 少数事例的数量会增加,直到达到所需的百分比值。The number of minority cases is increased until it matches the desired percentage value.

    备注

    请仅使用 100 的倍数作为 SMOTE 百分比。Use only multiples of 100 for the SMOTE percentage.

  6. 使用“最近的邻域数量”选项可以确定 SMOTE 算法在构建新事例时使用的特征空间的大小。Use the Number of nearest neighbors option to determine the size of the feature space that the SMOTE algorithm uses in building new cases. 最近的邻域是与目标事例类似的数据行(事例)。A nearest neighbor is a row of data (a case) that's similar to a target case. 任何两个事例之间的距离由组合所有特征的加权向量度量。The distance between any two cases is measured by combining the weighted vectors of all features.

    • 通过增加最近的邻域的数量,可以从更多事例中获得特征。By increasing the number of nearest neighbors, you get features from more cases.
    • 通过使最近的邻域数量保持在较低水平,可以使用与原始示例中的特征更相似的特征。By keeping the number of nearest neighbors low, you use features that are more like those in the original sample.
  7. 如果要确保在针对相同数据运行同一管道时获得相同结果,请在“随机种子”框中输入一个值。Enter a value in the Random seed box if you want to ensure the same results over runs of the same pipeline, with the same data. 否则,模块将根据部署管道时的处理器时钟值生成随机种子。Otherwise, the module generates a random seed based on processor clock values when the pipeline is deployed. 随机种子的生成可能会导致运行结果略有不同。The generation of a random seed can cause slightly different results over runs.

  8. 提交管道。Submit the pipeline.

    模块的输出是一个数据集,其中包含原始行以及众多包含少数事例的附加行。The output of the module is a dataset that contains the original rows plus a number of added rows with minority cases.

技术说明Technical notes

  • 发布使用 SMOTE 模块的模型时,在将其发布为 Web 服务之前,请从预测管道中删除 SMOTE 。When you're publishing a model that uses the SMOTE module, remove SMOTE from the predictive pipeline before it's published as a web service. 这是因为 SMOTE 用于在训练期间改进模型,而不用于评分。The reason is that SMOTE is intended for improving a model during training, not for scoring. 如果已发布的预测管道包含 SMOTE 模块,则可能会出现错误。You might get an error if a published predictive pipeline contains the SMOTE module.

  • 如果在应用 SMOTE 之前清理缺失值或应用其他转换来修复数据,通常可以获得更好的结果。You can often get better results if you clean missing values or apply other transformations to fix data before you apply SMOTE.

  • 一些研究人员调查了 SMOTE 对高维或稀疏数据的有效性,如用于文本分类或基因组学数据集的数据。Some researchers have investigated whether SMOTE is effective on high-dimensional or sparse data, such as data used in text classification or genomics datasets. 本文很好地总结了在这种情况下应用 SMOTE 的效果和理论有效性:Blagus 和 Lusa:用于高维不平衡数据的 SMOTEThis paper has a good summary of the effects and of the theoretical validity of applying SMOTE in such cases: Blagus and Lusa: SMOTE for high-dimensional class-imbalanced data.

  • 如果 SMOTE 在数据集中无效,可以考虑使用的其他方法包括:If SMOTE is not effective in your dataset, other approaches that you might consider include:

    • 用于少数事例过采样或多数类事例欠采样的方法。Methods for oversampling the minority cases or undersampling the majority cases.
    • 使用聚类分析、装袋或自适应增强直接帮助学习器的系综技术。Ensemble techniques that help the learner directly by using clustering, bagging, or adaptive boosting.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.