“将数据分组到箱中”模块Group Data into Bins module

本文介绍如何使用 Azure 机器学习设计器中的“将数据分组到箱中”模块来对数字进行分组或更改连续数据的分布。This article describes how to use the Group Data into Bins module in Azure Machine Learning designer, to group numbers or change the distribution of continuous data.

“将数据分组到箱中”模块支持使用多个选项将数据分箱。The Group Data into Bins module supports multiple options for binning data. 可以自定义量化边界的设置方式,以及在箱中分配值的方式。例如,可以:You can customize how the bin edges are set and how values are apportioned into the bins. For example, you can:

  • 手动键入一系列要用作箱边界的值。Manually type a series of values to serve as the bin boundaries.
  • 使用 分位数 或百分位排名将值分配到箱。Assign values to bins by using quantiles , or percentile ranks.
  • 强制将值均匀分布到箱中。Force an even distribution of values into the bins.

有关分箱和分组的更多信息More about binning and grouping

在为机器学习准备数字数据时,将数据“分箱”或分组(有时称为“量化”)是一个重要工具 。Binning or grouping data (sometimes called quantization ) is an important tool in preparing numerical data for machine learning. 它适用于如下场景:It's useful in scenarios like these:

  • 一列连续的数字中的唯一值太多,无法有效地建模。A column of continuous numbers has too many unique values to model effectively. 因此你会自动或手动将这些值分配到各组,以创建较小的一组离散范围。So you automatically or manually assign the values to groups, to create a smaller set of discrete ranges.

  • 你需要将一列数字替换为表示特定范围的分类值。You want to replace a column of numbers with categorical values that represent specific ranges.

    例如,对于用户人口统计数据,你可能想要在年龄列中通过指定自定义范围(例如 1-15、16-22、23-30 等)来对该列中的值分组。For example, you might want to group values in an age column by specifying custom ranges, such as 1-15, 16-22, 23-30, and so forth for user demographics.

  • 某个数据集包含几个极端值,这些值远远超出预期的范围,并且对训练的模型造成了很大的影响。A dataset has a few extreme values, all well outside the expected range, and these values have an outsized influence on the trained model. 为了减少模型中的偏差,可通过使用分位数方法将数据转换为均匀分布。To mitigate the bias in the model, you might transform the data to a uniform distribution by using the quantiles method.

    使用此方法时,“将数据分组到箱中”模块将确定理想的箱位置和箱宽度,以确保在每个箱中放入数量大致相同的样本。With this method, the Group Data into Bins module determines the ideal bin locations and bin widths to ensure that approximately the same number of samples fall into each bin. 然后,根据你选择的归一化方法,箱中的值将转换为百分位数或者映射到箱号。Then, depending on the normalization method you choose, the values in the bins are either transformed to percentiles or mapped to a bin number.

分箱示例Examples of binning

下图显示了使用 分位数 方法分箱之前和之后的数字值分布。The following diagram shows the distribution of numeric values before and after binning with the quantiles method. 请注意,与左侧的原始数据相比,数据已分箱并转换为单位法线标度。Notice that compared to the raw data at left, the data has been binned and transformed to a unit-normal scale.

你可以找到此管道运行的结果中的示例You can find an example from the result of this pipeline run.

由于可以通过许多方式对数据进行分组且所有方式都可自定义,因此我们建议使用不同的方法和值进行试验。Because there are so many ways to group data, all customizable, we recommend that you experiment with different methods and values.

如何配置“将数据分组到箱中”How to configure Group Data into Bins

  1. 在设计器中将“将数据分组到箱中”模块添加到管道。Add the Group Data Into Bins module to your pipeline in the designer. 可以在“数据转换”类别中找到此模块。You can find this module in the category Data Transformation .

  2. 连接包含要分箱的数字数据的数据集。Connect the dataset that has numerical data to bin. 量化只能应用于包含数字数据的列。Quantization can be applied only to columns that contain numeric data.

    如果数据集包含非数字列,请使用选择数据集中的列模块选择要处理的列子集。If the dataset contains non-numeric columns, use the Select Columns in Dataset module to select a subset of columns to work with.

  3. 指定分箱模式。Specify the binning mode. 分箱模式决定其他参数,因此请务必首先选择“分箱模式”选项。The binning mode determines other parameters, so be sure to select the Binning mode option first. 以下分箱类型受支持:The following types of binning are supported:

    • 分位数 :分位数方法根据百分位排名将值分配到箱中。Quantiles : The quantile method assigns values to bins based on percentile ranks. 此方法也称为等高分箱。This method is also known as equal height binning.

    • 等宽 :如果使用此选项,必须指定箱的总数。数据列中的值会放入箱中,且每个箱在开始值和结束值之间的间隔均相同。Equal Width : With this option, you must specify the total number of bins. The values from the data column are placed in the bins such that each bin has the same interval between starting and ending values. 因此,如果数据聚集在特定点附近,则一些箱可能会包含更多的值。As a result, some bins might have more values if data is clumped around a certain point.

    • 自定义边界 :可以指定每个箱的开始值。Custom Edges : You can specify the values that begin each bin. 边界值始终是箱的下边界。The edge value is always the lower boundary of the bin.

      例如,假设要将值分组到两个箱中。其中一个箱中有大于 0 的值,另一个箱中有小于或等于 0 的值。For example, assume you want to group values into two bins. One will have values greater than 0, and one will have values less than or equal to 0. 在这种情况下,对于量化边界,请在“量化边界的逗号分隔列表”中输入“0” 。In this case, for bin edges, you enter 0 in Comma-separated list of bin edges . 该模块的输出将会是 1 和 2,表示每个行值的箱索引。The output of the module will be 1 and 2, indicating the bin index for each row value. 请注意,逗号分隔值列表必须采用升序,例如“1,3,5,7”。Note that the comma-separated value list must be in an ascending order, such as 1, 3, 5, 7.

  4. 如果使用“分位数”和“等宽”分箱模式,请使用“箱数”选项来指定要创建的箱数(或“分位数”) 。If you're using the Quantiles and Equal Width binning modes, use the Number of bins option to specify how many bins, or quantiles , you want to create.

  5. 对于“要分箱的列”,请使用列选择器来选择要分箱的值所在的列。For Columns to bin , use the column selector to choose the columns that have the values you want to bin. 列必须是数字数据类型。Columns must be a numeric data type.

    同一分箱规则将应用到你选择的所有适用列。The same binning rule is applied to all applicable columns that you choose. 如果需要通过使用其他方法将某些列分箱,请对每组列都使用“将数据分组到箱中”模块的一个单独实例。If you need to bin some columns by using a different method, use a separate instance of the Group Data into Bins module for each set of columns.

    警告

    如果选择的列不属于允许的类型,会生成运行时错误。If you choose a column that's not an allowed type, a runtime error is generated. 模块在发现任一列的类型不被允许时,会立即返回错误。The module returns an error as soon as it finds any column of a disallowed type. 如果收到错误,请检查所有选定的列。If you get an error, review all selected columns. 该错误不会列出所有无效列。The error does not list all invalid columns.

  6. 对于“输出模式”,请指明输出量化值时需要使用的方式:For Output mode , indicate how you want to output the quantized values:

    • 追加 :创建一个包含分箱值的新列,并将其追加到输入表。Append : Creates a new column with the binned values, and appends that to the input table.

    • Inplace :使用数据集中的新值替换原始值。Inplace : Replaces the original values with the new values in the dataset.

    • ResultOnly :仅返回结果列。ResultOnly : Returns just the result columns.

  7. 如果选择“分位数”分箱模式,请使用“分位数归一化”选项来确定值在按分位数排序之前如何归一化。 If you select the Quantiles binning mode, use the Quantile normalization option to determine how values are normalized before sorting into quantiles. 请注意,将值归一化会转换这些值,但不影响最终的箱数。Note that normalizing values transforms the values but doesn't affect the final number of bins.

    以下归一化类型受支持:The following normalization types are supported:

    • Percent :值在 [0,100] 范围内归一化。Percent : Values are normalized within the range [0,100].

    • PQuantile :值在 [0,1] 范围内归一化。PQuantile : Values are normalized within the range [0,1].

    • QuantileIndex :值在 [1,箱数] 范围内归一化。QuantileIndex : Values are normalized within the range [1,number of bins].

  8. 如果选择“自定义边界”选项,请在“量化边界的逗号分隔列表”文本框中输入要用作量化边界的逗号分隔数字列表。If you choose the Custom Edges option, enter a comma-separated list of numbers to use as bin edges in the Comma-separated list of bin edges text box.

    这些值标记了分隔各箱的点。例如,如果输入一个量化边界值,将会生成两个箱。The values mark the point that divides bins. For example, if you enter one bin edge value, two bins will be generated. 如果输入两个量化边界值,将会生成三个箱。If you enter two bin edge values, three bins will be generated.

    这些值必须按箱的创建顺序从低到高排序。The values must be sorted in the order that the bins are created, from lowest to highest.

  9. 选择“将列标记为分类”选项可指明应将量化列作为分类变量进行处理。Select the Tag columns as categorical option to indicate that the quantized columns should be handled as categorical variables.

  10. 提交管道。Submit the pipeline.

结果Results

“将数据分组到箱中”模块返回一个数据集,其中每个元素都已根据指定的模式分箱。The Group Data into Bins module returns a dataset in which each element has been binned according to the specified mode.

它还返回“分箱转换”。It also returns a binning transformation . 该函数可传递到“应用转换”模块,以通过使用相同的分箱模式和参数将新的数据样本分箱。That function can be passed to the Apply Transformation module to bin new samples of data by using the same binning mode and parameters.

提示

如果对训练数据使用分箱,就必须对数据使用在测试和预测时所用的相同分箱方法。If you use binning on your training data, you must use the same binning method on data that you use for testing and prediction. 还必须使用相同的箱位置和箱宽度。You must also use the same bin locations and bin widths.

为确保始终使用相同的分箱方法转换数据,建议保存有用的数据转换。To ensure that data is always transformed by using the same binning method, we recommend that you save useful data transformations. 然后通过使用“应用转换”模块将这些转换应用到其他数据集。Then apply them to other datasets by using the Apply Transformation module.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.