剪切值Clip Values

本文介绍 Azure 机器学习设计器(预览版)的一个模块。This article describes a module of Azure Machine Learning designer (preview).

使用“剪切值”模块来识别高于或低于指定阈值的数据值,并有选择地将该值替换为平均值、常量或其他替代值。Use the Clip Values module to identify and optionally replace data values that are above or below a specified threshold with a mean, a constant, or other substitute value.

将模块连接到包含要剪切的数字的数据集,选择要使用的列,然后设置阈值或值范围以及替换方法。You connect the module to a dataset that has the numbers you want to clip, choose the columns to work with, and then set a threshold or range of values, and a replacement method. 该模块可以只输出结果,也可以输出附加到原始数据集的更改值。The module can output either just the results, or the changed values appended to the original dataset.

如何配置“剪切值”How to configure Clip Values

在开始之前,请确定要剪切的列以及要使用的方法。Before you begin, identify the columns you want to clip, and the method to use. 建议先对一小部分数据测试任何剪切方法。We recommend that you test any clipping method on a small subset of data first.

模块会对选中的所有列应用相同的条件和替换方法。The module applies the same criteria and replacement method to all columns that you include in the selection. 因此,应务必排除不打算更改的列。Therefore, be sure to exclude columns that you don't want to change.

如果需要对某些列应用剪切方法或不同的条件,则需要为每组类似的列使用新的剪切值实例。If you need to apply clipping methods or different criteria to some columns, you must use a new instance of Clip Values for each set of similar columns.

  1. 剪切值模块添加到管道,并将其连接到要修改的数据集。Add the Clip Values module to your pipeline and connect it to the dataset you want to modify. 可以在“扩大和缩小”类别中的“数据转换”下找到此模块 。You can find this module under Data Transformation, in the Scale and Reduce category.

  2. 在 “列列表”中,使用“列选择器”选择要应用剪切值 的列。In List of columns, use the Column Selector to choose the columns to which Clip Values will be applied.

  3. 对于“阈值集” ,从下拉列表中选择以下选项之一。For Set of thresholds, choose one of the following options from the dropdown list. 这些选项确定如何为可接受的值以及必须剪切的值设置上限和下限。These options determine how you set the upper and lower boundaries for acceptable values vs. values that must be clipped.

    • ClipPeaks:如果按峰值对值进行剪切,则仅指定上限值。ClipPeaks: When you clip values by peaks, you specify only an upper boundary. 将替换大于该上限值的值。Values greater than that boundary value are replaced.

    • ClipSubpeaks:如果按子峰值对值进行剪切,则仅指定下限值。ClipSubpeaks: When you clip values by subpeaks, you specify only a lower boundary. 将替换小于该下限值的值。Values that are less than that boundary value are replaced.

    • ClipPeaksAndSubpeaks:如果按峰值和子峰值对值进行剪切,可以同时指定上限值和下限值。ClipPeaksAndSubpeaks: When you clip values by peaks and subpeaks, you can specify both the upper and lower boundaries. 超出该范围的值将被替换。Values that are outside that range are replaced. 不会更改与限值匹配的值。Values that match the boundary values are not changed.

  4. 根据在上一步中所做的选择,可以设置以下阈值:Depending on your selection in the preceding step, you can set the following threshold values:

    • 阈值下限:仅当选择了 ClipSubPeaks 时显示Lower threshold: Displayed only if you choose ClipSubPeaks
    • 阈值上限:仅当选择了 ClipPeaks 时显示Upper threshold: Displayed only if you choose ClipPeaks
    • 阈值:仅当选择了 ClipPeaksAndSubPeaks 时显示Threshold: Displayed only if you choose ClipPeaksAndSubPeaks

    对于每种阈值类型,选择“常数” 或“百分位数” 。For each threshold type, choose either Constant or Percentile.

  5. 如果选择“常数” ,请在文本框中键入最大值或最小值。If you select Constant, type the maximum or minimum value in the text box. 例如,假设已知值 999 用作占位符值。For example, assume that you know the value 999 was used as a placeholder value. 可以选择“常数” 用于阈值上限,然后在“阈值上限常数值” 中键入 999。You could choose Constant for the upper threshold, and type 999 in Constant value for upper threshold.

  6. 如果选择“百分位数” ,会将列值限制到一个百分比范围内。If you choose Percentile, you constrain the column values to a percentile range.

    例如,只保留 10-80 百分比范围内的值,并替换所有其他值。For example, assume you want to keep only the values in the 10-80 percentile range, and replace all others. 可以选择“百分位数” ,然后在“阈值下限百分位数” 中键入 10,并在“阈值上限百分位数” 中键入 80。You would choose Percentile, and then type 10 for Percentile value for lower threshold, and type 80 for Percentile value for upper threshold.

    可参阅有关百分位数的部分,查看有关如何使用百分位数范围的一些示例。See the section on percentiles for some examples of how to use percentile ranges.

  7. 定义替换值。Define a substitute value.

    与指定限值完全匹配的数字被视为包含在允许的值范围内,因此不会被替换。Numbers that exactly match the boundaries you specified are considered to be inside the allowed range of values, and thus are not replaced. 指定范围之外的所有数字都将被替换为替换值。All numbers that fall outside the specified range are replaced with the substitute value.

    • 峰值替换值:定义用于替换所有大于指定阈值的列值的值。Substitute value for peaks: Defines the value to substitute for all column values that are greater than the specified threshold.
    • 子峰值替换值:定义用于替换所有小于指定阈值的列值的值。Substitute value for subpeaks: Defines the value to use as a substitute for all column values that are less than the specified threshold.
    • 如果使用 ClipPeaksAndSubpeaks 选项,可以为剪切上限值和下限值分别指定替换值。If you use the ClipPeaksAndSubpeaks option, you can specify separate replacement values for the upper and lower clipped values.

    支持以下替换值:The following replacement values are supported:

    • 阈值:将剪切值替换为指定的阈值。Threshold: Replaces clipped values with the specified threshold value.

    • 平均值:将剪切值替换为列值的平均值。Mean: Replaces clipped values with the mean of the column values. 在对值进行剪切之前计算平均值。The mean is computed before values are clipped.

    • 中值:将剪切值替换为列值的中值。Median: Replaces clipped values with the median of the column values. 在对值进行剪切之前计算中值。The median is computed before values are clipped.

    • 缺失值Missing. 将剪切值替换为缺失(空)值。Replaces clipped values with the missing (empty) value.

  8. 添加指示器列:如果需要生成一个新列来指示是否向该行中的数据应用了指定的剪切操作,应选择此选项。Add indicator columns: Select this option if you want to generate a new column that tells you whether or not the specified clipping operation applied to the data in that row. 如果要测试一组新的剪切和替换值,此选项会很有用。This option is useful when you are testing a new set of clipping and substitution values.

  9. 覆盖标志:指示希望如何生成新值。Overwrite flag: Indicate how you want the new values to be generated. 默认情况下,剪切值会构建一个新列,其中峰值已剪切为所需阈值。By default, Clip Values constructs a new column with the peak values clipped to the desired threshold. 新值会覆盖原始列。New values overwrite the original column.

    若想保留原始列并添加含剪切值的新列,请取消选择此选项。To keep the original column and add a new column with the clipped values, deselect this option.

  10. 提交管道。Submit the pipeline.

    右键单击“剪切值”模块,选择“可视化”或选择该模块,然后切换到右侧面板中的“输出”选项卡,单击“端口输出”中的直方图图标,以查看这些值并确保剪切操作符合你的预期。Right-click the Clip Values module and select Visualize or select the module and switch to the Outputs tab in the right panel, click on the histogram icon in the Port outputs, to review the values and make sure the clipping operation met your expectations.

剪切时采用百分位数的示例Examples for clipping using percentiles

若要理解按百分位数剪切的原理,可以思考一个包含 10 行的数据集,其中每一行分别包含数值 1-10 中的一个值。To understand how clipping by percentiles works, consider a dataset with 10 rows, which have one instance each of the values 1-10.

  • 如果使用百分位数作为阈值上限,则在第 90 个百分位数的值处,数据集中 90% 的值必须小于该值。If you are using percentile as the upper threshold, at the value for the 90th percentile, 90 percent of all values in the dataset must be less than that value.

  • 如果使用百分位数作为阈值下限,则在第 10 个百分位数的值处,数据集中 10% 的值必须小于该值。If you are using percentile as the lower threshold, at the value for the 10th percentile, 10 percent of all values in the dataset must be less than that value.

  1. 对于“阈值集” ,选择 ClipPeaksAndSubPeaksFor Set of thresholds, choose ClipPeaksAndSubPeaks.

  2. 对于“阈值上限” ,选择“百分位数” ,在“百分位数”中 ,键入 90。For Upper threshold, choose Percentile, and for Percentile number, type 90.

  3. 对于“替代值上限值” ,选择“缺失值” 。For Upper substitute value, choose Missing Value.

  4. 对于“阈值下限” ,选择“百分位数” ,在“百分位数”中 ,键入 10。For Lower threshold, choose Percentile, and for Percentile number, type 10.

  5. 对于“替代值下限值” ,选择“缺失值” 。For Lower substitute value, choose Missing Value.

  6. 取消选择选项“覆盖标志” ,然后选择选项“添加指示器列” 。Deselect the option Overwrite flag, and select the option, Add indicator column.

现在,请尝试在同一管道上使用 60 作为阈值上限百分位数,使用 30 作为阈值下限百分位数,并使用该阈值作为替换值。Now try the same pipeline using 60 as the upper percentile threshold and 30 as the lower percentile threshold, and use the threshold value as the replacement value. 下表对以下两种结果进行了比较:The following table compares these two results:

  1. 替换为缺失值;阈值上限 = 90;阈值下限 = 20Replace with missing; Upper threshold = 90; Lower threshold = 20

  2. 替换为阈值;上限值百分位数 = 60;下限值百分位数 = 40Replace with threshold; Upper percentile = 60; Lower percentile = 40

原始数据Original data 替换为缺失值Replace with missing 替换为阈值Replace with threshold
11

22

33

44

55

66

77

88

99

10 个10
TRUETRUE

TRUETRUE

3, FALSE3, FALSE

4, FALSE4, FALSE

5, FALSE5, FALSE

6, FALSE6, FALSE

7, FALSE7, FALSE

8, FALSE8, FALSE

9, FALSE9, FALSE

TRUETRUE
4, TRUE4, TRUE

4, TRUE4, TRUE

4, TRUE4, TRUE

4, TRUE4, TRUE

5, FALSE5, FALSE

6, FALSE6, FALSE

7, TRUE7, TRUE

7, TRUE7, TRUE

7, TRUE7, TRUE

7, TRUE7, TRUE

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.