基于筛选器的特征选择Filter Based Feature Selection

本文介绍如何使用 Azure 机器学习设计器(预览版)中的“基于筛选器的特征选择”模块。This article describes how to use the Filter Based Feature Selection module in Azure Machine Learning designer (preview). 此模块可帮助你识别输入数据集中具有最高预测能力的列。This module helps you identify the columns in your input dataset that have the greatest predictive power.

概括而言,“特征选择”是指在指定了输出的情况下,将统计测试应用到输入的过程。 In general, feature selection refers to the process of applying statistical tests to inputs, given a specified output. 目标是确定哪些列能够更准确地预测输出。The goal is to determine which columns are more predictive of the output. “基于筛选器的特征选择”模块提供多种特征选择算法供用户选择。The Filter Based Feature Selection module provides multiple feature selection algorithms to choose from. 该模块包含“皮尔逊相关”和卡方值等相关性方法。The module includes correlation methods such as Pearson correlation and chi-squared values.

使用“基于筛选器的特征选择”模块时,需要提供一个数据集,并标识包含标签或因变量的列。When you use the Filter Based Feature Selection module, you provide a dataset and identify the column that contains the label or dependent variable. 然后,指定一个用于度量特征重要性的方法。You then specify a single method to use in measuring feature importance.

该模块输出包含最佳特征列(按预测能力排名)的数据集。The module outputs a dataset that contains the best feature columns, as ranked by predictive power. 它还根据所选的指标输出特征的名称及其评分。It also outputs the names of the features and their scores from the selected metric.

基于筛选器的特征选择是什么?What filter-based feature selection is

此模块之所以称为“基于筛选器”的特征选择,是因为你要使用所选的指标来查找不相关的属性。This module for feature selection is called "filter-based" because you use the selected metric to find irrelevant attributes. 然后从模型中筛选出多余的列。You then filter out redundant columns from your model. 选择适合数据的单个统计度量后,该模块将计算每个特征列的评分。You choose a single statistical measure that suits your data, and the module calculates a score for each feature column. 返回的列已按其特征评分排名。The columns are returned ranked by their feature scores.

选择适当的特征可能会改善分类的准确度和效率。By choosing the right features, you can potentially improve the accuracy and efficiency of classification.

通常只使用具有最高评分的列来生成预测模型。You typically use only the columns with the best scores to build your predictive model. 可将特征选择评分不佳的列保留在数据集中,并在生成模型时将其忽略。Columns with poor feature selection scores can be left in the dataset and ignored when you build a model.

如何选择特征选择指标How to choose a feature selection metric

“基于筛选器的特征选择”模块提供各种指标用于评估每个列中的信息值。The Filter-Based Feature Selection module provides a variety of metrics for assessing the information value in each column. 本部分将大致介绍每个指标及其应用方式。This section provides a general description of each metric, and how it's applied. 可以在技术说明以及有关配置每个模块的说明中,找到使用每个指标所要满足的其他要求。You can find additional requirements for using each metric in the technical notes and in the instructions for configuring each module.

  • 皮尔逊相关Pearson correlation

    皮尔逊相关统计(或皮尔逊相关系数)在统计模型中也称为 r 值。Pearson’s correlation statistic, or Pearson’s correlation coefficient, is also known in statistical models as the r value. 对于任意两个变量,它会返回指示相关性强度的值。For any two variables, it returns a value that indicates the strength of the correlation.

    皮尔逊相关系数的计算方式是:将两个变量的协方差除以其标准偏差的积。Pearson's correlation coefficient is computed by taking the covariance of two variables and dividing by the product of their standard deviations. 这两个变量的标度变化不影响该系数。Changes of scale in the two variables don't affect the coefficient.

  • 卡方Chi squared

    双向卡方测试是测量预期值与实际结果的接近程度的一种统计方法。The two-way chi-squared test is a statistical method that measures how close expected values are to actual results. 该方法假设变量是随机的,并且是从独立变量的足够样本中抽取的。The method assumes that variables are random and drawn from an adequate sample of independent variables. 生成的卡方统计信息指示实际结果与预期(随机)结果之间的差距。The resulting chi-squared statistic indicates how far results are from the expected (random) result.

提示

如果需要对自定义特征选择方法使用不同的选项,请使用执行 R 脚本模块。If you need a different option for the custom feature selection method, use the Execute R Script module.

如何配置“基于筛选器的特征选择”How to configure Filter-Based Feature Selection

选择标准统计指标。You choose a standard statistical metric. 该模块计算一对列之间的相关性:标签列和特征列。The module computes the correlation between a pair of columns: the label column and a feature column.

  1. 将“基于筛选器的特征选择”模块添加到管道。Add the Filter-Based Feature Selection module to your pipeline. 可以在设计器的“特征选择”类别中找到它。 You can find it in the Feature Selection category in the designer.

  2. 连接至少包含两个列(潜在特征)的输入数据集。Connect an input dataset that contains at least two columns that are potential features.

    为了确保分析某列并生成特征评分,请使用编辑元数据模块来设置 IsFeature 属性。To ensure that a column is analyzed and a feature score is generated, use the Edit Metadata module to set the IsFeature attribute.

    重要

    确保提供用作输入的列是潜在特征。Ensure that the columns that you're providing as input are potential features. 例如,包含单个值的列没有信息值。For example, a column that contains a single value has no information value.

    如果你知道某些列会产生不良的特征,可将其从列选择中删除。If you know that some columns would make bad features, you can remove them from the column selection. 还可以使用编辑元数据模块将其标记为“分类”。 You can also use the Edit Metadata module to flag them as Categorical.

  3. 对于“特征评分方法”,请选择以下已建立的统计方法之一,以便在计算评分时使用。 For Feature scoring method, choose one of the following established statistical methods to use in calculating scores.

    方法Method 要求Requirements
    皮尔逊相关Pearson correlation 标签可以是文本或数字。Label can be text or numeric. 特征必须是数字。Features must be numeric.
    卡方Chi squared 标签和特征可以是文本或数字。Labels and features can be text or numeric. 使用此方法来计算两个分类列的特征重要性。Use this method for computing feature importance for two categorical columns.

    提示

    如果更改所选的指标,将重置所有其他选择。If you change the selected metric, all other selections will be reset. 因此,请务必先设置此选项。So be sure to set this option first.

  4. 选择“仅对特征列运行”选项,以便仅为事先已标记为特征的列生成评分。 Select the Operate on feature columns only option to generate a score only for columns that were previously marked as features.

    如果清除此选项,模块也会为不满足条件的任何列创建评分,直到达到“所需特征数”中指定的列数。 If you clear this option, the module will create a score for any column that otherwise meets the criteria, up to the number of columns specified in Number of desired features.

  5. 对于“目标列”,请选择“启动列选择器”,以按名称或索引选择标签列。 For Target column, select Launch column selector to choose the label column either by name or by its index. (索引从 1 开始。)(Indexes are one-based.)
    涉及到统计相关性的所有方法都需要一个标签列。A label column is required for all methods that involve statistical correlation. 如果未选择一个或多个标签列,模块将返回设计时错误。The module returns a design-time error if you choose no label column or multiple label columns.

  6. 对于“所需特征数”,请输入要作为结果返回的特征列数目: For Number of desired features, enter the number of feature columns that you want returned as a result:

    • 可指定的最小特征数为 1,但我们建议增大此值。The minimum number of features that you can specify is one, but we recommend that you increase this value.

    • 如果指定的所需特征数大于数据集中的列数,将返回所有特征。If the specified number of desired features is greater than the number of columns in the dataset, then all features are returned. 甚至会返回评分为零的特征。Even features with zero scores are returned.

    • 如果指定的结果列少于特征列,则会按评分的降序将特征排名。If you specify fewer result columns than there are feature columns, the features are ranked by descending score. 只返回评分最高的特征。Only the top features are returned.

  7. 运行管道或选择“基于筛选器的特征选择”模块,然后选择“运行所选模块”。 Run the pipeline, or select the Filter Based Feature Selection module and then select Run selected.

结果Results

处理完成后:After processing is complete:

  • 若要查看已分析的特征列及其评分的完整列表,请右键单击该模块。To see a complete list of the feature columns that were analyzed, and their scores, right-click the module. 依次选择“特征”、“可视化”。 Select Features, and then select Visualize.

  • 若要查看基于特征选择条件生成的数据集,请右键单击该模块。To view the dataset that's generated based on your feature selection criteria, right-click the module. 依次选择“数据集”、“可视化”。 Select Dataset, and then select Visualize.

如果数据集包含的列数少于预期,请检查模块设置。If the dataset contains fewer columns than you expected, check the module settings. 此外,请检查作为输入提供的列数据类型。Also check the data types of the columns provided as input. 例如,如果将“所需特征数”设置为 1,则输出数据集只包含两列:标签列,以及排名最高的特征列。 For example, if you set Number of desired features to 1, the output dataset contains just two columns: the label column, and the most highly ranked feature column.

技术说明Technical notes

实现详细信息Implementation details

如果对数字特征和分类标签使用皮尔逊相关,则特征评分的计算方式如下:If you use Pearson correlation on a numeric feature and a categorical label, the feature score is calculated as follows:

  1. 对于分类列中的每个级别,将计算数字列的条件平均值。For each level in the categorical column, compute the conditional mean of numeric column.

  2. 将条件平均列与数字列相关联。Correlate the column of conditional means with the numeric column.

要求Requirements

  • 对于指定为“标签”或“评分”列的任何列,无法生成特征选择评分。 A feature selection score can't be generated for any column that's designated as a Label or Score column.

  • 如果尝试对评分方法不支持的数据类型的列使用评分方法,模块将引发错误。If you try to use a scoring method with a column of a data type that the method doesn't support, the module will raise an error. 或者,会将零评分分配到该列。Or, a zero score will be assigned to the column.

  • 如果某个列包含逻辑 (true/false) 值,这些值将作为 True = 1False = 0 进行处理。If a column contains logical (true/false) values, they're processed as True = 1 and False = 0.

  • 如果已将某个列指定为“标签”或“评分”,则该列不能是特征。 A column can't be a feature if it has been designated as a Label or a Score.

如何处理缺失值How missing values are handled

  • 无法将包含所有缺失值的列指定为目标(标签)列。You can't specify as a target (label) column any column that has all missing values.

  • 如果某个列包含缺失值,则模块在计算该列的评分时,将忽略这些缺失值。If a column contains missing values, the module ignores them when it's computing the score for the column.

  • 如果指定为特征列的某个列包含所有缺失值,模块将分配零评分。If a column designated as a feature column has all missing values, the module assigns a zero score.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.