# 基于筛选器的特征选择Filter Based Feature Selection

## 如何选择特征选择指标How to choose a feature selection metric

“基于筛选器的特征选择”模块提供各种指标用于评估每个列中的信息值。The Filter-Based Feature Selection module provides a variety of metrics for assessing the information value in each column. 本部分将大致介绍每个指标及其应用方式。This section provides a general description of each metric, and how it's applied. 可以在技术说明以及有关配置每个模块的说明中，找到使用每个指标所要满足的其他要求。You can find additional requirements for using each metric in the technical notes and in the instructions for configuring each module.

• 皮尔逊相关Pearson correlation

皮尔逊相关统计（或皮尔逊相关系数）在统计模型中也称为 `r` 值。Pearson’s correlation statistic, or Pearson’s correlation coefficient, is also known in statistical models as the `r` value. 对于任意两个变量，它会返回指示相关性强度的值。For any two variables, it returns a value that indicates the strength of the correlation.

皮尔逊相关系数的计算方式是：将两个变量的协方差除以其标准偏差的积。Pearson's correlation coefficient is computed by taking the covariance of two variables and dividing by the product of their standard deviations. 这两个变量的标度变化不影响该系数。Changes of scale in the two variables don't affect the coefficient.

• 卡方Chi squared

双向卡方测试是测量预期值与实际结果的接近程度的一种统计方法。The two-way chi-squared test is a statistical method that measures how close expected values are to actual results. 该方法假设变量是随机的，并且是从独立变量的足够样本中抽取的。The method assumes that variables are random and drawn from an adequate sample of independent variables. 生成的卡方统计信息指示实际结果与预期（随机）结果之间的差距。The resulting chi-squared statistic indicates how far results are from the expected (random) result.

## 如何配置“基于筛选器的特征选择”How to configure Filter-Based Feature Selection

1. 将“基于筛选器的特征选择”模块添加到管道。Add the Filter-Based Feature Selection module to your pipeline. 可以在设计器的“特征选择”类别中找到它。 You can find it in the Feature Selection category in the designer.

2. 连接至少包含两个列（潜在特征）的输入数据集。Connect an input dataset that contains at least two columns that are potential features.

为了确保分析某列并生成特征评分，请使用 编辑元数据模块来设置 IsFeature 属性。To ensure that a column is analyzed and a feature score is generated, use the Edit Metadata module to set the IsFeature attribute.

重要

确保提供用作输入的列是潜在特征。Ensure that the columns that you're providing as input are potential features. 例如，包含单个值的列没有信息值。For example, a column that contains a single value has no information value.

如果你知道某些列会产生不良的特征，可将其从列选择中删除。If you know that some columns would make bad features, you can remove them from the column selection. 还可以使用 You can also use the Edit Metadata module to flag them as Categorical .

3. 对于“特征评分方法”，请选择以下已建立的统计方法之一，以便在计算评分时使用。 For Feature scoring method , choose one of the following established statistical methods to use in calculating scores.

方法Method 要求Requirements
皮尔逊相关Pearson correlation 标签可以是文本或数字。Label can be text or numeric. 特征必须是数字。Features must be numeric.
卡方Chi squared 标签和特征可以是文本或数字。Labels and features can be text or numeric. 使用此方法来计算两个分类列的特征重要性。Use this method for computing feature importance for two categorical columns.

提示

如果更改所选的指标，将重置所有其他选择。If you change the selected metric, all other selections will be reset. 因此，请务必先设置此选项。So be sure to set this option first.

4. 选择“仅对特征列运行”选项，以便仅为事先已标记为特征的列生成评分。 Select the Operate on feature columns only option to generate a score only for columns that were previously marked as features.

如果清除此选项，模块也会为不满足条件的任何列创建评分，直到达到“所需特征数”中指定的列数。 If you clear this option, the module will create a score for any column that otherwise meets the criteria, up to the number of columns specified in Number of desired features .

5. 对于“目标列”，请选择“启动列选择器”，以按名称或索引选择标签列。 For Target column , select Launch column selector to choose the label column either by name or by its index. （索引从 1 开始。）(Indexes are one-based.)
涉及到统计相关性的所有方法都需要一个标签列。A label column is required for all methods that involve statistical correlation. 如果未选择一个或多个标签列，模块将返回设计时错误。The module returns a design-time error if you choose no label column or multiple label columns.

6. 对于“所需特征数”，请输入要作为结果返回的特征列数目： For Number of desired features , enter the number of feature columns that you want returned as a result:

• 可指定的最小特征数为 1，但我们建议增大此值。The minimum number of features that you can specify is one, but we recommend that you increase this value.

• 如果指定的所需特征数大于数据集中的列数，将返回所有特征。If the specified number of desired features is greater than the number of columns in the dataset, then all features are returned. 甚至会返回评分为零的特征。Even features with zero scores are returned.

• 如果指定的结果列少于特征列，则会按评分的降序将特征排名。If you specify fewer result columns than there are feature columns, the features are ranked by descending score. 只返回评分最高的特征。Only the top features are returned.

7. 提交管道或选择“基于筛选器的特征选择”模块，然后选择“运行所选模块”。 Submit the pipeline, or select the Filter Based Feature Selection module and then select Run selected .

## 结果Results

• 若要查看已分析的特征列及其分数的完整列表，请右键单击该模块并选择“可视化” 。To see a complete list of the analyzed feature columns and their scores, right-click the module and select Visualize .

• 若要根据特征选择条件查看数据集，请右键单击该模块并选择“可视化” 。To view the dataset based on your feature selection criteria, right-click the module and select Visualize .

## 技术说明Technical notes

### 实现详细信息Implementation details

1. 对于分类列中的每个级别，将计算数字列的条件平均值。For each level in the categorical column, compute the conditional mean of numeric column.

2. 将条件平均列与数字列相关联。Correlate the column of conditional means with the numeric column.

### 要求Requirements

• 对于指定为“标签”或“评分”列的任何列，无法生成特征选择评分。 A feature selection score can't be generated for any column that's designated as a Label or Score column.

• 如果尝试对评分方法不支持的数据类型的列使用评分方法，模块将引发错误。If you try to use a scoring method with a column of a data type that the method doesn't support, the module will raise an error. 或者，会将零评分分配到该列。Or, a zero score will be assigned to the column.

• 如果某个列包含逻辑 (true/false) 值，这些值将作为 `True = 1``False = 0` 进行处理。If a column contains logical (true/false) values, they're processed as `True = 1` and `False = 0`.

• 如果已将某个列指定为“标签”或“评分”，则该列不能是特征。 A column can't be a feature if it has been designated as a Label or a Score .

### 如何处理缺失值How missing values are handled

• 无法将包含所有缺失值的列指定为目标（标签）列。You can't specify as a target (label) column any column that has all missing values.

• 如果某个列包含缺失值，则模块在计算该列的评分时，将忽略这些缺失值。If a column contains missing values, the module ignores them when it's computing the score for the column.

• 如果指定为特征列的某个列包含所有缺失值，模块将分配零评分。If a column designated as a feature column has all missing values, the module assigns a zero score.