Team Data Science Process (TDSP) 中的功能选择Feature selection in the Team Data Science Process (TDSP)

本文介绍特征工程的目的,并提供其在机器学习数据增强过程中作用的相关示例。This article explains the purposes of feature selection and provides examples of its role in the data enhancement process of machine learning. 这些示例来自 Azure 机器学习工作室。These examples are drawn from Azure Machine Learning Studio.

特征的工程设计和选择是 Team Data Science Process (TDSP) 是什么?一文中所述的 Team Data Science Process (TDSP) 的一部分。The engineering and selection of features is one part of the Team Data Science Process (TDSP) outlined in the article What is the Team Data Science Process?. 特征工程和选择是 TDSP 的 开发特征 步骤的一部分。Feature engineering and selection are parts of the Develop features step of the TDSP.

  • 特征工程:该过程尝试从数据中的现有原始特征创建其他相关特征,并提高学习算法的预测能力。feature engineering: This process attempts to create additional relevant features from the existing raw features in the data, and to increase predictive power to the learning algorithm.
  • 特征选择:该过程选择原始数据特征的关键子集,试图降低定型问题的维度。feature selection: This process selects the key subset of original data features in an attempt to reduce the dimensionality of the training problem.

通常,首先应用 特征工程 生成其他特征,然后执行 特征选择 步骤,消除不相关、冗余或高度相关的特征。Normally feature engineering is applied first to generate additional features, and then the feature selection step is performed to eliminate irrelevant, redundant, or highly correlated features.

从数据中筛选特征 - 特征选择Filter features from your data - feature selection

功能选择可用于分类和回归任务。Feature selection may be used for classification or regression tasks. 目标是从原始数据集中选择特征子集,即通过使用最少特征集来表示数据中最大方差量以减少其维度。The goal is to select a subset of the features from the original dataset that reduce its dimensions by using a minimal set of features to represent the maximum amount of variance in the data. 此特征子集用于定型模型。This subset of features is used to train the model. 特征选择有两个主要目的。Feature selection serves two main purposes.

  • 第一,特征选择通常通过消除不相关、冗余或高度相关的特征,来提高分类准确度。First, feature selection often increases classification accuracy by eliminating irrelevant, redundant, or highly correlated features.
  • 第二,它会减少特征数,这使模型定型过程更高效。Second, it decreases the number of features, which makes the model training process more efficient. 效率对于训练成本高昂的学习器(例如支持向量机)很重要。Efficiency is important for learners that are expensive to train such as support vector machines.

虽然特征选择确实试图减少数据集中用于定型模型的特征数,但不会将它称为术语“降维”。Although feature selection does seek to reduce the number of features in the dataset used to train the model, it is not referred to by the term "dimensionality reduction". 特征选择方法提取数据中原始特征的子集,而不会改变它们。Feature selection methods extract a subset of original features in the data without changing them. 降维方法采用可以转换原始特征并且可对它们进行修改的工程特征。Dimensionality reduction methods employ engineered features that can transform the original features and thus modify them. 降维方法示例包括主成分分析、典型相关分析和奇异值分解。Examples of dimensionality reduction methods include Principal Component Analysis, canonical correlation analysis, and Singular Value Decomposition.

其中特征选择方法在监管上下文中的一个广泛应用类别称为“基于筛选器的特征选择”。Among others, one widely applied category of feature selection methods in a supervised context is called "filter-based feature selection". 通过评估每个特征和目标属性之间的相关性,这些方法应用统计测量来为每个特征分配分数。By evaluating the correlation between each feature and the target attribute, these methods apply a statistical measure to assign a score to each feature. 将按分数对这些特征排名,从而有助于基于该排名针对保留或消除特定特征设置阈值。The features are then ranked by the score, which may be used to help set the threshold for keeping or eliminating a specific feature. 这些方法中使用的统计度量值示例包括积差相关、互信息和卡方分布检验。Examples of the statistical measures used in these methods include Person correlation, mutual information, and the Chi squared test.

在 Azure 机器学习工作室中,为特征选择提供了多个模块。In Azure Machine Learning Studio, there are modules provided for feature selection. 如下图中所示,这些模块包括基于筛选器的特征选择费舍尔线性判别分析As shown in the following figure, these modules include Filter-Based Feature Selection and Fisher Linear Discriminant Analysis.


例如,考虑使用基于筛选器的特征选择模块。Consider, for example, the use of the Filter-Based Feature Selection module. 为方便起见,继续使用文本挖掘示例。For convenience, continue using the text mining example. 假设在通过特征哈希模块创建了一组 256 个特征后想要生成回归模型,并且响应变量为“Col1”(包含范围为 1 到 5 的书籍审核评级)。Assume that you want to build a regression model after a set of 256 features are created through the Feature Hashing module, and that the response variable is the "Col1" that contains book review ratings ranging from 1 to 5. 通过将“特征评分方法”设为“皮尔逊相关”,“目标列”将设为“Col1”,而“所需特征数”将设为 50。By setting "Feature scoring method" to be "Pearson Correlation", the "Target column" to be "Col1", and the "Number of desired features" to 50. 然后,基于筛选器的特征选择模块将生成一个包含 50 个特征且目标属性为“Col1”的数据集。Then the module Filter-Based Feature Selection produces a dataset containing 50 features together with the target attribute "Col1". 下图显示了此实验的流程以及输入参数:The following figure shows the flow of this experiment and the input parameters:


下图显示了生成的数据集:The following figure shows the resulting datasets:


基于特征本身和目标属性“Col1”之间的皮尔逊相关对每个特征进行评分。Each feature is scored based on the Pearson Correlation between itself and the target attribute "Col1". 保留高分数的特征。The features with top scores are kept.

下图显示了所选特征的相应分数:The corresponding scores of the selected features are shown in the following figure:


通过应用此基于筛选器的特征选择模块,将从256 个特征中选出 50 个,因为它们基于评分方法“皮尔逊相关”具有与目标变量“Col1”最相关的特征。By applying this Filter-Based Feature Selection module, 50 out of 256 features are selected because they have the most correlated features with the target variable "Col1", based on the scoring method "Pearson Correlation".


特征工程和特征选择是两个常见的工程和选择的特征,用于提高定型过程的效率,此过程尝试提取数据中包含的关键信息。Feature engineering and feature selection are two commonly Engineered and selected features increase the efficiency of the training process that attempts to extract the key information contained in the data. 它们还提高了这些模型准确分类输入数据以及更加可靠地预测感兴趣的结果的能力。They also improve the power of these models to classify the input data accurately and to predict outcomes of interest more robustly. 特征工程和选择也可以结合起来使用,以使学习在计算上更易处理。Feature engineering and selection can also combine to make the learning more computationally tractable. 它通过增加然后减少校准或定型模型所需的特征数量来实现这一目标。It does so by enhancing and then reducing the number of features needed to calibrate or train a model. 从数学上来说,所选的定型模型的特征是一组最小的独立变量,它们解释数据中的模式,并成功预测结果。Mathematically speaking, the features selected to train the model are a minimal set of independent variables that explain the patterns in the data and then predict outcomes successfully.

并不总是必须执行特征工程或特征选择。It is not always necessarily to perform feature engineering or feature selection. 根据收集的数据、选择的算法以及实验的目的决定是否需要。Whether it is needed or not depends on the data collected, the algorithm selected, and the objective of the experiment.