数据科学中的特征工程Feature engineering in data science

在本文中,你将了解特征工程及其在机器学习增强数据中的作用。In this article, you learn about feature engineering and its role in enhancing data in machine learning. 从在 Azure 机器学习工作室(经典)试验中提取的说明性示例中学习。Learn from illustrative examples drawn from Azure Machine Learning Studio (classic) experiments.

  • 特征工程:基于原始数据中创建新特征来提高学习算法的预测能力的过程。Feature engineering: The process of creating new features from raw data to increase the predictive power of the learning algorithm. 工程特征应捕获原始特征集中不易发现的附加信息。Engineered features should capture additional information that is not easily apparent in the original feature set.
  • 特征选择:选择部分关键特征来降低训练问题的维度的过程。Feature selection: The process of selecting the key subset of features to reduce the dimensionality of the training problem.

通常,先应用特征工程来生成其他特征,然后执行特征选择来消除不相关、冗余或高度相关的特征 。Normally feature engineering is applied first to generate additional features, and then feature selection is done to eliminate irrelevant, redundant, or highly correlated features.

特征工程和特征选择在 Team Data Science Process (TDSP) 的建模阶段完成。Feature engineering and selection are part of the modeling stage of the Team Data Science Process (TDSP). 要详细了解 TDSP 和数据科学生命周期,请参阅什么是 TDSP?To learn more about the TDSP and the data science lifecycle, see What is the TDSP?

什么是特征工程?What is feature engineering?

训练数据由一个矩阵组成,该矩阵由行和列构成。Training data consists of a matrix composed of rows and columns. 矩阵中的每一行都是一个观察或记录。Each row in the matrix is an observation or record. 每行的列是描述每条记录的特征。The columns of each row are the features that describe each record. 实验设计中指定的特征应显示数据中的模式特征。The features specified in the experimental design should characterize the patterns in the data.

尽管许多原始数据字段可直接用于训练模型,但通常需要为增强的训练数据集创建其他(工程)特征。Although many of the raw data fields can be used directly to train a model, it's often necessary to create additional (engineered) features for an enhanced training dataset.

用于增强训练的工程特征提供了能更好区分数据中的模式的信息。Engineered features that enhance training provide information that better differentiates the patterns in the data. 但是,此过程是一件艺术。But this process is something of an art. 要做出正确可靠而又富有成效的决策,通常需要领域专业知识。Sound and productive decisions often require domain expertise.

示例 1:为回归模型添加临时特征Example 1: Add temporal features for a regression model

使用 Azure 机器学习工作室(经典)中的自行车租赁需求预测实验来演示如何为回归任务设计特征。Let's use the experiment Demand forecasting of bikes rentals in Azure Machine Learning Studio (classic) to demonstrate how to engineer features for a regression task. 此实验的目的是预测在特定月/日/小时内自行车的租赁需求。The objective of this experiment is to predict the demand for bike rentals within a specific month/day/hour.

自行车租赁数据集Bike rental dataset

自行车租赁 UCI 数据集基于美国自行车共享公司的真实数据。The Bike Rental UCI dataset is based on real data from a bike share company based in the United States. 它表示 2011 年和 2012 年某一天中特定小时内的自行车租赁数量。It represents the number of bike rentals within a specific hour of a day for the years 2011 and 2012. 它包含 17,379 行和 17 列。It contains 17,379 rows and 17 columns.

原始特征集包含天气条件(温度/湿度/风速)和日期类型(假日/工作日)。The raw feature set contains weather conditions (temperature/humidity/wind speed) and the type of the day (holiday/weekday). 要预测的字段是计数,表示特定小时内的自行车租赁计数。The field to predict is the count, which represents the bike rentals within a specific hour. 计数范围从 1 到 977。Count ranges from 1 to 977.

创建特征工程实验Create a feature engineering experiment

为了实现在训练数据中构建有效特征的目标,使用四个不同的训练数据集通过相同的算法构建四个回归模型。With the goal of constructing effective features in the training data, four regression models are built using the same algorithm but with four different training datasets. 四个数据集表示相同的原始输入数据,但是具有越来越多的特征集。The four datasets represent the same raw input data, but with an increasing number of features set. 这些特征可分为以下四类:These features are grouped into four categories:

  1. A = 所预测的那天的天气 + 假日 + 工作日 + 周末特征A = weather + holiday + weekday + weekend features for the predicted day
  2. B = 过去 12 小时内每小时租用的自行车数量B = number of bikes that were rented in each of the previous 12 hours
  3. C = 过去 12 天内每天同一小时租用的自行车数量C = number of bikes that were rented in each of the previous 12 days at the same hour
  4. D = 过去 12 周内同一天同一小时租用的自行车数量D = number of bikes that were rented in each of the previous 12 weeks at the same hour and the same day

除了已经存在于最初的原始数据中的特征集 A 之外,其他三个特征集都需要通过特征工程过程创建。Besides feature set A, which already exists in the original raw data, the other three sets of features are created through the feature engineering process. 特征集 B 捕获近期的自行车需求量。Feature set B captures recent demand for the bikes. 特征集 C 捕获特定时间的自行车需求量。Feature set C captures the demand for bikes at a particular hour. 特征集 D 捕获一周中特定日期特定小时的自行车需求量。Feature set D captures demand for bikes at particular hour and particular day of the week. 四个训练数据集分别包括特征集 A、A + B、A + B + C 和 A + B + C + D。The four training datasets each includes feature set A, A+B, A+B+C, and A+B+C+D, respectively.

使用工作室(经典)的特征工程Feature engineering using Studio (classic)

在工作室(经典)中,下面 4 个训练数据集通过预处理的输入数据集的 4 个分支形成。In the Studio (classic) experiment, these four training datasets are formed via four branches from the pre-processed input dataset. 除了最左边的分支外,每个分支都包含执行 R 脚本模块,会在此模块中构建派生特征(特征集 B、C 和 D)并附加到导入的数据集。Except for the leftmost branch, each of these branches contains an Execute R Script module, in which the derived features (feature set B, C, and D) are constructed and appended to the imported dataset.

下图演示了用于在左侧第二个分支中创建特征集 B 的 R 脚本。The following figure demonstrates the R script used to create feature set B in the second left branch.

创建特征

结果Results

下表总结了四个模型的性能结果比较:A comparison of the performance results of the four models is summarized in the following table:

结果比较

特征 A + B + C 显示出最佳结果。The best results are shown by features A+B+C. 训练数据中包括附加特征集时,错误率会降低。The error rate decreases when additional feature set are included in the training data. 它验证了一个假设,即特征集 B 和 C为回归任务提供附加的相关信息。It verifies the presumption that the feature set B, C provide additional relevant information for the regression task. 但添加 D 特征似乎并不有助于进一步降低错误率。But adding the D feature does not seem to provide any additional reduction in the error rate.

示例 2:创建用于文本挖掘的特征Example 2: Create features for text mining

特征工程广泛应用于与文本挖掘相关的任务,例如文档分类和情绪分析。Feature engineering is widely applied in tasks related to text mining such as document classification and sentiment analysis. 由于原始文本的各个部分通常用作输入数据,因此需要特征工程过程来创建涉及单词/短语频率的特征。Since individual pieces of raw text usually serve as the input data, the feature engineering process is needed to create the features involving word/phrase frequencies.

特征哈希Feature hashing

为了完成这个任务,应用被称为特征哈希的技术将任意文本特征有效地转换为索引。To achieve this task, a technique called feature hashing is applied to efficiently turn arbitrary text features into indices. 该方法将哈希函数应用于特征并将它们的哈希值直接用作索引,而不是将每个文本特征(单词/短语)与特定索引相关联。Instead of associating each text feature (words/phrases) to a particular index, this method applies a hash function to the features and using their hash values as indices directly.

在工作室(经典)中,有一个特征哈希模块,你可用它来轻松创建单词/短语特征。In Studio (classic), there is a Feature Hashing module that creates word/phrase features conveniently. 下图显示使用此模块的示例。Following figure shows an example of using this module. 输入数据集包含两列:书籍评分(范围为 1 到 5),以及实际审核内容。The input dataset contains two columns: the book rating ranging from 1 to 5, and the actual review content. 此模块的目标是检索一系列新特征,这些特征显示特定书评中相应单词/短语出现的频率。The goal of this module is to retrieve a bunch of new features that show the occurrence frequency of the corresponding word(s)/phrase(s) within the particular book review. 若要使用此模块,完成以下步骤:To use this module, complete the following steps:

  • 首先,选择包含输入文本(此示例中的“Col2”)的列。First, select the column that contains the input text ("Col2" in this example).
  • 其次,将“Hashing bitsize”设置为 8,这意味着将创建“2^8=256”个特征。Second, set the "Hashing bitsize" to 8, which means 2^8=256 features will be created. 所有文本中的单词/短语将哈希处理为 256 个索引。The word/phase in all the text will be hashed to 256 indices. 参数“Hashing bitsize”的范围是 1 到 31。The parameter "Hashing bitsize" ranges from 1 to 31. 如果将单词/短语设置为更大的数字,则其不太可能被哈希处理到相同的索引中。The word(s)/phrase(s) are less likely to be hashed into the same index if setting it to be a larger number.
  • 最后,将参数“N-grams”设置为 2。Third, set the parameter "N-grams" to 2. 此值会从输入文本中获得单元语法(每个单词的特征)和二元语法(每对相邻单词的特征)的出现频率。This value gets the occurrence frequency of unigrams (a feature for every single word) and bigrams (a feature for every pair of adjacent words) from the input text. 参数“N 元语法”的范围是 0 到 10 ,其指示要包括在特征中的连续单词的最大数目。The parameter "N-grams" ranges from 0 to 10, which indicates the maximum number of sequential words to be included in a feature.

“特征哈希”模块

下图显示了这些新特征的外观。The following figure shows what these new feature look like.

“特征哈希”示例

结束语Conclusion

设计和选择的特征提高了训练过程的效率,此过程尝试提取数据中包含的关键信息。Engineered and selected features increase the efficiency of the training process, which attempts to extract the key information contained in the data. 它们还提高了这些模型准确分类输入数据以及更加可靠地预测感兴趣的结果的能力。They also improve the power of these models to classify the input data accurately and to predict outcomes of interest more robustly.

特征工程和选择也可以结合起来使用,以使学习在计算上更易处理。Feature engineering and selection can also combine to make the learning more computationally tractable. 它通过增加然后减少校准或训练模型所需的特征数量来实现这一目标。It does so by enhancing and then reducing the number of features needed to calibrate or train a model. 从数学上来说,所选特征是一组最小的独立变量,它们解释数据中的模式,并成功预测结果。Mathematically, the selected features are a minimal set of independent variables that explain the patterns in the data and predict outcomes successfully.

并不总是必须执行特征工程或特征选择。It's not always necessarily to perform feature engineering or feature selection. 这取决于数据、选择的算法以及实验的目的。It depends on the data, the algorithm selected, and the objective of the experiment.

后续步骤Next steps

若要在特定环境中创建数据功能,请参阅以下文章:To create features for data in specific environments, see the following articles: