# 数据科学中的特征工程Feature engineering in data science

• 特征工程：基于原始数据中创建新特征来提高学习算法的预测能力的过程。Feature engineering: The process of creating new features from raw data to increase the predictive power of the learning algorithm. 工程特征应捕获原始特征集中不易发现的附加信息。Engineered features should capture additional information that is not easily apparent in the original feature set.
• 特征选择：选择部分关键特征来降低训练问题的维度的过程。Feature selection: The process of selecting the key subset of features to reduce the dimensionality of the training problem.

## 示例 1：为回归模型添加临时特征Example 1: Add temporal features for a regression model

### 创建特征工程实验Create a feature engineering experiment

1. A = 所预测的那天的天气 + 假日 + 工作日 + 周末特征A = weather + holiday + weekday + weekend features for the predicted day
2. B = 过去 12 小时内每小时租用的自行车数量B = number of bikes that were rented in each of the previous 12 hours
3. C = 过去 12 天内每天同一小时租用的自行车数量C = number of bikes that were rented in each of the previous 12 days at the same hour
4. D = 过去 12 周内同一天同一小时租用的自行车数量D = number of bikes that were rented in each of the previous 12 weeks at the same hour and the same day

## 示例 2：创建用于文本挖掘的特征Example 2: Create features for text mining

### 特征哈希Feature hashing

• 首先，选择包含输入文本（此示例中的“Col2”）的列。First, select the column that contains the input text ("Col2" in this example).
• 其次，将“Hashing bitsize”设置为 8，这意味着将创建“2^8=256”个特征。Second, set the "Hashing bitsize" to 8, which means 2^8=256 features will be created. 所有文本中的单词/短语将哈希处理为 256 个索引。The word/phase in all the text will be hashed to 256 indices. 参数“Hashing bitsize”的范围是 1 到 31。The parameter "Hashing bitsize" ranges from 1 to 31. 如果将单词/短语设置为更大的数字，则其不太可能被哈希处理到相同的索引中。The word(s)/phrase(s) are less likely to be hashed into the same index if setting it to be a larger number.
• 最后，将参数“N-grams”设置为 2。Third, set the parameter "N-grams" to 2. 此值会从输入文本中获得单元语法（每个单词的特征）和二元语法（每对相邻单词的特征）的出现频率。This value gets the occurrence frequency of unigrams (a feature for every single word) and bigrams (a feature for every pair of adjacent words) from the input text. 参数“N 元语法”的范围是 0 到 10 ，其指示要包括在特征中的连续单词的最大数目。The parameter "N-grams" ranges from 0 to 10, which indicates the maximum number of sequential words to be included in a feature.