自动化机器学习中的特征化Featurization in automated machine learning

本指南中介绍:In this guide, you learn:

“特征工程”是使用数据领域知识创建特征的过程,这些特征有助于机器学习 (ML) 算法提高学习效果。Feature engineering is the process of using domain knowledge of the data to create features that help machine learning (ML) algorithms to learn better. 在 Azure 机器学习中,应用了数据缩放和规范化技术来简化特征工程。In Azure Machine Learning, data-scaling and normalization techniques are applied to make feature engineering easier. 这些技术和此特征工程在自动化机器学习 (AutoML) 试验中统称为特征化。 Collectively, these techniques and this feature engineering are called featurization in automated machine learning, or AutoML, experiments.

必备条件Prerequisites

本文假设你已知道如何配置 AutoML 试验。This article assumes that you already know how to configure an AutoML experiment. 有关配置的信息,请参阅以下文章:For information about configuration, see the following articles:

配置特征化Configure featurization

在每一个自动化机器学习试验中,默认情况下都会将自动缩放和规范化技术应用于数据。In every automated machine learning experiment, automatic scaling and normalization techniques are applied to your data by default. 这些技术是特征化的类型,用于帮助对不同规模数据的特征敏感的某些算法。These techniques are types of featurization that help certain algorithms that are sensitive to features on different scales. 不过,你还可以启用其他特征化,例如缺失值插补、编码和转换 。However, you can also enable additional featurization, such as missing-values imputation, encoding, and transforms.

备注

自动化机器学习特征化步骤(例如特征规范化、处理缺失数据,或将文本转换为数字)成为了基础模型的一部分。Steps for automated machine learning featurization (such as feature normalization, handling missing data, or converting text to numeric) become part of the underlying model. 使用模型进行预测时,将自动向输入数据应用在训练期间应用的相同特征化步骤。When you use the model for predictions, the same featurization steps that are applied during training are applied to your input data automatically.

对于使用 Python SDK 配置的试验,你可以启用或禁用特征化设置,并进一步指定要用于试验的特征化步骤。For experiments that you configure with the Python SDK, you can enable or disable the featurization setting and further specify the featurization steps to be used for your experiment. 如果使用的是 Azure 机器学习工作室,请参阅启用特征化的步骤If you're using the Azure Machine Learning studio, see the steps to enable featurization.

下表列出了 AutoMLConfig 类featurization 的已接受设置:The following table shows the accepted settings for featurization in the AutoMLConfig class:

特征化配置Featurization configuration 说明Description
"featurization": 'auto' 指定在预处理过程中自动执行数据护栏和特征化步骤Specifies that, as part of preprocessing, data guardrails and featurization steps are to be done automatically. 此设置为默认设置。This setting is the default.
"featurization": 'off' 指定不自动执行特征化步骤。Specifies that featurization steps are not to be done automatically.
"featurization": 'FeaturizationConfig' 指定将使用自定义特征化步骤。Specifies that customized featurization steps are to be used. 了解如何自定义特征化Learn how to customize featurization.

自动特征化Automatic featurization

下表汇总了自动应用于你的数据的技术。The following table summarizes techniques that are automatically applied to your data. 这些方法适用于使用 SDK 或工作室配置的试验。These techniques are applied for experiments that are configured by using the SDK or the studio. 若要禁用此行为,请在 AutoMLConfig 对象中设置 "featurization": 'off'To disable this behavior, set "featurization": 'off' in your AutoMLConfig object.

备注

如果计划将 AutoML 创建的模型导出到 ONNX 模型,则 ONNX 格式中仅支持使用星号(“*”)指示的特征化选项。If you plan to export your AutoML-created models to an ONNX model, only the featurization options indicated with an asterisk ("*") are supported in the ONNX format. 详细了解如何将模型转换为 ONNXLearn more about converting models to ONNX.

特征化步骤Featurization steps 说明Description
删除高基数或者无差异的特征*Drop high cardinality or no variance features* 从训练集和验证集中删除这些特征。Drop these features from training and validation sets. 适用于所有值都缺失的特征、所有行使用同一值的特征,或者包含高基数(例如哈希、ID 或 GUID)的特征。Applies to features with all values missing, with the same value across all rows, or with high cardinality (for example, hashes, IDs, or GUIDs).
插补缺少的值*Impute missing values* 对于数字特征,将在列中插补平均值。For numeric features, impute with the average of values in the column.

对于分类特征,将插补最常用值。For categorical features, impute with the most frequent value.
生成其他特征*Generate additional features* 对于日期时间特征:年、月、日、星期、年日期、季、年周、小时、分钟、秒。For DateTime features: Year, Month, Day, Day of week, Day of year, Quarter, Week of the year, Hour, Minute, Second.

对于预测任务,将创建以下附加的日期/时间特征:ISO 年份、半年、字符串形式的日历月份、周、字符串形式的周几、一季中的日期、一年中的日期、AM/PM(如果是中午 (12 pm) 之前的时间,则此项为 0,否则为 1)、字符串形式的 AM/PM、一天中的小时(12 小时制)For forecasting tasks, these additional DateTime features are created: ISO year, Half - half-year, Calendar month as string, Week, Day of week as string, Day of quarter, Day of year, AM/PM (0 if hour is before noon (12 pm), 1 otherwise), AM/PM as string, Hour of day (12hr basis)

对于文本特征:基于单元语法、双元语法和三元语法的字词频率。For Text features: Term frequency based on unigrams, bigrams, and trigrams. 详细了解如何通过 BERT 执行此操作Learn more about how this is done with BERT.
转换和编码*Transform and encode* 将唯一值较少的数字特征转换为分类特征。Transform numeric features that have few unique values into categorical features.

将为低基数分类特征使用 One-hot 编码。One-hot encoding is used for low-cardinality categorical features. 将为高基数分类特征使用 One-hot-hash 编码。One-hot-hash encoding is used for high-cardinality categorical features.
单词嵌入Word embeddings 文本特征化器使用预先训练的模型将文本标记的矢量转换为句子矢量。A text featurizer converts vectors of text tokens into sentence vectors by using a pretrained model. 每个单词在文档中的嵌入矢量与其余矢量聚合在一起,以生成文档特征矢量。Each word's embedding vector in a document is aggregated with the rest to produce a document feature vector.
目标编码Target encodings 对于分类特征,此步骤将每个类别映射到回归问题的平均目标值,并映射到分类问题的每个类的类概率。For categorical features, this step maps each category with an averaged target value for regression problems, and to the class probability for each class for classification problems. 应用基于频率的加权和 k 折交叉验证,以减少稀疏数据类别导致的映射过度拟合与干扰。Frequency-based weighting and k-fold cross-validation are applied to reduce overfitting of the mapping and noise caused by sparse data categories.
文本目标编码Text target encoding 对于文本输入,将使用带有词袋的堆叠线性模型来生成每个类的概率。For text input, a stacked linear model with bag-of-words is used to generate the probability of each class.
证据权重 (WoE)Weight of Evidence (WoE) 将 WoE 计算为分类列与目标列的关联度量。Calculates WoE as a measure of correlation of categorical columns to the target column. WoE 的计算公式为类内概率与类外概率的比的对数。WoE is calculated as the log of the ratio of in-class vs. out-of-class probabilities. 此步骤为每个类生成一个数字特征列,无需显式插补缺失值和处理离群值。This step produces one numeric feature column per class and removes the need to explicitly impute missing values and outlier treatment.
群集距离Cluster Distance 基于所有数字列训练 k 平均聚类模型。Trains a k-means clustering model on all numeric columns. 生成 k 个新特征(每个聚类一个新数字特征),这些特征包含每个样本与每个聚类质心之间的距离。Produces k new features (one new numeric feature per cluster) that contain the distance of each sample to the centroid of each cluster.

数据护栏Data guardrails

“数据护栏”有助于识别数据的潜在问题(例如缺少值或类不平衡)。Data guardrails help you identify potential issues with your data (for example, missing values or class imbalance). 它们还可帮助你采取纠正措施来改进结果。They also help you take corrective actions for improved results.

数据护栏适用于:Data guardrails are applied:

  • 对于 SDK 试验:当在 AutoMLConfig 对象中指定了参数 "featurization": 'auto'validation=auto 时。For SDK experiments: When the parameters "featurization": 'auto' or validation=auto are specified in your AutoMLConfig object.
  • 对于工作室试验:当启用了自动特征化时。For studio experiments: When automatic featurization is enabled.

可通过以下方式查看试验的数据护栏:You can review the data guardrails for your experiment:

  • 使用 SDK 提交试验时设置 show_output=TrueBy setting show_output=True when you submit an experiment by using the SDK.

  • 访问工作室中自动化机器学习运行的“数据护栏”选项卡。In the studio, on the Data guardrails tab of your automated ML run.

数据护栏状态Data guardrail states

数据护栏显示以下三种状态之一:Data guardrails display one of three states:

状态State 说明Description
已通过Passed 未检测到任何数据问题,你无需执行任何操作。No data problems were detected and no action is required by you.
已完成Done 已对数据应用更改。Changes were applied to your data. 我们建议你查看 AutoML 采取的纠正措施,以确保所做的更改与预期的结果一致。We encourage you to review the corrective actions that AutoML took, to ensure that the changes align with the expected results.
出现警告Alerted 检测到数据问题,但无法修正。A data issue was detected but couldn't be remedied. 我们建议你进行修正并解决此问题。We encourage you to revise and fix the issue.

支持的数据护栏Supported data guardrails

下表描述了当前支持的数据护栏,以及你在提交试验时可能会看到的相关状态:The following table describes the data guardrails that are currently supported and the associated statuses that you might see when you submit your experiment:

护栏Guardrail 状态Status 触发器的条件  Condition for trigger
插补缺少的特征值Missing feature values imputation PassedPassed


完成Done
在训练数据中未检测到缺失特征值。No missing feature values were detected in your training data. 详细了解缺失值插补。Learn more about missing-value imputation.

在训练数据中检测到缺失特征值并进行了插补。Missing feature values were detected in your training data and were imputed.
高基数特征处理High cardinality feature handling PassedPassed


完成Done
已分析输入,但未检测到任何高基数特征。Your inputs were analyzed, and no high-cardinality features were detected.

在输入中检测到了高基数特征,并进行了处理。High-cardinality features were detected in your inputs and were handled.
验证拆分处理Validation split handling 完成Done 已将验证配置设置为 'auto',并且训练数据包含的行少于 20,000 行。The validation configuration was set to 'auto' and the training data contained fewer than 20,000 rows.
已使用交叉验证来验证经过训练的模型的每个迭代。Each iteration of the trained model was validated by using cross-validation. 详细了解验证数据Learn more about validation data.

已将验证配置设置为 'auto',并且训练数据包含的行多于 20,000 行。The validation configuration was set to 'auto', and the training data contained more than 20,000 rows.
输入数据已被拆分成训练数据集和验证数据集,以用于验证模型。The input data has been split into a training dataset and a validation dataset for validation of the model.
类均衡检测Class balancing detection PassedPassed



收到警报Alerted


完成Done
输入已经过分析,训练数据中的所有类都是均衡的。Your inputs were analyzed, and all classes are balanced in your training data. 如果某个数据集中每个类都有良好的表示形式(按样本的数量和比率进行度量),则将该数据集视为均衡的数据集。A dataset is considered to be balanced if each class has good representation in the dataset, as measured by number and ratio of samples.

在输入中检测到了不均衡类。Imbalanced classes were detected in your inputs. 若要修复模型偏差,请解决均衡问题。To fix model bias, fix the balancing problem. 详细了解不均衡数据Learn more about imbalanced data.

在输入中检测到不均衡类,并且扫描逻辑已确定要应用均衡。Imbalanced classes were detected in your inputs and the sweeping logic has determined to apply balancing.
内存问题检测Memory issues detection PassedPassed



完成Done

已分析了选定的值(范围、滞后、滚动窗口),但未检测到潜在的内存不足问题。The selected values (horizon, lag, rolling window) were analyzed, and no potential out-of-memory issues were detected. 详细了解时序预测配置Learn more about time-series forecasting configurations.


已分析了选定的值(范围、滞后、滚动窗口),可能会导致你的试验遇到内存不足问题。The selected values (horizon, lag, rolling window) were analyzed and will potentially cause your experiment to run out of memory. 滞后或滚动窗口配置已关闭。The lag or rolling-window configurations have been turned off.
频率检测Frequency detection PassedPassed



完成Done

已分析了时序,所有数据点都与检测到的频率保持一致。The time series was analyzed, and all data points are aligned with the detected frequency.

已分析时序,检测到了与已检测到的频率不一致的数据点。The time series was analyzed, and data points that don't align with the detected frequency were detected. 这些数据点已从数据集中删除。These data points were removed from the dataset. 详细了解时序预测的数据准备Learn more about data preparation for time-series forecasting.

自定义特征化Customize featurization

你可以自定义特征化设置,以确保用于训练机器学习模型的数据和特征能够产生相关的预测。You can customize your featurization settings to ensure that the data and features that are used to train your ML model result in relevant predictions.

若要自定义特征化,请在 AutoMLConfig 对象中指定  "featurization": FeaturizationConfigTo customize featurizations, specify "featurization": FeaturizationConfig in your AutoMLConfig object. 如果使用 Azure 机器学习工作室进行试验,请参阅操作方法文章If you're using the Azure Machine Learning studio for your experiment, see the how-to article. 若要为预测任务类型自定义特征化,请参阅预测操作指南To customize featurization for forecastings task types, refer to the forecasting how-to.

支持的自定义项包括:Supported customizations include:

自定义Customization 定义Definition
列用途更新Column purpose update 重写指定列的自动检测到的特征类型。Override the autodetected feature type for the specified column.
转换器参数更新Transformer parameter update 更新指定转换器的参数。Update the parameters for the specified transformer. 当前支持 Imputer(平均值、最频繁使用的值和中值)和 HashOneHotEncoder。 Currently supports Imputer (mean, most frequent, and median) and HashOneHotEncoder.
删除列Drop columns 指定要从特征化中删除的列。Specifies columns to drop from being featurized.
阻止转换器Block transformers 指定要在特征化过程中使用的块转换器。Specifies block transformers to be used in the featurization process.

使用 API 调用创建 FeaturizationConfig 对象:Create the FeaturizationConfig object by using API calls:

featurization_config = FeaturizationConfig()
featurization_config.blocked_transformers = ['LabelEncoder']
featurization_config.drop_columns = ['aspiration', 'stroke']
featurization_config.add_column_purpose('engine-size', 'Numeric')
featurization_config.add_column_purpose('body-style', 'CategoricalHash')
#default strategy mean, add transformer param for for 3 columns
featurization_config.add_transformer_params('Imputer', ['engine-size'], {"strategy": "median"})
featurization_config.add_transformer_params('Imputer', ['city-mpg'], {"strategy": "median"})
featurization_config.add_transformer_params('Imputer', ['bore'], {"strategy": "most_frequent"})
featurization_config.add_transformer_params('HashOneHotEncoder', [], {"number_of_bits": 3})

特征化透明度Featurization transparency

每个 AutoML 模型均已自动应用特征化。Every AutoML model has featurization automatically applied. 特征化包括自动特征工程(执行 "featurization": 'auto' 时)和缩放和归一化,这会影响所选算法及其超参数值。Featurization includes automated feature engineering (when "featurization": 'auto') and scaling and normalization, which then impacts the selected algorithm and its hyperparameter values. AutoML 支持不同的方法,以确保你直观了解对模型应用了哪些内容。AutoML supports different methods to ensure you have visibility into what was applied to your model.

请考虑此预测示例:Consider this forecasting example:

  • 有四个输入功能:A(数值)、B(数值)、C(数值)、D(日期/时间)。There are four input features: A (Numeric), B (Numeric), C (Numeric), D (DateTime).
  • 数值特征 C 被删除,因为它是一个 ID 列,具有所有唯一值。Numeric feature C is dropped because it is an ID column with all unique values.
  • 数值特征 A 和 B 包含缺失值,因此由平均值进行插补。Numeric features A and B have missing values and hence are imputed by the mean.
  • 日期/时间特征 D 已特征化为 11 个不同的工程特征。DateTime feature D is featurized into 11 different engineered features.

若要获取此信息,请使用来自自动化 ML 试验运行的 fitted_model 输出。To get this information, use the fitted_model output from your automated ML experiment run.

automl_config = AutoMLConfig(…)
automl_run = experiment.submit(automl_config …)
best_run, fitted_model = automl_run.get_output()

自动化特征工程Automated feature engineering

get_engineered_feature_names() 返回工程特征名称的列表。The get_engineered_feature_names() returns a list of engineered feature names.

备注

请将“timeseriestransformer”用于任务为“预测”的情况,否则请将“datatransformer”用于“回归”或“分类”任务。Use 'timeseriestransformer' for task='forecasting', else use 'datatransformer' for 'regression' or 'classification' task.

fitted_model.named_steps['timeseriestransformer']. get_engineered_feature_names ()

此列表包括所有工程特征的名称。This list includes all engineered feature names.

['A', 'B', 'A_WASNULL', 'B_WASNULL', 'year', 'half', 'quarter', 'month', 'day', 'hour', 'am_pm', 'hour12', 'wday', 'qday', 'week']

get_featurization_summary() 获取所有输入特征的特征化摘要。The get_featurization_summary() gets a featurization summary of all the input features.

fitted_model.named_steps['timeseriestransformer'].get_featurization_summary()

输出Output

[{'RawFeatureName': 'A',
  'TypeDetected': 'Numeric',
  'Dropped': 'No',
  'EngineeredFeatureCount': 2,
  'Tranformations': ['MeanImputer', 'ImputationMarker']},
 {'RawFeatureName': 'B',
  'TypeDetected': 'Numeric',
  'Dropped': 'No',
  'EngineeredFeatureCount': 2,
  'Tranformations': ['MeanImputer', 'ImputationMarker']},
 {'RawFeatureName': 'C',
  'TypeDetected': 'Numeric',
  'Dropped': 'Yes',
  'EngineeredFeatureCount': 0,
  'Tranformations': []},
 {'RawFeatureName': 'D',
  'TypeDetected': 'DateTime',
  'Dropped': 'No',
  'EngineeredFeatureCount': 11,
  'Tranformations': ['DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime']}]
输出Output 定义Definition
RawFeatureNameRawFeatureName 从提供的数据集中输入特征/列名称。Input feature/column name from the dataset provided.
TypeDetectedTypeDetected 检测到的输入特征的数据类型。Detected datatype of the input feature.
DroppedDropped 指示是否已删除或使用输入特征。Indicates if the input feature was dropped or used.
EngineeringFeatureCountEngineeringFeatureCount 通过自动化特征工程转换生成的特征数。Number of features generated through automated feature engineering transforms.
转换Transformations 应用于输入特征以生成工程特征的转换列表。List of transformations applied to input features to generate engineered features.

缩放和归一化Scaling and normalization

若要了解缩放/归一化和所选算法及其超参数值,请使用 fitted_model.stepsTo understand the scaling/normalization and the selected algorithm with its hyperparameter values, use fitted_model.steps.

下面的示例输出是针对所选运行而运行 fitted_model.steps 的结果:The following sample output is from running fitted_model.steps for a chosen run:

[('RobustScaler', 
  RobustScaler(copy=True, 
  quantile_range=[10, 90], 
  with_centering=True, 
  with_scaling=True)), 

  ('LogisticRegression', 
  LogisticRegression(C=0.18420699693267145, class_weight='balanced', 
  dual=False, 
  fit_intercept=True, 
  intercept_scaling=1, 
  max_iter=100, 
  multi_class='multinomial', 
  n_jobs=1, penalty='l2', 
  random_state=None, 
  solver='newton-cg', 
  tol=0.0001, 
  verbose=0, 
  warm_start=False))

若要获取更多详细信息,请使用此帮助程序函数:To get more details, use this helper function:

from pprint import pprint

def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators': list(
                e[0] for e in step[1].estimators), 'weights': step[1].weights})
            print()
            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0] + ' - ')
        else:
            pprint(step[1].get_params())
            print()

print_model(model)

此帮助程序函数使用 LogisticRegression with RobustScalar 作为特定算法返回特定运行的以下输出。This helper function returns the following output for a particular run using LogisticRegression with RobustScalar as the specific algorithm.

RobustScaler
{'copy': True,
'quantile_range': [10, 90],
'with_centering': True,
'with_scaling': True}

LogisticRegression
{'C': 0.18420699693267145,
'class_weight': 'balanced',
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'max_iter': 100,
'multi_class': 'multinomial',
'n_jobs': 1,
'penalty': 'l2',
'random_state': None,
'solver': 'newton-cg',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}

预测类概率Predict class probability

使用自动化 ML 生成的模型都具有包装器对象,这些对象对其开源来源类中的功能进行镜像。Models produced using automated ML all have wrapper objects that mirror functionality from their open-source origin class. 自动化 ML 返回的大多数分类模型包装器对象都实现了 predict_proba() 函数,该函数接受特征(X 值)的数组式或稀疏矩阵数据样本,并返回每个样本的 n 维数组及其各自的类概率。Most classification model wrapper objects returned by automated ML implement the predict_proba() function, which accepts an array-like or sparse matrix data sample of your features (X values), and returns an n-dimensional array of each sample and its respective class probability.

假设你已使用上文中的相同调用检索了最佳运行和拟合的模型,则可以直接从拟合的模型调用 predict_proba(),并根据模型类型提供相应格式的 X_test 样本。Assuming you have retrieved the best run and fitted model using the same calls from above, you can call predict_proba() directly from the fitted model, supplying an X_test sample in the appropriate format depending on the model type.

best_run, fitted_model = automl_run.get_output()
class_prob = fitted_model.predict_proba(X_test)

如果基础模型不支持 predict_proba() 函数或者格式不正确,则会引发特定于模型类的异常。</span>If the underlying model does not support the predict_proba() function or the format is incorrect, a model class-specific exception will be thrown. 有关如何针对不同的模型类型实现此函数的示例,请参阅 RandomForestClassifierXGBoost 参考文档。See the RandomForestClassifier and XGBoost reference docs for examples of how this function is implemented for different model types.

BERT 集成BERT integration

BERT 在 AutoML 的特征化层中使用。BERT is used in the featurization layer of AutoML. 在此层中,如果列包含自由文本或其他类型的数据(例如时间戳或简单编号),则会相应地应用特征化。In this layer, if a column contains free text or other types of data like timestamps or simple numbers, then featurization is applied accordingly.

对于 BERT,已利用用户提供的标签对该模型进行了微调和训练。For BERT, the model is fine-tuned and trained utilizing the user-provided labels. 在这里,文档嵌入会作为特征随其他特征(例如基于时间戳的特征、周中的某一天)一起输出。From here, document embeddings are output as features alongside others, like timestamp-based features, day of week.

BERT 步骤BERT steps

若要调用 BERT,必须在 automl_settings 中设置 enable_dnn: True,并使用 GPU 计算(例如 vm_size = "STANDARD_NC6" 或性能更高的 GPU)。In order to invoke BERT, you have to set enable_dnn: True in your automl_settings and use a GPU compute (e.g. vm_size = "STANDARD_NC6", or a higher GPU). 如果使用 CPU 计算,则 AutoML 会启用 BiLSTM DNN 特征化器,而非启用 BERT。If a CPU compute is used, then instead of BERT, AutoML enables the BiLSTM DNN featurizer.

AutoML 会为 BERT 执行以下步骤。AutoML takes the following steps for BERT.

  1. 所有文本列的预处理和标记化。Preprocessing and tokenization of all text columns. 例如,可以在最终模型的特征化摘要中找到“StringCast”转换器。For example, the "StringCast" transformer can be found in the final model's featurization summary. 此笔记本中提供了一个有关如何生成模型的特征化摘要的示例。An example of how to produce the model's featurization summary can be found in this notebook.

  2. 将所有文本列连接到单个文本列中,因此在最终模型中会调用 StringConcatTransformerConcatenate all text columns into a single text column, hence the StringConcatTransformer in the final model.

    我们实现的 BERT 将训练示例的总文本长度限制为 128 个标记。Our implementation of BERT limits total text length of a training sample to 128 tokens. 这意味着,当已连接时,所有文本列在理想情况下的长度最多应为 128 个标记。That means, all text columns when concatenated, should ideally be at most 128 tokens in length. 如果存在多个列,则应修剪每个列,使此条件得到满足。If multiple columns are present, each column should be pruned so this condition is satisfied. 否则,对于长度大于 128 个标记的已连接列,BERT 的 tokenizer 层会将此输入截断为 128 个标记。Otherwise, for concatenated columns of length >128 tokens BERT's tokenizer layer truncates this input to 128 tokens.

  3. 在特征扫描过程中,AutoML 在数据样本上将 BERT 与基线(词袋特征)进行比较。As part of feature sweeping, AutoML compares BERT against the baseline (bag of words features) on a sample of the data. 这一比较确定了 BERT 是否可以提高准确性。This comparison determines if BERT would give accuracy improvements. 如果 BERT 的性能优于基线,AutoML 会使用 BERT 对整个数据进行文本特征化。If BERT performs better than the baseline, AutoML then uses BERT for text featurization for the whole data. 在这种情况下,你将在最终模型中看到 PretrainedTextDNNTransformerIn that case, you will see the PretrainedTextDNNTransformer in the final model.

BERT 的运行时间通常比其他的特征化器更长。BERT generally runs longer than other featurizers. 为了提高性能,我们建议使用“STANDARD_NC24r”或“STANDARD_NC24rs_V3”以提供其 RDMA 功能。For better performance, we recommend using "STANDARD_NC24r" or "STANDARD_NC24rs_V3" for their RDMA capabilities.

如果有多个节点可用(最多可以使用 8 个节点),则 AutoML 会在多个节点之间分配 BERT 训练。AutoML will distribute BERT training across multiple nodes if they are available (upto a max of eight nodes). 可以通过将 max_concurrent_iterations 参数设置为大于 1,在 AutoMLConfig 对象中完成此操作。This can be done in your AutoMLConfig object by setting the max_concurrent_iterations parameter to higher than 1.

支持的语言Supported languages

AutoML 目前支持大约 100 种语言,它根据数据集的语言选择合适的 BERT 模型。AutoML currently supports around 100 languages and depending on the dataset's language, AutoML chooses the appropriate BERT model. 对于德语数据,我们使用德语 BERT 模型。For German data, we use the German BERT model. 对于英语,我们使用英语 BERT 模型。For English, we use the English BERT model. 对于所有其他语言,我们使用多语言 BERT 模型。For all other languages, we use the multilingual BERT model.

在下面的代码中,将触发德语 BERT 模型,因为数据集语言被指定为 deu,而根据 ISO 分类,这一 3 个字母的代码表示德语:In the following code, the German BERT model is triggered, since the dataset language is specified to deu, the three letter language code for German according to ISO classification:

from azureml.automl.core.featurization import FeaturizationConfig

featurization_config = FeaturizationConfig(dataset_language='deu')

automl_settings = {
    "experiment_timeout_minutes": 120,
    "primary_metric": 'accuracy', 
# All other settings you want to use 
    "featurization": featurization_config,
    
  "enable_dnn": True, # This enables BERT DNN featurizer
    "enable_voting_ensemble": False,
    "enable_stack_ensemble": False
}

后续步骤Next steps