自动化机器学习中的数据特征化Data featurization in automated machine learning

了解 Azure 机器学习中的数据特征化设置,以及如何为自动化机器学习试验自定义这些特征。Learn about the data featurization settings in Azure Machine Learning, and how to customize those features for automated machine learning experiments.

特征工程和特征化Feature engineering and featurization

训练数据由行和列构成。Training data consists of rows and columns. 每一行都是一条观测或记录,而每行的列则是用于描述每条记录的特征。Each row is an observation or record, and the columns of each row are the features that describe each record. 通常,会选择对数据中模式的特点描述效果最佳的特征来创建预测模型。Typically, the features that best characterize the patterns in the data are selected to create predictive models.

虽然许多原始数据字段都可以直接用于训练某个模型,但通常还是需要创建其他(工程)特征,通过这些特征所提供的信息来更好地区分数据中的模式。Although many of the raw data fields can be used directly to train a model, it's often necessary to create additional (engineered) features that provide information that better differentiates patterns in the data. 这个过程被称为“特征工程”,利用数据领域知识来创建特征,这些特征继而又会帮助机器学习算法更好地学习。This process is called feature engineering, where the use of domain knowledge of the data is leveraged to create features that, in turn, help machine learning algorithms to learn better.

在 Azure 机器学习中,应用了数据缩放和规范化技术来简化特征工程。In Azure Machine Learning, data-scaling and normalization techniques are applied to make feature engineering easier. 这些技术和此特征工程在自动化 ML 试验中统称为“特征化”。Collectively, these techniques and this feature engineering are called featurization in automated ML experiments.


本文假设你已知道如何配置自动化 ML 试验。This article assumes that you already know how to configure an automated ML experiment. 有关配置的信息,请参阅以下文章:For information about configuration, see the following articles:

配置特征化Configure featurization

在每一个自动化机器学习试验中,默认情况下都会将自动缩放和规范化技术应用于数据。In every automated machine learning experiment, automatic scaling and normalization techniques are applied to your data by default. 这些技术是特征化的类型,用于帮助对不同规模数据的特征敏感的某些算法。These techniques are types of featurization that help certain algorithms that are sensitive to features on different scales. 可以启用更多特征化,例如缺失值插补、编码和转换 。You can enable more featurization, such as missing-values imputation, encoding, and transforms.


自动化机器学习特征化步骤(例如特征规范化、处理缺失数据,或将文本转换为数字)成为了基础模型的一部分。Steps for automated machine learning featurization (such as feature normalization, handling missing data, or converting text to numeric) become part of the underlying model. 使用模型进行预测时,将自动向输入数据应用在训练期间应用的相同特征化步骤。When you use the model for predictions, the same featurization steps that are applied during training are applied to your input data automatically.

对于使用 Python SDK 配置的试验,你可以启用或禁用特征化设置,并进一步指定要用于试验的特征化步骤。For experiments that you configure with the Python SDK, you can enable or disable the featurization setting and further specify the featurization steps to be used for your experiment. 如果使用的是 Azure 机器学习工作室,请参阅启用特征化的步骤If you're using the Azure Machine Learning studio, see the steps to enable featurization.

下表列出了 AutoMLConfig 类featurization 的已接受设置:The following table shows the accepted settings for featurization in the AutoMLConfig class:

特征化配置Featurization configuration 说明Description
"featurization": 'auto' 指定在预处理过程中自动执行数据护栏特征化步骤Specifies that, as part of preprocessing, data guardrails and featurization steps are to be done automatically. 此设置为默认设置。This setting is the default.
"featurization": 'off' 指定不自动执行特征化步骤。Specifies that featurization steps are not to be done automatically.
"featurization": 'FeaturizationConfig' 指定将使用自定义特征化步骤。Specifies that customized featurization steps are to be used. 了解如何自定义特征化Learn how to customize featurization.

自动特征化Automatic featurization

下表汇总了自动应用于你的数据的技术。The following table summarizes techniques that are automatically applied to your data. 这些方法适用于使用 SDK 或工作室配置的试验。These techniques are applied for experiments that are configured by using the SDK or the studio. 若要禁用此行为,请在 AutoMLConfig 对象中设置 "featurization": 'off'To disable this behavior, set "featurization": 'off' in your AutoMLConfig object.


如果计划将 AutoML 创建的模型导出到 ONNX 模型,则 ONNX 格式中仅支持使用星号(“*”)指示的特征化选项。If you plan to export your AutoML-created models to an ONNX model, only the featurization options indicated with an asterisk ("*") are supported in the ONNX format. 详细了解如何将模型转换为 ONNXLearn more about converting models to ONNX.

特征化步骤Featurization steps 说明Description
删除高基数或者无差异的特征*Drop high cardinality or no variance features* 从训练集和验证集中删除这些特征。Drop these features from training and validation sets. 适用于所有值都缺失的特征、所有行使用同一值的特征,或者包含高基数(例如哈希、ID 或 GUID)的特征。Applies to features with all values missing, with the same value across all rows, or with high cardinality (for example, hashes, IDs, or GUIDs).
插补缺少的值*Impute missing values* 对于数字特征,将在列中插补平均值。For numeric features, impute with the average of values in the column.

对于分类特征,将插补最常用值。For categorical features, impute with the most frequent value.
生成更多特征_Generate more features _ 对于日期时间特征:年、月、日、星期、年日期、季、年周、小时、分钟、秒。For DateTime features: Year, Month, Day, Day of week, Day of year, Quarter, Week of the year, Hour, Minute, Second.

对于预测任务,将创建以下附加的日期/时间特征:ISO 年份、半年、字符串形式的日历月份、周、字符串形式的周几、一季中的日期、一年中的日期、AM/PM(如果是中午 (12 pm) 之前的时间,则此项为 0,否则为 1)、字符串形式的 AM/PM、一天中的小时(12 小时制)_For forecasting tasks,* these additional DateTime features are created: ISO year, Half - half-year, Calendar month as string, Week, Day of week as string, Day of quarter, Day of year, AM/PM (0 if hour is before noon (12 pm), 1 otherwise), AM/PM as string, Hour of day (12-hr basis)

对于文本特征:基于单元语法、双元语法和三元语法的字词频率。For Text features: Term frequency based on unigrams, bigrams, and trigrams. 详细了解如何通过 BERT 执行此操作Learn more about how this is done with BERT.
转换和编码*Transform and encode* 将唯一值较少的数字特征转换为分类特征。Transform numeric features that have few unique values into categorical features.

将为低基数分类特征使用 One-hot 编码。One-hot encoding is used for low-cardinality categorical features. 将为高基数分类特征使用 One-hot-hash 编码。One-hot-hash encoding is used for high-cardinality categorical features.
单词嵌入Word embeddings 文本特征化器使用预先训练的模型将文本标记的矢量转换为句子矢量。A text featurizer converts vectors of text tokens into sentence vectors by using a pre-trained model. 每个单词在文档中的嵌入矢量与其余矢量聚合在一起,以生成文档特征矢量。Each word's embedding vector in a document is aggregated with the rest to produce a document feature vector.
群集距离Cluster Distance 基于所有数字列训练 k 平均聚类模型。Trains a k-means clustering model on all numeric columns. 生成 k 个新特征(每个聚类一个新数字特征),这些特征包含每个样本与每个聚类质心之间的距离。Produces k new features (one new numeric feature per cluster) that contain the distance of each sample to the centroid of each cluster.

数据护栏Data guardrails

“数据护栏”有助于识别数据的潜在问题(例如缺少值或类不平衡)。Data guardrails help you identify potential issues with your data (for example, missing values or class imbalance). 它们还可帮助你采取纠正措施来改进结果。They also help you take corrective actions for improved results.

数据护栏适用于:Data guardrails are applied:

  • 对于 SDK 试验:当在 AutoMLConfig 对象中指定了参数 "featurization": 'auto'validation=auto 时。For SDK experiments: When the parameters "featurization": 'auto' or validation=auto are specified in your AutoMLConfig object.
  • 对于工作室试验:当启用了自动特征化时。For studio experiments: When automatic featurization is enabled.

可通过以下方式查看试验的数据护栏:You can review the data guardrails for your experiment:

  • 使用 SDK 提交试验时设置 show_output=TrueBy setting show_output=True when you submit an experiment by using the SDK.

  • 访问工作室中自动化机器学习运行的“数据护栏”选项卡。In the studio, on the Data guardrails tab of your automated ML run.

数据护栏状态Data guardrail states

数据护栏显示以下三种状态之一:Data guardrails display one of three states:

状态State 说明Description
已通过Passed 未检测到任何数据问题,你无需执行任何操作。No data problems were detected and no action is required by you.
已完成Done 已对数据应用更改。Changes were applied to your data. 我们建议你查看 AutoML 采取的纠正措施,以确保所做的更改与预期的结果一致。We encourage you to review the corrective actions that AutoML took, to ensure that the changes align with the expected results.
出现警告Alerted 检测到数据问题,但无法修正。A data issue was detected but couldn't be remedied. 我们建议你进行修正并解决此问题。We encourage you to revise and fix the issue.

支持的数据护栏Supported data guardrails

下表描述了当前支持的数据护栏,以及你在提交试验时可能会看到的相关状态:The following table describes the data guardrails that are currently supported and the associated statuses that you might see when you submit your experiment:

护栏Guardrail 状态Status 触发器的条件  Condition for trigger
插补缺少的特征值Missing feature values imputation PassedPassed

在训练数据中未检测到缺失特征值。No missing feature values were detected in your training data. 详细了解缺失值插补。Learn more about missing-value imputation.

在训练数据中检测到缺失特征值并进行了插补。Missing feature values were detected in your training data and were imputed.
高基数特征处理High cardinality feature handling PassedPassed

已分析输入,但未检测到任何高基数特征。Your inputs were analyzed, and no high-cardinality features were detected.

在输入中检测到了高基数特征,并进行了处理。High-cardinality features were detected in your inputs and were handled.
验证拆分处理Validation split handling 完成Done 已将验证配置设置为 'auto',并且训练数据包含的行少于 20,000 行。The validation configuration was set to 'auto' and the training data contained fewer than 20,000 rows.
已使用交叉验证来验证经过训练的模型的每个迭代。Each iteration of the trained model was validated by using cross-validation. 详细了解验证数据Learn more about validation data.

已将验证配置设置为 'auto',并且训练数据包含的行多于 20,000 行。The validation configuration was set to 'auto', and the training data contained more than 20,000 rows.
输入数据已被拆分成训练数据集和验证数据集,以用于验证模型。The input data has been split into a training dataset and a validation dataset for validation of the model.
类均衡检测Class balancing detection PassedPassed


输入已经过分析,训练数据中的所有类都是均衡的。Your inputs were analyzed, and all classes are balanced in your training data. 如果某个数据集中每个类都有良好的表示形式(按样本的数量和比率进行度量),则将该数据集视为均衡的数据集。A dataset is considered to be balanced if each class has good representation in the dataset, as measured by number and ratio of samples.

在输入中检测到了不均衡类。Imbalanced classes were detected in your inputs. 若要修复模型偏差,请解决均衡问题。To fix model bias, fix the balancing problem. 详细了解不均衡数据Learn more about imbalanced data.

在输入中检测到不均衡类,并且扫描逻辑已确定要应用均衡。Imbalanced classes were detected in your inputs and the sweeping logic has determined to apply balancing.
内存问题检测Memory issues detection PassedPassed


已分析了选定的值(范围、滞后、滚动窗口),但未检测到潜在的内存不足问题。The selected values (horizon, lag, rolling window) were analyzed, and no potential out-of-memory issues were detected. 详细了解时序预测配置Learn more about time-series forecasting configurations.

已分析了选定的值(范围、滞后、滚动窗口),可能会导致你的试验遇到内存不足问题。The selected values (horizon, lag, rolling window) were analyzed and will potentially cause your experiment to run out of memory. 滞后或滚动窗口配置已关闭。The lag or rolling-window configurations have been turned off.
频率检测Frequency detection PassedPassed


已分析了时序,所有数据点都与检测到的频率保持一致。The time series was analyzed, and all data points are aligned with the detected frequency.

已分析时序,检测到了与已检测到的频率不一致的数据点。The time series was analyzed, and data points that don't align with the detected frequency were detected. 这些数据点已从数据集中删除。These data points were removed from the dataset. 详细了解时序预测的数据准备Learn more about data preparation for time-series forecasting.

自定义特征化Customize featurization

你可以自定义特征化设置,以确保用于训练机器学习模型的数据和特征能够产生相关的预测。You can customize your featurization settings to ensure that the data and features that are used to train your ML model result in relevant predictions.

若要自定义特征化,请在 AutoMLConfig 对象中指定 "featurization": FeaturizationConfigTo customize featurizations, specify "featurization": FeaturizationConfig in your AutoMLConfig object. 如果使用 Azure 机器学习工作室进行试验,请参阅操作方法文章If you're using the Azure Machine Learning studio for your experiment, see the how-to article. 若要为预测任务类型自定义特征化,请参阅预测操作指南To customize featurization for forecastings task types, refer to the forecasting how-to.

支持的自定义项包括:Supported customizations include:

自定义Customization 定义Definition
列用途更新Column purpose update 重写指定列的自动检测到的特征类型。Override the autodetected feature type for the specified column.
转换器参数更新Transformer parameter update 更新指定转换器的参数。Update the parameters for the specified transformer. 当前支持 Imputer(平均值、最频繁使用的值和中值)和 HashOneHotEncoder。 Currently supports Imputer (mean, most frequent, and median) and HashOneHotEncoder.
删除列Drop columns 指定要从特征化中删除的列。Specifies columns to drop from being featurized.
阻止转换器Block transformers 指定要在特征化过程中使用的块转换器。Specifies block transformers to be used in the featurization process.


从 SDK 版本 1.19 开始,删除列功能已弃用。The drop columns functionality is deprecated as of SDK version 1.19. 在自动化 ML 试验中使用数据集之前,请将数据集中的列删除,这是数据清理过程的一部分。Drop columns from your dataset as part of data cleansing, prior to consuming it in your automated ML experiment.

使用 API 调用创建 FeaturizationConfig 对象:Create the FeaturizationConfig object by using API calls:

featurization_config = FeaturizationConfig()
featurization_config.blocked_transformers = ['LabelEncoder']
featurization_config.drop_columns = ['aspiration', 'stroke']
featurization_config.add_column_purpose('engine-size', 'Numeric')
featurization_config.add_column_purpose('body-style', 'CategoricalHash')
#default strategy mean, add transformer param for for 3 columns
featurization_config.add_transformer_params('Imputer', ['engine-size'], {"strategy": "median"})
featurization_config.add_transformer_params('Imputer', ['city-mpg'], {"strategy": "median"})
featurization_config.add_transformer_params('Imputer', ['bore'], {"strategy": "most_frequent"})
featurization_config.add_transformer_params('HashOneHotEncoder', [], {"number_of_bits": 3})

特征化透明度Featurization transparency

每个 AutoML 模型均已自动应用特征化。Every AutoML model has featurization automatically applied. 特征化包括自动特征工程(执行 "featurization": 'auto' 时)和缩放和归一化,这会影响所选算法及其超参数值。Featurization includes automated feature engineering (when "featurization": 'auto') and scaling and normalization, which then impacts the selected algorithm and its hyperparameter values. AutoML 支持不同的方法,以确保你直观了解对模型应用了哪些内容。AutoML supports different methods to ensure you have visibility into what was applied to your model.

请考虑此预测示例:Consider this forecasting example:

  • 有四个输入功能:A(数值)、B(数值)、C(数值)、D(日期/时间)。There are four input features: A (Numeric), B (Numeric), C (Numeric), D (DateTime).
  • 数值特征 C 被删除,因为它是一个 ID 列,具有所有唯一值。Numeric feature C is dropped because it is an ID column with all unique values.
  • 数值特征 A 和 B 包含缺失值,因此由平均值进行插补。Numeric features A and B have missing values and hence are imputed by the mean.
  • 日期/时间特征 D 已特征化为 11 个不同的工程特征。DateTime feature D is featurized into 11 different engineered features.

若要获取此信息,请使用来自自动化 ML 试验运行的 fitted_model 输出。To get this information, use the fitted_model output from your automated ML experiment run.

automl_config = AutoMLConfig(…)
automl_run = experiment.submit(automl_config …)
best_run, fitted_model = automl_run.get_output()

自动化特征工程Automated feature engineering

get_engineered_feature_names() 返回工程特征名称的列表。The get_engineered_feature_names() returns a list of engineered feature names.


请将“timeseriestransformer”用于任务为“预测”的情况,否则请将“datatransformer”用于“回归”或“分类”任务。Use 'timeseriestransformer' for task='forecasting', else use 'datatransformer' for 'regression' or 'classification' task.

fitted_model.named_steps['timeseriestransformer']. get_engineered_feature_names ()

此列表包括所有工程特征的名称。This list includes all engineered feature names.

['A', 'B', 'A_WASNULL', 'B_WASNULL', 'year', 'half', 'quarter', 'month', 'day', 'hour', 'am_pm', 'hour12', 'wday', 'qday', 'week']

get_featurization_summary() 获取所有输入特征的特征化摘要。The get_featurization_summary() gets a featurization summary of all the input features.



[{'RawFeatureName': 'A',
  'TypeDetected': 'Numeric',
  'Dropped': 'No',
  'EngineeredFeatureCount': 2,
  'Tranformations': ['MeanImputer', 'ImputationMarker']},
 {'RawFeatureName': 'B',
  'TypeDetected': 'Numeric',
  'Dropped': 'No',
  'EngineeredFeatureCount': 2,
  'Tranformations': ['MeanImputer', 'ImputationMarker']},
 {'RawFeatureName': 'C',
  'TypeDetected': 'Numeric',
  'Dropped': 'Yes',
  'EngineeredFeatureCount': 0,
  'Tranformations': []},
 {'RawFeatureName': 'D',
  'TypeDetected': 'DateTime',
  'Dropped': 'No',
  'EngineeredFeatureCount': 11,
  'Tranformations': ['DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime']}]
输出Output 定义Definition
RawFeatureNameRawFeatureName 从提供的数据集中输入特征/列名称。Input feature/column name from the dataset provided.
TypeDetectedTypeDetected 检测到的输入特征的数据类型。Detected datatype of the input feature.
DroppedDropped 指示是否已删除或使用输入特征。Indicates if the input feature was dropped or used.
EngineeringFeatureCountEngineeringFeatureCount 通过自动化特征工程转换生成的特征数。Number of features generated through automated feature engineering transforms.
转换Transformations 应用于输入特征以生成工程特征的转换列表。List of transformations applied to input features to generate engineered features.

缩放和归一化Scaling and normalization

若要了解缩放/归一化和所选算法及其超参数值,请使用 fitted_model.stepsTo understand the scaling/normalization and the selected algorithm with its hyperparameter values, use fitted_model.steps.

下面的示例输出是针对所选运行而运行 fitted_model.steps 的结果:The following sample output is from running fitted_model.steps for a chosen run:

  quantile_range=[10, 90], 

  LogisticRegression(C=0.18420699693267145, class_weight='balanced', 
  n_jobs=1, penalty='l2', 

若要获取更多详细信息,请使用此帮助程序函数:To get more details, use this helper function:

from pprint import pprint

def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators': list(
                e[0] for e in step[1].estimators), 'weights': step[1].weights})
            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0] + ' - ')


此帮助程序函数使用 LogisticRegression with RobustScalar 作为特定算法返回特定运行的以下输出。This helper function returns the following output for a particular run using LogisticRegression with RobustScalar as the specific algorithm.

{'copy': True,
'quantile_range': [10, 90],
'with_centering': True,
'with_scaling': True}

{'C': 0.18420699693267145,
'class_weight': 'balanced',
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'max_iter': 100,
'multi_class': 'multinomial',
'n_jobs': 1,
'penalty': 'l2',
'random_state': None,
'solver': 'newton-cg',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}

预测类概率Predict class probability

使用自动化 ML 生成的模型都具有包装器对象,这些对象对其开源来源类中的功能进行镜像。Models produced using automated ML all have wrapper objects that mirror functionality from their open-source origin class. 自动化 ML 返回的大多数分类模型包装器对象都实现了 predict_proba() 函数,该函数接受特征(X 值)的数组式或稀疏矩阵数据样本,并返回每个样本的 n 维数组及其各自的类概率。Most classification model wrapper objects returned by automated ML implement the predict_proba() function, which accepts an array-like or sparse matrix data sample of your features (X values), and returns an n-dimensional array of each sample and its respective class probability.

假设你已使用上文中的相同调用检索了最佳运行和拟合的模型,则可以直接从拟合的模型调用 predict_proba(),并根据模型类型提供相应格式的 X_test 样本。Assuming you have retrieved the best run and fitted model using the same calls from above, you can call predict_proba() directly from the fitted model, supplying an X_test sample in the appropriate format depending on the model type.

best_run, fitted_model = automl_run.get_output()
class_prob = fitted_model.predict_proba(X_test)

如果基础模型不支持 predict_proba() 函数或者格式不正确,则会引发特定于模型类的异常。</span>If the underlying model does not support the predict_proba() function or the format is incorrect, a model class-specific exception will be thrown. 有关如何针对不同的模型类型实现此函数的示例,请参阅 RandomForestClassifierXGBoost 参考文档。See the RandomForestClassifier and XGBoost reference docs for examples of how this function is implemented for different model types.

自动化 ML 中的 BERT 集成BERT integration in automated ML

BERT 在 AutoML 的特征化层中使用。BERT is used in the featurization layer of AutoML. 在此层中,如果列包含自由文本或其他类型的数据(例如时间戳或简单编号),则会相应地应用特征化。In this layer, if a column contains free text or other types of data like timestamps or simple numbers, then featurization is applied accordingly.

对于 BERT,已利用用户提供的标签对该模型进行了微调和训练。For BERT, the model is fine-tuned and trained utilizing the user-provided labels. 在这里,文档嵌入会作为特征随其他特征(例如基于时间戳的特征、周中的某一天)一起输出。From here, document embeddings are output as features alongside others, like timestamp-based features, day of week.

调用 BERT 的步骤Steps to invoke BERT

若要调用 BERT,请在 automl_settings 中设置 enable_dnn: True,并使用 GPU 计算(vm_size = "STANDARD_NC6" 或性能更高的 GPU)。In order to invoke BERT, set enable_dnn: True in your automl_settings and use a GPU compute (vm_size = "STANDARD_NC6" or a higher GPU). 如果使用 CPU 计算,则 AutoML 会启用 BiLSTM DNN 特征化器,而非启用 BERT。If a CPU compute is used, then instead of BERT, AutoML enables the BiLSTM DNN featurizer.

AutoML 会为 BERT 执行以下步骤。AutoML takes the following steps for BERT.

  1. 所有文本列的预处理和标记化。Preprocessing and tokenization of all text columns. 例如,可以在最终模型的特征化摘要中找到“StringCast”转换器。For example, the "StringCast" transformer can be found in the final model's featurization summary. 此笔记本中提供了一个有关如何生成模型的特征化摘要的示例。An example of how to produce the model's featurization summary can be found in this notebook.

  2. 将所有文本列连接到单个文本列中,因此在最终模型中会调用 StringConcatTransformerConcatenate all text columns into a single text column, hence the StringConcatTransformer in the final model.

    我们实现的 BERT 将训练示例的总文本长度限制为 128 个标记。Our implementation of BERT limits total text length of a training sample to 128 tokens. 这意味着,当已连接时,所有文本列在理想情况下的长度最多应为 128 个标记。That means, all text columns when concatenated, should ideally be at most 128 tokens in length. 如果存在多个列,则应修剪每个列,使此条件得到满足。If multiple columns are present, each column should be pruned so this condition is satisfied. 否则,对于长度大于 128 个标记的已连接列,BERT 的 tokenizer 层会将此输入截断为 128 个标记。Otherwise, for concatenated columns of length >128 tokens BERT's tokenizer layer truncates this input to 128 tokens.

  3. 在特征扫描过程中,AutoML 在数据样本上将 BERT 与基线(词袋特征)进行比较。As part of feature sweeping, AutoML compares BERT against the baseline (bag of words features) on a sample of the data. 这一比较确定了 BERT 是否可以提高准确性。This comparison determines if BERT would give accuracy improvements. 如果 BERT 的性能优于基线,AutoML 会使用 BERT 对整个数据进行文本特征化。If BERT performs better than the baseline, AutoML then uses BERT for text featurization for the whole data. 在这种情况下,你将在最终模型中看到 PretrainedTextDNNTransformerIn that case, you will see the PretrainedTextDNNTransformer in the final model.

BERT 的运行时间通常比其他的特征化器更长。BERT generally runs longer than other featurizers. 为了提高性能,我们建议使用“STANDARD_NC24r”或“STANDARD_NC24rs_V3”以提供其 RDMA 功能。For better performance, we recommend using "STANDARD_NC24r" or "STANDARD_NC24rs_V3" for their RDMA capabilities.

如果有多个节点可用(最多可以使用 8 个节点),则 AutoML 会在多个节点之间分配 BERT 训练。AutoML will distribute BERT training across multiple nodes if they are available (upto a max of eight nodes). 可以通过将 max_concurrent_iterations 参数设置为大于 1,在 AutoMLConfig 对象中完成此操作。This can be done in your AutoMLConfig object by setting the max_concurrent_iterations parameter to higher than 1.

autoML 中 BERT 支持的语言Supported languages for BERT in autoML

AutoML 目前支持大约 100 种语言,它根据数据集的语言选择合适的 BERT 模型。AutoML currently supports around 100 languages and depending on the dataset's language, AutoML chooses the appropriate BERT model. 对于德语数据,我们使用德语 BERT 模型。For German data, we use the German BERT model. 对于英语,我们使用英语 BERT 模型。For English, we use the English BERT model. 对于所有其他语言,我们使用多语言 BERT 模型。For all other languages, we use the multilingual BERT model.

在下面的代码中,将触发德语 BERT 模型,因为数据集语言被指定为 deu,而根据 ISO 分类,这一 3 个字母的代码表示德语:In the following code, the German BERT model is triggered, since the dataset language is specified to deu, the three letter language code for German according to ISO classification:

from azureml.automl.core.featurization import FeaturizationConfig

featurization_config = FeaturizationConfig(dataset_language='deu')

automl_settings = {
    "experiment_timeout_minutes": 120,
    "primary_metric": 'accuracy', 
# All other settings you want to use 
    "featurization": featurization_config,
  "enable_dnn": True, # This enables BERT DNN featurizer
    "enable_voting_ensemble": False,
    "enable_stack_ensemble": False

后续步骤Next steps