Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
APPLIES TO:
Azure CLI ml extension v2 (current)
Python SDK azure-ai-ml v2 (current)
In this article, you learn how to train natural language processing (NLP) models in Azure Machine Learning by using automated machine learning (AutoML). You can create NLP models by using AutoML via the Azure Machine Learning CLI v2 or the Azure Machine Learning Python SDK v2.
NLP in AutoML lets machine-learning professionals and data scientists use their own text data and build custom models for multiclass text classification, multilabel text classification, and named entity recognition (NER). You can seamlessly integrate with Azure Machine Learning data labeling to label your text data, or use existing labeled data.
AutoML provides the option to use distributed training on multi-GPU compute clusters for faster model training. The resulting model can be operationalized at scale using Azure Machine Learning MLOps capabilities.
Prerequisites
Azure subscription. If you don't have an Azure subscription, sign up to try the trial subscription today.
An Azure Machine Learning workspace with a GPU training compute. To create the workspace, see Create workspace resources. For more information about Azure-provided GPU instances, see GPU optimized virtual machine sizes.
Note
Some NLP use cases, such as non-English datasets and longer range documents, require using multilingual models or models with longer max sequence length. These scenarios might require higher GPU memory for model training to succeed, such as the NCv3 or NDv2 series.
Some familiarity with setting up AutoML experiments. For more information about the AutoML experiment design pattern, see Set up AutoML training for tabular data.
- The Azure Machine Learning CLI v2 installed. For guidance to update and install the latest version, see Install and set up the CLI (v2).
Select your NLP task
Decide what NLP task you want to accomplish. AutoML supports the follow deep neural network NLP tasks:
| Task | AutoML job syntax | Description |
|---|---|---|
| Multiclass text classification | CLI v2: text_classification SDK v2: text_classification() |
There are multiple possible classes and each sample can be classified as exactly one class. The task is to predict the correct class for each sample. For example, classify a movie script as Comedy or Romantic. |
| Multilabel text classification | CLI v2: text_classification_multilabel SDK v2: text_classification_multilabel() |
There are multiple possible classes and each sample can be assigned any number of classes. The task is to predict all the classes for each sample. For example, classify a movie script as Comedy, Romantic, and Comedy and Romantic. |
| Named Entity Recognition (NER) | CLI v2:text_ner SDK v2: text_ner() |
There are multiple possible tags for tokens in sequences. The task is to predict the tags for all the tokens for each sequence. For example, extract domain-specific entities from unstructured text, such as contracts or financial documents. |
Thresholding
Thresholding is a multilabel text classification feature that determines the threshold at which the predicted probabilities produce a positive label. Lower values allow more labels, which is better when users care about recall, but can lead to more false positives. Higher values allow fewer labels, which are better when users care about precision, but could lead to more false negatives.
Prepare data
For AutoML NLP experiments, you can provide your data in .csv format for multiclass and multilabel classification tasks. For NER tasks, provide two-column .txt files that use a space as the separator and adhere to the CoNLL format. The following sections provide details about the data format accepted for each task.
Multiclass
For multiclass classification, the dataset can contain up to several text columns and exactly one label column. The following example has only one text column.
text,labels
"I love watching Shanghai Bulls games.","NBA"
"Tom Brady is a great player.","NFL"
"There is a game between Yankees and Orioles tonight","MLB"
"Stephen Curry made the highest number of three-pointers","NBA"
Multilabel
For multilabel classification, the dataset columns can be the same as multiclass, but there are special format requirements for data in the label column. The following table shows the two accepted formats with examples.
| Label column format options | Multiple labels | One label | No labels |
|---|---|---|---|
| Plain text | "label1, label2, label3" |
"label1" |
"" |
| Python list with quotes | "['label1','label2','label3']" |
"['label1']" |
"[]" |
Important
Different parsers read labels for these formats. For the plain text format, all nonalphanumeric characters except for _ are read as label separators. For example, the label "cs.AI" is read as "cs" and "AI". With the Python list format, the label is "['cs.AI']", which is read as "cs.AI".
The following example shows multilabel data in plain text format:
text,labels
"I love watching Shanghai Bulls games.","basketball"
"The four most popular leagues are NFL, MLB, NBA and NHL","football,baseball,basketball,hockey"
"I like drinking beer.",""
The following example shows multilabel data in Python list with quotes format.
text,labels
"I love watching Shanghai Bulls games.","['basketball']"
"The four most popular leagues are NFL, MLB, NBA and NHL","['football','baseball','basketball','hockey']"
"I like drinking beer.","[]"
Named entity recognition (NER)
Unlike multiclass or multilabel classification, which take .csv format datasets, NER requires CoNLL format. The data file must contain exactly two columns, and the token and the label in each row are separated by a single space.
For example,
Hudson B-loc
Square I-loc
is O
a O
famous O
place O
in O
New B-loc
York I-loc
City I-loc
Stephen B-per
Curry I-per
got O
three O
championship O
rings O
Data validation
Before it trains models, AutoML applies data validation checks on the input data to ensure that the data can be preprocessed correctly. If any of these checks fail, the run fails with the relevant error message. The following factors are required to pass data validation checks for each task.
Note
Some data validation checks are applicable to both the training and the validation set and others apply only to the training set. If the test dataset doesn't pass data validation, AutoML can't capture it, and model inference failure or a decline in model performance are possible.
| Task | Data validation check |
|---|---|
| All tasks | At least 50 training samples are required. |
| Multiclass and multilabel | The training data and validation data must have: - The same set of columns. - The same column order from left to right. - The same data type for columns with the same name. - At least two unique labels. - Unique column names within each dataset. For example, the training set can't have multiple columns named Age. |
| Multiclass only | None. |
| Multilabel only | - The label column format must be in the accepted format. - At least one sample should have 0 or 2+ labels; otherwise it should be a multiclass task. - All labels should be in str or int format, with no overlaps. - You can't have both label 1 and label '1'. |
| NER only | - The file can't start with an empty line. - Each line must be an empty line, or follow the format {token} {label} where there's exactly one space between the token and the label and no white space after the label. - All labels must start with I- B-, or be exactly O, and are case-sensitive. - There must be exactly one empty line between two samples, and exactly one empty line at the end of the file. |
Configure the experiment
AutoML NLP capability is triggered through task-specific automl type jobs, the same workflow used for submitting classification, regression, and forecasting AutoML tasks. You set parameters such as experiment_name, compute_name, and data inputs the same as for those experiments. However, there are the following key differences:
- You can ignore
primary_metric, because it's only for reporting. AutoML trains only one model per run for NLP, and there's no model selection. - The
label_column_nameparameter is required only for multiclass and multilabel text classification tasks.
- If more than 10% of the samples in your dataset contain more than 128 tokens, it's considered long range.
For CLI v2 AutoML jobs, you configure your experiment in a YAML file. See the following examples:
Language settings
As part of the NLP functionality, AutoML supports language-specific and multilingual pretrained text in 104 languages for Deep Neural Network (DNN) models such as Bidirectional Encoder Representations from Transformers (BERT) models. Language selection defaults to English.
The following table summarizes which model is applied based on task type and language. For more information, see supported languages and their codes.
| Task | Syntax for dataset_language |
Text model algorithm |
|---|---|---|
| Multilabel text classification | "eng" "deu" "mul" |
English BERT uncased German BERT Multilingual BERT For all other languages, AutoML applies multilingual BERT. |
| Multiclass text classification | "eng" "deu" "mul" |
English BERT cased Multilingual BERT For all other languages, AutoML applies multilingual BERT. |
| Named entity recognition (NER) | "eng" "deu" "mul" |
English BERT cased German BERT Multilingual BERT For all other languages, AutoML applies multilingual BERT. |
BERT is also used in the featurization process of AutoML experiment training. For more information, see BERT integration and featurization in AutoML (SDK v1).
You can specify your dataset language in the featurization section of your configuration YAML file.
featurization:
dataset_language: "eng"
Submit the AutoML job
To submit your AutoML job, run the following CLI v2 command. Replace the placeholders with your YAML filename and path, workspace name, resource group, and subscription ID.
az ml job create --file ./<YAML filename> --workspace-name <machine-learning-workspace> --resource-group <resource-group> --subscription <subscription ID>
You can also run your NLP experiments with distributed training on an Azure Machine Learning compute cluster.
Code examples
For more examples, see the following sample YAML files for each NLP task and other examples at https://github.com/Azure/azureml-examples/cli/jobs/automl-standalone-jobs.
Model sweeping and hyperparameter tuning (preview)
Important
This feature is currently in public preview. This preview version is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities.
For more information, see Supplemental Terms of Use for Azure Previews.
AutoML NLP lets you provide a list of models and combinations of hyperparameters via the hyperparameter search space in the config. Hyperdrive generates several child runs. Each child run is a fine-tuning run for a given NLP model and set of hyperparameter values that were chosen and swept over based on the provided search space.
Supported model algorithms
The following list shows all the pretrained text DNN models available in AutoML NLP for fine-tuning:
- bert-base-cased
- bert-large-uncased
- bert-base-multilingual-cased
- bert-base-german-cased
- bert-large-cased
- distilbert-base-cased
- distilbert-base-uncased
- roberta-base
- roberta-large
- distilroberta-base
- xlm-roberta-base
- xlm-roberta-large
- xlnet-base-cased
- xlnet-large-cased
The large models are larger than their base counterparts and are typically more performant, but take up more GPU memory and time for training. Large model SKU requirements are more stringent and require using NDv2-series VMs for the best results.
Supported hyperparameters
The following table describes the hyperparameters that AutoML NLP supports.
| Parameter name | Description | Syntax |
|---|---|---|
gradient_accumulation_steps |
The number of backward operations whose gradients are to be summed up before performing one step of gradient descent by calling the optimizer's step function. An effective batch size is gradient_accumulation_steps times larger than the maximum size that fits the GPU. |
Must be a positive integer. |
learning_rate |
Initial learning rate. | Must be a float in the range [0, 1]. |
learning_rate_scheduler |
Type of learning rate scheduler. | Must choose from linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup. |
model_name |
Name of one of the supported models. | Must choose from bert_base_cased, bert_base_uncased, bert_base_multilingual_cased, bert_base_german_cased, bert_large_cased, bert_large_uncased, distilbert_base_cased, distilbert_base_uncased, roberta_base, roberta_large, distilroberta_base, xlm_roberta_base, xlm_roberta_large, xlnet_base_cased, xlnet_large_cased. |
number_of_epochs |
Number of training epochs. | Must be a positive integer. |
training_batch_size |
Training batch size. | Must be a positive integer. |
validation_batch_size |
Validation batch size. | Must be a positive integer. |
warmup_ratio |
Ratio of total training steps used for a linear warmup from 0 to learning_rate. |
Must be a float in the range [0, 1]. |
weight_decay |
Value of weight decay when optimizer is sgd, adam, or adamw. |
Must be a float in the range [0, 1]. |
All discrete hyperparameters only allow choice distributions, such as the integer-typed training_batch_size and the string-typed model_name hyperparameters. All continuous hyperparameters like learning_rate support all distributions.
Configure sweep settings
You can configure all the sweep-related parameters. Multiple model subspaces can be constructed with hyperparameters conditional to the respective model, as shown in each hyperparameter tuning example.
The same discrete and continuous distribution options that are available for general HyperDrive jobs are supported. For all nine options, see Hyperparameter tuning a model.
limits:
timeout_minutes: 120
max_trials: 4
max_concurrent_trials: 2
sweep:
sampling_algorithm: grid
early_termination:
type: bandit
evaluation_interval: 10
slack_factor: 0.2
search_space:
- model_name:
type: choice
values: [bert_base_cased, roberta_base]
number_of_epochs:
type: choice
values: [3, 4]
- model_name:
type: choice
values: [distilbert_base_cased]
learning_rate:
type: uniform
min_value: 0.000005
max_value: 0.00005
Sampling methods for the sweep
When you sweep hyperparameters, you need to specify the sampling method to use for sweeping over the defined parameter space. The following sampling methods are supported with the sampling_algorithm parameter:
| Sampling type | AutoML job syntax |
|---|---|
| Random | random |
| Grid | grid |
| Bayesian | bayesian |
Experiment budget
You can optionally specify the experiment budget for your AutoML NLP training job using the timeout_minutes parameter in the limits. This parameter defines the amount of time in minutes before the experiment terminates. If no timeout is specified, the default experiment timeout is seven days and the maximum is 60 days.
AutoML NLP also supports trial_timeout_minutes, the maximum amount of time in minutes an individual trial can run before being terminated, and max_nodes, the maximum number of nodes from the backing compute cluster to use for the job. These parameters are also set in the limits section.
limits:
timeout_minutes: 60
trial_timeout_minutes: 20
max_nodes: 2
Early termination policies
You can automatically end poorly performing runs with an early termination policy. Early termination improves computational efficiency, saving compute resources that would otherwise be spent on less promising configurations.
AutoML NLP supports early termination policies using the early_termination parameter. If no termination policy is specified, all configurations are run to completion. For more information, see Specify early termination policy.
Resources for the sweep
You can control the resources spent on your hyperparameter sweep by specifying the max_trials and the max_concurrent_trials for the sweep.
| Parameter | Detail |
|---|---|
max_trials |
Maximum number of configurations to sweep. Must be an integer between 1 and 1000. When exploring just the default hyperparameters for a given model algorithm, set this parameter to 1. The default value is 1. |
max_concurrent_trials |
Maximum number of runs that can run concurrently. If specified, must be an integer between 1 and 100. The default value is 1. NOTE: - The number of concurrent runs is gated on the resources available in the specified compute target. Ensure that the compute target has the available resources for the desired concurrency. - The max_concurrent_trials value is capped at max_trials internally. For example, if you set max_concurrent_trials=4, max_trials=2, values are internally updated as max_concurrent_trials=2, max_trials=2. |
The following example shows how to configure the sweep related parameters.
sweep:
limits:
max_trials: 10
max_concurrent_trials: 2
sampling_algorithm: random
early_termination:
type: bandit
evaluation_interval: 2
slack_factor: 0.2
delay_evaluation: 6
Known issues
Certain datasets produce low scores, even zero, regardless of the NLP task. High loss values accompany these scores, implying that the neural network failed to converge. Such cases are uncommon but possible, and can happen more frequently on certain GPU series.
The best way to handle these cases is to use hyperparameter tuning and provide a wider range of values, especially for hyperparameters like learning rates. Until hyperparameter tuning capability is available in production, use the NC6 or ND6 compute clusters if you experience these issues. These clusters typically have fairly stable training outcomes.