Configure training, validation, cross-validation and test data in automated machine learning
APPLIES TO: Python SDK azureml v1
In this article, you learn the different options for configuring training data and validation data splits along with cross-validation settings for your automated machine learning, automated ML, experiments.
In Azure Machine Learning, when you use automated ML to build multiple ML models, each child run needs to validate the related model by calculating the quality metrics for that model, such as accuracy or AUC weighted. These metrics are calculated by comparing the predictions made with each model with real labels from past observations in the validation data. Learn more about how metrics are calculated based on validation type.
Automated ML experiments perform model validation automatically. The following sections describe how you can further customize validation settings with the Azure Machine Learning Python SDK.
For a low-code or no-code experience, see Create your automated machine learning experiments in Azure Machine Learning studio.
Prerequisites
For this article you need,
An Azure Machine Learning workspace. To create the workspace, see Create workspace resources.
Familiarity with setting up an automated machine learning experiment with the Azure Machine Learning SDK. Follow the tutorial or how-to to see the fundamental automated machine learning experiment design patterns.
An understanding of train/validation data splits and cross-validation as machine learning concepts. For a high-level explanation,
Important
The Python commands in this article require the latest azureml-train-automl
package version.
- Install the latest
azureml-train-automl
package to your local environment. - For details on the latest
azureml-train-automl
package, see the release notes.
Default data splits and cross-validation in machine learning
Use the AutoMLConfig object to define your experiment and training settings. In the following code snippet, notice that only the required parameters are defined, that is the parameters for n_cross_validations
or validation_data
are not included.
Note
The default data splits and cross-validation are not supported in forecasting scenarios.
data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/creditcard.csv"
dataset = Dataset.Tabular.from_delimited_files(data)
automl_config = AutoMLConfig(compute_target = aml_remote_compute,
task = 'classification',
primary_metric = 'AUC_weighted',
training_data = dataset,
label_column_name = 'Class'
)
If you do not explicitly specify either a validation_data
or n_cross_validations
parameter, automated ML applies default techniques depending on the number of rows provided in the single dataset training_data
.
Training data size | Validation technique |
---|---|
Larger than 20,000 rows | Train/validation data split is applied. The default is to take 10% of the initial training data set as the validation set. In turn, that validation set is used for metrics calculation. |
Smaller than 20,000 rows | Cross-validation approach is applied. The default number of folds depends on the number of rows. If the dataset is less than 1,000 rows, 10 folds are used. If the rows are between 1,000 and 20,000, then three folds are used. |
Provide validation data
In this case, you can either start with a single data file and split it into training data and validation data sets or you can provide a separate data file for the validation set. Either way, the validation_data
parameter in your AutoMLConfig
object assigns which data to use as your validation set. This parameter only accepts data sets in the form of an Azure Machine Learning dataset or pandas dataframe.
Note
The validation_data
parameter requires the training_data
and label_column_name
parameters to be set as well. You can only set one validation parameter, that is you can only specify either validation_data
or n_cross_validations
, not both.
The following code example explicitly defines which portion of the provided data in dataset
to use for training and validation.
data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/creditcard.csv"
dataset = Dataset.Tabular.from_delimited_files(data)
training_data, validation_data = dataset.random_split(percentage=0.8, seed=1)
automl_config = AutoMLConfig(compute_target = aml_remote_compute,
task = 'classification',
primary_metric = 'AUC_weighted',
training_data = training_data,
validation_data = validation_data,
label_column_name = 'Class'
)
Provide validation set size
In this case, only a single dataset is provided for the experiment. That is, the validation_data
parameter is not specified, and the provided dataset is assigned to the training_data
parameter.
In your AutoMLConfig
object, you can set the validation_size
parameter to hold out a portion of the training data for validation. This means that the validation set will be split by automated ML from the initial training_data
provided. This value should be between 0.0 and 1.0 non-inclusive (for example, 0.2 means 20% of the data is held out for validation data).
Note
The validation_size
parameter is not supported in forecasting scenarios.
See the following code example:
data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/creditcard.csv"
dataset = Dataset.Tabular.from_delimited_files(data)
automl_config = AutoMLConfig(compute_target = aml_remote_compute,
task = 'classification',
primary_metric = 'AUC_weighted',
training_data = dataset,
validation_size = 0.2,
label_column_name = 'Class'
)
K-fold cross-validation
To perform k-fold cross-validation, include the n_cross_validations
parameter and set it to a value. This parameter sets how many cross validations to perform, based on the same number of folds.
Note
The n_cross_validations
parameter is not supported in classification scenarios that use deep neural networks.
For forecasting scenarios, see how cross validation is applied in Set up AutoML to train a time-series forecasting model.
In the following code, five folds for cross-validation are defined. Hence, five different trainings, each training using 4/5 of the data, and each validation using 1/5 of the data with a different holdout fold each time.
As a result, metrics are calculated with the average of the five validation metrics.
data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/creditcard.csv"
dataset = Dataset.Tabular.from_delimited_files(data)
automl_config = AutoMLConfig(compute_target = aml_remote_compute,
task = 'classification',
primary_metric = 'AUC_weighted',
training_data = dataset,
n_cross_validations = 5
label_column_name = 'Class'
)
Monte Carlo cross-validation
To perform Monte Carlo cross validation, include both the validation_size
and n_cross_validations
parameters in your AutoMLConfig
object.
For Monte Carlo cross validation, automated ML sets aside the portion of the training data specified by the validation_size
parameter for validation, and then assigns the rest of the data for training. This process is then repeated based on the value specified in the n_cross_validations
parameter; which generates new training and validation splits, at random, each time.
Note
The Monte Carlo cross-validation is not supported in forecasting scenarios.
The follow code defines, 7 folds for cross-validation and 20% of the training data should be used for validation. Hence, 7 different trainings, each training uses 80% of the data, and each validation uses 20% of the data with a different holdout fold each time.
data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/creditcard.csv"
dataset = Dataset.Tabular.from_delimited_files(data)
automl_config = AutoMLConfig(compute_target = aml_remote_compute,
task = 'classification',
primary_metric = 'AUC_weighted',
training_data = dataset,
n_cross_validations = 7
validation_size = 0.2,
label_column_name = 'Class'
)
Specify custom cross-validation data folds
You can also provide your own cross-validation (CV) data folds. This is considered a more advanced scenario because you are specifying which columns to split and use for validation. Include custom CV split columns in your training data, and specify which columns by populating the column names in the cv_split_column_names
parameter. Each column represents one cross-validation split, and is filled with integer values 1 or 0--where 1 indicates the row should be used for training and 0 indicates the row should be used for validation.
Note
The cv_split_column_names
parameter is not supported in forecasting scenarios.
The following code snippet contains bank marketing data with two CV split columns 'cv1' and 'cv2'.
data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/bankmarketing_with_cv.csv"
dataset = Dataset.Tabular.from_delimited_files(data)
automl_config = AutoMLConfig(compute_target = aml_remote_compute,
task = 'classification',
primary_metric = 'AUC_weighted',
training_data = dataset,
label_column_name = 'y',
cv_split_column_names = ['cv1', 'cv2']
)
Note
To use cv_split_column_names
with training_data
and label_column_name
, please upgrade your Azure Machine Learning Python SDK version 1.6.0 or later. For previous SDK versions, please refer to using cv_splits_indices
, but note that it is used with X
and y
dataset input only.
Metric calculation for cross validation in machine learning
When either k-fold or Monte Carlo cross validation is used, metrics are computed on each validation fold and then aggregated. The aggregation operation is an average for scalar metrics and a sum for charts. Metrics computed during cross validation are based on all folds and therefore all samples from the training set. Learn more about metrics in automated machine learning.
When either a custom validation set or an automatically selected validation set is used, model evaluation metrics are computed from only that validation set, not the training data.
Provide test data (preview)
Important
This feature is currently in public preview. This preview version is provided without a service-level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Azure Previews.
You can also provide test data to evaluate the recommended model that automated ML generates for you upon completion of the experiment. When you provide test data it's considered a separate from training and validation, so as to not bias the results of the test run of the recommended model. Learn more about training, validation and test data in automated ML.
Warning
This feature is not available for the following automated ML scenarios
Test datasets must be in the form of an Azure Machine Learning TabularDataset. You can specify a test dataset with the test_data
and test_size
parameters in your AutoMLConfig
object. These parameters are mutually exclusive and can not be specified at the same time or with cv_split_column_names
or cv_splits_indices
.
With the test_data
parameter, specify an existing dataset to pass into your AutoMLConfig
object.
automl_config = AutoMLConfig(task='forecasting',
...
# Provide an existing test dataset
test_data=test_dataset,
...
forecasting_parameters=forecasting_parameters)
To use a train/test split instead of providing test data directly, use the test_size
parameter when creating the AutoMLConfig
. This parameter must be a floating point value between 0.0 and 1.0 exclusive, and specifies the percentage of the training dataset that should be used for the test dataset.
automl_config = AutoMLConfig(task = 'regression',
...
# Specify train/test split
training_data=training_data,
test_size=0.2)
Note
For regression tasks, random sampling is used.
For classification tasks, stratified sampling is used, but random sampling is used as a fall back when stratified sampling is not feasible.
Forecasting does not currently support specifying a test dataset using a train/test split with the test_size
parameter.
Passing the test_data
or test_size
parameters into the AutoMLConfig
, automatically triggers a remote test run upon completion of your experiment. This test run uses the provided test data to evaluate the best model that automated ML recommends. Learn more about how to get the predictions from the test run.