Configure training, validation, cross-validation and test data in automated machine learning

APPLIES TO: Python SDK azureml v1

In this article, you learn the different options for configuring training data and validation data splits along with cross-validation settings for your automated machine learning, automated ML, experiments.

In Azure Machine Learning, when you use automated ML to build multiple ML models, each child run needs to validate the related model by calculating the quality metrics for that model, such as accuracy or AUC weighted. These metrics are calculated by comparing the predictions made with each model with real labels from past observations in the validation data. Learn more about how metrics are calculated based on validation type.

Automated ML experiments perform model validation automatically. The following sections describe how you can further customize validation settings with the Azure Machine Learning Python SDK.

For a low-code or no-code experience, see Create your automated machine learning experiments in Azure Machine Learning studio.

Prerequisites

For this article you need,

Important

The Python commands in this article require the latest azureml-train-automl package version.

Default data splits and cross-validation in machine learning

Use the AutoMLConfig object to define your experiment and training settings. In the following code snippet, notice that only the required parameters are defined, that is the parameters for n_cross_validations or validation_data are not included.

Note

The default data splits and cross-validation are not supported in forecasting scenarios.

data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/creditcard.csv"

dataset = Dataset.Tabular.from_delimited_files(data)

automl_config = AutoMLConfig(compute_target = aml_remote_compute,
                             task = 'classification',
                             primary_metric = 'AUC_weighted',
                             training_data = dataset,
                             label_column_name = 'Class'
                            )

If you do not explicitly specify either a validation_data or n_cross_validations parameter, automated ML applies default techniques depending on the number of rows provided in the single dataset training_data.

Training data size Validation technique
Larger than 20,000 rows Train/validation data split is applied. The default is to take 10% of the initial training data set as the validation set. In turn, that validation set is used for metrics calculation.
Smaller than 20,000 rows Cross-validation approach is applied. The default number of folds depends on the number of rows.
If the dataset is less than 1,000 rows, 10 folds are used.
If the rows are between 1,000 and 20,000, then three folds are used.

Provide validation data

In this case, you can either start with a single data file and split it into training data and validation data sets or you can provide a separate data file for the validation set. Either way, the validation_data parameter in your AutoMLConfig object assigns which data to use as your validation set. This parameter only accepts data sets in the form of an Azure Machine Learning dataset or pandas dataframe.

Note

The validation_data parameter requires the training_data and label_column_name parameters to be set as well. You can only set one validation parameter, that is you can only specify either validation_data or n_cross_validations, not both.

The following code example explicitly defines which portion of the provided data in dataset to use for training and validation.

data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/creditcard.csv"

dataset = Dataset.Tabular.from_delimited_files(data)

training_data, validation_data = dataset.random_split(percentage=0.8, seed=1)

automl_config = AutoMLConfig(compute_target = aml_remote_compute,
                             task = 'classification',
                             primary_metric = 'AUC_weighted',
                             training_data = training_data,
                             validation_data = validation_data,
                             label_column_name = 'Class'
                            )

Provide validation set size

In this case, only a single dataset is provided for the experiment. That is, the validation_data parameter is not specified, and the provided dataset is assigned to the training_data parameter.

In your AutoMLConfig object, you can set the validation_size parameter to hold out a portion of the training data for validation. This means that the validation set will be split by automated ML from the initial training_data provided. This value should be between 0.0 and 1.0 non-inclusive (for example, 0.2 means 20% of the data is held out for validation data).

Note

The validation_size parameter is not supported in forecasting scenarios.

See the following code example:

data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/creditcard.csv"

dataset = Dataset.Tabular.from_delimited_files(data)

automl_config = AutoMLConfig(compute_target = aml_remote_compute,
                             task = 'classification',
                             primary_metric = 'AUC_weighted',
                             training_data = dataset,
                             validation_size = 0.2,
                             label_column_name = 'Class'
                            )

K-fold cross-validation

To perform k-fold cross-validation, include the n_cross_validations parameter and set it to a value. This parameter sets how many cross validations to perform, based on the same number of folds.

Note

The n_cross_validations parameter is not supported in classification scenarios that use deep neural networks. For forecasting scenarios, see how cross validation is applied in Set up AutoML to train a time-series forecasting model.

In the following code, five folds for cross-validation are defined. Hence, five different trainings, each training using 4/5 of the data, and each validation using 1/5 of the data with a different holdout fold each time.

As a result, metrics are calculated with the average of the five validation metrics.

data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/creditcard.csv"

dataset = Dataset.Tabular.from_delimited_files(data)

automl_config = AutoMLConfig(compute_target = aml_remote_compute,
                             task = 'classification',
                             primary_metric = 'AUC_weighted',
                             training_data = dataset,
                             n_cross_validations = 5
                             label_column_name = 'Class'
                            )

Monte Carlo cross-validation

To perform Monte Carlo cross validation, include both the validation_size and n_cross_validations parameters in your AutoMLConfig object.

For Monte Carlo cross validation, automated ML sets aside the portion of the training data specified by the validation_size parameter for validation, and then assigns the rest of the data for training. This process is then repeated based on the value specified in the n_cross_validations parameter; which generates new training and validation splits, at random, each time.

Note

The Monte Carlo cross-validation is not supported in forecasting scenarios.

The follow code defines, 7 folds for cross-validation and 20% of the training data should be used for validation. Hence, 7 different trainings, each training uses 80% of the data, and each validation uses 20% of the data with a different holdout fold each time.

data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/creditcard.csv"

dataset = Dataset.Tabular.from_delimited_files(data)

automl_config = AutoMLConfig(compute_target = aml_remote_compute,
                             task = 'classification',
                             primary_metric = 'AUC_weighted',
                             training_data = dataset,
                             n_cross_validations = 7
                             validation_size = 0.2,
                             label_column_name = 'Class'
                            )

Specify custom cross-validation data folds

You can also provide your own cross-validation (CV) data folds. This is considered a more advanced scenario because you are specifying which columns to split and use for validation. Include custom CV split columns in your training data, and specify which columns by populating the column names in the cv_split_column_names parameter. Each column represents one cross-validation split, and is filled with integer values 1 or 0--where 1 indicates the row should be used for training and 0 indicates the row should be used for validation.

Note

The cv_split_column_names parameter is not supported in forecasting scenarios.

The following code snippet contains bank marketing data with two CV split columns 'cv1' and 'cv2'.

data = "https://automlsamplenotebookdata.blob.core.chinacloudapi.cn/automl-sample-notebook-data/bankmarketing_with_cv.csv"

dataset = Dataset.Tabular.from_delimited_files(data)

automl_config = AutoMLConfig(compute_target = aml_remote_compute,
                             task = 'classification',
                             primary_metric = 'AUC_weighted',
                             training_data = dataset,
                             label_column_name = 'y',
                             cv_split_column_names = ['cv1', 'cv2']
                            )

Note

To use cv_split_column_names with training_data and label_column_name, please upgrade your Azure Machine Learning Python SDK version 1.6.0 or later. For previous SDK versions, please refer to using cv_splits_indices, but note that it is used with X and y dataset input only.

Metric calculation for cross validation in machine learning

When either k-fold or Monte Carlo cross validation is used, metrics are computed on each validation fold and then aggregated. The aggregation operation is an average for scalar metrics and a sum for charts. Metrics computed during cross validation are based on all folds and therefore all samples from the training set. Learn more about metrics in automated machine learning.

When either a custom validation set or an automatically selected validation set is used, model evaluation metrics are computed from only that validation set, not the training data.

Provide test data (preview)

Important

This feature is currently in public preview. This preview version is provided without a service-level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Azure Previews.

You can also provide test data to evaluate the recommended model that automated ML generates for you upon completion of the experiment. When you provide test data it's considered a separate from training and validation, so as to not bias the results of the test run of the recommended model. Learn more about training, validation and test data in automated ML.

Test datasets must be in the form of an Azure Machine Learning TabularDataset. You can specify a test dataset with the test_data and test_size parameters in your AutoMLConfig object. These parameters are mutually exclusive and can not be specified at the same time or with cv_split_column_names or cv_splits_indices.

With the test_data parameter, specify an existing dataset to pass into your AutoMLConfig object.

automl_config = AutoMLConfig(task='forecasting',
                             ...
                             # Provide an existing test dataset
                             test_data=test_dataset,
                             ...
                             forecasting_parameters=forecasting_parameters)

To use a train/test split instead of providing test data directly, use the test_size parameter when creating the AutoMLConfig. This parameter must be a floating point value between 0.0 and 1.0 exclusive, and specifies the percentage of the training dataset that should be used for the test dataset.

automl_config = AutoMLConfig(task = 'regression',
                             ...
                             # Specify train/test split
                             training_data=training_data,
                             test_size=0.2)

Note

For regression tasks, random sampling is used.
For classification tasks, stratified sampling is used, but random sampling is used as a fall back when stratified sampling is not feasible.
Forecasting does not currently support specifying a test dataset using a train/test split with the test_size parameter.

Passing the test_data or test_size parameters into the AutoMLConfig, automatically triggers a remote test run upon completion of your experiment. This test run uses the provided test data to evaluate the best model that automated ML recommends. Learn more about how to get the predictions from the test run.

Next steps