在自动化机器学习中配置数据拆分和交叉验证Configure data splits and cross-validation in automated machine learning

在本文中,你将了解用于为自动化机器学习 (AutoML) 试验配置训练/验证数据拆分和交叉验证的各种选项。In this article, you learn the different options for configuring training/validation data splits and cross-validation for your automated machine learning, AutoML, experiments.

在 Azure 机器学习中,当使用 AutoML 来生成多个 ML 模型时,每个子运行都需要通过计算该模型的质量指标(例如准确度或加权 AUC)来验证相关的模型。In Azure Machine Learning, when you use AutoML to build multiple ML models, each child run needs to validate the related model by calculating the quality metrics for that model, such as accuracy or AUC weighted. 这些指标的计算方法是将每个模型所做的预测与验证数据中过去观察到的实际标签进行比较。These metrics are calculated by comparing the predictions made with each model with real labels from past observations in the validation data.

AutoML 试验会自动执行模型验证。AutoML experiments perform model validation automatically. 下面的各个部分介绍了如何使用 Azure 机器学习 Python SDK 进一步自定义验证设置。The following sections describe how you can further customize validation settings with the Azure Machine Learning Python SDK.

对于低代码或无代码体验,请参阅在 Azure 机器学习工作室中创建自动化机器学习试验For a low-code or no-code experience, see Create your automated machine learning experiments in Azure Machine Learning studio.

备注

工作室当前支持训练/验证数据拆分和交叉验证选项,但它不支持为验证集指定单独的数据文件。The studio currently supports training/validation data splits and cross-validation options, but it does not support specifying individual data files for your validation set.

先决条件Prerequisites

在本文中,你需要:For this article you need,

默认数据拆分和交叉验证Default data splits and cross-validation

使用 AutoMLConfig 对象定义试验和训练设置。Use the AutoMLConfig object to define your experiment and training settings. 请注意,在下面的代码片段中,只定义了必需的参数,也就是说,包括 n_cross_validationvalidation_ data 的参数。In the following code snippet, notice that only the required parameters are defined, that is the parameters for n_cross_validation or validation_ data are not included.

data = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/creditcard.csv"

dataset = Dataset.Tabular.from_delimited_files(data)

automl_config = AutoMLConfig(compute_target = aml_remote_compute,
                             task = 'classification',
                             primary_metric = 'AUC_weighted',
                             training_data = dataset,
                             label_column_name = 'Class'
                            )

如果未显式指定 validation_datan_cross_validation 参数,则 AutoML 将根据提供的单个数据集 training_data 中的行数来应用默认技术:If you do not explicitly specify either a validation_data or n_cross_validation parameter, AutoML applies default techniques depending on the number of rows in the single dataset training_data provided:

训练数据大小Training data size 验证技术Validation technique
大于 20,000 行Larger than 20,000 rows 将应用训练/验证数据拆分。Train/validation data split is applied. 默认行为是将初始训练数据集的 10% 用作验证集。The default is to take 10% of the initial training data set as the validation set. 然后,该验证集将用于指标计算。In turn, that validation set is used for metrics calculation.
小于 20,000 行Smaller than 20,000 rows 将应用交叉验证方法。Cross-validation approach is applied. 默认折数取决于行数。The default number of folds depends on the number of rows.
如果数据集小于 1,000 行,则使用 10 折。If the dataset is less than 1,000 rows, 10 folds are used.
如果行数在 1,000 到 20,000 之间,则使用 3 折。If the rows are between 1,000 and 20,000, then three folds are used.

提供验证数据Provide validation data

在这种情况下,你可以从单个数据文件开始,将其拆分为训练集和验证集,也可以为验证集提供单独的数据文件。In this case, you can either start with a single data file and split it into training and validation sets or you can provide a separate data file for the validation set. 无论采用哪种方式,AutoMLConfig 对象中的 validation_data 参数都将分配要用作验证集的数据。Either way, the validation_data parameter in your AutoMLConfig object assigns which data to use as your validation set. 此参数仅接受 Azure 机器学习数据集 或 pandas 数据帧格式的数据集。This parameter only accepts data sets in the form of an Azure Machine Learning dataset or pandas dataframe.

下面的代码示例显式定义了要将 dataset 中所提供数据的哪部分用于训练和验证。The following code example explicitly defines which portion of the provided data in dataset to use for training and validation.

data = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/creditcard.csv"

dataset = Dataset.Tabular.from_delimited_files(data)

training_data, validation_data = dataset.random_split(percentage=0.8, seed=1)

automl_config = AutoMLConfig(compute_target = aml_remote_compute,
                             task = 'classification',
                             primary_metric = 'AUC_weighted',
                             training_data = training_data,
                             validation_data = validation_data,
                             label_column_name = 'Class'
                            )

提供验证集大小Provide validation set size

在这种情况下,只为试验提供单个数据集。In this case, only a single dataset is provided for the experiment. 也就是说,指定 validation_data 参数,提供的数据集将分配给 training_data 参数。That is, the validation_data parameter is not specified, and the provided dataset is assigned to the training_data parameter. AutoMLConfig 对象中,你可以设置 validation_size 参数来保存一部分用于验证的训练数据。In your AutoMLConfig object, you can set the validation_size parameter to hold out a portion of the training data for validation. 这意味着,验证集将由 AutoML 从提供的初始 training_data 中拆分出来。This means that the validation set will be split by AutoML from the initial training_data provided. 此值的范围应为 0.0 到 1.0(不含,例如,0.2 表示保留 20% 的数据用作验证数据)。This value should be between 0.0 and 1.0 non-inclusive (for example, 0.2 means 20% of the data is held out for validation data).

请参阅以下代码示例:See the following code example:

data = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/creditcard.csv"

dataset = Dataset.Tabular.from_delimited_files(data)

automl_config = AutoMLConfig(compute_target = aml_remote_compute,
                             task = 'classification',
                             primary_metric = 'AUC_weighted',
                             training_data = dataset,
                             validation_size = 0.2,
                             label_column_name = 'Class'
                            )

设置交叉验证次数Set the number of cross-validations

若要执行交叉验证,请包括 n_cross_validations 参数并将其设置为某个值。To perform cross-validation, include the n_cross_validations parameter and set it to a value. 此参数基于相同的折数设置要执行的交叉验证次数。This parameter sets how many cross validations to perform, based on the same number of folds.

在下面的代码中,定义了要将 5 折用于交叉验证。In the following code, five folds for cross-validation are defined. 因此有五个不同的训练,每个训练使用 4/5 的数据,每个验证使用 1/5 的数据,且每次都使用不同的维持数据折。Hence, five different trainings, each training using 4/5 of the data, and each validation using 1/5 of the data with a different holdout fold each time.

因此,将使用 5 个验证指标的平均值来计算指标。As a result, metrics are calculated with the average of the 5 validation metrics.

data = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/creditcard.csv"

dataset = Dataset.Tabular.from_delimited_files(data)

automl_config = AutoMLConfig(compute_target = aml_remote_compute,
                             task = 'classification',
                             primary_metric = 'AUC_weighted',
                             training_data = dataset,
                             n_cross_validations = 5
                             label_column_name = 'Class'
                            )

指定自定义交叉验证数据折数Specify custom cross-validation data folds

你还可以提供自己的交叉验证 (CV) 数据折数。You can also provide your own cross-validation (CV) data folds. 这被视为更高级的方案,因为你需要指定要将哪些列拆分出来用于验证。This is considered a more advanced scenario because you are specifying which columns to split and use for validation. 请在训练数据中包括自定义 CV 拆分列,并通过在 cv_split_column_names 参数中填充列名来指定列。Include custom CV split columns in your training data, and specify which columns by populating the column names in the cv_split_column_names parameter. 每个列表示一个交叉验证拆分,并用整数值 1 或 0 填充,其中 1 表示该行应当用于训练,0 表示该行应当用于验证。Each column represents one cross-validation split, and is filled with integer values 1 or 0 --where 1 indicates the row should be used for training and 0 indicates the row should be used for validation.

下面的代码片段包含具有两个 CV 拆分列(“cv1”和“cv2”)的银行营销数据。The following code snippet contains bank marketing data with two CV split columns 'cv1' and 'cv2'.

data = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_with_cv.csv"

dataset = Dataset.Tabular.from_delimited_files(data)

automl_config = AutoMLConfig(compute_target = aml_remote_compute,
                             task = 'classification',
                             primary_metric = 'AUC_weighted',
                             training_data = dataset,
                             label_column_name = 'y',
                             cv_split_column_names = ['cv1', 'cv2']
                            )

备注

若要将 cv_split_column_namestraining_datalabel_column_name 一起使用,请升级到 Azure 机器学习 Python SDK 1.6.0 或更高版本。To use cv_split_column_names with training_data and label_column_name, please upgrade your Azure Machine Learning Python SDK version 1.6.0 or later. 对于以前的 SDK 版本,请参阅 cv_splits_indices 使用方面的内容,但请注意,它仅可与 Xy 数据集输入一起使用。For previous SDK versions, please refer to using cv_splits_indices, but note that it is used with X and y dataset input only.

后续步骤Next steps