交叉验证模型Cross Validate Model

本文介绍如何使用 Azure 机器学习设计器(预览版)中的“交叉验证模型”模块。This article describes how to use the Cross Validate Model module in Azure Machine Learning designer (preview). “交叉验证”技术通常在机器学习中用来评估数据集的可变性,以及通过该数据训练的任何模型的可靠性。 Cross-validation is a technique often used in machine learning to assess both the variability of a dataset and the reliability of any model trained through that data.

“交叉验证模型”模块将带标签的数据集用作输入,与未训练的分类或回归模型一起工作。The Cross Validate Model module takes as input a labeled dataset, together with an untrained classification or regression model. 它将数据集分割成某个数量的子集(折),在每个折上生成一个模型,然后为每个折返回一组准确度统计信息。 It divides the dataset into some number of subsets (folds), builds a model on each fold, and then returns a set of accuracy statistics for each fold. 通过比较所有折的准确度统计信息,可以解释数据集的质量。By comparing the accuracy statistics for all the folds, you can interpret the quality of the data set. 然后,可以了解模型是否容易受到数据变化的影响。You can then understand whether the model is susceptible to variations in the data.

“交叉验证模型”还返回数据集的预测结果和概率,使你能够评估预测的可靠性。Cross Validate Model also returns predicted results and probabilities for the dataset, so that you can assess the reliability of the predictions.

交叉验证的工作原理How cross-validation works

  1. 交叉验证会将训练数据随机分割成折。Cross-validation randomly divides training data into folds.

    如果未事先将数据集分区,则算法默认使用 10 个折。The algorithm defaults to 10 folds if you have not previously partitioned the dataset. 若要将数据集分割成不同数量的折,可以使用分区和采样模块并指定要使用的折数。To divide the dataset into a different number of folds, you can use the Partition and Sample module and indicate how many folds to use.

  2. 模块将留出折 1 中的数据以用于验证。The module sets aside the data in fold 1 to use for validation. (此折有时称为“维持数据折”。) 模块使用剩余的折来训练模型。(This is sometimes called the holdout fold.) The module uses the remaining folds to train a model.

    例如,如果创建了五个折,模块将在交叉验证过程中生成五个模型。For example, if you create five folds, the module generates five models during cross-validation. 模块使用 4/5 的数据来训练每个模型。The module trains each model by using four-fifths of the data. 它基于 1/5 的剩余数据测试每个模型。It tests each model on the remaining one-fifth.

  3. 在测试每个折的模型期间,模块将评估多项准确度统计信息。During testing of the model for each fold, the module evaluates multiple accuracy statistics. 模块要使用哪些统计信息取决于要评估的模型类型。Which statistics the module uses depends on the type of model that you're evaluating. 不同的统计信息用于评估分类模型和回归模型。Different statistics are used to evaluate classification models versus regression models.

  4. 完成所有折的生成和评估过程后,“交叉验证模型”将生成一组性能指标,以及所有数据的评分结果。When the building and evaluation process is complete for all folds, Cross Validate Model generates a set of performance metrics and scored results for all the data. 查看这些指标可以确定是否有任何一个折的准确度偏高或偏低。Review these metrics to see whether any single fold has high or low accuracy.

交叉验证的优势Advantages of cross-validation

评估模型的另一种常用方法是使用拆分数据将数据分割成训练集和测试集,然后基于训练数据验证模型。A different and common way of evaluating a model is to divide the data into a training and test set by using Split Data, and then validate the model on the training data. 但是,交叉验证提供一些优势:But cross-validation offers some advantages:

  • 交叉验证使用更多的测试数据。Cross-validation uses more test data.

    交叉验证使用较大数据空间中指定的参数来度量模型的性能。Cross-validation measures the performance of the model with the specified parameters in a bigger data space. 即,交叉验证将整个训练数据集(而不是其一部分)用于训练和评估。That is, cross-validation uses the entire training dataset for both training and evaluation, instead of a portion. 相比之下,如果使用随机拆分后生成的数据来验证模型,则通常只会基于 30% 或更少的可用数据来评估模型。In contrast, if you validate a model by using data generated from a random split, typically you evaluate the model on only 30 percent or less of the available data.

    但是,由于交叉验证基于较大的数据集训练和验证模型多次,因此其计算密集度要高得多。However, because cross-validation trains and validates the model multiple times over a larger dataset, it's much more computationally intensive. 与基于随机拆分的验证相比,其花费的时间也要长得多。It takes much longer than validating on a random split.

  • 交叉验证同时评估数据集和模型。Cross-validation evaluates both the dataset and the model.

    交叉验证不只是度量模型的准确度。Cross-validation doesn't simply measure the accuracy of a model. 它还可让你大致了解数据集的代表性高低,以及模型对数据变化的敏感性。It also gives you some idea of how representative the dataset is and how sensitive the model might be to variations in the data.

如何使用“交叉验证模型”How to use Cross Validate Model

如果数据集很大,则运行交叉验证可能需要较长的时间。Cross-validation can take a long time to run if your dataset is large. 因此,可以在生成和测试模型的初始阶段使用“交叉验证模型”。So, you might use Cross Validate Model in the initial phase of building and testing your model. 在此阶段,可以评估模型参数的好坏(假设容许任何计算时间)。In that phase, you can evaluate the goodness of the model parameters (assuming that computation time is tolerable). 然后,可以在训练模型评估模型模块中使用建立的参数来训练和评估模型。You can then train and evaluate your model by using the established parameters with the Train Model and Evaluate Model modules.

在此方案中,将使用“交叉验证模型”同时训练和测试模型。In this scenario, you both train and test the model by using Cross Validate Model.

  1. 将“交叉验证模型”模块添加到管道。Add the Cross Validate Model module to your pipeline. 可以在 Azure 机器学习设计器中的“模型评分和评估”类别中找到此模块。 You can find it in Azure Machine Learning designer, in the Model Scoring & Evaluation category.

  2. 连接任何分类或回归模型的输出。Connect the output of any classification or regression model.

    例如,如果使用“双类提升决策树”进行分类,请使用所需的参数配置模型。 For example, if you're using Two Class Boosted Decision Tree for classification, configure the model with the parameters that you want. 然后,将分类器的“未训练的模型”端口中的连接器拖放到“交叉验证模型”的匹配端口。 Then, drag a connector from the Untrained model port of the classifier to the matching port of Cross Validate Model.

    提示

    无需训练模型,因为“交叉验证模型”会在评估过程中自动训练模型。You don't have to train the model, because Cross-Validate Model automatically trains the model as part of evaluation.

  3. 在“交叉验证模型”的“数据集”端口上,连接任何带标签的训练数据集。 On the Dataset port of Cross Validate Model, connect any labeled training dataset.

  4. 在交叉验证模型的右侧面板中,单击“编辑列” 。In the right panel of Cross Validate Model, click Edit column. 选择包含类标签或可预测值的单个列。Select the single column that contains the class label, or the predictable value.

  5. 若要在针对相同数据的连续运行中重复交叉验证的结果,请为“随机种子”参数设置一个值。 Set a value for the Random seed parameter if you want to repeat the results of cross-validation across successive runs on the same data.

  6. 提交管道。Submit the pipeline.

  7. 有关报告的说明,请参阅结果部分。See the Results section for a description of the reports.

结果Results

所有迭代完成后,“交叉验证模型”将为整个数据集创建评分。After all iterations are complete, Cross Validate Model creates scores for the entire dataset. 它还会创建可用于评估模型质量的性能指标。It also creates performance metrics that you can use to assess the quality of the model.

评分结果Scored results

模块的第一个输出提供每个行的源数据,以及一些预测值和相关概率。The first output of the module provides the source data for each row, together with some predicted values and related probabilities.

若要查看结果,请在管道中右键单击“交叉验证模型”模块。To view the results, in the pipeline, right-click the Cross Validate Model module. 选择“将评分结果可视化” 。Select Visualize Scored results.

新列名New column name 说明Description
评分标签Scored Labels 此列添加在数据集的末尾。This column is added at the end of the dataset. 其中包含每个行的预测值。It contains the predicted value for each row.
评分概率Scored Probabilities 此列添加在数据集的末尾。This column is added at the end of the dataset. 它指示“评分标签”中的值的估计概率。 It indicates the estimated probability of the value in Scored Labels.
折编号Fold Number 指示在交叉验证过程中每个数据行分配到的折的从零开始的索引。Indicates the zero-based index of the fold that each row of data was assigned to during cross-validation.

评估结果Evaluation results

第二份报告已按折分组。The second report is grouped by folds. 请记住,在执行过程中,“交叉验证模型”会将训练数据随机拆分为 n 折(默认为 10)。Remember that during execution, Cross Validate Model randomly splits the training data into n folds (by default, 10). 每次迭代数据集时,“交叉验证模型”将使用一折作为验证数据集。In each iteration over the dataset, Cross Validate Model uses one fold as a validation dataset. 它使用剩余的 n-1 折来训练模型。It uses the remaining n-1 folds to train a model. 将会根据所有其他折中的数据测试 n 个模型中的每个模型。Each of the n models is tested against the data in all the other folds.

在此报告中,将按索引值的升序列出折。In this report, the folds are listed by index value, in ascending order. 若要根据任何其他列进行排序,可将结果另存为数据集。To order on any other column, you can save the results as a dataset.

若要查看结果,请在管道中右键单击“交叉验证模型”模块。To view the results, in the pipeline, right-click the Cross Validate Model module. 选择“通过折叠将评估结果可视化” 。Select Visualize Evaluation results by fold.

列名称Column name 说明Description
折编号Fold number 每个折的标识符。An identifier for each fold. 如果创建了 5 个折,则会有 5 个数据子集,其编号为 0 到 4。If you created five folds, there would be five subsets of data, numbered 0 to 4.
折中的示例数Number of examples in fold 分配给每个折的行数。The number of rows assigned to each fold. 它们应大致相等。They should be roughly equal.

根据要评估的模型类型,该模块还会包含每个折的以下指标:The module also includes the following metrics for each fold, depending on the type of model that you're evaluating:

  • 分类模型:精准率、召回率、F 评分、AUC、准确度Classification models: Precision, recall, F-score, AUC, accuracy

  • 回归模型:平均绝对误差、平均根方根误差、相对绝对误差、相对平方误差和决定系数。Regression models: Mean absolute error, root mean squared error, relative absolute error, relative squared error, and coefficient of determination

技术说明Technical notes

  • 在将数据集用于交叉验证之前,最佳做法是规范化数据集。It's a best practice to normalize datasets before you use them for cross-validation.

  • “交叉验证模型”的计算密集度要高得多,与使用随机分割的数据集验证模型相比,其完成时间更长。Cross Validate Model is much more computationally intensive and takes longer to complete than if you validated the model by using a randomly divided dataset. 原因在于,“交叉验证模型”需要训练并验证模型多次。The reason is that Cross Validate Model trains and validates the model multiple times.

  • 使用交叉验证来度量模型的准确度时,无需将数据集拆分为训练集和测试集。There's no need to split the dataset into training and testing sets when you use cross-validation to measure the accuracy of the model.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.