选择参数优化 Azure 机器学习工作室(经典)中的算法Choose parameters to optimize your algorithms in Azure Machine Learning Studio (classic)

适用于: yes机器学习工作室(经典) noAzure 机器学习APPLIES TO: yesMachine Learning Studio (classic) noAzure Machine Learning

本主题介绍如何为 Azure 机器学习工作室(经典)中的算法选择合适的超参数集。This topic describes how to choose the right hyperparameter set for an algorithm in Azure Machine Learning Studio (classic). 大多数机器学习算法使用参数来设置。Most machine learning algorithms have parameters to set. 训练模型时,需要为这些参数提供值。When you train a model, you need to provide values for those parameters. 经过训练的模型效力取决于选择的模型参数。The efficacy of the trained model depends on the model parameters that you choose. 查找最佳参数集的过程称为模型选择The process of finding the optimal set of parameters is known as model selection.

有多种方法可选择模型。There are various ways to do model selection. 在机器学习中,交叉验证是模型选择中最常用的方法之一,而且它是 Azure 机器学习工作室(经典)中的默认模型选择机制。In machine learning, cross-validation is one of the most widely used methods for model selection, and it is the default model selection mechanism in Azure Machine Learning Studio (classic). 由于 Azure 机器学习工作室(经典)支持 R 和 Python,始终可使用 R 或 Python 执行其自己的模型选择机制。Because Azure Machine Learning Studio (classic) supports both R and Python, you can always implement their own model selection mechanisms by using either R or Python.

查找最佳参数集有四个步骤:There are four steps in the process of finding the best parameter set:

  1. 定义参数空间:对于算法,首先决定要考虑的确切参数值。Define the parameter space: For the algorithm, first decide the exact parameter values you want to consider.
  2. 定义交叉验证设置:决定如何为数据集选择交叉验证折叠。Define the cross-validation settings: Decide how to choose cross-validation folds for the dataset.
  3. 定义指标:决定用于确定最佳参数集的指标,例如准确性、均方根误差、精度、撤销率或 F 分数。Define the metric: Decide what metric to use for determining the best set of parameters, such as accuracy, root mean squared error, precision, recall, or f-score.
  4. 训练、评估和比较:对于每个独一无二的参数值组合,交叉验证基于定义的误差指标进行执行。Train, evaluate, and compare: For each unique combination of the parameter values, cross-validation is carried out by and based on the error metric you define. 评估和比较后,可选择最佳模型。After evaluation and comparison, you can choose the best-performing model.

下图说明了如何在 Azure 机器学习工作室(经典)中执行此操作。The following image illustrates how this can be achieved in Azure Machine Learning Studio (classic).

查找最佳参数集

定义参数空间Define the parameter space

可在模型初始化步骤中定义参数集。You can define the parameter set at the model initialization step. 所有机器学习算法的参数窗格具有两个训练模式:单个参数 和参数范围 。The parameter pane of all machine learning algorithms has two trainer modes: Single Parameter and Parameter Range. 选择参数范围模式。Choose Parameter Range mode. 在参数范围模式下,可为每个参数输入多个值。In Parameter Range mode, you can enter multiple values for each parameter. 可在文本框中输入以逗号分隔的值。You can enter comma-separated values in the text box.

双类提升决策树,单个参数

或者,使用使用范围生成器定义网格的最大和最小网格点和生成的总点数。Alternately, you can define the maximum and minimum points of the grid and the total number of points to be generated with Use Range Builder. 默认情况下,参数值按线性刻度生成。By default, the parameter values are generated on a linear scale. 但是,如果“对数刻度” 处于选中状态,这些值会在对数刻度中生成(即相邻点的比率是常量,而不是它们的差)。But if Log Scale is checked, the values are generated in the log scale (that is, the ratio of the adjacent points is constant instead of their difference). 对于整数参数,可使用连字符定义范围。For integer parameters, you can define a range by using a hyphen. 例如,“1-10”是指介于 1 到 10(均含)之间的所有整数构成参数集。For example, “1-10” means that all integers between 1 and 10 (both inclusive) form the parameter set. 也支持混合模式。A mixed mode is also supported. 例如,参数集“1-10、20、50”将包括整数 1-10、20 和 50。For example, the parameter set “1-10, 20, 50” would include integers 1-10, 20, and 50.

双类提升决策树,参数范围

定义交叉验证折叠Define cross-validation folds

分区和示例模块可用于随机将折叠分配到数据。The Partition and Sample module can be used to randomly assign folds to the data. 在模块的以下示例配置中,我们定义 5 个折叠并随机将折叠号分配到示例实例。In the following sample configuration for the module, we define five folds and randomly assign a fold number to the sample instances.

分区和采样

定义指标Define the metric

优化模型超参数模块支持为给定算法和数据集凭经验选择最佳参数集。The Tune Model Hyperparameters module provides support for empirically choosing the best set of parameters for a given algorithm and dataset. 除了有关训练模型的其他信息,此模块的“属性” 窗格包括确定最佳参数集的指标。In addition to other information regarding training the model, the Properties pane of this module includes the metric for determining the best parameter set. 它分别具有两个不同的下拉列表框用于分类和回归算法。It has two different drop-down list boxes for classification and regression algorithms, respectively. 如果正在考虑的算法是分类算法,则忽略回归指标,反之亦然。If the algorithm under consideration is a classification algorithm, the regression metric is ignored and vice versa. 在此特定示例中,该指标为准确性In this specific example, the metric is Accuracy.

扫描参数

训练、评估和比较Train, evaluate, and compare

相同的优化模型超参数模块训练所有对应于参数集的模型、评估各种指标并基于所选指标创建训练最佳的模型。The same Tune Model Hyperparameters module trains all the models that correspond to the parameter set, evaluates various metrics, and then creates the best-trained model based on the metric you choose. 此模块具有两个必需输入:This module has two mandatory inputs:

  • 未训练的学习器The untrained learner
  • 数据集The dataset

该模块还具有一个可选数据集输入。The module also has an optional dataset input. 使用折叠信息将数据集连接到必需数据集输入。Connect the dataset with fold information to the mandatory dataset input. 如果未向数据集分配任何折叠信息,那么默认情况下会自动执行 10 个折叠交叉验证。If the dataset is not assigned any fold information, then a 10-fold cross-validation is automatically executed by default. 如果未执行折叠分配且在可选数据集端口提供验证数据集,则将选择训练-测试模式且第一个数据集用于为每个参数组合训练模型。If the fold assignment is not done and a validation dataset is provided at the optional dataset port, then a train-test mode is chosen and the first dataset is used to train the model for each parameter combination.

提升决策树分类器

然后会在验证数据集上评估模型。The model is then evaluated on the validation dataset. 模块的左侧输出端口显示不同的指标作为参数值的函数。The left output port of the module shows different metrics as functions of parameter values. 右输出端口根据所选指标提供对应于最佳模型的已训练模型(本例中为准确性)。The right output port gives the trained model that corresponds to the best-performing model according to the chosen metric (Accuracy in this case).

验证数据集

可以通过可视化右侧输出端口查看所选的确切参数。You can see the exact parameters chosen by visualizing the right output port. 保存为已训练模型后,此模型可用于对测试集进行评分或可操作性 Web 服务。This model can be used in scoring a test set or in an operationalized web service after saving as a trained model.