使用 Azure 机器学习优化模型的超参数Tune hyperparameters for your model with Azure Machine Learning

使用 Azure 机器学习 HyperDrive 程序包自动执行高效的超参数优化。Automate efficient hyperparameter tuning by using Azure Machine Learning HyperDrive package. 了解如何完成通过 Azure 机器学习 SDK 优化超参数所需的步骤:Learn how to complete the steps required to tune hyperparameters with the Azure Machine Learning SDK:

  1. 定义参数搜索空间Define the parameter search space
  2. 指定要优化的主要指标Specify a primary metric to optimize
  3. 为低性能运行指定提前终止策略Specify early termination policy for low-performing runs
  4. 分配资源Allocate resources
  5. 使用所定义的配置启动试验Launch an experiment with the defined configuration
  6. 将训练运行可视化Visualize the training runs
  7. 为模型选择最佳配置Select the best configuration for your model

什么是超参数?What are hyperparameters?

超参数 是可调整的参数,可用于控制模型训练过程。Hyperparameters are adjustable parameters that let you control the model training process. 例如,使用神经网络时,你决定隐藏层的数目以及每个层中的节点数。For example, with neural networks, you decide the number of hidden layers and the number of nodes in each layer. 模型性能很大程度上取决于超参数。Model performance depends heavily on hyperparameters.

超参数优化 是确定能够获得最佳性能的超参数配置的过程。Hyperparameter tuning is the process of finding the configuration of hyperparameters that results in the best performance. 通常,该过程在计算方面成本高昂,并且是手动的。The process is typically computationally expensive and manual.

Azure 机器学习使你能够自动执行超参数优化,并且并行运行试验以有效地优化超参数。Azure Machine Learning lets you automate hyperparameter tuning and run experiments in parallel to efficiently optimize hyperparameters.

定义搜索空间Define the search space

通过探索针对每个超参数定义的值范围来优化超参数。Tune hyperparameters by exploring the range of values defined for each hyperparameter.

超参数可以是离散的,也可以是连续的,并具有由参数表达式描述的值分布。Hyperparameters can be discrete or continuous, and has a distribution of values described by a parameter expression.

离散超参数Discrete hyperparameters

离散超参数将指定为离散值中的一个 choiceDiscrete hyperparameters are specified as a choice among discrete values. choice 可以是:choice can be:

  • 一个或多个逗号分隔值one or more comma-separated values
  • range 对象a range object
  • 任意 list 对象any arbitrary list object
    {
        "batch_size": choice(16, 32, 64, 128)
        "number_of_hidden_layers": choice(range(1,5))
    }

在这种情况下,batch_size 采用 [16、32、64、128] 中的一个值,number_of_hidden_layers 采用 [1、2、3、4] 中的一个值。In this case, batch_size one of the values [16, 32, 64, 128] and number_of_hidden_layers takes one of the values [1, 2, 3, 4].

也可以使用一个分布来指定以下高级离散超参数:The following advanced discrete hyperparameters can also be specified using a distribution:

  • quniform(low, high, q) - 返回类似于 round(uniform(low, high) / q) * q 的值quniform(low, high, q) - Returns a value like round(uniform(low, high) / q) * q
  • qloguniform(low, high, q) - 返回类似于 round(exp(uniform(low, high)) / q) * q 的值qloguniform(low, high, q) - Returns a value like round(exp(uniform(low, high)) / q) * q
  • qnormal(mu, sigma, q) - 返回类似于 round(normal(mu, sigma) / q) * q 的值qnormal(mu, sigma, q) - Returns a value like round(normal(mu, sigma) / q) * q
  • qlognormal(mu, sigma, q) - 返回类似于 round(exp(normal(mu, sigma)) / q) * q 的值qlognormal(mu, sigma, q) - Returns a value like round(exp(normal(mu, sigma)) / q) * q

连续超参数Continuous hyperparameters

将连续超参数指定为一个连续值范围内的分布:The Continuous hyperparameters are specified as a distribution over a continuous range of values:

  • uniform(low, high) - 返回高低之间的均匀分布值uniform(low, high) - Returns a value uniformly distributed between low and high
  • loguniform(low, high) - 返回根据 exp(uniform(low, high)) 绘制的值,使返回值的对数均匀分布loguniform(low, high) - Returns a value drawn according to exp(uniform(low, high)) so that the logarithm of the return value is uniformly distributed
  • normal(mu, sigma) - 返回正态分布的实际值,包括平均值 μ 和标准方差 σnormal(mu, sigma) - Returns a real value that's normally distributed with mean mu and standard deviation sigma
  • lognormal(mu, sigma) - 返回根据 exp(normal(mu, sigma)) 绘制的值,使返回值的对数呈正态分布lognormal(mu, sigma) - Returns a value drawn according to exp(normal(mu, sigma)) so that the logarithm of the return value is normally distributed

参数空间定义的示例:An example of a parameter space definition:

    {    
        "learning_rate": normal(10, 3),
        "keep_probability": uniform(0.05, 0.1)
    }

此代码定义具有两个参数(learning_ratekeep_probability)的搜索空间。This code defines a search space with two parameters - learning_rate and keep_probability. learning_rate 包含平均值为 10、标准偏差为 3 的正态分布。learning_rate has a normal distribution with mean value 10 and a standard deviation of 3. keep_probability 包含最小值为 0.05、最大值为 0.1 的均匀分布。keep_probability has a uniform distribution with a minimum value of 0.05 and a maximum value of 0.1.

超参数空间采样Sampling the hyperparameter space

指定参数采样方法来取代超参数空间。Specify the parameter sampling method to use over the hyperparameter space. Azure 机器学习支持以下方法:Azure Machine Learning supports the following methods:

  • 随机采样Random sampling
  • 网格采样Grid sampling
  • 贝叶斯采样Bayesian sampling

随机采样Random sampling

随机采样支持离散和连续超参数。Random sampling supports discrete and continuous hyperparameters. 它支持提前终止低性能运行。It supports early termination of low-performance runs. 某些用户使用随机采样执行初始搜索,然后优化搜索空间来改善结果。Some users do an initial search with random sampling and then refine the search space to improve results.

在随机采样中,超参数值是从定义的搜索空间中随机选择的。In random sampling, hyperparameter values are randomly selected from the defined search space.

from azureml.train.hyperdrive import RandomParameterSampling
from azureml.train.hyperdrive import normal, uniform, choice
param_sampling = RandomParameterSampling( {
        "learning_rate": normal(10, 3),
        "keep_probability": uniform(0.05, 0.1),
        "batch_size": choice(16, 32, 64, 128)
    }
)

网格采样Grid sampling

网格采样支持离散超参数。Grid sampling supports discrete hyperparameters. 如果你的预算允许在搜索空间中彻底进行搜索,请使用网格采样。Use grid sampling if you can budget to exhaustively search over the search space. 支持提前终止低性能运行。Supports early termination of low-performance runs.

对所有可能的值执行简单的网格搜索。Performs a simple grid search over all possible values. 网格采样只能与 choice 超参数一起使用。Grid sampling can only be used with choice hyperparameters. 例如,以下空间有 6 个样本:For example, the following space has six samples:

from azureml.train.hyperdrive import GridParameterSampling
from azureml.train.hyperdrive import choice
param_sampling = GridParameterSampling( {
        "num_hidden_layers": choice(1, 2, 3),
        "batch_size": choice(16, 32)
    }
)

贝叶斯采样Bayesian sampling

贝叶斯采样基于贝叶斯优化算法。Bayesian sampling is based on the Bayesian optimization algorithm. 它根据先前样本的表现情况来选取样本,以便新样本可以改善主要指标。It picks samples based on how previous samples performed, so that new samples improve the primary metric.

如果你有足够的预算来探索超参数空间,则建议使用贝叶斯采样。Bayesian sampling is recommended if you have enough budget to explore the hyperparameter space. 为获得最佳结果,建议最大运行次数大于或等于正在优化的超参数数目的 20 倍。For best results, we recommend a maximum number of runs greater than or equal to 20 times the number of hyperparameters being tuned.

并发运行的数目会影响优化过程的有效性。The number of concurrent runs has an impact on the effectiveness of the tuning process. 数目较小的并发运行可能带来更好的采样收敛,因为较小的并行度会增加可从先前完成的运行中获益的运行的数量。A smaller number of concurrent runs may lead to better sampling convergence, since the smaller degree of parallelism increases the number of runs that benefit from previously completed runs.

Bayesian 采样仅支持搜索空间中的 choiceuniformquniform 分布。Bayesian sampling only supports choice, uniform, and quniform distributions over the search space.

from azureml.train.hyperdrive import BayesianParameterSampling
from azureml.train.hyperdrive import uniform, choice
param_sampling = BayesianParameterSampling( {
        "learning_rate": uniform(0.05, 0.1),
        "batch_size": choice(16, 32, 64, 128)
    }
)

指定主要指标Specify primary metric

指定你希望让超参数优化对其进行优化的主要指标Specify the primary metric you want hyperparameter tuning to optimize. 将根据此主要指标评估每个训练运行。Each training run is evaluated for the primary metric. 提前终止策略使用主要指标来识别低性能运行。The early termination policy uses the primary metric to identify low-performance runs.

请为主要指标指定以下属性:Specify the following attributes for your primary metric:

  • primary_metric_name:主要指标的名称需要与训练脚本记录的指标的名称完全匹配。primary_metric_name: The name of the primary metric needs to exactly match the name of the metric logged by the training script
  • primary_metric_goal:可以是 PrimaryMetricGoal.MAXIMIZEPrimaryMetricGoal.MINIMIZE,确定在评估运行时是要将主要指标最大化还是最小化。primary_metric_goal: It can be either PrimaryMetricGoal.MAXIMIZE or PrimaryMetricGoal.MINIMIZE and determines whether the primary metric will be maximized or minimized when evaluating the runs.
primary_metric_name="accuracy",
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE

此示例最大程度地提高了“准确度”。This sample maximizes "accuracy".

记录用于超参数优化的指标Log metrics for hyperparameter tuning

模型的训练脚本在模型训练期间 必须 记录主要指标,以便 HyperDrive 可以访问它以进行超参数优化。The training script for your model must log the primary metric during model training so that HyperDrive can access it for hyperparameter tuning.

使用以下示例代码片段在训练脚本中记录主要指标:Log the primary metric in your training script with the following sample snippet:

from azureml.core.run import Run
run_logger = Run.get_context()
run_logger.log("accuracy", float(val_accuracy))

训练脚本会计算 val_accuracy,并将其记录为主要指标“准确度”。The training script calculates the val_accuracy and logs it as the primary metric "accuracy". 每次记录指标时,超参数优化服务都会收到该指标。Each time the metric is logged, it's received by the hyperparameter tuning service. 你需要确定报告频率。It's up to you to determine the frequency of reporting.

若要详细了解如何在模型训练运行中记录值,请参阅在 Azure ML 训练运行中启用日志记录For more information on logging values in model training runs, see Enable logging in Azure ML training runs.

指定提前终止策略Specify early termination policy

使用提前终止策略自动终止性能不佳的运行。Automatically terminate poorly performing runs with an early termination policy. 提前终止提高了计算效率。Early termination improves computational efficiency.

你可以配置以下参数来控制何时应用该策略:You can configure the following parameters that control when a policy is applied:

  • evaluation_interval:应用策略的频率。evaluation_interval: the frequency of applying the policy. 每次训练脚本都会将主要指标计数记录为一个间隔。Each time the training script logs the primary metric counts as one interval. 如果 evaluation_interval 为 1,则训练脚本每次报告主要指标时,都会应用策略。An evaluation_interval of 1 will apply the policy every time the training script reports the primary metric. 如果 evaluation_interval 为 2,则会每隔一次应用该策略。An evaluation_interval of 2 will apply the policy every other time. 如果未指定,则默认将 evaluation_interval 设置为 1。If not specified, evaluation_interval is set to 1 by default.
  • delay_evaluation:将第一个策略评估延迟指定的间隔数。delay_evaluation: delays the first policy evaluation for a specified number of intervals. 这是一个可选参数,可让所有配置运行最小间隔数,避免训练运行过早终止。This is an optional parameter that avoids premature termination of training runs by allowing all configurations to run for a minimum number of intervals. 如果已指定,则每隔大于或等于 delay_evaluation 的 evaluation_interval 倍数应用策略。If specified, the policy applies every multiple of evaluation_interval that is greater than or equal to delay_evaluation.

Azure 机器学习支持以下提前终止策略:Azure Machine Learning supports the following early termination policies:

老虎机策略Bandit policy

老虎机策略基于松驰因子/松驰数量和评估间隔。Bandit policy is based on slack factor/slack amount and evaluation interval. 与最佳性能运行相比,如果主要指标不在指定的松驰因子/松驰数量范围内,则老虎机会终止运行。Bandit terminates runs where the primary metric is not within the specified slack factor/slack amount compared to the best performing run.

备注

贝叶斯采样不支持提前终止。Bayesian sampling does not support early termination. 使用贝叶斯采样时,请设置 early_termination_policy = NoneWhen using Bayesian sampling, set early_termination_policy = None.

指定以下配置参数:Specify the following configuration parameters:

  • slack_factorslack_amount:在最佳性能训练运行方面允许的松驰。slack_factor or slack_amount: the slack allowed with respect to the best performing training run. slack_factor 以比率的形式指定允许的松驰。slack_factor specifies the allowable slack as a ratio. slack_amount 以绝对数量(而不是比率)的形式指定允许的松驰。slack_amount specifies the allowable slack as an absolute amount, instead of a ratio.

    例如,假设以间隔 10 应用老虎机策略。For example, consider a Bandit policy applied at interval 10. 另外,假设性能最佳的运行以间隔 10 报告了主要指标 0.8,目标是最大化主要指标。Assume that the best performing run at interval 10 reported a primary metric is 0.8 with a goal to maximize the primary metric. 如果策略指定的 slack_factor 为 0.2,则间隔为 10 时其最佳指标小于 0.66 (0.8/(1+slack_factor)) 的任何训练运行都会被终止。If the policy specifies a slack_factor of 0.2, any training runs whose best metric at interval 10 is less than 0.66 (0.8/(1+slack_factor)) will be terminated.

  • evaluation_interval:(可选)应用策略的频率evaluation_interval: (optional) the frequency for applying the policy

  • delay_evaluation:(可选)将第一个策略评估延迟指定的间隔数delay_evaluation: (optional) delays the first policy evaluation for a specified number of intervals

from azureml.train.hyperdrive import BanditPolicy
early_termination_policy = BanditPolicy(slack_factor = 0.1, evaluation_interval=1, delay_evaluation=5)

在此示例中,指标报告时将在每个间隔应用提前终止策略,从评估间隔 5 开始。In this example, the early termination policy is applied at every interval when metrics are reported, starting at evaluation interval 5. 其最佳指标小于最佳性能运行的 (1/(1+0.1) 或 91% 的任何运行将被终止。Any run whose best metric is less than (1/(1+0.1) or 91% of the best performing run will be terminated.

中间值停止策略Median stopping policy

中值停止是基于运行报告的主要指标的运行平均值的提前终止策略。Median stopping is an early termination policy based on running averages of primary metrics reported by the runs. 此策略计算所有训练运行的运行平均值,并终止其主要指标值比平均值的中间值更差的运行。This policy computes running averages across all training runs and terminates runs with primary metric values worse than the median of averages.

此策略采用以下配置参数:This policy takes the following configuration parameters:

  • evaluation_interval:应用策略的频率(可选参数)。evaluation_interval: the frequency for applying the policy (optional parameter).
  • delay_evaluation:将第一个策略评估延迟指定的间隔数(可选参数)。delay_evaluation: delays the first policy evaluation for a specified number of intervals (optional parameter).
from azureml.train.hyperdrive import MedianStoppingPolicy
early_termination_policy = MedianStoppingPolicy(evaluation_interval=1, delay_evaluation=5)

在此示例中,将在每个间隔应用提前终止策略,从评估间隔 5 开始。In this example, the early termination policy is applied at every interval starting at evaluation interval 5. 如果某个运行的最佳主要指标比所有训练运行中间隔 1:5 的运行平均值的中间值更差,则会在间隔 5 处终止该运行。A run is terminated at interval 5 if its best primary metric is worse than the median of the running averages over intervals 1:5 across all training runs.

截断选择策略Truncation selection policy

截断选择在每个评估间隔取消给定百分比的性能最差的运行。Truncation selection cancels a percentage of lowest performing runs at each evaluation interval. 使用主要指标对运行进行比较。Runs are compared using the primary metric.

此策略采用以下配置参数:This policy takes the following configuration parameters:

  • truncation_percentage:要在每个评估间隔终止的性能最低的运行百分比。truncation_percentage: the percentage of lowest performing runs to terminate at each evaluation interval. 一个介于 1 到 99 之间的整数值。An integer value between 1 and 99.
  • evaluation_interval:(可选)应用策略的频率evaluation_interval: (optional) the frequency for applying the policy
  • delay_evaluation:(可选)将第一个策略评估延迟指定的间隔数delay_evaluation: (optional) delays the first policy evaluation for a specified number of intervals
from azureml.train.hyperdrive import TruncationSelectionPolicy
early_termination_policy = TruncationSelectionPolicy(evaluation_interval=1, truncation_percentage=20, delay_evaluation=5)

在此示例中,将在每个间隔应用提前终止策略,从评估间隔 5 开始。In this example, the early termination policy is applied at every interval starting at evaluation interval 5. 如果某个运行采用间隔 5 时的性能在采用间隔 5 的所有运行中处于 20% 的性能最低范围内,则会在间隔 5 处终止该运行。A run terminates at interval 5 if its performance at interval 5 is in the lowest 20% of performance of all runs at interval 5.

无终止策略(默认设置)No termination policy (default)

如果未指定策略,则超参数优化服务将让所有训练运行一直执行到完成。If no policy is specified, the hyperparameter tuning service will let all training runs execute to completion.

policy=None

选择提前终止策略Picking an early termination policy

  • 对于既可以节省成本又不会终止有前景的作业的保守策略,请考虑使用 evaluation_interval 为 1 且 delay_evaluation 为 5 的中间值停止策略。For a conservative policy that provides savings without terminating promising jobs, consider a Median Stopping Policy with evaluation_interval 1 and delay_evaluation 5. 这属于保守的设置,可以提供大约 25%-35% 的节省,且不会造成主要指标损失(基于我们的评估数据)。These are conservative settings, that can provide approximately 25%-35% savings with no loss on primary metric (based on our evaluation data).
  • 如果想节省更多成本,则可以使用老虎机策略或截断选择策略,因为前者的可用可宽延时间更短,后者的截断百分比更大。For more aggressive savings, use Bandit Policy with a smaller allowable slack or Truncation Selection Policy with a larger truncation percentage.

分配资源Allocate resources

通过指定最大训练运行数来控制资源预算。Control your resource budget by specifying the maximum number of training runs.

  • max_total_runs:最大训练运行数。max_total_runs: Maximum number of training runs. 必须是介于 1 到 1000 之间的整数。Must be an integer between 1 and 1000.
  • max_duration_minutes:(可选)超参数优化试验的最长持续时间(分钟)。max_duration_minutes: (optional) Maximum duration, in minutes, of the hyperparameter tuning experiment. 此持续时间之后的运行会被取消。Runs after this duration are canceled.

备注

如果同时指定了 max_total_runsmax_duration_minutes,在达到其中的第一个阈值时,会终止超参数优化试验。If both max_total_runs and max_duration_minutes are specified, the hyperparameter tuning experiment terminates when the first of these two thresholds is reached.

此外,指定在超参数优化搜索期间要并行运行的最大训练运行数。Additionally, specify the maximum number of training runs to run concurrently during your hyperparameter tuning search.

  • max_concurrent_runs:(可选)可以并发运行的最大运行数。max_concurrent_runs: (optional) Maximum number of runs that can run concurrently. 如果未指定此项,所有运行都将并行启动。If not specified, all runs launch in parallel. 如果指定了此项,则必须是 1 和 100 之间的整数。If specified, must be an integer between 1 and 100.

备注

并发运行数根据指定计算目标中的可用资源进行限制。The number of concurrent runs is gated on the resources available in the specified compute target. 请确保计算目标能够为所需的并发性提供足够的可用资源。Ensure that the compute target has the available resources for the desired concurrency.

max_total_runs=20,
max_concurrent_runs=4

此代码会将超参数优化试验配置为总共最多使用 20 个运行,每次运行四个配置。This code configures the hyperparameter tuning experiment to use a maximum of 20 total runs, running four configurations at a time.

配置试验Configure experiment

若要配置超参数优化试验,请提供以下信息:To configure your hyperparameter tuning experiment, provide the following:

  • 所定义的超参数搜索空间The defined hyperparameter search space
  • 你的提前终止策略Your early termination policy
  • 主要指标The primary metric
  • 资源分配设置Resource allocation settings
  • ScriptRunConfig srcScriptRunConfig src

ScriptRunConfig 是将通过采样的超参数运行的训练脚本。The ScriptRunConfig is the training script that will run with the sampled hyperparameters. 它定义每个作业的资源(单节点或多节点)和要使用的计算目标。It defines the resources per job (single or multi-node), and the compute target to use.

备注

src 中指定的计算目标必须具有足够的资源来满足你的并发级别。The compute target specified in src must have enough resources to satisfy your concurrency level. 有关 ScriptRunConfig 的详细信息,请参阅配置训练运行For more information on ScriptRunConfig, see Configure training runs.

配置超参数优化试验:Configure your hyperparameter tuning experiment:

from azureml.train.hyperdrive import HyperDriveConfig
hd_config = HyperDriveConfig(run_config=src,
                             hyperparameter_sampling=param_sampling,
                             policy=early_termination_policy,
                             primary_metric_name="accuracy",
                             primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                             max_total_runs=100,
                             max_concurrent_runs=4)

提交试验Submit experiment

定义超参数优化配置后,请提交试验After you define your hyperparameter tuning configuration, submit the experiment:

from azureml.core.experiment import Experiment
experiment = Experiment(workspace, experiment_name)
hyperdrive_run = experiment.submit(hd_config)

热启动超参数优化试验(可选)Warm start your hyperparameter tuning experiment (optional)

确定模型的最佳超参数值可能是一个迭代过程。Finding the best hyperparameter values for your model can be an iterative process. 你可以重复使用从前五个运行中获得的知识来加速超参数优化。You can reuse knowledge from the five previous runs to accelerate hyperparameter tuning.

对热启动的处理方式不同,具体取决于采样方法:Warm starting is handled differently depending on the sampling method:

  • 贝叶斯采样:来自先前运行的试用将用作先验知识来选取新样本,并改善主要指标。Bayesian sampling: Trials from the previous run are used as prior knowledge to pick new samples, and to improve the primary metric.
  • 随机采样网格采样:提前终止使用从以前的运行获得的知识来确定性能不佳的运行。Random sampling or grid sampling: Early termination uses knowledge from previous runs to determine poorly performing runs.

指定要从中开始热启动的父运行的列表。Specify the list of parent runs you want to warm start from.

from azureml.train.hyperdrive import HyperDriveRun

warmstart_parent_1 = HyperDriveRun(experiment, "warmstart_parent_run_ID_1")
warmstart_parent_2 = HyperDriveRun(experiment, "warmstart_parent_run_ID_2")
warmstart_parents_to_resume_from = [warmstart_parent_1, warmstart_parent_2]

如果取消了某个超参数优化试验,则可以从上一个检查点恢复训练运行。If a hyperparameter tuning experiment is canceled, you can resume training runs from the last checkpoint. 但是,你的训练脚本必须处理检查点逻辑。However, your training script must handle checkpoint logic.

训练运行必须使用相同的超参数配置,并装载输出文件夹。The training run must use the same hyperparameter configuration and mounted the outputs folders. 训练脚本必须接受 resume-from 参数,该参数包含要从中恢复训练运行的检查点或模型文件。The training script must accept the resume-from argument, which contains the checkpoint or model files from which to resume the training run. 可以使用以下代码片段恢复单个训练运行:You can resume individual training runs using the following snippet:

from azureml.core.run import Run

resume_child_run_1 = Run(experiment, "resume_child_run_ID_1")
resume_child_run_2 = Run(experiment, "resume_child_run_ID_2")
child_runs_to_resume = [resume_child_run_1, resume_child_run_2]

可以配置超参数优化试验,使其从之前的试验中热启动,或者使用可选参数 resume_fromresume_child_runs 在配置中恢复单个训练运行:You can configure your hyperparameter tuning experiment to warm start from a previous experiment or resume individual training runs using the optional parameters resume_from and resume_child_runs in the config:

from azureml.train.hyperdrive import HyperDriveConfig

hd_config = HyperDriveConfig(run_config=src,
                             hyperparameter_sampling=param_sampling,
                             policy=early_termination_policy,
                             resume_from=warmstart_parents_to_resume_from,
                             resume_child_runs=child_runs_to_resume,
                             primary_metric_name="accuracy",
                             primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                             max_total_runs=100,
                             max_concurrent_runs=4)

将实验可视化Visualize experiment

使用 Notebook 小组件来可视化你的训练运行的进度。Use the Notebook widget to visualize the progress of your training runs. 以下代码片段可在 Jupyter 笔记本中的一个位置可视化所有的超参数优化运行:The following snippet visualizes all your hyperparameter tuning runs in one place in a Jupyter notebook:

from azureml.widgets import RunDetails
RunDetails(hyperdrive_run).show()

此代码会显示一个表格,其中详细描述了每个超参数配置的训练运行。This code displays a table with details about the training runs for each of the hyperparameter configurations.

超参数优化表

还可以将每个运行的性能可视化为训练进度。You can also visualize the performance of each of the runs as training progresses.

超参数优化绘图

你可以使用“并行坐标绘图”来直观识别各个超参数的性能与值之间的关联。You can visually identify the correlation between performance and values of individual hyperparameters by using a Parallel Coordinates Plot.

超参数优化并行坐标hyperparameter tuning parallel coordinates

还可将 Azure Web 门户中的所有超参数优化运行可视化。You can also visualize all of your hyperparameter tuning runs in the Azure web portal. 若要详细了解如何在门户中查看试验,请参阅如何跟踪试验For more information on how to view an experiment in the portal, see how to track experiments.

找到最佳模型Find the best model

完成所有超参数优化运行后,确定性能最佳的配置和超参数值:Once all of the hyperparameter tuning runs have completed, identify the best performing configuration and hyperparameter values:

best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['Arguments']

print('Best Run Id: ', best_run.id)
print('\n Accuracy:', best_run_metrics['accuracy'])
print('\n learning rate:',parameter_values[3])
print('\n keep probability:',parameter_values[5])
print('\n batch size:',parameter_values[7])

示例 NotebookSample notebook

请参阅以下文件夹中 train-hyperparameter-* 笔记本:Refer to train-hyperparameter-* notebooks in this folder:

阅读使用 Jupyter 笔记本探索此服务一文,了解如何运行笔记本。Learn how to run notebooks by following the article Use Jupyter notebooks to explore this service.

后续步骤Next steps