使用 Azure 机器学习优化模型的超参数Tune hyperparameters for your model with Azure Machine Learning

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

使用 Azure 机器学习有效地优化模型的超参数。Efficiently tune hyperparameters for your model using Azure Machine Learning. 超参数优化包括以下步骤:Hyperparameter tuning includes the following steps:

  • 定义参数搜索空间Define the parameter search space
  • 指定要优化的主要指标Specify a primary metric to optimize
  • 指定提前终止性能不佳运行的条件Specify early termination criteria for poorly performing runs
  • 分配资源用于优化超参数Allocate resources for hyperparameter tuning
  • 使用上述配置启动试验Launch an experiment with the above configuration
  • 将训练运行可视化Visualize the training runs
  • 为模型选择最佳性能配置Select the best performing configuration for your model

什么是超参数?What are hyperparameters?

超参数是所选的可调整参数,用于训练控制训练过程本身的模型。Hyperparameters are adjustable parameters you choose to train a model that govern the training process itself. 例如,要想训练深度神经网络,需要在训练模型前确定网络中隐藏层的数量,以及每个层中的节点数量。For example, to train a deep neural network, you decide the number of hidden layers in the network and the number of nodes in each layer prior to training the model. 这些值通常在训练过程中保持恒定。These values usually stay constant during the training process.

在深度学习/机器学习方案中,模型性能在很大程度上取决于选择的超参数值。In deep learning / machine learning scenarios, model performance depends heavily on the hyperparameter values selected. 超参数探索的目标是搜索不同的超参数配置,从而找到可以提供最佳性能的配置。The goal of hyperparameter exploration is to search across various hyperparameter configurations to find a configuration that results in the best performance. 如果搜索空间庞大并且评估每个配置所需的开销很大,则超参数探索过程通常是一个繁琐的手动过程。Typically, the hyperparameter exploration process is painstakingly manual, given that the search space is vast and evaluation of each configuration can be expensive.

使用 Azure 机器学习可以有效地将超参数探索过程自动化,从而节省大量时间和资源。Azure Machine Learning allows you to automate hyperparameter exploration in an efficient manner, saving you significant time and resources. 指定超参数值的范围及训练运行的最大数量。You specify the range of hyperparameter values and a maximum number of training runs. 然后,系统会自动启动多个具有不同参数配置且同时进行的运行,并找到可以提供最佳性能的配置(由你选择的指标来度量)。The system then automatically launches multiple simultaneous runs with different parameter configurations and finds the configuration that results in the best performance, measured by the metric you choose. 性能不佳的训练运行会自动提前终止,从而可以减少计算资源的浪费。Poorly performing training runs are automatically early terminated, reducing wastage of compute resources. 这些资源可以用来探索其他超参数配置。These resources are instead used to explore other hyperparameter configurations.

定义搜索空间Define search space

通过探索针对每个超参数定义的值范围来自动优化超参数。Automatically tune hyperparameters by exploring the range of values defined for each hyperparameter.

超参数的类型Types of hyperparameters

每个超参数既可以是离散的,也可以是连续的,并具有由参数表达式描述的值分布。Each hyperparameter can either be discrete or continuous and has a distribution of values described by a parameter expression.

离散超参数Discrete hyperparameters

离散超参数将指定为离散值中的一个 choiceDiscrete hyperparameters are specified as a choice among discrete values. choice 可以是:choice can be:

  • 一个或多个逗号分隔值one or more comma-separated values
  • range 对象a range object
  • 任意 list 对象any arbitrary list object
    {
        "batch_size": choice(16, 32, 64, 128)
        "number_of_hidden_layers": choice(range(1,5))
    }

在这种情况下,batch_size 采用 [16、32、64、128] 中的一个值,number_of_hidden_layers 采用 [1、2、3、4] 中的一个值。In this case, batch_size takes on one of the values [16, 32, 64, 128] and number_of_hidden_layers takes on one of the values [1, 2, 3, 4].

也可以使用分布来指定高级离散超参数。Advanced discrete hyperparameters can also be specified using a distribution. 支持以下分布:The following distributions are supported:

  • quniform(low, high, q) - 返回类似于 round(uniform(low, high) / q) * q 的值quniform(low, high, q) - Returns a value like round(uniform(low, high) / q) * q
  • qloguniform(low, high, q) - 返回类似于 round(exp(uniform(low, high)) / q) * q 的值qloguniform(low, high, q) - Returns a value like round(exp(uniform(low, high)) / q) * q
  • qnormal(mu, sigma, q) - 返回类似于 round(normal(mu, sigma) / q) * q 的值qnormal(mu, sigma, q) - Returns a value like round(normal(mu, sigma) / q) * q
  • qlognormal(mu, sigma, q) - 返回类似于 round(exp(normal(mu, sigma)) / q) * q 的值qlognormal(mu, sigma, q) - Returns a value like round(exp(normal(mu, sigma)) / q) * q

连续超参数Continuous hyperparameters

将连续超参数指定为一个连续值范围内的分布。Continuous hyperparameters are specified as a distribution over a continuous range of values. 支持的分布包括:Supported distributions include:

  • uniform(low, high) - 返回高低之间的均匀分布值uniform(low, high) - Returns a value uniformly distributed between low and high
  • loguniform(low, high) - 返回根据 exp(uniform(low, high)) 绘制的值,使返回值的对数均匀分布loguniform(low, high) - Returns a value drawn according to exp(uniform(low, high)) so that the logarithm of the return value is uniformly distributed
  • normal(mu, sigma) - 返回正态分布的实际值,包括平均值 μ 和标准方差 σnormal(mu, sigma) - Returns a real value that's normally distributed with mean mu and standard deviation sigma
  • lognormal(mu, sigma) - 返回根据 exp(normal(mu, sigma)) 绘制的值,使返回值的对数呈正态分布lognormal(mu, sigma) - Returns a value drawn according to exp(normal(mu, sigma)) so that the logarithm of the return value is normally distributed

参数空间定义的示例:An example of a parameter space definition:

    {    
        "learning_rate": normal(10, 3),
        "keep_probability": uniform(0.05, 0.1)
    }

此代码定义具有两个参数(learning_ratekeep_probability)的搜索空间。This code defines a search space with two parameters - learning_rate and keep_probability. learning_rate 包含平均值为 10、标准偏差为 3 的正态分布。learning_rate has a normal distribution with mean value 10 and a standard deviation of 3. keep_probability 包含最小值为 0.05、最大值为 0.1 的均匀分布。keep_probability has a uniform distribution with a minimum value of 0.05 and a maximum value of 0.1.

超参数空间采样Sampling the hyperparameter space

还可以指定参数采样方法来取代超参数空间定义。You can also specify the parameter sampling method to use over the hyperparameter space definition. Azure 机器学习支持随机采样、网格采样和 Bayesian 采样。Azure Machine Learning supports random sampling, grid sampling, and Bayesian sampling.

选取采样方法Picking a sampling method

  • 如果超参数空间可定义为离散值中的选项,并且你的预算充足,可以详尽地搜索定义的搜索空间中的所有值,则可以采用网格采样。Grid sampling can be used if your hyperparameter space can be defined as a choice among discrete values and if you have sufficient budget to exhaustively search over all values in the defined search space. 另外,网格采样还可自动提前终止性能不佳的运行,从而减少资源的浪费。Additionally, one can use automated early termination of poorly performing runs, which reduces wastage of resources.
  • 随机采样允许超参数空间包含离散超参数和连续超参数。Random sampling allows the hyperparameter space to include both discrete and continuous hyperparameters. 它在大多数时候实际应用良好,并且也可自动提前终止性能不佳的运行。In practice it produces good results most of the times and also allows the use of automated early termination of poorly performing runs. 某些用户使用随机采样执行初始搜索,然后以迭代方式优化搜索空间来改善结果。Some users perform an initial search using random sampling and then iteratively refine the search space to improve results.
  • Bayesian 采样在选择超参数值时利用了上述示例中的知识,有效地尝试了改进所有报告的主要指标。Bayesian sampling leverages knowledge of previous samples when choosing hyperparameter values, effectively trying to improve the reported primary metric. 如果你有足够的预算探索超参数空间,建议使用 Bayesian 采样 - 为了实现 Bayesian 采样的最佳结果,我们建议使用大于或等于正在优化的超参数数的 20 倍的最大运行数。Bayesian sampling is recommended when you have sufficient budget to explore the hyperparameter space - for best results with Bayesian Sampling we recommend using a maximum number of runs greater than or equal to 20 times the number of hyperparameters being tuned. 请注意,Bayesian 采样目前不支持任何提前终止策略。Note that Bayesian sampling does not currently support any early termination policy.

随机采样Random sampling

在随机采样中,超参数值是从定义的搜索空间中随机选择的。In random sampling, hyperparameter values are randomly selected from the defined search space. 随机采样允许搜索空间包含离散超参数和连续超参数。Random sampling allows the search space to include both discrete and continuous hyperparameters.

from azureml.train.hyperdrive import RandomParameterSampling
from azureml.train.hyperdrive import normal, uniform, choice
param_sampling = RandomParameterSampling( {
        "learning_rate": normal(10, 3),
        "keep_probability": uniform(0.05, 0.1),
        "batch_size": choice(16, 32, 64, 128)
    }
)

网格采样Grid sampling

网格采样针对定义的搜索空间中的所有可行值执行简单网格搜索。Grid sampling performs a simple grid search over all feasible values in the defined search space. 只能对使用 choice 指定的超参数使用此方法。It can only be used with hyperparameters specified using choice. 例如,以下空间总共包含 6 个样本:For example, the following space has a total of six samples:

from azureml.train.hyperdrive import GridParameterSampling
from azureml.train.hyperdrive import choice
param_sampling = GridParameterSampling( {
        "num_hidden_layers": choice(1, 2, 3),
        "batch_size": choice(16, 32)
    }
)

贝叶斯采样Bayesian sampling

采样基于 Bayesian 优化算法,可以根据后续样本智能选择超参数值。Bayesian sampling is based on the Bayesian optimization algorithm and makes intelligent choices on the hyperparameter values to sample next. 它会根据以前的采样方式选取样本,使新样本能够改进报告的主要指标。It picks the sample based on how the previous samples performed, such that the new sample improves the reported primary metric.

使用贝叶斯采样时,并发运行的数目会影响优化效果。When you use Bayesian sampling, the number of concurrent runs has an impact on the effectiveness of the tuning process. 通常情况下,数目较小的并发运行可能带来更好的采样收敛,因为较小的并行度会增加可从先前完成的运行中获益的运行的数量。Typically, a smaller number of concurrent runs can lead to better sampling convergence, since the smaller degree of parallelism increases the number of runs that benefit from previously completed runs.

Bayesian 采样仅支持搜索空间中的 choiceuniformquniform 分布。Bayesian sampling only supports choice, uniform, and quniform distributions over the search space.

from azureml.train.hyperdrive import BayesianParameterSampling
from azureml.train.hyperdrive import uniform, choice
param_sampling = BayesianParameterSampling( {
        "learning_rate": uniform(0.05, 0.1),
        "batch_size": choice(16, 32, 64, 128)
    }
)

备注

贝叶斯采样不支持任何提前终止策略(请参阅指定提前终止策略)。Bayesian sampling does not support any early termination policy (See Specify an early termination policy). 使用贝叶斯参数采样时,请设置 early_termination_policy = None,或不使用 early_termination_policy 参数。When using Bayesian parameter sampling, set early_termination_policy = None, or leave off the early_termination_policy parameter.

指定主要指标Specify primary metric

指定希望让超参数优化试验优化的主要指标Specify the primary metric you want the hyperparameter tuning experiment to optimize. 将根据此主要指标评估每个训练运行。Each training run is evaluated for the primary metric. 性能不佳的运行(其主要指标不符合提前终止策略设置的条件)将会终止。Poorly performing runs (where the primary metric does not meet criteria set by the early termination policy) will be terminated. 除了主要指标名称以外,还需指定优化的目标是要最大化还是最小化主要指标。In addition to the primary metric name, you also specify the goal of the optimization - whether to maximize or minimize the primary metric.

  • primary_metric_name:要优化的主要指标的名称。primary_metric_name: The name of the primary metric to optimize. 主要指标的名称需要与训练脚本记录的指标名称完全匹配。The name of the primary metric needs to exactly match the name of the metric logged by the training script. 请参阅记录用于超参数优化的指标See Log metrics for hyperparameter tuning.
  • primary_metric_goal:可以是 PrimaryMetricGoal.MAXIMIZEPrimaryMetricGoal.MINIMIZE,确定在评估运行时是要将主要指标最大化还是最小化。primary_metric_goal: It can be either PrimaryMetricGoal.MAXIMIZE or PrimaryMetricGoal.MINIMIZE and determines whether the primary metric will be maximized or minimized when evaluating the runs.
primary_metric_name="accuracy",
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE

优化运行以最大化“准确性”。Optimize the runs to maximize "accuracy". 确保在训练脚本中记录此值。Make sure to log this value in your training script.

记录用于超参数优化的指标Log metrics for hyperparameter tuning

模型的训练脚本必须在模型训练过程中记录相关指标。The training script for your model must log the relevant metrics during model training. 配置超参数优化时,指定要用于评估运行性能的主要指标。When you configure the hyperparameter tuning, you specify the primary metric to use for evaluating run performance. (请参阅指定要优化的主要指标。)必须在训练脚本中记录此指标,以便将其用于超参数优化过程。(See Specify a primary metric to optimize.) In your training script, you must log this metric so it is available to the hyperparameter tuning process.

使用以下示例代码片段将此指标记录在训练脚本中:Log this metric in your training script with the following sample snippet:

from azureml.core.run import Run
run_logger = Run.get_context()
run_logger.log("accuracy", float(val_accuracy))

训练脚本将计算 val_accuracy,并将其记录为“准确性”,它会用作主要指标。The training script calculates the val_accuracy and logs it as "accuracy", which is used as the primary metric. 每次记录指标时,超参数优化服务都将收到该指标。Each time the metric is logged it is received by the hyperparameter tuning service. 由模型开发人员确定报告此指标的频率。It is up to the model developer to determine how frequently to report this metric.

指定提前终止策略Specify early termination policy

使用提前终止策略自动终止性能不佳的运行。Terminate poorly performing runs automatically with an early termination policy. 终止可以减少资源浪费,并将这些资源用于探索其他参数配置。Termination reduces wastage of resources and instead uses these resources for exploring other parameter configurations.

使用提前终止策略时,可以配置以下参数来控制何时应用该策略:When using an early termination policy, you can configure the following parameters that control when a policy is applied:

  • evaluation_interval:应用策略的频率。evaluation_interval: the frequency for applying the policy. 每次训练脚本都会将主要指标计数记录为一个间隔。Each time the training script logs the primary metric counts as one interval. 因此,如果 evaluation_interval 为 1,则训练脚本每次报告主要指标时,都会应用策略。Thus an evaluation_interval of 1 will apply the policy every time the training script reports the primary metric. 如果 evaluation_interval 为 2,则训练脚本每两次报告主要指标时会应用策略。An evaluation_interval of 2 will apply the policy every other time the training script reports the primary metric. 如果未指定,则默认将 evaluation_interval 设置为 1。If not specified, evaluation_interval is set to 1 by default.
  • delay_evaluation:将第一个策略评估延迟指定的间隔数。delay_evaluation: delays the first policy evaluation for a specified number of intervals. 这是一个可选参数,可让所有配置运行初始设置的最小间隔数,避免训练运行过早终止。It is an optional parameter that allows all configurations to run for an initial minimum number of intervals, avoiding premature termination of training runs. 如果已指定,则每隔大于或等于 delay_evaluation 的 evaluation_interval 倍数应用策略。If specified, the policy applies every multiple of evaluation_interval that is greater than or equal to delay_evaluation.

Azure 机器学习支持以下提前终止策略。Azure Machine Learning supports the following Early Termination Policies.

老虎机策略Bandit policy

Bandit 是基于可宽延时间因子/可宽延时间量和评估间隔的终止策略。Bandit is a termination policy based on slack factor/slack amount and evaluation interval. 如果主要指标不在与最佳性能训练运行相关的指定松驰因子/松驰数量范围内,则此策略会提前终止任何运行。The policy early terminates any runs where the primary metric is not within the specified slack factor / slack amount with respect to the best performing training run. 此策略采用以下配置参数:It takes the following configuration parameters:

  • slack_factorslack_amount:在最佳性能训练运行方面允许的松驰。slack_factor or slack_amount: the slack allowed with respect to the best performing training run. slack_factor 以比率的形式指定允许的松驰。slack_factor specifies the allowable slack as a ratio. slack_amount 以绝对数量(而不是比率)的形式指定允许的松驰。slack_amount specifies the allowable slack as an absolute amount, instead of a ratio.

    例如,假设以间隔 10 应用老虎机策略。For example, consider a Bandit policy being applied at interval 10. 另外,性能最佳的运行以间隔 10 报告了主要指标 0.8,目标是最大化主要指标。Assume that the best performing run at interval 10 reported a primary metric 0.8 with a goal to maximize the primary metric. 如果为策略指定的 slack_factor 为 0.2,则间隔为 10 时其最佳指标小于 0.66 (0.8/(1+slack_factor)) 的任何训练运行将被终止。If the policy was specified with a slack_factor of 0.2, any training runs, whose best metric at interval 10 is less than 0.66 (0.8/(1+slack_factor)) will be terminated. 如果为策略指定的 slack_amount 为 0.2,则间隔为 10 时其最佳指标小于 0.6 (0.8-slack_amount) 的任何训练运行将被终止。If instead, the policy was specified with a slack_amount of 0.2, any training runs, whose best metric at interval 10 is less than 0.6 (0.8 - slack_amount) will be terminated.

  • evaluation_interval:应用策略的频率(可选参数)。evaluation_interval: the frequency for applying the policy (optional parameter).

  • delay_evaluation:将第一个策略评估延迟指定的间隔数(可选参数)。delay_evaluation: delays the first policy evaluation for a specified number of intervals (optional parameter).

from azureml.train.hyperdrive import BanditPolicy
early_termination_policy = BanditPolicy(slack_factor = 0.1, evaluation_interval=1, delay_evaluation=5)

在此示例中,指标报告时将在每个间隔应用提前终止策略,从评估间隔 5 开始。In this example, the early termination policy is applied at every interval when metrics are reported, starting at evaluation interval 5. 其最佳指标小于最佳性能运行的 (1/(1+0.1) 或 91% 的任何运行将被终止。Any run whose best metric is less than (1/(1+0.1) or 91% of the best performing run will be terminated.

中间值停止策略Median stopping policy

中值停止是基于运行报告的主要指标的运行平均值的提前终止策略。Median stopping is an early termination policy based on running averages of primary metrics reported by the runs. 此策略计算所有训练运行的运行平均值,并终止其性能比运行平均值的中间值更差的运行。This policy computes running averages across all training runs and terminates runs whose performance is worse than the median of the running averages. 此策略采用以下配置参数:This policy takes the following configuration parameters:

  • evaluation_interval:应用策略的频率(可选参数)。evaluation_interval: the frequency for applying the policy (optional parameter).
  • delay_evaluation:将第一个策略评估延迟指定的间隔数(可选参数)。delay_evaluation: delays the first policy evaluation for a specified number of intervals (optional parameter).
from azureml.train.hyperdrive import MedianStoppingPolicy
early_termination_policy = MedianStoppingPolicy(evaluation_interval=1, delay_evaluation=5)

在此示例中,将在每个间隔应用提前终止策略,从评估间隔 5 开始。In this example, the early termination policy is applied at every interval starting at evaluation interval 5. 如果某个运行的最佳主要指标比所有训练运行中间隔 1:5 的运行平均值的中间值更差,则在间隔 5 处终止该运行。A run will be terminated at interval 5 if its best primary metric is worse than the median of the running averages over intervals 1:5 across all training runs.

截断选择策略Truncation selection policy

截断选择在每个评估间隔取消给定百分比的性能最差的运行。Truncation selection cancels a given percentage of lowest performing runs at each evaluation interval. 根据运行的性能和主要指标比较运行,并终止 X% 的性能最低的运行。Runs are compared based on their performance on the primary metric and the lowest X% are terminated. 此策略采用以下配置参数:It takes the following configuration parameters:

  • truncation_percentage:要在每个评估间隔终止的性能最低的运行百分比。truncation_percentage: the percentage of lowest performing runs to terminate at each evaluation interval. 指定一个介于 1 到 99 之间的整数值。Specify an integer value between 1 and 99.
  • evaluation_interval:应用策略的频率(可选参数)。evaluation_interval: the frequency for applying the policy (optional parameter).
  • delay_evaluation:将第一个策略评估延迟指定的间隔数(可选参数)。delay_evaluation: delays the first policy evaluation for a specified number of intervals (optional parameter).
from azureml.train.hyperdrive import TruncationSelectionPolicy
early_termination_policy = TruncationSelectionPolicy(evaluation_interval=1, truncation_percentage=20, delay_evaluation=5)

在此示例中,将在每个间隔应用提前终止策略,从评估间隔 5 开始。In this example, the early termination policy is applied at every interval starting at evaluation interval 5. 如果某个运行在间隔 5 处的性能,在间隔 5 的所有运行中处于 20% 的性能最低的运行范围内,则会终止该运行。A run will be terminated at interval 5 if its performance at interval 5 is in the lowest 20% of performance of all runs at interval 5.

无终止策略No termination policy

如果希望所有训练运行一直运行到完成,请将策略设置为“无”。If you want all training runs to run to completion, set policy to None. 这样就不会应用任何提前终止策略。This will have the effect of not applying any early termination policy.

policy=None

默认策略Default policy

如果未指定策略,则超参数优化服务将让所有训练运行一直执行到完成。If no policy is specified, the hyperparameter tuning service will let all training runs execute to completion.

选择提前终止策略Picking an early termination policy

  • 如果你正在寻找既可以节省成本又不会终止有前景的作业的保守策略,则可以使用 evaluation_interval 为 1 且 delay_evaluation 为 5 的中间值停止策略。If you are looking for a conservative policy that provides savings without terminating promising jobs, you can use a Median Stopping Policy with evaluation_interval 1 and delay_evaluation 5. 这属于保守的设置,可以提供大约 25%-35% 的节省,且不会造成主要指标损失(基于我们的评估数据)。These are conservative settings, that can provide approximately 25%-35% savings with no loss on primary metric (based on our evaluation data).
  • 如果希望通过提前终止实现成本节省,则可以使用 Bandit 策略或截断选择策略,因为前者的可用可宽延时间更严格(更小),后者的截断百分比更大。If you are looking for more aggressive savings from early termination, you can either use Bandit Policy with a stricter (smaller) allowable slack or Truncation Selection Policy with a larger truncation percentage.

分配资源Allocate resources

通过指定训练运行的最大总数,控制超参数优化试验的资源预算。Control your resource budget for your hyperparameter tuning experiment by specifying the maximum total number of training runs. (可选)指定超参数优化试验的最长持续时间。Optionally specify the maximum duration for your hyperparameter tuning experiment.

  • max_total_runs:要创建的训练运行的最大总数。max_total_runs: Maximum total number of training runs that will be created. 上限 - 例如,如果超参数空间有限并且样本较少,则运行数也会较少。Upper bound - there may be fewer runs, for instance, if the hyperparameter space is finite and has fewer samples. 必须是介于 1 和 1000 之间的数字。Must be a number between 1 and 1000.
  • max_duration_minutes:超参数优化试验的最长持续时间(分钟)。max_duration_minutes: Maximum duration in minutes of the hyperparameter tuning experiment. 此参数可选。如果已指定,则会自动取消在此持续时间后运行的所有运行。Parameter is optional, and if present, any runs that would be running after this duration are automatically canceled.

备注

如果同时指定了 max_total_runsmax_duration_minutes,在达到其中的第一个阈值时,会终止超参数优化试验。If both max_total_runs and max_duration_minutes are specified, the hyperparameter tuning experiment terminates when the first of these two thresholds is reached.

此外,指定在超参数优化搜索期间要并行运行的最大训练运行数。Additionally, specify the maximum number of training runs to run concurrently during your hyperparameter tuning search.

  • max_concurrent_runs:在任意给定时刻要并行运行的最大运行数。max_concurrent_runs: Maximum number of runs to run concurrently at any given moment. 如果未指定,将并行启动所有 max_total_runsIf none specified, all max_total_runs will be launched in parallel. 如果要指定,值必须是介于 1 和 100 之间的数字。If specified, must be a number between 1 and 100.

备注

并发运行数根据指定计算目标中的可用资源进行限制。The number of concurrent runs is gated on the resources available in the specified compute target. 因此,需要确保计算目标能够为所需的并发性提供足够的可用资源。Hence, you need to ensure that the compute target has the available resources for the desired concurrency.

分配用于优化超参数的资源:Allocate resources for hyperparameter tuning:

max_total_runs=20,
max_concurrent_runs=4

此代码会将超参数优化试验配置为总共最多使用 20 个运行,每次运行四个配置。This code configures the hyperparameter tuning experiment to use a maximum of 20 total runs, running four configurations at a time.

配置试验Configure experiment

使用上述部分中定义的超参数搜索空间、提前终止策略、主要指标和资源分配来配置超参数优化试验。Configure your hyperparameter tuning experiment using the defined hyperparameter search space, early termination policy, primary metric, and resource allocation from the sections above. 此外,提供将结合采样的超参数调用的 estimatorAdditionally, provide an estimator that will be called with the sampled hyperparameters. estimator 描述运行的训练脚本、每个作业的资源(单个或多个 GPU)以及要使用的计算目标。The estimator describes the training script you run, the resources per job (single or multi-gpu), and the compute target to use. 由于超参数优化试验的并发性受可用资源限制,所以请确保 estimator 中指定的计算目标能够为所需的并发性提供足够的资源。Since concurrency for your hyperparameter tuning experiment is gated on the resources available, ensure that the compute target specified in the estimator has sufficient resources for your desired concurrency. (有关估算器的详细信息,请参阅如何训练模型。)(For more information on estimators, see how to train models.)

配置超参数优化试验:Configure your hyperparameter tuning experiment:

from azureml.train.hyperdrive import HyperDriveConfig
hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
                          hyperparameter_sampling=param_sampling, 
                          policy=early_termination_policy,
                          primary_metric_name="accuracy", 
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                          max_total_runs=100,
                          max_concurrent_runs=4)

提交试验Submit experiment

定义超参数优化配置后,请提交试验Once you define your hyperparameter tuning configuration, submit an experiment:

from azureml.core.experiment import Experiment
experiment = Experiment(workspace, experiment_name)
hyperdrive_run = experiment.submit(hyperdrive_run_config)

experiment_name 是要分配给超参数优化试验的名称,workspace 是要在其中创建试验的工作区(有关试验的详细信息,请参阅 Azure 机器学习的工作原理experiment_name is the name you assign to your hyperparameter tuning experiment, and workspace is the workspace in which you want to create the experiment (For more information on experiments, see How does Azure Machine Learning work?)

热启动超参数优化试验(可选)Warm start your hyperparameter tuning experiment (optional)

通常,为模型查找最佳参数值可以是一个迭代过程,需要多次从之前超参数优化运行中学习到的优化运行。Often, finding the best hyperparameter values for your model can be an iterative process, needing multiple tuning runs that learn from previous hyperparameter tuning runs. 重复运用这些之前运行的知识将加速超参数优化过程,从而降低优化模型的成本,并有可能改进生成模型的主要指标。Reusing knowledge from these previous runs will accelerate the hyperparameter tuning process, thereby reducing the cost of tuning the model and will potentially improve the primary metric of the resulting model. 通过 Bayesian 采样热启动超参数优化实验时,先前运行的试验将作为先验知识,以智能选择新样本,从而改进主要指标。When warm starting a hyperparameter tuning experiment with Bayesian sampling, trials from the previous run will be used as prior knowledge to intelligently pick new samples, to improve the primary metric. 此外,使用随机或网格采样时,任何提前终止决策都将利用之前运行的指标来确定性能不佳的训练运行。Additionally, when using Random or Grid sampling, any early termination decisions will leverage metrics from the previous runs to determine poorly performing training runs.

使用 Azure 机器学习,你可以利用最多 5 个之前完成的/取消的超参数优化父运行的知识来热启动超参数优化运行。Azure Machine Learning allows you to warm start your hyperparameter tuning run by leveraging knowledge from up to 5 previously completed / canceled hyperparameter tuning parent runs. 可使用以下代码片段指定要热启动的父运行列表:You can specify the list of parent runs you want to warm start from using this snippet:

from azureml.train.hyperdrive import HyperDriveRun

warmstart_parent_1 = HyperDriveRun(experiment, "warmstart_parent_run_ID_1")
warmstart_parent_2 = HyperDriveRun(experiment, "warmstart_parent_run_ID_2")
warmstart_parents_to_resume_from = [warmstart_parent_1, warmstart_parent_2]

此外,在某些情况下,超参数优化试验的单个训练运行可能会因预算限制而取消,或因其他原因而失败。Additionally, there may be occasions when individual training runs of a hyperparameter tuning experiment are canceled due to budget constraints or fail due to other reasons. 现在可以从最后一个检查点恢复这种单个训练运行(假设训练脚本处理检查点)。It is now possible to resume such individual training runs from the last checkpoint (assuming your training script handles checkpoints). 恢复单个训练运行将使用相同的超参数配置,并装载用于该运行的输出文件夹。Resuming an individual training run will use the same hyperparameter configuration and mount the outputs folder used for that run. 训练脚本应接受 resume-from 参数,该参数包含要从中恢复训练运行的检查点或模型文件。The training script should accept the resume-from argument, which contains the checkpoint or model files from which to resume the training run. 可以使用以下代码片段恢复单个训练运行:You can resume individual training runs using the following snippet:

from azureml.core.run import Run

resume_child_run_1 = Run(experiment, "resume_child_run_ID_1")
resume_child_run_2 = Run(experiment, "resume_child_run_ID_2")
child_runs_to_resume = [resume_child_run_1, resume_child_run_2]

可以配置超参数优化试验,使其从之前的试验中热启动,或者使用可选参数 resume_fromresume_child_runs 在配置中恢复单个训练运行:You can configure your hyperparameter tuning experiment to warm start from a previous experiment or resume individual training runs using the optional parameters resume_from and resume_child_runs in the config:

from azureml.train.hyperdrive import HyperDriveConfig

hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
                          hyperparameter_sampling=param_sampling, 
                          policy=early_termination_policy,
                          resume_from=warmstart_parents_to_resume_from, 
                          resume_child_runs=child_runs_to_resume,
                          primary_metric_name="accuracy", 
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                          max_total_runs=100,
                          max_concurrent_runs=4)

将实验可视化Visualize experiment

Azure 机器学习 SDK 提供了可将训练运行进度可视化的 Notebook 小组件The Azure Machine Learning SDK provides a Notebook widget that visualizes the progress of your training runs. 以下代码片段可在 Jupyter 笔记本中的一个位置可视化所有的超参数优化运行:The following snippet visualizes all your hyperparameter tuning runs in one place in a Jupyter notebook:

from azureml.widgets import RunDetails
RunDetails(hyperdrive_run).show()

此代码会显示一个表格,其中详细描述了每个超参数配置的训练运行。This code displays a table with details about the training runs for each of the hyperparameter configurations.

超参数优化表

还可以将每个运行的性能可视化为训练进度。You can also visualize the performance of each of the runs as training progresses.

超参数优化绘图

此外,可以使用“并行坐标绘图”来直观地识别各个超参数的性能与值之间的关联。Additionally, you can visually identify the correlation between performance and values of individual hyperparameters using a Parallel Coordinates Plot.

超参数优化并行坐标hyperparameter tuning parallel coordinates

还可将 Azure Web 门户中的所有超参数优化运行可视化。You can visualize all your hyperparameter tuning runs in the Azure web portal as well. 若要详细了解如何在 Web 门户中查看试验,请参阅如何跟踪试验For more information on how to view an experiment in the web portal, see how to track experiments.

找到最佳模型Find the best model

完成所有超参数优化运行后,识别性能最佳的配置和对应的超参数值:Once all of the hyperparameter tuning runs have completed, identify the best performing configuration and the corresponding hyperparameter values:

best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['Arguments']

print('Best Run Id: ', best_run.id)
print('\n Accuracy:', best_run_metrics['accuracy'])
print('\n learning rate:',parameter_values[3])
print('\n keep probability:',parameter_values[5])
print('\n batch size:',parameter_values[7])

示例 NotebookSample notebook

请参阅以下文件夹中 train-hyperparameter-* 笔记本:Refer to train-hyperparameter-* notebooks in this folder:

阅读使用 Jupyter 笔记本探索此服务一文,了解如何运行笔记本。Learn how to run notebooks by following the article Use Jupyter notebooks to explore this service.

后续步骤Next steps