Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
APPLIES TO:
Azure CLI ml extension v2 (current)
Python SDK azure-ai-ml v2 (current)
In this article, you learn how to automate efficient hyperparameter tuning with the Azure Machine Learning SDK v2 and CLI v2 using the SweepJob class.
- Define the parameter search space
- Choose a sampling algorithm
- Set the optimization objective
- Configure an early termination policy
- Set sweep job limits
- Submit the experiment
- Visualize training jobs
- Select the best configuration
What is hyperparameter tuning?
Hyperparameters are adjustable settings that control model training. For neural networks, for example, you choose the number of hidden layers and the number of nodes per layer. Model performance depends heavily on these values.
Hyperparameter tuning (or hyperparameter optimization) is the process of finding the hyperparameter configuration that yields the best performance. This process is often computationally expensive and manual.
Azure Machine Learning lets you automate hyperparameter tuning and run experiments in parallel to efficiently optimize hyperparameters.
Define the search space
Tune hyperparameters by exploring the range of values defined for each hyperparameter.
Hyperparameters can be discrete or continuous, and can have a value distribution expressed with a parameter expression.
Discrete hyperparameters
Discrete hyperparameters are specified as a Choice
among discrete values. Choice
can be:
- one or more comma-separated values
- a
range
object - any arbitrary
list
object
from azure.ai.ml.sweep import Choice
command_job_for_sweep = command_job(
batch_size=Choice(values=[16, 32, 64, 128]),
number_of_hidden_layers=Choice(values=range(1,5)),
)
References:
In this case, batch_size
takes one of [16, 32, 64, 128] and number_of_hidden_layers
takes one of [1, 2, 3, 4].
The following advanced discrete hyperparameters can also be specified using a distribution:
QUniform(min_value, max_value, q)
- Returns a value like round(Uniform(min_value, max_value) / q) * qQLogUniform(min_value, max_value, q)
- Returns a value like round(exp(Uniform(min_value, max_value)) / q) * qQNormal(mu, sigma, q)
- Returns a value like round(Normal(mu, sigma) / q) * qQLogNormal(mu, sigma, q)
- Returns a value like round(exp(Normal(mu, sigma)) / q) * q
Continuous hyperparameters
Continuous hyperparameters are specified as a distribution over a continuous range of values:
Uniform(min_value, max_value)
- Returns a value uniformly distributed between min_value and max_valueLogUniform(min_value, max_value)
- Returns a value drawn according to exp(Uniform(min_value, max_value)) so that the logarithm of the return value is uniformly distributedNormal(mu, sigma)
- Returns a real value that's normally distributed with mean mu and standard deviation sigmaLogNormal(mu, sigma)
- Returns a value drawn according to exp(Normal(mu, sigma)) so that the logarithm of the return value is normally distributed
An example of a parameter space definition:
from azure.ai.ml.sweep import Normal, Uniform
command_job_for_sweep = command_job(
learning_rate=Normal(mu=10, sigma=3),
keep_probability=Uniform(min_value=0.05, max_value=0.1),
)
References:
This code defines a search space with two parameters - learning_rate
and keep_probability
. learning_rate
has a normal distribution with mean value 10 and a standard deviation of 3. keep_probability
has a uniform distribution with a minimum value of 0.05 and a maximum value of 0.1.
For the CLI, use the sweep job YAML schema to define the search space:
search_space:
conv_size:
type: choice
values: [2, 5, 7]
dropout_rate:
type: uniform
min_value: 0.1
max_value: 0.2
Sampling the hyperparameter space
Specify the sampling method for the hyperparameter space. Azure Machine Learning supports:
- Random sampling
- Grid sampling
- Bayesian sampling
Random sampling
Random sampling supports discrete and continuous hyperparameters, and supports early termination of low-performing jobs. Many users start with random sampling to identify promising regions, then refine.
In random sampling, values are drawn uniformly (or via the specified random rule) from the defined search space. After creating your command job, use sweep
to define the sampling algorithm.
from azure.ai.ml.entities import CommandJob
from azure.ai.ml.sweep import RandomSamplingAlgorithm, SweepJob, SweepJobLimits
command_job = CommandJob(
inputs=dict(kernel="linear", penalty=1.0),
compute=cpu_cluster,
environment=f"{job_env.name}:{job_env.version}",
code="./scripts",
command="python scripts/train.py --kernel $kernel --penalty $penalty",
experiment_name="sklearn-iris-flowers",
)
sweep = SweepJob(
sampling_algorithm=RandomSamplingAlgorithm(seed=999, rule="sobol", logbase="e"),
trial=command_job,
search_space={"ss": Choice(type="choice", values=[{"space1": True}, {"space2": True}])},
inputs={"input1": {"file": "top_level.csv", "mode": "ro_mount"}}, # type:ignore
compute="top_level",
limits=SweepJobLimits(trial_timeout=600),
)
References:
Sobol
Sobol is a quasi-random sequence that improves space-filling and reproducibility. Provide a seed and set rule="sobol"
on RandomSamplingAlgorithm
.
from azure.ai.ml.sweep import RandomSamplingAlgorithm
sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm = RandomSamplingAlgorithm(seed=123, rule="sobol"),
...
)
References: RandomSamplingAlgorithm
Grid sampling
Grid sampling supports discrete hyperparameters. Use grid sampling if you can budget to exhaustively search over the search space. Supports early termination of low-performance jobs.
Grid sampling does a simple grid search over all possible values. Grid sampling can only be used with choice
hyperparameters. For example, the following space has six samples:
from azure.ai.ml.sweep import Choice
command_job_for_sweep = command_job(
batch_size=Choice(values=[16, 32]),
number_of_hidden_layers=Choice(values=[1,2,3]),
)
sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm = "grid",
...
)
References: Choice
Bayesian sampling
Bayesian sampling (Bayesian optimization) selects new samples based on prior results to improve the primary metric efficiently.
Bayesian sampling is recommended if you have enough budget to explore the hyperparameter space. For best results, we recommend a maximum number of jobs greater than or equal to 20 times the number of hyperparameters being tuned.
The number of concurrent jobs has an impact on the effectiveness of the tuning process. A smaller number of concurrent jobs may lead to better sampling convergence, since the smaller degree of parallelism increases the number of jobs that benefit from previously completed jobs.
Bayesian sampling supports choice
, uniform
, and quniform
distributions.
from azure.ai.ml.sweep import Uniform, Choice
command_job_for_sweep = command_job(
learning_rate=Uniform(min_value=0.05, max_value=0.1),
batch_size=Choice(values=[16, 32, 64, 128]),
)
sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm = "bayesian",
...
)
References:
Specify the objective of the sweep
Define the objective of your sweep job by specifying the primary metric and goal you want hyperparameter tuning to optimize. Each training job is evaluated for the primary metric. The early termination policy uses the primary metric to identify low-performance jobs.
primary_metric
: The name of the primary metric needs to exactly match the name of the metric logged by the training scriptgoal
: It can be eitherMaximize
orMinimize
and determines whether the primary metric will be maximized or minimized when evaluating the jobs.
from azure.ai.ml.sweep import Uniform, Choice
command_job_for_sweep = command_job(
learning_rate=Uniform(min_value=0.05, max_value=0.1),
batch_size=Choice(values=[16, 32, 64, 128]),
)
sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm = "bayesian",
primary_metric="accuracy",
goal="Maximize",
)
References:
This sample maximizes "accuracy".
Log metrics for hyperparameter tuning
Your training script must log the primary metric with the exact name expected by the sweep job.
Log the primary metric in your training script with the following sample snippet:
import mlflow
mlflow.log_metric("accuracy", float(val_accuracy))
References: mlflow.log_metric
The training script calculates the val_accuracy
and logs it as the primary metric "accuracy". Each time the metric is logged, it's received by the hyperparameter tuning service. It's up to you to determine the frequency of reporting.
For more information on logging values for training jobs, see Enable logging in Azure Machine Learning training jobs.
Specify early termination policy
End poorly performing jobs early to improve efficiency.
You can configure the following parameters that control when a policy is applied:
evaluation_interval
: the frequency of applying the policy. Each time the training script logs the primary metric counts as one interval. Anevaluation_interval
of 1 will apply the policy every time the training script reports the primary metric. Anevaluation_interval
of 2 will apply the policy every other time. If not specified,evaluation_interval
is set to 0 by default.delay_evaluation
: delays the first policy evaluation for a specified number of intervals. This is an optional parameter that avoids premature termination of training jobs by allowing all configurations to run for a minimum number of intervals. If specified, the policy applies every multiple of evaluation_interval that is greater than or equal to delay_evaluation. If not specified,delay_evaluation
is set to 0 by default.
Azure Machine Learning supports the following early termination policies:
Bandit policy
Bandit policy uses a slack factor or amount plus evaluation interval. It ends a job when its primary metric falls outside the allowed slack from the best job.
Specify the following configuration parameters:
slack_factor
orslack_amount
: Allowed difference from the best job.slack_factor
is a ratio;slack_amount
is an absolute value.For example, consider a Bandit policy applied at interval 10. Assume that the best performing job at interval 10 reported a primary metric is 0.8 with a goal to maximize the primary metric. If the policy specifies a
slack_factor
of 0.2, any training jobs whose best metric at interval 10 is less than 0.66 (0.8/(1+slack_factor
)) will be terminated.evaluation_interval
: (optional) the frequency for applying the policydelay_evaluation
: (optional) delays the first policy evaluation for a specified number of intervals
from azure.ai.ml.sweep import BanditPolicy
sweep_job.early_termination = BanditPolicy(slack_factor = 0.1, delay_evaluation = 5, evaluation_interval = 1)
References: BanditPolicy
In this example, the early termination policy is applied at every interval when metrics are reported, starting at evaluation interval 5. Any jobs whose best metric is less than (1/(1+0.1) or 91% of the best performing jobs will be terminated.
Median stopping policy
Median stopping is an early termination policy based on running averages of primary metrics reported by the jobs. This policy computes running averages across all training jobs and stops jobs whose primary metric value is worse than the median of the averages.
This policy takes the following configuration parameters:
evaluation_interval
: the frequency for applying the policy (optional parameter).delay_evaluation
: delays the first policy evaluation for a specified number of intervals (optional parameter).
from azure.ai.ml.sweep import MedianStoppingPolicy
sweep_job.early_termination = MedianStoppingPolicy(delay_evaluation = 5, evaluation_interval = 1)
References: MedianStoppingPolicy
In this example, the early termination policy is applied at every interval starting at evaluation interval 5. A job is stopped at interval 5 if its best primary metric is worse than the median of the running averages over intervals 1:5 across all training jobs.
Truncation selection policy
Truncation selection cancels a percentage of lowest performing jobs at each evaluation interval. Jobs are compared using the primary metric.
This policy takes the following configuration parameters:
truncation_percentage
: the percentage of lowest performing jobs to terminate at each evaluation interval. An integer value between 1 and 99.evaluation_interval
: (optional) the frequency for applying the policydelay_evaluation
: (optional) delays the first policy evaluation for a specified number of intervalsexclude_finished_jobs
: specifies whether to exclude finished jobs when applying the policy
from azure.ai.ml.sweep import TruncationSelectionPolicy
sweep_job.early_termination = TruncationSelectionPolicy(evaluation_interval=1, truncation_percentage=20, delay_evaluation=5, exclude_finished_jobs=true)
References: TruncationSelectionPolicy
In this example, the early termination policy is applied at every interval starting at evaluation interval 5. A job terminates at interval 5 if its performance at interval 5 is in the lowest 20% of performance of all jobs at interval 5 and will exclude finished jobs when applying the policy.
No termination policy (default)
If no policy is specified, the hyperparameter tuning service lets all training jobs execute to completion.
sweep_job.early_termination = None
References: SweepJob
Picking an early termination policy
- For a conservative policy that provides savings without terminating promising jobs, consider a Median Stopping Policy with
evaluation_interval
1 anddelay_evaluation
5. These are conservative settings that can provide approximately 25%-35% savings with no loss on primary metric (based on our evaluation data). - For more aggressive savings, use Bandit Policy with a smaller allowable slack or Truncation Selection Policy with a larger truncation percentage.
Set limits for your sweep job
Control your resource budget by setting limits for your sweep job.
max_total_trials
: Maximum number of trial jobs. Must be an integer between 1 and 1000.max_concurrent_trials
: (optional) Maximum number of trial jobs that can run concurrently. If not specified, max_total_trials number of jobs launch in parallel. If specified, must be an integer between 1 and 1000.timeout
: Maximum time in seconds the entire sweep job is allowed to run. Once this limit is reached the system cancels the sweep job, including all its trials.trial_timeout
: Maximum time in seconds each trial job is allowed to run. Once this limit is reached the system cancels the trial.
Note
If both max_total_trials and timeout are specified, the hyperparameter tuning experiment terminates when the first of these two thresholds is reached.
Note
The number of concurrent trial jobs is gated on the resources available in the specified compute target. Ensure that the compute target has the available resources for the desired concurrency.
sweep_job.set_limits(max_total_trials=20, max_concurrent_trials=4, timeout=1200)
References: SweepJob.set_limits
This code configures the hyperparameter tuning experiment to use a maximum of 20 total trial jobs, running four trial jobs at a time with a timeout of 1,200 seconds for the entire sweep job.
Configure hyperparameter tuning experiment
To configure your hyperparameter tuning experiment, provide the following:
- The defined hyperparameter search space
- Your sampling algorithm
- Your early termination policy
- Your objective
- Resource limits
- CommandJob or CommandComponent
- SweepJob
SweepJob can run a hyperparameter sweep on the Command or Command Component.
Note
The compute target used in sweep_job
must have enough resources to satisfy your concurrency level. For more information on compute targets, see Compute targets.
Configure your hyperparameter tuning experiment:
from azure.ai.ml import MLClient
from azure.ai.ml import command, Input
from azure.ai.ml.sweep import Choice, Uniform, MedianStoppingPolicy
from azure.identity import DefaultAzureCredential
# Create your base command job
command_job = command(
code="./src",
command="python main.py --iris-csv ${{inputs.iris_csv}} --learning-rate ${{inputs.learning_rate}} --boosting ${{inputs.boosting}}",
environment="AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu@latest",
inputs={
"iris_csv": Input(
type="uri_file",
path="https://azuremlexamples.blob.core.chinacloudapi.cn/datasets/iris.csv",
),
"learning_rate": 0.9,
"boosting": "gbdt",
},
compute="cpu-cluster",
)
# Override your inputs with parameter expressions
command_job_for_sweep = command_job(
learning_rate=Uniform(min_value=0.01, max_value=0.9),
boosting=Choice(values=["gbdt", "dart"]),
)
# Call sweep() on your command job to sweep over your parameter expressions
sweep_job = command_job_for_sweep.sweep(
compute="cpu-cluster",
sampling_algorithm="random",
primary_metric="test-multi_logloss",
goal="Minimize",
)
# Specify your experiment details
sweep_job.display_name = "lightgbm-iris-sweep-example"
sweep_job.experiment_name = "lightgbm-iris-sweep-example"
sweep_job.description = "Run a hyperparameter sweep job for LightGBM on Iris dataset."
# Define the limits for this sweep
sweep_job.set_limits(max_total_trials=20, max_concurrent_trials=10, timeout=7200)
# Set early stopping on this one
sweep_job.early_termination = MedianStoppingPolicy(
delay_evaluation=5, evaluation_interval=2
)
References:
The command_job
is invoked as a function so you can apply parameter expressions. The sweep
function is configured with trial
, sampling algorithm, objective, limits, and compute. The snippet comes from the sample notebook Run hyperparameter sweep on a Command or CommandComponent. In this sample, learning_rate
and boosting
are tuned. Early stopping is driven by a MedianStoppingPolicy
, which stops a job whose primary metric is worse than the median of running averages across all jobs (see MedianStoppingPolicy reference).
To see how the parameter values are received, parsed, and passed to the training script to be tuned, refer to this code sample
Important
Every hyperparameter sweep job restarts the training from scratch, including rebuilding the model and all the data loaders. You can minimize this cost by using an Azure Machine Learning pipeline or manual process to do as much data preparation as possible prior to your training jobs.
Submit hyperparameter tuning experiment
After you define your hyperparameter tuning configuration, submit the job:
# submit the sweep
returned_sweep_job = ml_client.create_or_update(sweep_job)
# get a URL for the status of the job
returned_sweep_job.services["Studio"].endpoint
References:
Visualize hyperparameter tuning jobs
Visualize hyperparameter tuning jobs in Azure Machine Learning studio. For details, see View job records in the studio.
Metrics chart: This visualization tracks the metrics logged for each hyperdrive child job over the duration of hyperparameter tuning. Each line represents a child job, and each point measures the primary metric value at that iteration of runtime.
Parallel Coordinates Chart: This visualization shows the correlation between primary metric performance and individual hyperparameter values. The chart is interactive via movement of axes (select and drag by the axis label), and by highlighting values across a single axis (select and drag vertically along a single axis to highlight a range of desired values). The parallel coordinates chart includes an axis on the rightmost portion of the chart that plots the best metric value corresponding to the hyperparameters set for that job instance. This axis is provided in order to project the chart gradient legend onto the data in a more readable fashion.
2-Dimensional Scatter Chart: This visualization shows the correlation between any two individual hyperparameters along with their associated primary metric value.
3-Dimensional Scatter Chart: This visualization is the same as 2D but allows for three hyperparameter dimensions of correlation with the primary metric value. You can also select and drag to reorient the chart to view different correlations in 3D space.
Find the best trial job
After all tuning jobs complete, retrieve the best trial outputs:
# Download best trial model output
ml_client.jobs.download(returned_sweep_job.name, output_name="model")
References:
You can use the CLI to download all default and named outputs of the best trial job and logs of the sweep job.
az ml job download --name <sweep-job> --all
Optionally, download only the best trial output:
az ml job download --name <sweep-job> --output-name model