配置和提交训练运行Configure and submit training runs

本文介绍如何配置和提交 Azure 机器学习运行以训练模型。In this article, you learn how to configure and submit Azure Machine Learning runs to train your models.

训练时,通常先在本地计算机上启动,然后再横向扩展到基于云的群集。When training, it is common to start on your local computer, and then later scale out to a cloud-based cluster. 使用 Azure 机器学习,你可以在各种计算目标上运行脚本,而无需更改训练脚本。With Azure Machine Learning, you can run your script on various compute targets without having to change your training script.

只需在脚本运行配置中为每个计算目标定义环境即可。All you need to do is define the environment for each compute target within a script run configuration . 然后,当你想要在不同的计算目标上运行训练试验时,可以指定该计算的运行配置。Then, when you want to run your training experiment on a different compute target, specify the run configuration for that compute.

先决条件Prerequisites

什么是脚本运行配置?What's a script run configuration?

ScriptRunConfig 用于配置在试验中提交训练运行所需的信息。A ScriptRunConfig is used to configure the information necessary for submitting a training run as part of an experiment.

使用 ScriptRunConfig 对象提交训练实验。You submit your training experiment with a ScriptRunConfig object. 此对象包含:This object includes the:

  • source_directory :包含训练脚本的源目录source_directory : The source directory that contains your training script
  • script :要运行的训练脚本script : The training script to run
  • compute_target :要在其上运行的计算目标compute_target : The compute target to run on
  • environment :运行脚本时要使用的环境environment : The environment to use when running the script
  • 一些其他的可配置选项(有关详细信息,请参阅参考文档and some additional configurable options (see the reference documentation for more information)

训练模型Train your model

对于所有类型的计算目标,用于提交训练运行的代码模式都是相同的:The code pattern to submit a training run is the same for all types of compute targets:

  1. 创建要运行的试验Create an experiment to run
  2. 创建脚本将在其中运行的环境Create an environment where the script will run
  3. 创建 ScriptRunConfig,它指定计算目标和环境Create a ScriptRunConfig, which specifies the compute target and environment
  4. 提交运行Submit the run
  5. 等待运行完成Wait for the run to complete

或者可以:Or you can:

创建试验Create an experiment

在工作区中创建试验。Create an experiment in your workspace.

from azureml.core import Experiment

experiment_name = 'my_experiment'
experiment = Experiment(workspace=ws, name=experiment_name)

选择计算目标Select a compute target

选择要在其中运行训练脚本的计算目标。Select the compute target where your training script will run on. 如果 ScriptRunConfig 中未指定任何计算目标,或者 compute_target='local',则 Azure ML 会在本地执行脚本。If no compute target is specified in the ScriptRunConfig, or if compute_target='local', Azure ML will execute your script locally.

本文中的示例代码假设你已创建了“先决条件”部分的计算目标 my_compute_targetThe example code in this article assumes that you have already created a compute target my_compute_target from the "Prerequisites" section.

创建环境Create an environment

Azure 机器学习环境是(机器学习训练发生于其中的)环境的封装。Azure Machine Learning environments are an encapsulation of the environment where your machine learning training happens. 此类学习环境会指定与训练和评分脚本有关的 Python 包、Docker 映像、环境变量和软件设置。They specify the Python packages, Docker image, environment variables, and software settings around your training and scoring scripts. 它们还指定运行时(Python、Spark 或 Docker)。They also specify runtimes (Python, Spark, or Docker).

你可以定义自己的环境,也可以使用 Azure ML 特选环境。You can either define your own environment, or use an Azure ML curated environment. 特选环境是默认情况下在工作区中可用的预定义环境。Curated environments are predefined environments that are available in your workspace by default. 这些环境由缓存的 Docker 映像支持,降低了运行准备成本。These environments are backed by cached Docker images which reduces the run preparation cost. 有关可用特选环境的完整列表,请参阅 Azure 机器学习特选环境See Azure Machine Learning Curated Environments for the full list of available curated environments.

对于远程计算目标,可以从使用以下常用特选环境之一开始:For a remote compute target, you can use one of these popular curated environments to start with:

from azureml.core import Workspace, Environment

ws = Workspace.from_config()
myenv = Environment.get(workspace=ws, name="AzureML-Minimal")

有关环境的更多信息和细节,请参阅在 Azure 机器学习中创建和使用软件环境For more information and details about environments, see Create & use software environments in Azure Machine Learning.

本地计算目标Local compute target

如果计算目标是本地计算机,你需要负责确保所有必需的包在脚本运行于的 Python 环境中可用。If your compute target is your local machine , you are responsible for ensuring that all the necessary packages are available in the Python environment where the script runs. 使用 python.user_managed_dependencies 来使用当前的 Python 环境(或指定路径上的 Python)。Use python.user_managed_dependencies to use your current Python environment (or the Python on the path you specify).

from azureml.core import Environment

myenv = Environment("user-managed-env")
myenv.python.user_managed_dependencies = True

# You can choose a specific Python environment by pointing to a Python path 
# myenv.python.interpreter_path = '/home/johndoe/miniconda3/envs/myenv/bin/python'

创建脚本运行配置Create the script run configuration

现在,你已经有了一个计算目标 (my_compute_target) 和环境 (myenv),接下来创建一个脚本运行配置,该配置运行位于 project_folder 目录中的培训脚本 (train.py):Now that you have a compute target (my_compute_target) and environment (myenv), create a script run configuration that runs your training script (train.py) located in your project_folder directory:

from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory=project_folder,
                      script='train.py',
                      compute_target=my_compute_target,
                      environment=myenv)

如果你未指定环境,系统会为你创建一个默认环境。If you do not specify an environment, a default environment will be created for you.

如果要将命令行参数传递给训练脚本,则可以通过 ScriptRunConfig 构造函数的 arguments 参数来指定这些参数,例如 arguments=['--arg1', arg1_val, '--arg2', arg2_val]If you have command-line arguments you want to pass to your training script, you can specify them via the arguments parameter of the ScriptRunConfig constructor, e.g. arguments=['--arg1', arg1_val, '--arg2', arg2_val].

如果要替代允许用于运行的默认最长时间,可以通过 max_run_duration_seconds 参数来实现。If you want to override the default maximum time allowed for the run, you can do so via the max_run_duration_seconds parameter. 如果运行时间超过此值,系统会尝试自动取消运行。The system will attempt to automatically cancel the run if it takes longer than this value.

指定分布式作业配置Specify a distributed job configuration

如果要运行分布式训练作业,请为 distributed_job_config 参数提供分布式作业特定配置。If you want to run a distributed training job, provide the distributed job-specific config to the distributed_job_config parameter. 支持的配置类型包括 MpiConfigurationTensorflowConfigurationSupported config types include MpiConfiguration, TensorflowConfiguration.

有关运行 Horovod、TensorFlow 和 PyTorch 分布式作业的详细信息和示例,请参阅:For more information and examples on running distributed Horovod, TensorFlow and PyTorch jobs, see:

提交试验Submit the experiment

run = experiment.submit(config=src)
run.wait_for_completion(show_output=True)

重要

提交训练运行时,将创建包含训练脚本的目录的快照,并将其发送到计算目标。When you submit the training run, a snapshot of the directory that contains your training scripts is created and sent to the compute target. 目录快照也作为试验的一部分存储在工作区中。It is also stored as part of the experiment in your workspace. 如果更改文件并再次提交运行,只会上传已更改的文件。If you change files and submit the run again, only the changed files will be uploaded.

为了防止在快照中包含不必要的文件,请在目录中创建 ignore 文件(.gitignore.amlignore)。To prevent unnecessary files from being included in the snapshot, make an ignore file (.gitignore or .amlignore) in the directory. 将要排除的文件和目录添加到此文件中。Add the files and directories to exclude to this file. 有关此文件中使用的语法的详细信息,请参阅 .gitignore语法和模式For more information on the syntax to use inside this file, see syntax and patterns for .gitignore. .amlignore 文件使用相同的语法。The .amlignore file uses the same syntax. 如果这两个文件都存在,则 .amlignore 文件的优先级更高。If both files exist, the .amlignore file takes precedence.

有关快照的详细信息,请参阅快照For more information about snapshots, see Snapshots.

重要

特殊文件夹 两个文件夹 outputslogs 接收 Azure 机器学习的特殊处理。Special Folders Two folders, outputs and logs , receive special treatment by Azure Machine Learning. 在训练期间,如果将文件写入相对于根目录(分别为 ./outputs./logs)的名为 outputs 和 logs 的文件夹,则会将这些文件自动上传到运行历史记录,以便在完成运行后对其具有访问权限 。During training, when you write files to folders named outputs and logs that are relative to the root directory (./outputs and ./logs, respectively), the files will automatically upload to your run history so that you have access to them once your run is finished.

要在训练期间创建项目(如模型文件、检查点、数据文件或绘制的图像),请将其写入 ./outputs 文件夹。To create artifacts during training (such as model files, checkpoints, data files, or plotted images) write these to the ./outputs folder.

同样,可以将任何运行训练日志写入 ./logs 文件夹。Similarly, you can write any logs from your training run to the ./logs folder. 要利用 Azure 机器学习的 TensorBoard 集成,请确保将 TensorBoard 日志写入此文件夹。To utilize Azure Machine Learning's TensorBoard integration make sure you write your TensorBoard logs to this folder. 正在运行时,你将能够启动 TensorBoard 并流式传输这些日志。While your run is in progress, you will be able to launch TensorBoard and stream these logs. 稍后,还能够从任何先前运行中还原日志。Later, you will also be able to restore the logs from any of your previous runs.

例如,在运行远程训练后将写入 outputs 文件夹的文件下载到本地计算机:run.download_file(name='outputs/my_output_file', output_file_path='my_destination_path')For example, to download a file written to the outputs folder to your local machine after your remote training run: run.download_file(name='outputs/my_output_file', output_file_path='my_destination_path')

Git 跟踪与集成Git tracking and integration

如果以本地 Git 存储库作为源目录开始训练运行,有关存储库的信息将存储在运行历史记录中。When you start a training run where the source directory is a local Git repository, information about the repository is stored in the run history. 有关详细信息,请参阅 Azure 机器学习的 Git 集成For more information, see Git integration for Azure Machine Learning.

Notebook 示例Notebook examples

有关为各种训练方案配置运行的示例,请参阅以下笔记本:See these notebooks for examples of configuring runs for various training scenarios:

阅读使用 Jupyter 笔记本探索此服务一文,了解如何运行笔记本。Learn how to run notebooks by following the article Use Jupyter notebooks to explore this service.

后续步骤Next steps