从 Estimators 迁移到 ScriptRunConfigMigrating from Estimators to ScriptRunConfig

到目前为止,已经有多种方法可用于通过 SDK 在 Azure 机器学习中配置训练作业,这些方法包括 Estimators、ScriptRunConfig 和较低级别的 RunConfiguration。Up until now, there have been multiple methods for configuring a training job in Azure Machine Learning via the SDK, including Estimators, ScriptRunConfig, and the lower-level RunConfiguration. 为了解决这种模糊性和不一致性问题,我们正在简化 Azure ML 中的作业配置过程。To address this ambiguity and inconsistency, we are simplifying the job configuration process in Azure ML. 现在应该使用 ScriptRunConfig 作为配置训练作业的推荐选项。You should now use ScriptRunConfig as the recommended option for configuring training jobs.

Python SDK 的 1.19.0 版本弃用了 Estimators。Estimators are deprecated with the 1.19.0 release of the Python SDK. 你通常还应避免自行显式实例化 RunConfiguration 对象,而应改为使用 ScriptRunConfig 类来配置作业。You should also generally avoid explicitly instantiating a RunConfiguration object yourself, and instead configure your job using the ScriptRunConfig class.

本文介绍了从 Estimators 迁移到 ScriptRunConfig 时的常见注意事项。This article covers common considerations when migrating from Estimators to ScriptRunConfig.

重要

若要从 Estimators 迁移到 ScriptRunConfig,请确保使用的 Python SDK 版本 >= 1.15.0。To migrate to ScriptRunConfig from Estimators, make sure you are using >= 1.15.0 of the Python SDK.

ScriptRunConfig 文档和示例ScriptRunConfig documentation and samples

Azure 机器学习文档和示例已更新为使用 ScriptRunConfig 进行作业配置和提交。Azure Machine Learning documentation and samples have been updated to use ScriptRunConfig for job configuration and submission.

有关使用 ScriptRunConfig 的信息,请参阅以下文档:For information on using ScriptRunConfig, refer to the following documentation:

此外,请参阅以下示例和教程:In addition, refer to the following samples & tutorials:

定义训练环境Defining the training environment

虽然各种框架估算器都预先配置了由 Docker 映像提供支持的环境,但这些映像的 Dockerfile 是专用的。While the various framework estimators have preconfigured environments that are backed by Docker images, the Dockerfiles for these images are private. 因此,你无法详细了解这些环境包含的内容。Therefore you do not have a lot of transparency into what these environments contain. 此外,估算器会采用与环境相关的配置作为其相应构造函数的单个参数(例如 pip_packagescustom_docker_image)。In addition, the estimators take in environment-related configurations as individual parameters (such as pip_packages, custom_docker_image) on their respective constructors.

使用 ScriptRunConfig 时,所有与环境相关的配置都封装在 Environment 对象中,该对象被传递到 ScriptRunConfig 构造函数的 environment 参数中。When using ScriptRunConfig, all environment-related configurations are encapsulated in the Environment object that gets passed into the environment parameter of the ScriptRunConfig constructor. 若要配置训练作业,请提供一个包含训练脚本所需的所有依赖项的环境。To configure a training job, provide an environment that has all the dependencies required for your training script. 如果未提供任何环境,Azure ML 会使用 Azure ML 基础映像之一(具体而言,是由 azureml.core.environment.DEFAULT_CPU_IMAGE 定义的映像)作为默认环境。If no environment is provided, Azure ML will use one of the Azure ML base images, specifically the one defined by azureml.core.environment.DEFAULT_CPU_IMAGE, as the default environment. 有多种方法可以提供环境:There are a couple of ways to provide an environment:

下面是使用特选 PyTorch 1.6 环境进行训练的示例:Here is an example of using the curated PyTorch 1.6 environment for training:

from azureml.core import Workspace, ScriptRunConfig, Environment

curated_env_name = 'AzureML-PyTorch-1.6-GPU'
pytorch_env = Environment.get(workspace=ws, name=curated_env_name)

compute_target = ws.compute_targets['my-cluster']
src = ScriptRunConfig(source_directory='.',
                      script='train.py',
                      compute_target=compute_target,
                      environment=pytorch_env)

如果要指定将在执行训练脚本的进程上设置的环境变量,请使用 Environment 对象:If you want to specify environment variables that will get set on the process where the training script is executed, use the Environment object:

myenv.environment_variables = {"MESSAGE":"Hello from Azure Machine Learning"}

有关配置和管理 Azure ML 环境的信息,请参阅:For information on configuring and managing Azure ML environments, see:

使用数据进行训练Using data for training

数据集Datasets

如果使用 Azure ML 数据集进行训练,请使用 arguments 参数将数据集作为参数传递给脚本。If you are using an Azure ML dataset for training, pass the dataset as an argument to your script using the arguments parameter. 这样,你就可以通过参数在训练脚本中获取数据路径(装入点或下载路径)。By doing so, you will get the data path (mounting point or download path) in your training script via arguments.

下面的示例配置一个训练作业,其中的 FileDataset mnist_ds 会装载到远程计算机上。The following example configures a training job where the FileDataset, mnist_ds, will get mounted on the remote compute.

src = ScriptRunConfig(source_directory='.',
                      script='train.py',
                      arguments=['--data-folder', mnist_ds.as_mount()], # or mnist_ds.as_download() to download
                      compute_target=compute_target,
                      environment=pytorch_env)

DataReference(旧方式)DataReference (old)

虽然我们建议使用 Azure ML 数据集而不是旧的 DataReference 方式,但如果你由于任何原因而仍在使用 DataReference,你必须按以下方式配置作业:While we recommend using Azure ML Datasets over the old DataReference way, if you are still using DataReferences for any reason, you must configure your job as follows:

# if you want to pass a DataReference object, such as the below:
datastore = ws.get_default_datastore()
data_ref = datastore.path('./foo').as_mount()

src = ScriptRunConfig(source_directory='.',
                      script='train.py',
                      arguments=['--data-folder', str(data_ref)], # cast the DataReference object to str
                      compute_target=compute_target,
                      environment=pytorch_env)
src.run_config.data_references = {data_ref.data_reference_name: data_ref.to_config()} # set a dict of the DataReference(s) you want to the `data_references` attribute of the ScriptRunConfig's underlying RunConfiguration object.

有关使用数据进行训练的详细信息,请参阅:For more information on using data for training, see:

分布式训练Distributed training

如果需要配置用于训练的分布式作业,请在 ScriptRunConfig 构造函数中指定 distributed_job_config 参数。If you need to configure a distributed job for training, do so by specifying the distributed_job_config parameter in the ScriptRunConfig constructor. 传入 MpiConfigurationPyTorchConfigurationTensorflowConfiguration(适用于相应类型的分布式作业)。Pass in an MpiConfiguration, PyTorchConfiguration, or TensorflowConfiguration for distributed jobs of the respective types.

以下示例将 PyTorch 训练作业配置为对 MPI/Horovod 使用分布式训练:The following example configures a PyTorch training job to use distributed training with MPI/Horovod:

from azureml.core.runconfig import MpiConfiguration

src = ScriptRunConfig(source_directory='.',
                      script='train.py',
                      compute_target=compute_target,
                      environment=pytorch_env,
                      distributed_job_config=MpiConfiguration(node_count=2, process_count_per_node=2))

有关详细信息,请参阅:For more information, see:

杂项Miscellaneous

如果出于任何原因而需要访问 ScriptRunConfig 的基础 RunConfiguration 对象,可以使用以下命令来那样做:If you need to access the underlying RunConfiguration object for a ScriptRunConfig for any reason, you can do so as follows:

src.run_config

后续步骤Next steps