通过估算器使用 Azure 机器学习训练模型Train models with Azure Machine Learning using estimator

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

凭借 Azure 机器学习,可以使用 RunConfiguration 对象ScriptRunConfig 对象轻松地将训练脚本提交到各种计算目标With Azure Machine Learning, you can easily submit your training script to various compute targets, using a RunConfiguration object and a ScriptRunConfig object. 该模式提供了很强的灵活性和最大程度的控制度。That pattern gives you a lot of flexibility and maximum control.

借助估算器类,可以更轻松地通过深入学习和强化学习来训练模型。The estimator class makes it easier to train models with deep learning and reinforcement learning. 它提供了一个高级抽象,可便于轻松地构造运行配置。It provides a high-level abstraction that lets you easily construct run configuration. 可以创建泛型估算器,并使用它在所选的任何计算目标(无论是本地计算机、Azure 中的单个 VM 还是 Azure 中的 GPU 群集)上提交使用任何所选的学习框架(如 scikit-learn)的训练脚本。You can create and use a generic Estimator to submit training script using any learning framework you choose (such as scikit-learn) on any compute target you choose, whether it's your local machine, a single VM in Azure, or a GPU cluster in Azure. 对于 PyTorch、TensorFlow、Chainer 和强化学习任务,Azure 机器学习还提供了相应的 PyTorchTensorFlowChainer强化学习估算器来简化这些框架的使用。For PyTorch, TensorFlow, Chainer, and reinforcement learning tasks, Azure Machine Learning also provides respective PyTorch, TensorFlow, Chainer, and reinforcement learning estimators to simplify using these frameworks.

使用估算器进行训练Train with an estimator

创建工作区并设置开发环境后,在 Azure 机器学习中训练模型包括以下步骤:Once you've created your workspace and set up your development environment, training a model in Azure Machine Learning involves the following steps:

  1. 创建远程计算目标(注意:也可将本地计算机用作计算目标)Create a remote compute target (note you can also use local computer as compute target)
  2. 训练数据上传到数据存储(可选)Upload your training data to datastore (Optional)
  3. 创建训练脚本Create your training script
  4. 创建 Estimator 对象Create an Estimator object
  5. 将估算器提交到工作区下的实验对象Submit the estimator to an experiment object under the workspace

本文重点介绍步骤 4-5。This article focuses on steps 4-5. 对于步骤 1-3,请以训练模型教程为例。For steps 1-3, refer to the train a model tutorial for an example.

单节点训练Single-node training

对在 Azure 中的远程计算上为 scikit 学习模型运行的单节点训练使用 EstimatorUse an Estimator for a single-node training run on remote compute in Azure for a scikit-learn model. 你应该已经创建了计算目标对象 compute_targetFileDataset 对象 dsYou should have already created your compute target object compute_target and your FileDataset object ds.

from azureml.train.estimator import Estimator

script_params = {
    # to mount files referenced by mnist dataset
    '--data-folder': ds.as_named_input('mnist').as_mount(),
    '--regularization': 0.8
}

sk_est = Estimator(source_directory='./my-sklearn-proj',
                   script_params=script_params,
                   compute_target=compute_target,
                   entry_script='train.py',
                   conda_packages=['scikit-learn'])

此代码片段指定了 Estimator 构造函数的以下参数。This code snippet specifies the following parameters to the Estimator constructor.

参数Parameter 说明Description
source_directory 包含训练作业所需的所有代码的本地目录。Local directory that contains all of your code needed for the training job. 此文件夹从本地计算机复制到远程计算。This folder gets copied from your local machine to the remote compute.
script_params 字典,指定要以 <command-line argument, value> 对的形式传递到训练脚本 entry_script 的命令行参数。Dictionary specifying the command-line arguments to pass to your training script entry_script, in the form of <command-line argument, value> pairs. 若要在 script_params 中指定详细标志,请使用 <command-line argument, "">To specify a verbose flag in script_params, use <command-line argument, "">.
compute_target 运行训练脚本的远程计算目标,在本例中为 Azure 机器学习计算 (AmlCompute) 群集。Remote compute target that your training script will run on, in this case an Azure Machine Learning Compute (AmlCompute) cluster. (请注意,尽管 AmlCompute 群集是常用目标,也可以选择其他计算目标类型,如 Azure VM,或甚至是本地计算机。)(Note even though AmlCompute cluster is the commonly used target, it is also possible to choose other compute target types such as Azure VMs or even local computer.)
entry_script 要在远程计算上运行的训练脚本的文件路径(相对于 source_directory)。Filepath (relative to the source_directory) of the training script to be run on the remote compute. 此文件及其依赖的其他任何文件都应位于此文件夹中。This file, and any additional files it depends on, should be located in this folder.
conda_packages 要通过训练脚本所需的 conda 安装的 Python 包列表。List of Python packages to be installed via conda needed by your training script.

构造函数有另一个名为 pip_packages 的参数,可用于所需的任何 pip 包。The constructor has another parameter called pip_packages that you use for any pip packages needed.

创建了 Estimator 对象后,请提交要在远程计算上通过调用实验对象 experiment 上的 submit 函数来运行的训练作业。Now that you've created your Estimator object, submit the training job to be run on the remote compute with a call to the submit function on your Experiment object experiment.

run = experiment.submit(sk_est)
print(run.get_portal_url())

重要

特殊文件夹两个文件夹 outputslogs 接收 Azure 机器学习的特殊处理。Special Folders Two folders, outputs and logs, receive special treatment by Azure Machine Learning. 在训练期间,如果将文件写入相对于根目录(分别为 ./outputs./logs)的名为 outputs 和 logs 的文件夹,则会将这些文件自动上传到运行历史记录,以便在完成运行后对其具有访问权限 。During training, when you write files to folders named outputs and logs that are relative to the root directory (./outputs and ./logs, respectively), the files will automatically upload to your run history so that you have access to them once your run is finished.

要在训练期间创建项目(如模型文件、检查点、数据文件或绘制的图像),请将其写入 ./outputs 文件夹。To create artifacts during training (such as model files, checkpoints, data files, or plotted images) write these to the ./outputs folder.

同样,可以将任何运行训练日志写入 ./logs 文件夹。Similarly, you can write any logs from your training run to the ./logs folder. 要利用 Azure 机器学习的 TensorBoard 集成,请确保将 TensorBoard 日志写入此文件夹。To utilize Azure Machine Learning's TensorBoard integration make sure you write your TensorBoard logs to this folder. 正在运行时,你将能够启动 TensorBoard 并流式传输这些日志。While your run is in progress, you will be able to launch TensorBoard and stream these logs. 稍后,还能够从任何先前运行中还原日志。Later, you will also be able to restore the logs from any of your previous runs.

例如,在运行远程训练后将写入 outputs 文件夹的文件下载到本地计算机:run.download_file(name='outputs/my_output_file', output_file_path='my_destination_path')For example, to download a file written to the outputs folder to your local machine after your remote training run: run.download_file(name='outputs/my_output_file', output_file_path='my_destination_path')

分布式训练和自定义 Docker 映像Distributed training and custom Docker images

使用 Estimator 可以执行以下两个额外的训练方案:There are two additional training scenarios you can carry out with the Estimator:

  • 使用自定义 Docker 映像Using a custom Docker image
  • 多节点群集上的分布式训练Distributed training on a multi-node cluster

以下代码演示了如何为 Keras 模型执行分布式训练。The following code shows how to carry out distributed training for a Keras model. 此外,它没有使用默认 Azure 机器学习映像,而是指定 Docker 中心 continuumio/miniconda 的自定义 Docker 映像来进行训练。In addition, instead of using the default Azure Machine Learning images, it specifies a custom docker image from Docker Hub continuumio/miniconda for training.

应已创建计算目标对象 compute_targetYou should have already created your compute target object compute_target. 按如下所示创建估算器:You create the estimator as follows:

from azureml.train.estimator import Estimator
from azureml.core.runconfig import MpiConfiguration

estimator = Estimator(source_directory='./my-keras-proj',
                      compute_target=compute_target,
                      entry_script='train.py',
                      node_count=2,
                      process_count_per_node=1,
                      distributed_training=MpiConfiguration(),        
                      conda_packages=['tensorflow', 'keras'],
                      custom_docker_image='continuumio/miniconda')

上述代码显示了 Estimator 构造函数的以下新参数:The above code exposes the following new parameters to the Estimator constructor:

参数Parameter 说明Description 默认Default
custom_docker_image 要使用的映像的名称。Name of the image you want to use. 仅提供公共 docker 存储库(这种情况下为 Docker 中心)中可用的映像。Only provide images available in public docker repositories (in this case Docker Hub). 若要使用专用 docker 存储库中的映像,请改为使用构造函数的 environment_definition 参数。To use an image from a private docker repository, use the constructor's environment_definition parameter instead. 请参阅示例See example. None
node_count 要用于训练作业的节点数。Number of nodes to use for your training job. 1
process_count_per_node 要在每个节点上运行的进程(或“工作线程”)数。Number of processes (or "workers") to run on each node. 在这种情况下,使用每个节点上均可用的 2GPU。In this case, you use the 2 GPUs available on each node. 1
distributed_training 用于使用 MPI 后端来启动分布式训练的 MPIConfiguration 对象。MPIConfiguration object for launching distributed training using MPI backend. None

最后,提交训练作业:Finally, submit the training job:

run = experiment.submit(estimator)
print(run.get_portal_url())

注册模型Registering a model

训练模型后,就可以将它保存并注册到工作区。Once you've trained the model, you can save and register it to your workspace. 借助模型注册,可以在工作区中存储模型,并对它进行版本管理,从而简化模型管理和部署Model registration lets you store and version your models in your workspace to simplify model management and deployment.

运行下面的代码会将模型注册到工作区,并可在远程计算上下文或部署脚本中按名称引用模型。Running the following code will register the model to your workspace, and will make it available to reference by name in remote compute contexts or deployment scripts. 有关详细信息和其他参数,请参阅参考文档中的 register_modelSee register_model in the reference docs for more information and additional parameters.

model = run.register_model(model_name='sklearn-sample', model_path=None)

GitHub 跟踪和集成GitHub tracking and integration

如果你启动训练运行(其中源目录为本地 Git 存储库),存储库的相关信息存储在运行历史记录中。When you start a training run where the source directory is a local Git repository, information about the repository is stored in the run history. 有关详细信息,请参阅 Azure 机器学习的 Git 集成For more information, see Git integration for Azure Machine Learning.

示例Examples

有关显示估算器模式基础知识的笔记本,请参阅:For a notebook that shows the basics of an estimator pattern, see:

有关使用估算器来训练 scikit-learn 模型的笔记本,请参阅:For a notebook that trains a scikit-learn model by using estimator, see:

有关使用深度学习框架专用估算器来训练模型的笔记本,请参阅:For notebooks on training models by using deep-learning-framework specific estimators, see:

阅读使用 Jupyter 笔记本探索此服务一文,了解如何运行笔记本。Learn how to run notebooks by following the article Use Jupyter notebooks to explore this service.

后续步骤Next steps