使用 Azure 机器学习大规模训练 TensorFlow 模型Train TensorFlow models at scale with Azure Machine Learning

本文介绍如何使用 Azure 机器学习大规模运行 TensorFlow 训练脚本。In this article, learn how to run your TensorFlow training scripts at scale using Azure Machine Learning.

此示例使用深度神经网络 (DNN) 训练并注册 TensorFlow 模型来对手写数字进行分类。This example trains and registers a TensorFlow model to classify handwritten digits using a deep neural network (DNN).

无论是从头开始开发 TensorFlow 模型,还是将现有模型引入云中,都可以使用 Azure 机器学习来横向扩展开源训练作业,以便构建、部署和监视生产级模型以及对其进行版本控制。Whether you're developing a TensorFlow model from the ground-up or you're bringing an existing model into the cloud, you can use Azure Machine Learning to scale out open-source training jobs to build, deploy, version, and monitor production-grade models.

先决条件Prerequisites

在以下任一环境中运行此代码:Run this code on either of these environments:

  • Azure 机器学习计算实例 - 无需下载或安装Azure Machine Learning compute instance - no downloads or installation necessary

    • 在开始本教程之前完成教程:设置环境和工作区创建预先加载了 SDK 和示例存储库的专用笔记本服务器。Complete the Tutorial: Setup environment and workspace to create a dedicated notebook server pre-loaded with the SDK and the sample repository.
    • 在笔记本服务器上的示例深度学习文件夹中,导航到以下目录,查找已完成且已展开的笔记本:how-to-use-azureml > ml-frameworks > tensorflow > train-hyperparameter-tune-deploy-with-tensorflow 文件夹。In the samples deep learning folder on the notebook server, find a completed and expanded notebook by navigating to this directory: how-to-use-azureml > ml-frameworks > tensorflow > train-hyperparameter-tune-deploy-with-tensorflow folder.
  • 你自己的 Jupyter 笔记本服务器Your own Jupyter Notebook server

    此外,还可以在 GitHub 示例页上找到本指南的完整 Jupyter Notebook 版本You can also find a completed Jupyter Notebook version of this guide on the GitHub samples page. 该笔记本包含扩展部分,其中涵盖智能超参数优化、模型部署和笔记本小组件。The notebook includes expanded sections covering intelligent hyperparameter tuning, model deployment, and notebook widgets.

设置试验Set up the experiment

本部分通过加载所需的 Python 包、初始化工作区、创建计算目标和定义训练环境来设置训练实验。This section sets up the training experiment by loading the required Python packages, initializing a workspace, creating the compute target, and defining the training environment.

导入程序包Import packages

首先,导入必需的 Python 库。First, import the necessary Python libraries.

import os
import urllib
import shutil
import azureml

from azureml.core import Experiment
from azureml.core import Workspace, Run
from azureml.core import Environment

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

初始化工作区Initialize a workspace

Azure 机器学习工作区是服务的顶级资源。The Azure Machine Learning workspace is the top-level resource for the service. 它提供了一个集中的位置来处理创建的所有项目。It provides you with a centralized place to work with all the artifacts you create. 在 Python SDK 中,可以通过创建 workspace 对象来访问工作区项目。In the Python SDK, you can access the workspace artifacts by creating a workspace object.

根据在先决条件部分中创建的 config.json 文件创建工作区对象。Create a workspace object from the config.json file created in the prerequisites section.

ws = Workspace.from_config()

创建文件数据集Create a file dataset

FileDataset 对象引用工作区数据存储或公共 URL 中的一个或多个文件。A FileDataset object references one or multiple files in your workspace datastore or public urls. 文件可以是任何格式,该类提供将文件下载或装载到计算机的功能。The files can be of any format, and the class provides you with the ability to download or mount the files to your compute. 通过创建 FileDataset,可以创建对数据源位置的引用。By creating a FileDataset, you create a reference to the data source location. 如果将任何转换应用于数据集,则它们也会存储在数据集中。If you applied any transformations to the data set, they will be stored in the data set as well. 数据会保留在其现有位置,因此不会产生额外的存储成本。The data remains in its existing location, so no extra storage cost is incurred. 有关详细信息,请参阅 Dataset 包中的操作指南。See the how-to guide on the Dataset package for more information.

from azureml.core.dataset import Dataset

web_paths = [
            'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'
            ]
dataset = Dataset.File.from_files(path=web_paths)

使用 register() 方法将数据集注册到工作区,以便将其与其他人共享,在各种试验中重复使用,以及在训练脚本中按名称引用。Use the register() method to register the data set to your workspace so they can be shared with others, reused across various experiments, and referred to by name in your training script.

dataset = dataset.register(workspace=ws,
                           name='mnist-dataset',
                           description='training and test dataset',
                           create_new_version=True)

# list the files referenced by dataset
dataset.to_path()

创建计算目标Create a compute target

创建用于运行 TensorFlow 作业的计算目标。Create a compute target for your TensorFlow job to run on. 在此示例中,创建启用了 GPU 的 Azure 机器学习计算群集。In this example, create a GPU-enabled Azure Machine Learning compute cluster.

cluster_name = "gpu-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', 
                                                           max_nodes=4)

    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

有关计算目标的详细信息,请参阅什么是计算目标一文。For more information on compute targets, see the what is a compute target article.

定义环境Define your environment

若要定义封装训练脚本依赖项的 Azure ML 环境,可以定义自定义环境或使用和 Azure ML 特选环境。To define the Azure ML Environment that encapsulates your training script's dependencies, you can either define a custom environment or use an Azure ML curated environment.

使用特选环境Use a curated environment

如果不想定义自己的环境,Azure ML 会提供预生成的特选环境。Azure ML provides prebuilt, curated environments if you don't want to define your own environment. Azure ML 具有几个针对 TensorFlow 的 CPU 和 GPU 特选环境,这些环境对应不同版本的 TensorFlow。Azure ML has several CPU and GPU curated environments for TensorFlow corresponding to different versions of TensorFlow. 有关详细信息,请参阅此文For more info, see here.

若要使用特选环境,可以改为运行以下命令:If you want to use a curated environment, you can run the following command instead:

curated_env_name = 'AzureML-TensorFlow-2.2-GPU'
tf_env = Environment.get(workspace=ws, name=curated_env_name)

若要查看特选环境中包含的包,可以将 conda 依赖项写入磁盘:To see the packages included in the curated environment, you can write out the conda dependencies to disk:

tf_env.save_to_directory(path=curated_env_name)

确保特选环境包括训练脚本所需的所有依赖项。Make sure the curated environment includes all the dependencies required by your training script. 如果没有,则必须修改环境以包含缺少的依赖项。If not, you will have to modify the environment to include the missing dependencies. 请注意,如果修改了环境,则必须为它提供新名称,因为“AzureML”前缀是为特选环境保留的。Note that if the environment is modified, you will have to give it a new name, as the 'AzureML' prefix is reserved for curated environments. 如果修改了 conda 依赖项 YAML 文件,则可以使用新名称从该文件创建新环境,例如:If you modified the conda dependencies YAML file, you can create a new environment from it with a new name, e.g.:

tf_env = Environment.from_conda_specification(name='tensorflow-2.2-gpu', file_path='./conda_dependencies.yml')

如果改为直接修改了特选环境对象,则可以使用新名称克隆该环境:If you had instead modified the curated environment object directly, you can clone that environment with a new name:

tf_env = tf_env.clone(new_name='tensorflow-2.2-gpu')

创建自定义环境Create a custom environment

还可以创建自己的 Azure ML 环境,以封装训练脚本的依赖项。You can also create your own Azure ML environment that encapsulates your training script's dependencies.

首先,在 YAML 文件中定义 conda 依赖项;在本例中,该文件名为 conda_dependencies.ymlFirst, define your conda dependencies in a YAML file; in this example the file is named conda_dependencies.yml.

channels:
- conda-forge
dependencies:
- python=3.6.2
- pip:
  - azureml-defaults
  - tensorflow-gpu==2.2.0

基于此 conda 环境规范创建 Azure ML 环境。Create an Azure ML environment from this conda environment specification. 此环境将在运行时打包到 Docker 容器中。The environment will be packaged into a Docker container at runtime.

在默认情况下,如果未指定基础映像,Azure ML 将使用 CPU 映像 azureml.core.environment.DEFAULT_CPU_IMAGE 作为基础映像。By default if no base image is specified, Azure ML will use a CPU image azureml.core.environment.DEFAULT_CPU_IMAGE as the base image. 由于本示例在 GPU 群集上运行训练,因此你需要指定具有必要 GPU 驱动程序和依赖项的 GPU 基础映像。Since this example runs training on a GPU cluster, you will need to specify a GPU base image that has the necessary GPU drivers and dependencies. Azure ML 维护一组在 Microsoft 容器注册表 (MCR) 上发布的基础映像,你可以使用这些映像,请参阅 Azure/AzureML 容器 GitHub 存储库获取详细信息。Azure ML maintains a set of base images published on Microsoft Container Registry (MCR) that you can use, see the Azure/AzureML-Containers GitHub repo for more information.

tf_env = Environment.from_conda_specification(name='tensorflow-2.2-gpu', file_path='./conda_dependencies.yml')

# Specify a GPU base image
tf_env.docker.enabled = True
tf_env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04'

提示

或者,也可以直接在自定义 Docker 映像或 Dockerfile 中捕获所有依赖项,然后从中创建环境。Optionally, you can just capture all your dependencies directly in a custom Docker image or Dockerfile, and create your environment from that. 有关详细信息,请参阅通过自定义映像进行训练For more information, see Train with custom image.

有关创建和使用环境的详细信息,请参阅在 Azure 机器学习中创建和使用软件环境For more information on creating and using environments, see Create and use software environments in Azure Machine Learning.

配置和提交训练运行Configure and submit your training run

创建 ScriptRunConfigCreate a ScriptRunConfig

创建一个 ScriptRunConfig 对象,以指定训练作业的配置详细信息,包括训练脚本、要使用的环境,以及要在其上运行的计算目标。Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on. 如果在 arguments 参数中指定,训练脚本的任何参数都将通过命令行传递。Any arguments to your training script will be passed via command line if specified in the arguments parameter.

from azureml.core import ScriptRunConfig

args = ['--data-folder', dataset.as_mount(),
        '--batch-size', 64,
        '--first-layer-neurons', 256,
        '--second-layer-neurons', 128,
        '--learning-rate', 0.01]

src = ScriptRunConfig(source_directory=script_folder,
                      script='tf_mnist.py',
                      arguments=args,
                      compute_target=compute_target,
                      environment=tf_env)

警告

Azure 机器学习通过复制整个源目录来运行训练脚本。Azure Machine Learning runs training scripts by copying the entire source directory. 如果你有不想上传的敏感数据,请使用 .ignore 文件或不将其包含在源目录中。If you have sensitive data that you don't want to upload, use a .ignore file or don't include it in the source directory . 改为使用 Azure ML 数据集来访问数据。Instead, access your data using an Azure ML dataset.

有关通过 ScriptRunConfig 配置作业的详细信息,请参阅配置并提交训练运行For more information on configuring jobs with ScriptRunConfig, see Configure and submit training runs.

警告

如果你以前使用 TensorFlow 估算器来配置 TensorFlow 训练作业,请注意,自 1.19.0 SDK 发行版起,该估算器已弃用。If you were previously using the TensorFlow estimator to configure your TensorFlow training jobs, please note that Estimators have been deprecated as of the 1.19.0 SDK release. 对于不低于 1.15.0 版本的 Azure ML SDK,建议使用 ScriptRunConfig 作为配置训练作业(包括使用深度学习框架的作业)的方法。With Azure ML SDK >= 1.15.0, ScriptRunConfig is the recommended way to configure training jobs, including those using deep learning frameworks. 有关常见的迁移问题,请参阅估算器到 ScriptRunConfig 迁移指南For common migration questions, see the Estimator to ScriptRunConfig migration guide.

提交运行Submit a run

运行对象在作业运行时和运行后提供运行历史记录的接口。The Run object provides the interface to the run history while the job is running and after it has completed.

run = Experiment(workspace=ws, name='Tutorial-TF-Mnist').submit(src)
run.wait_for_completion(show_output=True)

在运行执行过程中发生的情况What happens during run execution

执行运行时,会经历以下阶段:As the run is executed, it goes through the following stages:

  • 准备:根据所定义的环境创建 docker 映像。Preparing: A docker image is created according to the environment defined. 将映像上传到工作区的容器注册表,缓存以用于后续运行。The image is uploaded to the workspace's container registry and cached for later runs. 还会将日志流式传输到运行历史记录,可以查看日志以监视进度。Logs are also streamed to the run history and can be viewed to monitor progress. 如果改为指定特选环境,则会使用支持该特选环境的缓存映像。If a curated environment is specified instead, the cached image backing that curated environment will be used.

  • 缩放:如果 Batch AI 群集执行运行所需的节点多于当前可用节点,则群集将尝试纵向扩展。Scaling: The cluster attempts to scale up if the Batch AI cluster requires more nodes to execute the run than are currently available.

  • 正在运行:将脚本文件夹中的所有脚本上传到计算目标,装载或复制数据存储,然后执行 scriptRunning: All scripts in the script folder are uploaded to the compute target, data stores are mounted or copied, and the script is executed. 将 stdout 和 ./logs 文件夹中的输出流式传输到运行历史记录,即可将其用于监视运行。Outputs from stdout and the ./logs folder are streamed to the run history and can be used to monitor the run.

  • 后期处理:将运行的 ./outputs 文件夹复制到运行历史记录。Post-Processing: The ./outputs folder of the run is copied over to the run history.

注册或下载模型Register or download a model

训练模型后,可以将其注册到工作区。Once you've trained the model, you can register it to your workspace. 凭借模型注册,可以在工作区中存储模型并对其进行版本控制,从而简化模型管理和部署Model registration lets you store and version your models in your workspace to simplify model management and deployment. 可选:通过指定参数 model_frameworkmodel_framework_versionresource_configuration,无代码模型部署将可供使用。Optional: by specifying the parameters model_framework, model_framework_version, and resource_configuration, no-code model deployment becomes available. 这允许你通过已注册模型直接将模型部署为 Web服务,ResourceConfiguration 对象定义 Web 服务的计算资源。This allows you to directly deploy your model as a web service from the registered model, and the ResourceConfiguration object defines the compute resource for the web service.

from azureml.core import Model
from azureml.core.resource_configuration import ResourceConfiguration

model = run.register_model(model_name='tf-mnist', 
                           model_path='outputs/model',
                           model_framework=Model.Framework.TENSORFLOW,
                           model_framework_version='2.0',
                           resource_configuration=ResourceConfiguration(cpu=1, memory_in_gb=0.5))

此外,还可以使用“运行”对象下载模型的本地副本。You can also download a local copy of the model by using the Run object. 在训练脚本 tf_mnist.py 中,TensorFlow 保护程序对象将模型保存到本地文件夹(计算目标本地)。In the training script tf_mnist.py, a TensorFlow saver object persists the model to a local folder (local to the compute target). 可以使用“运行”对象下载副本。You can use the Run object to download a copy.

# Create a model folder in the current directory
os.makedirs('./model', exist_ok=True)
run.download_files(prefix='outputs/model', output_directory='./model', append_prefix=False)

分布式训练Distributed training

Azure 机器学习还支持多节点分布式 TensorFlow 作业,以便可以扩展训练工作负载。Azure Machine Learning also supports multi-node distributed TensorFlow jobs so that you can scale your training workloads. 你可以轻松运行分布式 TensorFlow 作业,Azure ML 将为你管理业务流程。You can easily run distributed TensorFlow jobs and Azure ML will manage the orchestration for you.

Azure ML 支持使用 Horovod 和 TensorFlow 的内置分布式训练 API 运行分布式 TensorFlow 作业。Azure ML supports running distributed TensorFlow jobs with both Horovod and TensorFlow's built-in distributed training API.

HorovodHorovod

Horovod 是 Uber 开发的用于分布式训练的开放源代码 all reduce 框架。Horovod is an open-source, all reduce framework for distributed training developed by Uber. 它提供了一个简单的路径来编写分布式 TensorFlow 代码进行训练。It offers an easy path to writing distributed TensorFlow code for training.

训练代码必须使用 Horovod 检测,以进行分布式训练。Your training code will have to be instrumented with Horovod for distributed training. 有关将 Horovod 与 TensorFlow 结合使用的详细信息,请参阅 Horovod 文档:For more information using Horovod with TensorFlow, refer to Horovod documentation:

有关将 Horovod 与 TensorFlow 结合使用的详细信息,请参阅 Horovod 文档:For more information on using Horovod with TensorFlow, refer to Horovod documentation:

此外,请确保训练环境包含 horovod 包。Additionally, make sure your training environment includes the horovod package. 如果你使用的是 TensorFlow 特选环境,则 horovod 已作为依赖项之一包含在内。If you are using a TensorFlow curated environment, horovod is already included as one of the dependencies. 如果使用自己的环境,请确保包含 horovod 依赖项,例如:If you are using your own environment, make sure the horovod dependency is included, for example:

channels:
- conda-forge
dependencies:
- python=3.6.2
- pip:
  - azureml-defaults
  - tensorflow-gpu==2.2.0
  - horovod==0.19.5

若要在 Azure ML 上使用 MPI/Horovod 执行分布式作业,必须指定到 ScriptRunConfig 构造函数的 distributed_job_config 参数的 MpiConfigurationIn order to execute a distributed job using MPI/Horovod on Azure ML, you must specify an MpiConfiguration to the distributed_job_config parameter of the ScriptRunConfig constructor. 以下代码将配置每个节点运行一个进程的 2 节点分布式作业。The below code will configure a 2-node distributed job running one process per node. 如果你还希望每个节点运行多个进程,(即,如果群集 SKU 有多个 GPU),请在 MpiConfiguration 中另外指定 process_count_per_node 参数(默认值为 1)。If you would also like to run multiple processes per node (i.e. if your cluster SKU has multiple GPUs), additionally specify the process_count_per_node parameter in MpiConfiguration (the default is 1).

from azureml.core import ScriptRunConfig
from azureml.core.runconfig import MpiConfiguration

src = ScriptRunConfig(source_directory=project_folder,
                      script='tf_horovod_word2vec.py',
                      arguments=['--input_data', dataset.as_mount()],
                      compute_target=compute_target,
                      environment=tf_env,
                      distributed_job_config=MpiConfiguration(node_count=2))

有关在 Azure ML 上使用 Horovod 运行分布式 TensorFlow 的完整教程,请参阅使用 Horovod 的分布式 TensorFlowFor a full tutorial on running distributed TensorFlow with Horovod on Azure ML, see Distributed TensorFlow with Horovod.

tf.distributetf.distribute

如果你正在使用训练代码中的本机分布式 TensorFlow(例如 TensorFlow 2.x 的 tf.distribute.Strategy API),则可以通过 Azure ML 启动分布式作业。If you are using native distributed TensorFlow in your training code, e.g. TensorFlow 2.x's tf.distribute.Strategy API, you can also launch the distributed job via Azure ML.

为此,请指定到 ScriptRunConfig 构造函数的 distributed_job_config 参数的 TensorflowConfigurationTo do so, specify a TensorflowConfiguration to the distributed_job_config parameter of the ScriptRunConfig constructor. 如果正在使用 tf.distribute.experimental.MultiWorkerMirroredStrategy,请在 TensorflowConfiguration 中指定与训练作业的节点数对应的 worker_countIf you are using tf.distribute.experimental.MultiWorkerMirroredStrategy, specify the worker_count in the TensorflowConfiguration corresponding to the number of nodes for your training job.

import os
from azureml.core import ScriptRunConfig
from azureml.core.runconfig import TensorflowConfiguration

distr_config = TensorflowConfiguration(worker_count=2, parameter_server_count=0)

model_path = os.path.join("./outputs", "keras-model")

src = ScriptRunConfig(source_directory=source_dir,
                      script='train.py',
                      arguments=["--epochs", 30, "--model-dir", model_path],
                      compute_target=compute_target,
                      environment=tf_env,
                      distributed_job_config=distr_config)

在 TensorFlow 中,在多台计算机上训练需要 TF_CONFIG 环境变量。In TensorFlow, the TF_CONFIG environment variable is required for training on multiple machines. 在执行训练脚本之前,Azure ML 会相应地为每个辅助角色配置和设置 TF_CONFIG 变量。Azure ML will configure and set the TF_CONFIG variable appropriately for each worker before executing your training script. 如果需要,可以通过 os.environ['TF_CONFIG'] 从训练脚本访问 TF_CONFIGYou can access TF_CONFIG from your training script if you need to via os.environ['TF_CONFIG'].

在主要辅助角色节点上设置的 TF_CONFIG 的示例结构:Example structure of TF_CONFIG set on a chief worker node:

TF_CONFIG='{
    "cluster": {
        "worker": ["host0:2222", "host1:2222"]
    },
    "task": {"type": "worker", "index": 0},
    "environment": "cloud"
}'

如果训练脚本使用参数服务器策略进行分布式训练(例如对于旧版 TensorFlow 1.x),则还需要指定要在作业中使用的参数服务器数目,例如 distr_config = TensorflowConfiguration(worker_count=2, parameter_server_count=1)If your training script uses the parameter server strategy for distributed training, i.e. for legacy TensorFlow 1.x, you will also need to specify the number of parameter servers to use in the job, e.g. distr_config = TensorflowConfiguration(worker_count=2, parameter_server_count=1).

部署 TensorFlow 模型Deploy a TensorFlow model

部署指南包含有关模型注册的部分,但由于你已有一个已注册的模型,因而可以直接跳到创建计算目标进行部署。The deployment how-to contains a section on registering models, but you can skip directly to creating a compute target for deployment, since you already have a registered model.

(预览版)无代码模型部署(Preview) No-code model deployment

除了传统的部署路线之外,还可以为 Tensorflow 使用无代码部署功能(预览版)。Instead of the traditional deployment route, you can also use the no-code deployment feature (preview) for Tensorflow. 通过如上所示使用 model_frameworkmodel_framework_versionresource_configuration 参数注册你的模型,可以简单地使用 deploy() 静态函数来部署模型。By registering your model as shown above with the model_framework, model_framework_version, and resource_configuration parameters, you can simply use the deploy() static function to deploy your model.

service = Model.deploy(ws, "tensorflow-web-service", [model])

完整的操作指南更深入地介绍了 Azure 机器学习。The full how-to covers deployment in Azure Machine Learning in greater depth.

后续步骤Next steps

在本文中,你训练并注册了一个 TensorFlow 模型,并了解了部署选项。In this article, you trained and registered a TensorFlow model, and learned about options for deployment. 有关 Azure 机器学习的详细信息,请参阅以下其他文章。See these other articles to learn more about Azure Machine Learning.