使用 Azure 机器学习大规模构建 TensorFlow 深度学习模型Build a TensorFlow deep learning model at scale with Azure Machine Learning

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

本文展示了如何使用 Azure 机器学习的 TensorFlow 估算器类大规模运行 TensorFlow 训练脚本。This article shows you how to run your TensorFlow training scripts at scale using Azure Machine Learning's TensorFlow estimator class. 此示例使用深度神经网络 (DNN) 训练并注册 TensorFlow 模型来对手写数字进行分类。This example trains and registers a TensorFlow model to classify handwritten digits using a deep neural network (DNN).

无论是从头开始开发 TensorFlow 模型,还是将现有模型引入云中,都可以使用 Azure 机器学习来横向扩展开源训练作业,以便构建、部署和监视生产级模型以及对其进行版本控制。Whether you're developing a TensorFlow model from the ground-up or you're bringing an existing model into the cloud, you can use Azure Machine Learning to scale out open-source training jobs to build, deploy, version, and monitor production-grade models.

详细了解深度学习与机器学习Learn more about deep learning vs machine learning.

先决条件Prerequisites

在以下任一环境中运行此代码:Run this code on either of these environments:

  • Azure 机器学习计算实例 - 无需下载或安装Azure Machine Learning compute instance - no downloads or installation necessary

    • 在开始本教程之前完成教程:设置环境和工作区以创建预先装载了 SDK 和示例存储库的专用笔记本服务器。Complete the Tutorial: Setup environment and workspace to create a dedicated notebook server pre-loaded with the SDK and the sample repository.
    • 在笔记本服务器上的示例深度学习文件夹中,导航到以下目录,查找已完成且已展开的笔记本:how-to-use-azureml > ml-frameworks > tensorflow > deployment > train-hyperparameter-tune-deploy-with-tensorflow 文件夹。In the samples deep learning folder on the notebook server, find a completed and expanded notebook by navigating to this directory: how-to-use-azureml > ml-frameworks > tensorflow > deployment > train-hyperparameter-tune-deploy-with-tensorflow folder.
  • 你自己的 Jupyter 笔记本服务器Your own Jupyter Notebook server

    此外,还可以在 GitHub 示例页上找到本指南的完整 Jupyter Notebook 版本You can also find a completed Jupyter Notebook version of this guide on the GitHub samples page. 该笔记本包含扩展部分,其中涵盖智能超参数优化、模型部署和笔记本小组件。The notebook includes expanded sections covering intelligent hyperparameter tuning, model deployment, and notebook widgets.

设置试验Set up the experiment

本部分将准备训练实验,包括加载所需 Python 包、初始化工作区、创建实验以及上传训练数据和训练脚本。This section sets up the training experiment by loading the required Python packages, initializing a workspace, creating an experiment, and uploading the training data and training scripts.

导入程序包Import packages

首先,导入必需的 Python 库。First, import the necessary Python libraries.

import os
import urllib
import shutil
import azureml

from azureml.core import Experiment
from azureml.core import Workspace, Run

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.train.dnn import TensorFlow

初始化工作区Initialize a workspace

Azure 机器学习工作区是服务的顶级资源。The Azure Machine Learning workspace is the top-level resource for the service. 它提供了一个集中的位置来处理创建的所有项目。It provides you with a centralized place to work with all the artifacts you create. 在 Python SDK 中,可以通过创建 workspace 对象来访问工作区项目。In the Python SDK, you can access the workspace artifacts by creating a workspace object.

根据在先决条件部分中创建的 config.json 文件创建工作区对象。Create a workspace object from the config.json file created in the prerequisites section.

ws = Workspace.from_config()

创建深度学习试验Create a deep learning experiment

创建一个试验和文件夹来容纳训练脚本。Create an experiment and a folder to hold your training scripts. 在此示例中,创建一个名为“tf-mnist”的试验。In this example, create an experiment called "tf-mnist".

script_folder = './tf-mnist'
os.makedirs(script_folder, exist_ok=True)

exp = Experiment(workspace=ws, name='tf-mnist')

创建文件数据集Create a file dataset

FileDataset 对象引用工作区数据存储或公共 URL 中的一个或多个文件。A FileDataset object references one or multiple files in your workspace datastore or public urls. 文件可以是任何格式,该类提供将文件下载或装载到计算机的功能。The files can be of any format, and the class provides you with the ability to download or mount the files to your compute. 通过创建 FileDataset,可以创建对数据源位置的引用。By creating a FileDataset, you create a reference to the data source location. 如果将任何转换应用于数据集,则它们也会存储在数据集中。If you applied any transformations to the data set, they will be stored in the data set as well. 数据会保留在其现有位置,因此不会产生额外的存储成本。The data remains in its existing location, so no extra storage cost is incurred. 有关详细信息,请参阅 Dataset 包中的操作指南。See the how-to guide on the Dataset package for more information.

from azureml.core.dataset import Dataset

web_paths = [
            'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'
            ]
dataset = Dataset.File.from_files(path=web_paths)

使用 register() 方法将数据集注册到工作区,以便将其与其他人共享,在各种试验中重复使用,以及在训练脚本中按名称引用。Use the register() method to register the data set to your workspace so they can be shared with others, reused across various experiments, and referred to by name in your training script.

dataset = dataset.register(workspace=ws,
                           name='mnist dataset',
                           description='training and test dataset',
                           create_new_version=True)

# list the files referenced by dataset
dataset.to_path()

创建计算目标Create a compute target

创建用于运行 TensorFlow 作业的计算目标。Create a compute target for your TensorFlow job to run on. 在此示例中,创建启用了 GPU 的 Azure 机器学习计算群集。In this example, create a GPU-enabled Azure Machine Learning compute cluster.

cluster_name = "gpucluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', 
                                                           max_nodes=4)

    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

有关计算目标的详细信息,请参阅什么是计算目标一文。For more information on compute targets, see the what is a compute target article.

创建 TensorFlow 估算器Create a TensorFlow estimator

TensorFlow 估算器提供了一种在计算目标上启动 TensorFlow 训练作业的简单方法。The TensorFlow estimator provides a simple way of launching a TensorFlow training job on a compute target.

TensorFlow 估算器是通过泛型 estimator 类实现的,该类可用于支持任何框架。The TensorFlow estimator is implemented through the generic estimator class, which can be used to support any framework. 有关使用泛型估算器训练模型的详细信息,请参阅通过估算器使用 Azure 机器学习训练模型For more information about training models using the generic estimator, see train models with Azure Machine Learning using estimator

如果你的训练脚本需要额外的 pip 或 conda 包才能运行,则可以通过使用 pip_packagesconda_packages 参数传递其名称在生成的 Docker 映像中安装这些包。If your training script needs additional pip or conda packages to run, you can have the packages installed on the resulting Docker image by passing their names through the pip_packages and conda_packages arguments.

script_params = {
    '--data-folder': dataset.as_named_input('mnist').as_mount(),
    '--batch-size': 50,
    '--first-layer-neurons': 300,
    '--second-layer-neurons': 100,
    '--learning-rate': 0.01
}

est = TensorFlow(source_directory=script_folder,
                 entry_script='tf_mnist.py',
                 script_params=script_params,
                 compute_target=compute_target,
                 use_gpu=True,
                 pip_packages=['azureml-dataprep[pandas,fuse]'])

Tip

Tensorflow 估算器类中添加了对 Tensorflow 2.0 的支持。Support for Tensorflow 2.0 has been added to the Tensorflow estimator class. 有关详细信息,请参阅此博客文章See the blog post for more information.

有关自定义 Python 环境的详细信息,请参阅创建和管理用于训练和部署的环境For more information on customizing your Python environment, see Create and manage environments for training and deployment.

提交运行Submit a run

运行对象在作业运行时和运行后提供运行历史记录的接口。The Run object provides the interface to the run history while the job is running and after it has completed.

run = exp.submit(est)
run.wait_for_completion(show_output=True)

执行运行时,会经历以下阶段:As the Run is executed, it goes through the following stages:

  • 准备:根据 TensorFlow 估计器创建 Docker 映像。Preparing: A Docker image is created according to the TensorFlow estimator. 将映像上传到工作区的容器注册表,缓存以用于后续运行。The image is uploaded to the workspace's container registry and cached for later runs. 还会将日志流式传输到运行历史记录,可以查看日志以监视进度。Logs are also streamed to the run history and can be viewed to monitor progress.

  • 缩放:如果 Batch AI 群集执行运行所需的节点多于当前可用节点,则群集将尝试纵向扩展。Scaling: The cluster attempts to scale up if the Batch AI cluster requires more nodes to execute the run than are currently available.

  • 正在运行:将脚本文件夹中的所有脚本上传到计算目标,装载或复制数据存储,然后执行 entry_script。Running: All scripts in the script folder are uploaded to the compute target, data stores are mounted or copied, and the entry_script is executed. 将 stdout 和 ./logs 文件夹中的输出流式传输到运行历史记录,可将其用于监视运行。Outputs from stdout and the ./logs folder are streamed to the run history and can be used to monitor the run.

  • 后期处理:将运行的 ./outputs 文件夹复制到运行历史记录。Post-Processing: The ./outputs folder of the run is copied over to the run history.

注册或下载模型Register or download a model

训练模型后,可以将其注册到工作区。Once you've trained the model, you can register it to your workspace. 凭借模型注册,可以在工作区中存储模型并对其进行版本控制,从而简化模型管理和部署Model registration lets you store and version your models in your workspace to simplify model management and deployment. 通过指定参数 model_frameworkmodel_framework_versionresource_configuration,无代码模型部署将可供使用。By specifying the parameters model_framework, model_framework_version, and resource_configuration, no-code model deployment becomes available. 这允许你通过已注册模型直接将模型部署为 Web服务,ResourceConfiguration 对象定义 Web 服务的计算资源。This allows you to directly deploy your model as a web service from the registered model, and the ResourceConfiguration object defines the compute resource for the web service.

from azureml.core import Model
from azureml.core.resource_configuration import ResourceConfiguration

model = run.register_model(model_name='tf-dnn-mnist', 
                           model_path='outputs/model',
                           model_framework=Model.Framework.TENSORFLOW,
                           model_framework_version='1.13.0',
                           resource_configuration=ResourceConfiguration(cpu=1, memory_in_gb=0.5))

此外,还可以使用“运行”对象下载模型的本地副本。You can also download a local copy of the model by using the Run object. 在训练脚本 mnist-tf.py 中,TensorFlow 保护程序对象将模型保存到本地文件夹(计算目标本地)。In the training script mnist-tf.py, a TensorFlow saver object persists the model to a local folder (local to the compute target). 可以使用“运行”对象下载副本。You can use the Run object to download a copy.

# Create a model folder in the current directory
os.makedirs('./model', exist_ok=True)

for f in run.get_file_names():
    if f.startswith('outputs/model'):
        output_file_path = os.path.join('./model', f.split('/')[-1])
        print('Downloading from {} to {} ...'.format(f, output_file_path))
        run.download_file(name=f, output_file_path=output_file_path)

分布式训练Distributed training

TensorFlow 估算器还支持跨 CPU 和 GPU 群集的分布式训练。The TensorFlow estimator also supports distributed training across CPU and GPU clusters. 你可以轻松运行分布式 TensorFlow 作业,Azure 机器学习将为你管理业务流程。You can easily run distributed TensorFlow jobs and Azure Machine Learning will manage the orchestration for you.

Azure 机器学习支持 TensorFlow 中的两种分布式训练方法:Azure Machine Learning supports two methods of distributed training in TensorFlow:

HorovodHorovod

Horovod 是 Uber 开发的用于分布式训练的开放源代码框架。Horovod is an open-source framework for distributed training developed by Uber. 它提供了通向分布式 GPU TensorFlow 作业的简单路径。It offers an easy path to distributed GPU TensorFlow jobs.

若要使用 Horovod,请在 TensorFlow 构造函数中为 distributed_training 参数指定一个 MpiConfiguration 对象。To use Horovod, specify an MpiConfiguration object for the distributed_training parameter in the TensorFlow constructor. 此参数可确保为你安装 Horovod 库,以便在训练脚本中使用。This parameter ensures that Horovod library is installed for you to use in your training script.

from azureml.core.runconfig import MpiConfiguration
from azureml.train.dnn import TensorFlow

# Tensorflow constructor
estimator= TensorFlow(source_directory=project_folder,
                      compute_target=compute_target,
                      script_params=script_params,
                      entry_script='script.py',
                      node_count=2,
                      process_count_per_node=1,
                      distributed_training=MpiConfiguration(),
                      framework_version='1.13',
                      use_gpu=True,
                      pip_packages=['azureml-dataprep[pandas,fuse]'])

参数服务器Parameter server

此外,你还可以运行本机分布式 TensorFlow,它使用参数服务器模型。You can also run native distributed TensorFlow, which uses the parameter server model. 在此方法中,你将在一组参数服务器和工作线程中进行训练。In this method, you train across a cluster of parameter servers and workers. 工作线程在训练期间计算梯度,而参数服务器聚合梯度。The workers calculate the gradients during training, while the parameter servers aggregate the gradients.

若要使用参数服务器方法,请在 TensorFlow 构造函数中为 distributed_training 参数指定一个 TensorflowConfiguration 对象。To use the parameter server method, specify a TensorflowConfiguration object for the distributed_training parameter in the TensorFlow constructor.

from azureml.train.dnn import TensorFlow

distributed_training = TensorflowConfiguration()
distributed_training.worker_count = 2

# Tensorflow constructor
tf_est= TensorFlow(source_directory=project_folder,
                      compute_target=compute_target,
                      script_params=script_params,
                      entry_script='script.py',
                      node_count=2,
                      process_count_per_node=1,
                      distributed_training=distributed_training,
                      use_gpu=True,
                      pip_packages=['azureml-dataprep[pandas,fuse]'])

# submit the TensorFlow job
run = exp.submit(tf_est)

在“TF_CONFIG”中定义群集规范Define cluster specifications in 'TF_CONFIG`

另外,你还需要 tf.train.ClusterSpec 的群集网络地址和端口,因此,Azure 机器学习会为你设置 TF_CONFIG 环境变量。You also need the network addresses and ports of the cluster for the tf.train.ClusterSpec, so Azure Machine Learning sets the TF_CONFIG environment variable for you.

TF_CONFIG 环境变量是一个 JSON 字符串。The TF_CONFIG environment variable is a JSON string. 下面是介绍参数服务器变量的示例:Here is an example of the variable for a parameter server:

TF_CONFIG='{
    "cluster": {
        "ps": ["host0:2222", "host1:2222"],
        "worker": ["host2:2222", "host3:2222", "host4:2222"],
    },
    "task": {"type": "ps", "index": 0},
    "environment": "cloud"
}'

对于 TensorFlow 的高级别 tf.estimator API,TensorFlow 将分析 TF_CONFIG 变量并为你生成群集规范。For TensorFlow's high level tf.estimator API, TensorFlow parses the TF_CONFIG variable and builds the cluster spec for you.

如果使用 TensorFlow 的低级别核心 API 进行训练,则需要分析 TF_CONFIG 变量并在训练代码中生成 tf.train.ClusterSpecFor TensorFlow's lower-level core APIs for training, parse the TF_CONFIG variable and build the tf.train.ClusterSpec in your training code.

import os, json
import tensorflow as tf

tf_config = os.environ.get('TF_CONFIG')
if not tf_config or tf_config == "":
    raise ValueError("TF_CONFIG not found.")
tf_config_json = json.loads(tf_config)
cluster_spec = tf.train.ClusterSpec(cluster)

部署 TensorFlow 模型Deploy a TensorFlow model

无论使用哪种估算器进行训练,都可以采用与 Azure 机器学习中任何其他已注册模型完全相同的方式部署你刚才注册的模型。The model you just registered can be deployed the exact same way as any other registered model in Azure Machine Learning, regardless of which estimator you used for training. 部署指南包含有关模型注册的部分,但由于你已有一个已注册的模型,因而可以直接跳到创建计算目标进行部署。The deployment how-to contains a section on registering models, but you can skip directly to creating a compute target for deployment, since you already have a registered model.

(预览版)无代码模型部署(Preview) No-code model deployment

除了传统的部署路线之外,还可以为 Tensorflow 使用无代码部署功能(预览版)。Instead of the traditional deployment route, you can also use the no-code deployment feature (preview)for Tensorflow. 通过如上所示使用 model_frameworkmodel_framework_versionresource_configuration 参数注册你的模型,可以简单地使用 deploy() 静态函数来部署模型。By registering your model as shown above with the model_framework, model_framework_version, and resource_configuration parameters, you can simply use the deploy() static function to deploy your model.

service = Model.deploy(ws, "tensorflow-web-service", [model])

完整的操作指南更深入地介绍了 Azure 机器学习。The full how-to covers deployment in Azure Machine Learning in greater depth.

后续步骤Next steps

在本文中,你训练并注册了一个 TensorFlow 模型,并了解了部署选项。In this article, you trained and registered a TensorFlow model, and learned about options for deployment. 有关 Azure 机器学习的详细信息,请参阅以下其他文章。See these other articles to learn more about Azure Machine Learning.