使用 Azure 机器学习大规模训练 Keras 模型Train Keras models at scale with Azure Machine Learning

本文介绍了如何使用 Azure 机器学习运行 Keras 训练脚本。In this article, learn how to run your Keras training scripts with Azure Machine Learning.

本文中的示例代码介绍了如何使用 Azure 机器学习来训练和注册使用 TensorFlow 后端生成的 Keras 分类模型。The example code in this article shows you how to train and register a Keras classification model built using the TensorFlow backend with Azure Machine Learning. 它凭借使用在 TensorFlow 上运行的 Keras Python 库构建的深度神经网络 (DNN),采用热门的 MNIST 数据集对手写数字进行分类。It uses the popular MNIST dataset to classify handwritten digits using a deep neural network (DNN) built using the Keras Python library running on top of TensorFlow.

Keras 是一种高级神经网络 API,能够基于其他常用 DNN 框架运行以简化开发。Keras is a high-level neural network API capable of running top of other popular DNN frameworks to simplify development. 使用 Azure 机器学习,可以使用弹性云计算资源快速横向扩展训练作业。With Azure Machine Learning, you can rapidly scale out training jobs using elastic cloud compute resources. 此外还可以跟踪训练运行、对模型进行版本控制、部署模型等。You can also track your training runs, version models, deploy models, and much more.

无论是从头开始开发 Keras 模型,还是将现有模型引入云中,Azure 机器学习都可以帮助你生成适用于生产的模型。Whether you're developing a Keras model from the ground-up or you're bringing an existing model into the cloud, Azure Machine Learning can help you build production-ready models.

备注

如果你使用的是 TensorFlow 中内置的 Keras API tf.keras,而不是独立的 Keras 包,请改为参阅训练 TensorFlow 模型If you are using the Keras API tf.keras built into TensorFlow and not the standalone Keras package, refer instead to Train TensorFlow models.

先决条件Prerequisites

在以下任一环境中运行此代码:Run this code on either of these environments:

  • Azure 机器学习计算实例 - 无需下载或安装Azure Machine Learning compute instance - no downloads or installation necessary

    • 在开始本教程之前完成教程:设置环境和工作区创建预先加载了 SDK 和示例存储库的专用笔记本服务器。Complete the Tutorial: Setup environment and workspace to create a dedicated notebook server pre-loaded with the SDK and the sample repository.
    • 在笔记本服务器上的示例文件夹中,通过导航到以下目录来查找已完成且已扩展的笔记本:how-to-use-azureml > ml-frameworks > keras > train-hyperparameter-tune-deploy-with-keras 文件夹。In the samples folder on the notebook server, find a completed and expanded notebook by navigating to this directory: how-to-use-azureml > ml-frameworks > keras > train-hyperparameter-tune-deploy-with-keras folder.
  • 你自己的 Jupyter 笔记本服务器Your own Jupyter Notebook server

    此外,还可以在 GitHub 示例页上找到本指南的完整 Jupyter Notebook 版本You can also find a completed Jupyter Notebook version of this guide on the GitHub samples page. 该笔记本包含扩展部分,其中涵盖智能超参数优化、模型部署和笔记本小组件。The notebook includes expanded sections covering intelligent hyperparameter tuning, model deployment, and notebook widgets.

设置试验Set up the experiment

本部分通过加载所需的 Python 包、初始化工作区、为输入训练数据创建 FileDataset、创建计算目标以及定义训练环境设置来训练实验。This section sets up the training experiment by loading the required Python packages, initializing a workspace, creating the FileDataset for the input training data, creating the compute target, and defining the training environment.

导入程序包Import packages

首先,导入必需的 Python 库。First, import the necessary Python libraries.

import os
import azureml
from azureml.core import Experiment
from azureml.core import Environment
from azureml.core import Workspace, Run
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

初始化工作区Initialize a workspace

Azure 机器学习工作区是服务的顶级资源。The Azure Machine Learning workspace is the top-level resource for the service. 它提供了一个集中的位置来处理创建的所有项目。It provides you with a centralized place to work with all the artifacts you create. 在 Python SDK 中,可以通过创建 workspace 对象来访问工作区项目。In the Python SDK, you can access the workspace artifacts by creating a workspace object.

根据在先决条件部分中创建的 config.json 文件创建工作区对象。Create a workspace object from the config.json file created in the prerequisites section.

ws = Workspace.from_config()

创建文件数据集Create a file dataset

FileDataset 对象引用工作区数据存储或公共 URL 中的一个或多个文件。A FileDataset object references one or multiple files in your workspace datastore or public urls. 文件可以是任何格式,该类提供将文件下载或装载到计算机的功能。The files can be of any format, and the class provides you with the ability to download or mount the files to your compute. 通过创建 FileDataset,可以创建对数据源位置的引用。By creating a FileDataset, you create a reference to the data source location. 如果将任何转换应用于数据集,则它们也会存储在数据集中。If you applied any transformations to the data set, they will be stored in the data set as well. 数据会保留在其现有位置,因此不会产生额外的存储成本。The data remains in its existing location, so no extra storage cost is incurred. 有关详细信息,请参阅 Dataset 包中的操作指南。See the how-to guide on the Dataset package for more information.

from azureml.core.dataset import Dataset

web_paths = [
            'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'
            ]
dataset = Dataset.File.from_files(path=web_paths)

可以使用 register() 方法将数据集注册到工作区,以便将其与其他人共享,在各种试验中重复使用,以及在训练脚本中按名称引用。You can use the register() method to register the data set to your workspace so they can be shared with others, reused across various experiments, and referred to by name in your training script.

dataset = dataset.register(workspace=ws,
                           name='mnist-dataset',
                           description='training and test dataset',
                           create_new_version=True)

创建计算目标Create a compute target

创建用于运行训练作业的计算目标。Create a compute target for your training job to run on. 在此示例中,创建启用了 GPU 的 Azure 机器学习计算群集。In this example, create a GPU-enabled Azure Machine Learning compute cluster.

cluster_name = "gpu-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',
                                                           max_nodes=4)

    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

有关计算目标的详细信息,请参阅什么是计算目标一文。For more information on compute targets, see the what is a compute target article.

定义环境Define your environment

定义 Azure ML 环境,用于封装训练脚本的依赖项。Define the Azure ML Environment that encapsulates your training script's dependencies.

首先,在 YAML 文件中定义 conda 依赖项;在本例中,该文件名为 conda_dependencies.ymlFirst, define your conda dependencies in a YAML file; in this example the file is named conda_dependencies.yml.

channels:
- conda-forge
dependencies:
- python=3.6.2
- pip:
  - azureml-defaults
  - tensorflow-gpu==2.0.0
  - keras<=2.3.1
  - matplotlib

基于此 conda 环境规范创建 Azure ML 环境。Create an Azure ML environment from this conda environment specification. 此环境将在运行时打包到 Docker 容器中。The environment will be packaged into a Docker container at runtime.

在默认情况下,如果未指定基础映像,Azure ML 将使用 CPU 映像 azureml.core.environment.DEFAULT_CPU_IMAGE 作为基础映像。By default if no base image is specified, Azure ML will use a CPU image azureml.core.environment.DEFAULT_CPU_IMAGE as the base image. 由于本示例在 GPU 群集上运行训练,因此你需要指定具有必要 GPU 驱动程序和依赖项的 GPU 基础映像。Since this example runs training on a GPU cluster, you will need to specify a GPU base image that has the necessary GPU drivers and dependencies. Azure ML 维护一组在 Microsoft 容器注册表 (MCR) 上发布的基础映像,你可以使用这些映像,请参阅 Azure/AzureML 容器 GitHub 存储库以获取详细信息。Azure ML maintains a set of base images published on Microsoft Container Registry (MCR) that you can use, see the Azure/AzureML-Containers GitHub repo for more information.

keras_env = Environment.from_conda_specification(name='keras-env', file_path='conda_dependencies.yml')

# Specify a GPU base image
keras_env.docker.enabled = True
keras_env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.0-cudnn7-ubuntu18.04'

若要详细了解如何创建和使用环境,请参阅在 Azure 机器学习中创建和使用软件环境For more information on creating and using environments, see Create and use software environments in Azure Machine Learning.

配置和提交训练运行Configure and submit your training run

创建 ScriptRunConfigCreate a ScriptRunConfig

首先,使用 Dataset 类从工作区数据存储中获取数据。First get the data from the workspace datastore using the Dataset class.

dataset = Dataset.get_by_name(ws, 'mnist-dataset')

# list the files referenced by mnist-dataset
dataset.to_path()

创建一个 ScriptRunConfig 对象,以指定训练作业的配置详细信息,包括训练脚本、要使用的环境,以及要在其上运行的计算目标。Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on.

训练脚本的任何自变量都将通过命令行传递(如果已在 arguments 参数中指定)。Any arguments to your training script will be passed via command line if specified in the arguments parameter. FileDataset 的 DatasetConsumptionConfig 会作为自变量传递到训练脚本,用于 --data-folder 自变量。The DatasetConsumptionConfig for our FileDataset is passed as an argument to the training script, for the --data-folder argument. Azure ML 会将此 DatasetConsumptionConfig 解析为后备数据存储的装入点,然后该装入点便可供从训练脚本中访问。Azure ML will resolve this DatasetConsumptionConfig to the mount-point of the backing datastore, which can then be accessed from the training script.

from azureml.core import ScriptRunConfig

args = ['--data-folder', dataset.as_mount(),
        '--batch-size', 50,
        '--first-layer-neurons', 300,
        '--second-layer-neurons', 100,
        '--learning-rate', 0.001]

src = ScriptRunConfig(source_directory=script_folder,
                      script='keras_mnist.py',
                      arguments=args,
                      compute_target=compute_target,
                      environment=keras_env)

若要详细了解如何使用 ScriptRunConfig 配置作业,请参阅配置并提交训练运行For more information on configuring jobs with ScriptRunConfig, see Configure and submit training runs.

警告

如果你以前使用 TensorFlow 估算器来配置 Keras 训练作业,请注意,自 1.19.0 SDK 发行版起,该估算器已弃用。If you were previously using the TensorFlow estimator to configure your Keras training jobs, please note that Estimators have been deprecated as of the 1.19.0 SDK release. 对于不低于 1.15.0 版本的 Azure ML SDK,建议使用 ScriptRunConfig 作为配置训练作业(包括使用深度学习框架的作业)的方法。With Azure ML SDK >= 1.15.0, ScriptRunConfig is the recommended way to configure training jobs, including those using deep learning frameworks. 有关常见的迁移问题,请参阅估算器到 ScriptRunConfig 的迁移指南For common migration questions, see the Estimator to ScriptRunConfig migration guide.

提交运行Submit your run

运行对象在作业运行时和运行后提供运行历史记录的接口。The Run object provides the interface to the run history while the job is running and after it has completed.

run = Experiment(workspace=ws, name='Tutorial-Keras-Minst').submit(src)
run.wait_for_completion(show_output=True)

在运行执行过程中发生的情况What happens during run execution

执行运行时,会经历以下阶段:As the run is executed, it goes through the following stages:

  • 准备:根据所定义的环境创建 docker 映像。Preparing: A docker image is created according to the environment defined. 将映像上传到工作区的容器注册表,缓存以用于后续运行。The image is uploaded to the workspace's container registry and cached for later runs. 还会将日志流式传输到运行历史记录,可以查看日志以监视进度。Logs are also streamed to the run history and can be viewed to monitor progress. 如果改为指定特选环境,则会使用支持该特选环境的缓存映像。If a curated environment is specified instead, the cached image backing that curated environment will be used.

  • 缩放:如果 Batch AI 群集执行运行所需的节点多于当前可用节点,则群集将尝试纵向扩展。Scaling: The cluster attempts to scale up if the Batch AI cluster requires more nodes to execute the run than are currently available.

  • 正在运行:将脚本文件夹中的所有脚本上传到计算目标,装载或复制数据存储,然后执行 scriptRunning: All scripts in the script folder are uploaded to the compute target, data stores are mounted or copied, and the script is executed. 将 stdout 和 ./logs 文件夹中的输出流式传输到运行历史记录,即可将其用于监视运行。Outputs from stdout and the ./logs folder are streamed to the run history and can be used to monitor the run.

  • 后期处理:该运行的 ./outputs 文件夹会复制到运行历史记录。Post-Processing: The ./outputs folder of the run is copied over to the run history.

注册模型Register the model

训练模型后,可以将其注册到工作区。Once you've trained the model, you can register it to your workspace. 凭借模型注册,可以在工作区中存储模型并对其进行版本控制,从而简化模型管理和部署Model registration lets you store and version your models in your workspace to simplify model management and deployment.

model = run.register_model(model_name='keras-mnist', model_path='outputs/model')

提示

部署指南包含有关模型注册的部分,但由于你已有一个已注册的模型,因而可以直接跳到创建计算目标进行部署。The deployment how-to contains a section on registering models, but you can skip directly to creating a compute target for deployment, since you already have a registered model.

你还可以下载模型的本地副本。You can also download a local copy of the model. 这对于在本地执行其他模型验证工作非常有用。This can be useful for doing additional model validation work locally. 在训练脚本keras_mnist.py 中,TensorFlow 保护程序对象将模型保存到本地文件夹(计算目标本地)。In the training script, keras_mnist.py, a TensorFlow saver object persists the model to a local folder (local to the compute target). 可以使用 Run 对象从运行历史记录下载副本。You can use the Run object to download a copy from the run history.

# Create a model folder in the current directory
os.makedirs('./model', exist_ok=True)

for f in run.get_file_names():
    if f.startswith('outputs/model'):
        output_file_path = os.path.join('./model', f.split('/')[-1])
        print('Downloading from {} to {} ...'.format(f, output_file_path))
        run.download_file(name=f, output_file_path=output_file_path)

后续步骤Next steps

在本文中,你在 Azure 机器学习上训练和注册了 Keras 模型。In this article, you trained and registered a Keras model on Azure Machine Learning. 若要了解如何部署模型,请继续参阅模型部署一文。To learn how to deploy a model, continue on to our model deployment article.