使用自定义 Docker 映像训练模型Train a model using a custom Docker image

本文介绍在通过 Azure 机器学习训练模型时如何使用自定义 Docker 映像。In this article, learn how to use a custom Docker image when training models with Azure Machine Learning.

本文中的示例脚本用于通过创建卷积神经网络来对宠物图像进行分类。The example scripts in this article are used to classify pet images by creating a convolutional neural network.

虽然 Azure 机器学习提供默认的 Docker 基础映像,但你也可以使用 Azure 机器学习环境来指定特定的基础映像,例如系统维护的一组 Azure ML 基础映像中的一个或你自己的自定义映像While Azure Machine Learning provides a default Docker base image, you can also use Azure Machine Learning environments to specify a specific base image, such as one of the set of maintained Azure ML base images or your own custom image. 借助自定义基础映像,你可以在执行训练作业时严密管理依赖项,以及更加严格地控制组件版本。Custom base images allow you to closely manage your dependencies and maintain tighter control over component versions when executing training jobs.

先决条件Prerequisites

在以下任一环境中运行此代码:Run this code on either of these environments:

设置试验Set up the experiment

本部分将准备训练实验,包括初始化工作区、创建实验以及上传训练数据和训练脚本。This section sets up the training experiment by initializing a workspace, creating an experiment, and uploading the training data and training scripts.

初始化工作区Initialize a workspace

Azure 机器学习工作区是服务的顶级资源。The Azure Machine Learning workspace is the top-level resource for the service. 它提供了一个集中的位置来处理创建的所有项目。It provides you with a centralized place to work with all the artifacts you create. 在 Python SDK 中,可以通过创建 workspace 对象来访问工作区项目。In the Python SDK, you can access the workspace artifacts by creating a workspace object.

根据在先决条件部分中创建的 config.json 文件创建工作区对象。Create a workspace object from the config.json file created in the prerequisites section.

from azureml.core import Workspace

ws = Workspace.from_config()

准备脚本Prepare scripts

本教程使用此处提供的训练脚本 train.py。For this tutorial, the training script train.py is provided here. 实际上,你可以原样接受任何自定义的训练脚本,并使用 Azure 机器学习运行它。In practice, you can take any custom training script, as is, and run it with Azure Machine Learning.

定义环境Define your environment

创建环境对象并启用 Docker。Create an environment object and enable Docker.

from azureml.core import Environment

fastai_env = Environment("fastai2")
fastai_env.docker.enabled = True

这个指定的基础映像支持 fast.ai 库,后者支持分布式深度学习功能。This specified base image supports the fast.ai library which allows for distributed deep learning capabilities. 有关详细信息,请参阅 fast.ai DockerHubFor more information, see the fast.ai DockerHub.

使用自定义 Docker 映像时,你可能已经正确设置了 Python 环境。When you are using your custom Docker image, you might already have your Python environment properly set up. 在这种情况下,请将 user_managed_dependencies 标志设置为 True,以便利用自定义映像的内置 python 环境。In that case, set the user_managed_dependencies flag to True in order to leverage your custom image's built-in python environment. 默认情况下,Azure ML 会使用你指定的依赖项构建一个 Conda 环境,并会在该环境中执行运行,而不是使用你在基础映像上安装的任何 Python 库。By default, Azure ML will build a Conda environment with dependencies you specified, and will execute the run in that environment instead of using any Python libraries that you installed on the base image.

fastai_env.docker.base_image = "fastdotai/fastai2:latest"
fastai_env.python.user_managed_dependencies = True

若要使用不在工作区中的专用容器注册表中的映像,必须使用 docker.base_image_registry 指定存储库的地址以及用户名和密码:To use an image from a private container registry that is not in your workspace, you must use docker.base_image_registry to specify the address of the repository and a user name and password:

# Set the container registry information
fastai_env.docker.base_image_registry.address = "myregistry.azurecr.io"
fastai_env.docker.base_image_registry.username = "username"
fastai_env.docker.base_image_registry.password = "password"

还可以使用自定义 Dockerfile。It is also possible to use a custom Dockerfile. 如果需要安装非 Python 包作为依赖项,请使用此方法,并记住将基础映像设置为“无”。Use this approach if you need to install non-Python packages as dependencies and remember to set the base image to None.

# Specify docker steps as a string. 
dockerfile = r"""
FROM mcr.microsoft.com/azureml/base:intelmpi2018.3-ubuntu16.04
RUN echo "Hello from custom container!"
"""

# Set base image to None, because the image is defined by dockerfile.
fastai_env.docker.base_image = None
fastai_env.docker.base_dockerfile = dockerfile

# Alternatively, load the string from a file.
fastai_env.docker.base_image = None
fastai_env.docker.base_dockerfile = "./Dockerfile"

创建或附加现有 AmlComputeCreate or attach existing AmlCompute

你需要创建一个计算目标来训练模型。You will need to create a compute target for training your model. 在本教程中,创建 AmlCompute 作为训练计算资源。In this tutorial, you create AmlCompute as your training compute resource.

创建 AmlCompute 需要花费大约 5 分钟。Creation of AmlCompute takes approximately 5 minutes. 如果工作区中已有使用该名称的 AmlCompute,则此代码会跳过创建流程。If the AmlCompute with that name is already in your workspace this code will skip the creation process.

与其他 Azure 服务一样,与 Azure 机器学习服务关联的某些资源(例如 AmlCompute)存在限制。As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. 若要了解默认限制以及申请更多配额的方法,请阅读此文章Please read this article on the default limits and how to request more quota.

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "gpu-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',
                                                           max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current AmlCompute
print(compute_target.get_status().serialize())

创建 ScriptRunConfigCreate a ScriptRunConfig

此 ScriptRunConfig 会将作业配置为在所需计算目标上执行。This ScriptRunConfig will configure your job for execution on the desired compute target.

from azureml.core import ScriptRunConfig

fastai_config = ScriptRunConfig(source_directory='fastai-example', script='train.py')
fastai_config.run_config.environment = fastai_env
fastai_config.run_config.target = compute_target

提交运行Submit your run

使用 ScriptRunConfig 对象提交训练运行时,submit 方法将返回 ScriptRun 类型的对象。When a training run is submitted using a ScriptRunConfig object, the submit method returns an object of type ScriptRun. 返回的 ScriptRun 对象使你能够以编程方式访问有关训练运行的信息。The returned ScriptRun object gives you programmatic access to information about the training run.

from azureml.core import Experiment

run = Experiment(ws,'fastai-custom-image').submit(fastai_config)
run.wait_for_completion(show_output=True)

警告

Azure 机器学习通过复制整个源目录来运行训练脚本。Azure Machine Learning runs training scripts by copying the entire source directory. 如果你有不想上传的敏感数据,请使用 .ignore 文件或不将其包含在源目录中。If you have sensitive data that you don't want to upload, use a .ignore file or don't include it in the source directory . 改为使用数据存储来访问数据。Instead, access your data using a datastore.

有关自定义 Python 环境的详细信息,请参阅创建和使用软件环境For more information about customizing your Python environment, see create & use software environments.

后续步骤Next steps

本文介绍了如何使用自定义 Docker 映像来训练模型。In this article, you trained a model using a custom Docker image. 有关 Azure 机器学习的详细信息,请参阅以下其他文章。See these other articles to learn more about Azure Machine Learning.