使用自定义 Docker 映像训练模型Train a model by using a custom Docker image

在本文中,你将了解在通过 Azure 机器学习训练模型时如何使用自定义 Docker 映像。In this article, learn how to use a custom Docker image when you're training models with Azure Machine Learning. 你将使用本文中的示例脚本通过创建卷积神经网络来对宠物图像进行分类。You'll use the example scripts in this article to classify pet images by creating a convolutional neural network.

Azure 机器学习提供了一个默认的 Docker 基础映像。Azure Machine Learning provides a default Docker base image. 你还可以使用 Azure 机器学习环境来指定一个不同的基础映像,例如系统维护的 Azure 机器学习基础映像之一或你自己的自定义映像You can also use Azure Machine Learning environments to specify a different base image, such as one of the maintained Azure Machine Learning base images or your own custom image. 借助自定义基础映像,你可以在运行训练作业时密切管理依赖项,以及更加严格地控制组件版本。Custom base images allow you to closely manage your dependencies and maintain tighter control over component versions when running training jobs.

先决条件Prerequisites

在以下任一环境中运行此代码:Run the code on either of these environments:

设置训练试验Set up a training experiment

在本部分中,你将通过初始化工作区、定义你的环境并配置计算目标来设置训练试验。In this section, you set up your training experiment by initializing a workspace, defining your environment, and configuring a compute target.

初始化工作区Initialize a workspace

Azure 机器学习工作区是服务的顶级资源。The Azure Machine Learning workspace is the top-level resource for the service. 它提供了一个集中化位置来处理你创建的所有项目。It gives you a centralized place to work with all the artifacts that you create. 在 Python SDK 中,可以通过创建 Workspace 对象来访问工作区项目。In the Python SDK, you can access the workspace artifacts by creating a Workspace object.

通过作为先决条件创建的 config.json 文件创建一个 Workspace 对象。Create a Workspace object from the config.json file that you created as a prerequisite.

from azureml.core import Workspace

ws = Workspace.from_config()

定义环境Define your environment

创建一个 Environment 对象并启用 Docker。Create an Environment object and enable Docker.

from azureml.core import Environment

fastai_env = Environment("fastai2")
fastai_env.docker.enabled = True

以下代码中指定的基础映像支持 fast.ai 库,后者支持分布式深度学习功能。The specified base image in the following code supports the fast.ai library, which allows for distributed deep-learning capabilities. 有关详细信息,请参阅 fast.ai Docker Hub 存储库For more information, see the fast.ai Docker Hub repository.

使用自定义 Docker 映像时,你可能已经正确设置了 Python 环境。When you're using your custom Docker image, you might already have your Python environment properly set up. 在这种情况下,请将 user_managed_dependencies 标志设置为 True,以使用自定义映像的内置 Python 环境。In that case, set the user_managed_dependencies flag to True to use your custom image's built-in Python environment. 默认情况下,Azure 机器学习构建包含指定依赖项的 Conda 环境。By default, Azure Machine Learning builds a Conda environment with dependencies that you specified. 该服务在该环境中运行脚本,而不是使用已安装在基础映像中的任何 Python 库。The service runs the script in that environment instead of using any Python libraries that you installed on the base image.

fastai_env.docker.base_image = "fastdotai/fastai2:latest"
fastai_env.python.user_managed_dependencies = True

使用专用容器注册表(可选)Use a private container registry (optional)

若要使用不在工作区中的专用容器注册表中的映像,请使用 docker.base_image_registry 指定存储库的地址以及用户名和密码:To use an image from a private container registry that isn't in your workspace, use docker.base_image_registry to specify the address of the repository and a username and password:

# Set the container registry information.
fastai_env.docker.base_image_registry.address = "myregistry.azurecr.io"
fastai_env.docker.base_image_registry.username = "username"
fastai_env.docker.base_image_registry.password = "password"

使用自定义 Dockerfile(可选)Use a custom Dockerfile (optional)

还可以使用自定义 Dockerfile。It's also possible to use a custom Dockerfile. 如果需要安装非 Python 包作为依赖项,请使用此方法。Use this approach if you need to install non-Python packages as dependencies. 记得将基础映像设置为 NoneRemember to set the base image to None.

# Specify Docker steps as a string. 
dockerfile = r"""
FROM mcr.microsoft.com/azureml/base:intelmpi2018.3-ubuntu16.04
RUN echo "Hello from custom container!"
"""

# Set the base image to None, because the image is defined by Dockerfile.
fastai_env.docker.base_image = None
fastai_env.docker.base_dockerfile = dockerfile

# Alternatively, load the string from a file.
fastai_env.docker.base_image = None
fastai_env.docker.base_dockerfile = "./Dockerfile"

重要

Azure 机器学习仅支持提供以下软件的 Docker 映像:Azure Machine Learning only supports Docker images that provide the following software:

  • Ubuntu 16.04 或更高版本。Ubuntu 16.04 or greater.
  • Conda 4.5.# 或更高版本。Conda 4.5.# or greater.
  • Python 3.5+。Python 3.5+.

若要详细了解如何创建和管理 Azure 机器学习环境,请参阅创建和使用软件环境For more information about creating and managing Azure Machine Learning environments, see Create and use software environments.

创建或附加计算目标Create or attach a compute target

你需要创建一个计算目标来训练模型。You need to create a compute target for training your model. 在本教程中,你将创建 AmlCompute 作为训练计算资源。In this tutorial, you create AmlCompute as your training compute resource.

创建 AmlCompute 需要几分钟时间。Creation of AmlCompute takes a few minutes. 如果你的工作区中已有 AmlCompute 资源,则此代码会跳过创建流程。If the AmlCompute resource is already in your workspace, this code skips the creation process.

与其他 Azure 服务一样,与 Azure 机器学习服务关联的某些资源(例如 AmlCompute)存在限制。As with other Azure services, there are limits on certain resources (for example, AmlCompute) associated with the Azure Machine Learning service. 有关详细信息,请参阅默认限制以及如何请求更高的配额For more information, see Default limits and how to request a higher quota.

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your cluster.
cluster_name = "gpu-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',
                                                           max_nodes=4)

    # Create the cluster.
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

# Use get_status() to get a detailed status for the current AmlCompute.
print(compute_target.get_status().serialize())

配置训练作业Configure your training job

对于本教程,请使用 GitHub 上的训练脚本 train.py。For this tutorial, use the training script train.py on GitHub. 实际上,你可以原样接受任何自定义的训练脚本并使用 Azure 机器学习运行它。In practice, you can take any custom training script and run it, as is, with Azure Machine Learning.

创建 ScriptRunConfig 资源,以将作业配置为在所需的计算目标上运行。Create a ScriptRunConfig resource to configure your job for running on the desired compute target.

from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory='fastai-example',
                      script='train.py',
                      compute_target=compute_target,
                      environment=fastai_env)

提交训练作业Submit your training job

使用 ScriptRunConfig 对象提交训练运行时,submit 方法返回 ScriptRun 类型的对象。When you submit a training run by using a ScriptRunConfig object, the submit method returns an object of type ScriptRun. 返回的 ScriptRun 对象使你能够以编程方式访问有关训练运行的信息。The returned ScriptRun object gives you programmatic access to information about the training run.

from azureml.core import Experiment

run = Experiment(ws,'Tutorial-fastai').submit(src)
run.wait_for_completion(show_output=True)

警告

Azure 机器学习通过复制整个源目录来运行训练脚本。Azure Machine Learning runs training scripts by copying the entire source directory. 如果你有不想上传的敏感数据,请使用 .ignore 文件或不将其包含在源目录中。If you have sensitive data that you don't want to upload, use an .ignore file or don't include it in the source directory. 请改为使用数据存储来访问数据。Instead, access your data by using a datastore.

后续步骤Next steps

在本文中,你已使用自定义 Docker 映像训练了一个模型。In this article, you trained a model by using a custom Docker image. 有关 Azure 机器学习的详细信息,请参阅下述其他文章:See these other articles to learn more about Azure Machine Learning: