使用 Azure Machine Learning 大规模训练 PyTorch 模型

适用于：Python SDK azure-ai-ml v2（当前）

本文介绍如何使用 Azure Machine Learning Python SDK v2 训练、超参数优化和部署 PyTorch 模型。

你将使用示例脚本对鸡和火鸡图像进行分类，以基于 PyTorch 的迁移学习教程构建深度学习神经网络 (DNN)。迁移学习是一种将解决某个问题时获得的知识应用于虽然不同但却相关的问题的技术。与从头开始训练相比，迁移学习需要较少的数据、时间和计算资源，从而缩短了训练过程。有关迁移学习的详细信息，请参阅深度学习与机器学习。

无论是从头开始训练深度学习 PyTorch 模型还是将现有模型引入云中，都使用Azure Machine Learning通过弹性云计算资源横向扩展开源训练作业。可以使用Azure Machine Learning生成、部署、版本和监视生产级模型。

先决条件

一个 Azure 订阅。如果还没有订阅，请创建试用版。
Python 3.10 或更高版本。
使用Azure Machine Learning计算实例或你自己的 Jupyter 笔记本运行本文中的代码。
- Azure Machine Learning计算实例 - 无需下载或安装：
  - 完成 Quickstart：开始使用 Azure Machine Learning 创建预加载 SDK 和示例存储库的专用笔记本服务器。
  - 在工作区的“笔记本”部分的“示例”选项卡下，导航至以下目录来查找已完成且展开的笔记本：SDK v2/sdk/python/jobs/single-step/pytorch/train-hyperparameter-tune-deploy-with-pytorch
- Jupyter 笔记本服务器：
  - 安装 Azure Machine Learning SDK （v2）。
  - pytorch_train.py下载训练脚本文件。

还可以在GitHub示例页上找到本指南的完整Jupyter 笔记本版本。

设置任务

本部分通过加载所需的Python包、连接到工作区、创建计算资源来运行命令作业以及创建用于运行作业的环境来设置训练作业。

连接到工作区

首先，连接到 Azure Machine Learning 工作区。工作区是服务的顶级资源。它提供了一个集中的位置，用于处理使用Azure Machine Learning时创建的所有项目。

使用 DefaultAzureCredential 访问工作区。此凭据处理大多数Azure SDK身份验证方案。

如果 DefaultAzureCredential 不适合你，请参阅 azure.identity 包或设置身份验证以获取更多可用的凭据。

# Handle to the workspace
from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

如果想要使用浏览器登录和进行身份验证，请取消注释以下代码并改用它。

# Handle to the workspace
# from azure.ai.ml import MLClient

# Authentication package
# from azure.identity import InteractiveBrowserCredential
# credential = InteractiveBrowserCredential()

小窍门

如果想要使用浏览器登录和进行身份验证，请改用 InteractiveBrowserCredential ：

from azure.identity import InteractiveBrowserCredential
credential = InteractiveBrowserCredential()

接下来，通过提供订阅 ID、资源组名称和工作区名称来获取工作区访问权限。要查找这些参数：

在Azure Machine Learning studio工具栏右上角查找工作区名称。
选择工作区名称以显示资源组和订阅 ID。
将资源组和订阅 ID 的值复制到代码中。

# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id="<SUBSCRIPTION_ID>",
    resource_group_name="<RESOURCE_GROUP>",
    workspace_name="<AML_WORKSPACE_NAME>",
)

运行此脚本会得到一个工作区句柄，你可以使用它来管理其他资源和作业。

注意

创建 MLClient 不会将客户端连接到工作区。客户端的初始化采用惰性加载模式，等待首次需要进行调用时才会执行。在本文中，在创建计算期间，将进行此调用。

创建计算资源以运行作业

Azure Machine Learning需要计算资源才能运行作业。此资源可以是具有 Linux 或Windows OS 的单个或多节点计算机，也可以是特定的计算结构（如 Spark）。

在以下示例脚本中，预配 Linux 计算群集。有关 VM 大小和价格的完整列表，请参阅 Azure Machine Learning 定价页。由于此示例需要 GPU 群集，因此请选择 Standard_NC4as_T4_v3 模型并创建Azure Machine Learning计算。

from azure.ai.ml.entities import AmlCompute

gpu_compute_taget = "gpu-cluster"

try:
    # let's see if the compute target already exists
    gpu_cluster = ml_client.compute.get(gpu_compute_taget)
    print(
        f"You already have a cluster named {gpu_compute_taget}, we'll reuse it as is."
    )

except Exception:
    print("Creating a new gpu compute target...")

    # Let's create the Azure ML compute object with the intended parameters
    gpu_cluster = AmlCompute(
        # Name assigned to the compute cluster
        name="gpu-cluster",
        # Azure ML Compute is the on-demand VM service
        type="amlcompute",
        # VM Family
        size="STANDARD_NC6",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in cluster
        max_instances=4,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=180,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )

    # Now, we pass the object to MLClient's create_or_update method
    gpu_cluster = ml_client.begin_create_or_update(gpu_cluster).result()

print(
    f"AMLCompute with name {gpu_cluster.name} is created, the compute size is {gpu_cluster.size}"
)

创建作业环境

若要运行Azure Machine Learning作业，需要一个环境。 Azure Machine Learning vironment封装在计算资源上运行机器学习训练脚本所需的依赖项（如软件运行时和库）。此环境类似于本地计算机上的Python环境。

Azure Machine Learning允许你使用特选（或现成）环境，或使用 Docker 映像或 Conda 配置创建自定义环境。在本文中，你将复用精选的 Azure Machine Learning 环境AzureML-acpt-pytorch-2.8-cuda12.6。请使用 @latest 指令来启用此环境的最新版本。

curated_env_name = "AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu@latest"

配置并提交训练作业

在本部分中，将介绍用于训练的数据。然后，你将了解如何使用提供的训练脚本运行训练作业。请学习如何通过配置用于运行训练脚本的命令来生成训练作业。然后，提交要在Azure Machine Learning中运行的训练作业。

获取训练数据

你可以使用此压缩文件中的数据集。此数据集包含了两个类（火鸡和鸡）各约 120 张训练图像，每个类有 100 张验证图像。这些图像是 Open Images v5 Dataset 的子集。训练脚本 pytorch_train.py 会下载并提取数据集。

准备训练脚本

在先决条件部分中，你提供了 训练脚本pytorch_train.py。在实际操作中，你应该能够直接使用任何自定义训练脚本< c0 >原样< /c0 >，并使用 Azure Machine Learning 运行它，而无需修改代码。

提供的训练脚本会下载数据、训练模型并注册模型。

生成训练作业

现在，你拥有运行作业所需的所有资产，请使用 Azure Machine Learning Python SDK v2 生成它。对于此示例，请创建一个 command.

Azure Machine Learning command 是一种资源，用于指定在云中执行训练代码所需的所有详细信息。这些详细信息包括输入和输出、要使用的硬件类型、要安装的软件以及代码运行方式。 command 包含用于执行单个命令的信息。

配置命令

使用常规用途 command 运行训练脚本并执行所需的任务。创建一个 command 对象来指定训练作业的配置详细信息。

from azure.ai.ml import command
from azure.ai.ml import Input

job = command(
    inputs=dict(
        num_epochs=30, learning_rate=0.001, momentum=0.9, output_dir="./outputs"
    ),
    compute=gpu_compute_taget,
    environment=curated_env_name,
    code="./src/",  # location of source code
    command="python pytorch_train.py --num_epochs ${{inputs.num_epochs}} --output_dir ${{inputs.output_dir}}",
    experiment_name="pytorch-birds",
    display_name="pytorch-birds-image",
)

此命令的输入包括周期数、学习率、动量和输出路径。
对于参数值：
1. 提供为运行此命令而创建的计算群集 gpu_compute_target = "gpu-cluster"。
2. 提供您早先已初始化过的精心配置环境。
3. 如果未使用位于样本文件夹中的已完成笔记本，请指定 pytorch_train.py 文件的具体位置。
4. 配置命令行操作本身。在本例中，命令为 python pytorch_train.py. 可通过 ${{ ... }} 表示法访问命令中的输入和输出。
5. 配置显示名称和试验名称等元数据。试验是针对特定项目执行的所有迭代的容器。在Azure Machine Learning studio中，在同一个试验名称下提交的所有作业都显示在彼此旁边。

提交作业

现在，提交要在Azure Machine Learning中运行的作业。这次在 create_or_update 上使用 ml_client.jobs。

ml_client.jobs.create_or_update(job)

作业完成后，它会将一个模型注册到您的工作区，作为训练的成果。它还输出用于在Azure Machine Learning studio中查看作业的链接。

警告

Azure Machine Learning通过复制整个源目录来运行训练脚本。如果你有不想上传的敏感数据，请使用 .ignore 文件或不将其包含在源目录中。

在作业执行过程中发生的情况

执行作业时，它将经历以下阶段：

准备：该过程根据定义的环境创建 Docker 映像。它将映像上传到工作区的容器注册表，并将其缓存以供以后运行。该过程还会将日志流式传输到作业历史记录中，以便您查看并监控进度。如果指定特选环境，则该过程将使用备份该特选环境的缓存映像。
缩放：如果群集执行运行所需的节点多于当前可用节点，则该群集将尝试纵向扩展。
正在运行：进程将 src 脚本文件夹中的所有脚本上传到计算目标。它装载或复制数据存储器。它执行脚本。进程将从 stdout 和 ./logs 文件夹流式传输输出到作业历史记录。可以使用这些输出来监视作业。

调整模型超参数

你使用一组参数训练了模型。现在，看看是否可以进一步提高模型的准确性。使用 Azure Machine Learning 的 sweep 功能调整和优化模型的超参数。

若要优化模型的超参数，请定义要在训练期间搜索的参数空间。将传递给训练作业的某些参数替换为使用azure.ml.sweep 包中的特殊输入。

由于训练脚本使用学习速率计划来衰减每个时期的学习速率，因此你可以优化初始学习速率和动量参数。

from azure.ai.ml.sweep import Uniform

# we will reuse the command_job created before. we call it as a function so that we can apply inputs
job_for_sweep = job(
    learning_rate=Uniform(min_value=0.0005, max_value=0.005),
    momentum=Uniform(min_value=0.9, max_value=0.99),
)

然后，使用一些特定于扫描的参数（例如要监视的主要指标和要使用的采样算法）在命令作业上配置扫描。

在以下代码中，随机采样尝试不同的配置集超参数，以最大化主要指标 best_val_acc。

此外，还可以定义提前终止策略，即 BanditPolicy提前终止性能不佳的运行。终止 BanditPolicy 任何超出主要评估指标松弛因子范围的运行。每个纪元都应用此策略（因为 best_val_acc 指标在每个纪元报告, 且 evaluation_interval=1）。第一个策略评估延迟到前 10 个时期之后（delay_evaluation=10）。

from azure.ai.ml.sweep import BanditPolicy

sweep_job = job_for_sweep.sweep(
    compute="gpu-cluster",
    sampling_algorithm="random",
    primary_metric="best_val_acc",
    goal="Maximize",
    max_total_trials=8,
    max_concurrent_trials=4,
    early_termination_policy=BanditPolicy(
        slack_factor=0.15, evaluation_interval=1, delay_evaluation=10
    ),
)

现在，请像以前一样提交此作业。这一次，请运行一个遍历作业来覆盖你的训练作业。

returned_sweep_job = ml_client.create_or_update(sweep_job)

# stream the output and wait until the job is finished
ml_client.jobs.stream(returned_sweep_job.name)

# refresh the latest status of the job after streaming
returned_sweep_job = ml_client.jobs.get(name=returned_sweep_job.name)

可使用在作业运行期间显示的工作室用户界面链接来监视作业。

找到最佳模型

在所有运行完成后，找到生成准确度最高模型的运行。

from azure.ai.ml.entities import Model

if returned_sweep_job.status == "Completed":

    # First let us get the run which gave us the best result
    best_run = returned_sweep_job.properties["best_child_run_id"]

    # lets get the model from this run
    model = Model(
        # the script stores the model as "outputs"
        path="azureml://jobs/{}/outputs/artifacts/paths/outputs/".format(best_run),
        name="run-model-example",
        description="Model created from run.",
        type="custom_model",
    )

else:
    print(
        "Sweep job status: {}. Please wait until it completes".format(
            returned_sweep_job.status
        )
    )

将模型部署为联机终结点

现在，您可以将模型部署为在线终结点，即作为 Azure 云中的 Web 服务。

若要部署机器学习服务，通常需要：

要部署的模型资产。这些资产包括您在训练作业中已注册的模型文件和元数据。
一些要作为服务运行的代码。这些代码根据给定的输入请求（入口脚本）执行模型。此脚本接收提交到已部署的 Web 服务的数据，并将其传递给模型。模型处理数据后，脚本会将模型的响应返回到客户端。该脚本特定于你的模型，并且必须识别模型需要和返回的数据。使用 MLFlow 模型时，Azure Machine Learning会自动为你创建此脚本。

有关部署和使用 Python SDK v2 通过托管在线终结点对机器学习模型进行评分的详细信息，请参阅使用 Python SDK v2 部署和评分机器学习模型。

创建新的联机终结点

部署模型的第一步是创建联机终结点。终结点名称在整个Azure区域中必须是唯一的。在本文中，请使用通用唯一标识符（UUID）创建唯一名称。

import uuid

# Creating a unique name for the endpoint
online_endpoint_name = "aci-birds-endpoint-" + str(uuid.uuid4())[:8]

from azure.ai.ml.entities import ManagedOnlineEndpoint

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="Classify turkey/chickens using transfer learning with PyTorch",
    auth_mode="key",
    tags={"data": "birds", "method": "transfer learning", "framework": "pytorch"},
)

endpoint = ml_client.begin_create_or_update(endpoint).result()

print(f"Endpoint {endpoint.name} provisioning state: {endpoint.provisioning_state}")

创建终结点后，按如下所示检索它：

endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

print(
    f'Endpint "{endpoint.name}" with provisioning state "{endpoint.provisioning_state}" is retrieved'
)

将模型部署到终结点

使用入口脚本部署模型。一个终结点可以有多个部署。通过使用规则，终结点可以将流量定向到这些部署。

在以下代码中，请创建单个部署来处理 100% 的传入流量。该代码对部署使用任意颜色名称蓝色。还可以将任何其他名称（如绿色或红色）用于部署。

用于将模型部署到终结点的代码：

部署之前注册的模型的最佳版本。
使用 score.py 文件为模型评分。
使用前面指定的特选环境执行推理。

from azure.ai.ml.entities import (
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration,
)

online_deployment_name = "aci-blue"

# create an online deployment.
blue_deployment = ManagedOnlineDeployment(
    name=online_deployment_name,
    endpoint_name=online_endpoint_name,
    model=model,
    environment=curated_env_name,
    code_configuration=CodeConfiguration(code="./score/", scoring_script="score.py"),
    instance_type="Standard_NC6s_v3",
    instance_count=1,
)

blue_deployment = ml_client.begin_create_or_update(blue_deployment).result()

注意

预计此部署需要一些时间才能完成。

测试已部署的模型

将模型部署到终结点后，使用 invoke 终结点上的方法预测已部署模型的输出。

若要测试终结点，请使用示例图像进行预测。首先，显示图像。

# install pillow if PIL cannot imported
# %pip install pillow
import json
from PIL import Image
import matplotlib.pyplot as plt

%matplotlib inline
plt.imshow(Image.open("test_img.jpg"))

创建一个函数来设置图像的格式并调整其大小。

# install torch and torchvision iof needed
#%pip install torch
#%pip install torchvision

import torch
from torchvision import transforms


def preprocess(image_file):
    """Preprocess the input image."""
    data_transforms = transforms.Compose(
        [
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    )

    image = Image.open(image_file)
    image = data_transforms(image).float()
    image = torch.tensor(image)
    image = image.unsqueeze(0)
    return image.numpy()

设置图像的格式并将其转换为 JSON 文件。

image_data = preprocess("test_img.jpg")
input_data = json.dumps({"data": image_data.tolist()})
with open("request.json", "w") as outfile:
    outfile.write(input_data)

使用此 JSON 调用终结点并输出结果。

# test the blue deployment
result = ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    request_file="request.json",
    deployment_name=online_deployment_name,
)

print(result)

清理资源

如果不再需要终结点，请将其删除以停止使用资源。在删除终结点之前，请确保没有其他部署使用终结点。

ml_client.online_endpoints.begin_delete(name=online_endpoint_name)

注意

预计此清理需要一些时间才能完成。

本文介绍如何在 Azure Machine Learning 上使用 PyTorch 训练和注册深度学习神经网络。你还将该模型部署到了联机终结点。若要了解有关Azure Machine Learning的详细信息，请参阅以下文章：

Last updated on 2026-04-22