分布式 GPU 训练指南 (SDK v2)

适用范围：Python SDK azure-ai-ml v2（最新版）

本文介绍如何在 Azure 机器学习中使用分布式 GPU 训练代码。你将了解如何使用 PyTorch、DeepSpeed 和 TensorFlow 的提示和示例运行现有代码。你还将了解如何使用 InfiniBand 加速分布式 GPU 训练。

提示

超过 90% 的时间，您应该使用分布式数据并行作为您的分布式并行类型。

先决条件

了解分布式 GPU 训练的基本概念，例如 数据并行度、 分布式数据并行度和 模型并行度。
安装了 1.5.0 或更高版本包的适用于 Python 的 azure-ai-ml。

PyTorch

Azure 机器学习支持使用 PyTorch 的本机分布式训练功能 torch.distributed运行分布式作业。

对于数据并行，官方 PyTorch 指南建议在单节点和多节点分布式训练中使用 DistributedDataParallel（DDP），而不是 DataParallel。 PyTorch 还建议对多处理包使用 DistributedDataParallel。因此，Azure 机器学习文档和示例侧重于 DistributedDataParallel 训练。

进程组初始化

任何分布式训练的主干都是一组进程，这些进程可以通过后端识别并相互通信。对于 PyTorch，可以通过在所有分布式进程中调用 torch.distributed.init_process_group 来创建进程组，以共同形成进程组。

torch.distributed.init_process_group(backend='nccl', init_method='env://', ...)

最常见的通信后端是 mpi， nccl以及 gloo。对于基于 GPU 的训练，请使用 nccl 以获得最佳性能。

若要在 Azure 机器学习上运行分布式 PyTorch，请使用 init_method 训练代码中的参数。此参数指定每个进程如何发现其他进程，以及它们如何使用通信后端初始化和验证进程组。默认情况下，如果未指定 init_method，PyTorch 将使用环境变量初始化方法 env://。

PyTorch 会查找以下用于初始化的环境变量：

MASTER_ADDR：承载进程且排名为的 0计算机的 IP 地址。
MASTER_PORT：托管进程且排名为 0的计算机上的免费端口。
WORLD_SIZE：进程总数。应等于用于分布式训练的 GPU 设备总数。
RANK：当前进程的全局排名。可能的值是 0(<world size> - 1)。

有关进程组初始化详细信息，请参阅 PyTorch 文档。

环境变量

许多应用程序还需要以下环境变量：

LOCAL_RANK：节点中进程的本地相对排名。可能的值是 0(<# of processes on the node> - 1)。此信息很有用，因为许多操作（例如数据准备）只需为每个节点执行一次，通常在 local_rank = 0 上进行。
NODE_RANK：多节点训练节点的排名。可能的值是 0(<total # of nodes> - 1)。

运行分布式 PyTorch 任务

无需使用启动器实用工具，例如 torch.distributed.launch 运行分布式 PyTorch 作业。您可以：

指定训练脚本和参数。
创建一个 command，并在 type 参数中将 PyTorch 指定为 process_count_per_instance 和 distribution。

对应于 process_count_per_instance 要为作业运行的进程总数，通常应等于每个节点的 GPU 数。如果未指定 process_count_per_instance，则默认情况下，Azure 机器学习会为每个节点启动一个进程。

Azure 机器学习在每个节点上设置MASTER_ADDR、MASTER_PORTWORLD_SIZE和NODE_RANK环境变量，并设置进程级别RANK和LOCAL_RANK环境变量。

PyTorch 示例

from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes

# === Note on path ===
# can be can be a local path or a cloud path. AzureML supports https://`, `abfss://`, `wasbs://` and `azureml://` URIs.
# Local paths are automatically uploaded to the default datastore in the cloud.
# More details on supported paths: https://docs.azure.cn/machine-learning/how-to-read-write-data-v2#supported-paths

inputs = {
    "cifar": Input(
        type=AssetTypes.URI_FOLDER, path=returned_job.outputs.cifar.path
    ),  # path="azureml:azureml_stoic_cartoon_wgb3lgvgky_output_data_cifar:1"), #path="azureml://datastores/workspaceblobstore/paths/azureml/stoic_cartoon_wgb3lgvgky/cifar/"),
    "epoch": 10,
    "batchsize": 64,
    "workers": 2,
    "lr": 0.01,
    "momen": 0.9,
    "prtfreq": 200,
    "output": "./outputs",
}

job = command(
    code="./src",  # local path where the code is stored
    command="python train.py --data-dir ${{inputs.cifar}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --workers ${{inputs.workers}} --learning-rate ${{inputs.lr}} --momentum ${{inputs.momen}} --print-freq ${{inputs.prtfreq}} --model-dir ${{inputs.output}}",
    inputs=inputs,
    environment="azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu:6",
    compute="gpu-cluster",  # Change the name to the gpu cluster of your workspace.
    instance_count=2,  # In this, only 2 node cluster was created.
    distribution={
        "type": "PyTorch",
        # set process count to the number of gpus per node
        # NV6 has only 1 GPU
        "process_count_per_instance": 1,
    },
)

有关运行 PyTorch 示例的完整笔记本，请参阅 azureml-examples：在 CIFAR-10 上使用 PyTorch 的分布式训练。

DeepSpeed

Azure 机器学习支持 DeepSpeed 作为顶级功能，以运行具有接近线性可伸缩性的分布式作业，以提高模型大小和增加的 GPU 数量。

可以使用 PyTorch 分发或消息传递接口（MPI）启用 DeepSpeed 以运行分布式训练。 Azure 机器学习支持 DeepSpeed 启动器来启动分布式训练，并自动调整以获得最佳 ds 配置。

可以使用配备最新尖端技术的特选环境，包括 DeepSpeed、ONNX（开放神经网络交换）运行时（ORT）、微软集体通信库（MSSCCL）和 PyTorch，来进行 DeepSpeed 训练任务。

DeepSpeed 示例

关于 DeepSpeed 训练和自动调优示例，请参阅 https://github.com/Azure/azureml-examples/cli/jobs/deepspeed。

TensorFlow

如果在训练代码（如 TensorFlow 2.x API）中使用tf.distribute.Strategy，则可以使用distribution参数或TensorFlowDistribution对象通过 Azure 机器学习启动分布式作业。

# create the command
job = command(
    code="./src",  # local path where the code is stored
    command="python main.py --epochs ${{inputs.epochs}} --model-dir ${{inputs.model_dir}}",
    inputs={"epochs": 1, "model_dir": "outputs/keras-model"},
    environment="AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-gpu@latest",
    compute="cpu-cluster",
    instance_count=2,
    # distribution = {"type": "mpi", "process_count_per_instance": 1},
    distribution={
        "type": "tensorflow",
        "parameter_server_count": 1,
        "worker_count": 2,
        "added_property": 7,
    },
    # distribution = {
    #        "type": "pytorch",
    #        "process_count_per_instance": 4,
    #        "additional_prop": {"nested_prop": 3},
    #    },
    display_name="tensorflow-mnist-distributed-example"
    # experiment_name: tensorflow-mnist-distributed-example
    # description: Train a basic neural network with TensorFlow on the MNIST dataset, distributed via TensorFlow.
)

# can also set the distribution in a separate step and using the typed objects instead of a dict
job.distribution = TensorFlowDistribution(parameter_server_count=1, worker_count=2)

如果训练脚本使用参数服务器策略进行分布式训练，例如旧版 TensorFlow 1.x，则还需要指定要在作业中使用的参数服务器数。在前面的示例中，您需要在"parameter_server_count" : 1参数中指定"worker_count": 2和distribution。

TF_CONFIG

若要在 TensorFlow 中的多台计算机上训练，请使用 TF_CONFIG 环境变量。对于 TensorFlow 作业，Azure 机器学习在运行训练脚本之前，为每个工作节点正确设置 TF_CONFIG 变量。

如果需要，可以通过使用TF_CONFIG从训练脚本访问os.environ['TF_CONFIG']。

以下示例在主工作器节点上设置 TF_CONFIG ：

TF_CONFIG='{
    "cluster": {
        "worker": ["host0:2222", "host1:2222"]
    },
    "task": {"type": "worker", "index": 0},
    "environment": "cloud"
}'

TensorFlow 示例

有关运行 TensorFlow 示例的完整笔记本，请参阅 tensorflow-mnist-distributed-example。

InfiniBand

增加训练模型的虚拟机（VM）数量时，训练该模型所需的时间应与训练 VM 数成线性比例下降。例如，如果在一个虚拟机（VM）上训练模型需要 100 秒，那么在两个 VM 上训练同一个模型应该需要 50 秒，那么在 4 个 VM 上训练模型需要 25 秒，依此类说。

InfiniBand 可以通过在群集中的节点之间启用低延迟的 GPU 到 GPU 通信，从而帮助你实现这种线性缩放。 InfiniBand 需要专用硬件才能运行，例如 Azure VM NC、ND-或 H 系列。这些 VM 具备远程直接内存访问（RDMA）、单根 I/O 虚拟化（SR-IOV）和 InfiniBand 支持。

这些 VM 通过低延迟和高带宽 InfiniBand 网络进行通信，其性能优于基于以太网的连接。对于 InfiniBand，SR-IOV 可以为任何 MPI 库实现接近裸机的性能，广泛应用于众多分布式训练框架和工具，如 NVIDIA 集体通信库（NCCL）。

这些库存单元（SKU）旨在满足计算密集型 GPU 加速机器学习工作负载的需求。有关详细信息，请参阅在 Azure 机器学习中采用 SR-IOV 加速分布式训练。

通常，只有名称中包含 r 的 VM SKU（引用 RDMA）才包含所需的 InfiniBand 硬件。例如，VM SKU Standard_NC24rs_v3 已启用 InfiniBand，但 Standard_NC24s_v3 未启用。这两个 SKU 的规格大致相同，但 InfiniBand 功能除外。这两个 SKU 都有 24 个核心、448 GB RAM、4 个同一 SKU 的 GPU，等等。有关支持 RDMA 和 InfiniBand 的计算机 SKU 的详细信息，请参阅高性能计算。

注释

旧代计算机 SKU Standard_NC24r 已启用 RDMA，但不包含 InfiniBand 所需的 SR-IOV 硬件。

如果使用以下支持 RDMA 的 InfiniBand 大小之一创建 AmlCompute 群集，OS 映像附带了启用 InfiniBand 预安装并预配置所需的 Mellanox OpenFabrics Enterprise Distribution （OFED）驱动程序。

Last updated on 2026-01-23

通过