Azure 机器学习的分布式训练Distributed training with Azure Machine Learning

本文介绍分布式训练以及 Azure 机器学习如何支持深度学习模型的分布式训练。In this article, you learn about distributed training and how Azure Machine Learning supports it for deep learning models.

在分布式训练中,用于训练模型的工作负载会在多个微型处理器之间进行拆分和共享,这些处理器称为工作器节点。In distributed training the workload to train a model is split up and shared among multiple mini processors, called worker nodes. 这些工作器节点并行工作以加速模型训练。These worker nodes work in parallel to speed up model training. 分布式训练可用于传统的 ML 模型,但更适用于计算和时间密集型任务,如用于训练深度神经网络的深度学习Distributed training can be used for traditional ML models, but is better suited for compute and time intensive tasks, like deep learning for training deep neural networks.

深度学习和分布式训练Deep learning and distributed training

分布式训练有两种主要类型:数据并行模型并行There are two main types of distributed training: data parallelism and model parallelism. 对于深度学习模型上的分布式训练,Python 中的 Azure 机器学习 SDK 支持与常用框架 PyTorch 和 TensorFlow 进行集成。For distributed training on deep learning models, the Azure Machine Learning SDK in Python supports integrations with popular frameworks, PyTorch and TensorFlow. 两种框架都采用数据并行方式进行分布式训练,并且可以利用 horovod 来优化计算速度。Both frameworks employ data parallelism for distributed training, and can leverage horovod for optimizing compute speeds.

对于不需要进行分布式训练的 ML 模型,请参阅使用 Azure 机器学习训练模型,了解使用 Python SDK 训练模型的不同方法。For ML models that don't require distributed training, see train models with Azure Machine Learning for the different ways to train models using the Python SDK.

数据并行Data parallelism

数据并行是两种分布式训练方法中较易实现的一个,对于大多数用例来说已经足够了。Data parallelism is the easiest to implement of the two distributed training approaches, and is sufficient for most use cases.

在此方法中,数据划分到计算群集中的分区内,其中分区数等于可用节点的总数。In this approach, the data is divided into partitions, where the number of partitions is equal to the total number of available nodes, in the compute cluster. 将在这些工作器节点的每一个中复制模型,每个工作器都操作和处理自己的数据子集。The model is copied in each of these worker nodes, and each worker operates on its own subset of the data. 请记住,每个节点都必须有能力支持正在进行训练的模型,也就是说,模型需要拟合每个节点。Keep in mind that each node has to have the capacity to support the model that's being trained, that is the model has to entirely fit on each node. 下图直观地阐释了此方法。The following diagram provides a visual demonstration of this approach.

数据并行概念关系图

每个节点独立计算其训练样本的预测结果与标记输出之间的误差。Each node independently computes the errors between its predictions for its training samples and the labeled outputs. 每个节点又会基于该误差更新其模型,且必须将其所有更改传达给其他节点,以便其相应更新其自己的模型。In turn, each node updates its model based on the errors and must communicate all of its changes to the other nodes to update their corresponding models. 这意味着,工作器节点需要在批处理计算结束时同步模型参数或梯度,以确保其训练的是一致的模型。This means that the worker nodes need to synchronize the model parameters, or gradients, at the end of the batch computation to ensure they are training a consistent model.

模型并行Model parallelism

在模型并行(也称为网络并行)中,模型将划分为可在不同节点中并发运行的不同部分,且每个部分都使用同一数据运行。In model parallelism, also known as network parallelism, the model is segmented into different parts that can run concurrently in different nodes, and each one will run on the same data. 此方法的可伸缩性取决于算法的任务并行程度,其实现方式比数据并行更为复杂。The scalability of this method depends on the degree of task parallelization of the algorithm, and it is more complex to implement than data parallelism.

在模型并行中,工作器节点只需要为每个前向或后向传播步骤同步共享的参数一次(通常情况下)。In model parallelism, worker nodes only need to synchronize the shared parameters, usually once for each forward or backward-propagation step. 并且,即使模型较大也没问题,因为每个节点只处理模型的一个部分,且使用一致的训练数据。Also, larger models aren't a concern since each node operates on a subsection of the model on the same training data.

后续步骤Next steps