分布式训练Distributed training

Azure Databricks 建议尽量在一台计算机上训练神经网络;由于通信开销,用于训练和推理的分布式代码比单计算机代码更复杂,且速度更慢。When possible, Azure Databricks recommends that you train neural networks on a single machine; distributed code for training and inference is more complex than single-machine code and slower due to communication overhead. 但是,如果模型或数据太大,以致无法装入一台计算机的内存中,则应该考虑使用分布式训练和推理。However, you should consider distributed training and inference if your model or your data are too large to fit in memory on a single machine.

备注

Azure Databricks 支持的 GPU VM 上不提供加速网络。Accelerated networking is not available on the GPU VMs supported by Azure Databricks. 因此,我们不建议在多节点 GPU 群集上运行分布式 DL 训练。Therefore we do not recommend running distributed DL training on a multiple node GPU cluster. 可以使用单个多 GPU 节点或一个多节点 CPU 群集来运行分布式 DL 训练。You can use a single multi-GPU node or a multiple node CPU cluster for distributed DL training.

HorovodHorovod

Horovod 是适用于 TensorFlow、Keras 和 PyTorch 的分布式训练框架。Horovod is a distributed training framework for TensorFlow, Keras, and PyTorch. Azure Databricks 支持使用 HorovodRunner 进行分布式 DL 训练。Azure Databricks supports distributed DL training using HorovodRunner.

要求Requirements

Databricks Runtime ML。Databricks Runtime ML.

使用 HorovodUse Horovod

以下页面提供了有关使用 Horovod 进行分布式深度学习的一般信息,以及演示如何使用 HorovodRunner 的示例笔记本。The following pages provide general information about distributed deep learning with Horovod and example notebooks illustrating how to use HorovodRunner.

排查 Horovod 安装问题Troubleshoot Horovod installation

问题:导入 horovod.{torch|tensorflow} 引发 ImportError: Extension horovod.{torch|tensorflow} has not been builtProblem: Importing horovod.{torch|tensorflow} raises ImportError: Extension horovod.{torch|tensorflow} has not been built

解决方案;Horovod 预安装在 Databricks Runtime ML 上,因此,如果未正确更新环境,通常会发生此错误。Solution: Horovod comes pre-installed on Databricks Runtime ML, so this error typically occurs if updating an environment goes wrong. 此错误表示 Horovod 是在所需库(PyTorch 或 TensorFlow)之前安装的。The error indicates that Horovod was installed before a required library (PyTorch or TensorFlow). 由于 Horovod 是在安装过程中编译的,因此如果在 Horovod 安装过程中没有这些包,则系统不会编译 horovod.{torch|tensorflow}Since Horovod is compiled during installation, horovod.{torch|tensorflow} will not get compiled if those packages aren’t present during the installation of Horovod. 要解决该问题,请执行以下步骤:To fix the issue, follow these steps:

  1. 验证是否位于 Databricks Runtime ML 群集上。Verify that you are on a Databricks Runtime ML cluster.
  2. 确保已安装 PyTorch 或 TensorFlow 包。Ensure that the PyTorch or TensorFlow package is already installed.
  3. 卸载 Horovod (%pip uninstall -y horovod)。Uninstall Horovod (%pip uninstall -y horovod).
  4. 安装 cmake (%pip install cmake)。Install cmake (%pip install cmake).
  5. 重新安装 horovodReinstall horovod.

HorovodEstimatorHorovodEstimator

Databricks Runtime ML 6.6 及更低版本支持 HorovodEstimator,后者与 HorovodRunner 类似,但你只能通过它使用 TensorFlow 估算器和 Apache Spark ML 管道 API。Databricks Runtime ML 6.6 and below support HorovodEstimator, which is similar to HorovodRunner but constrains you to TensorFlow Estimators and Apache Spark ML Pipeline APIs.

spark-tensorflow-distributorspark-tensorflow-distributor

spark-tensorflow-distributor 是 TensorFlow 中的开源原生包,用于在 Spark 群集上通过 TensorFlow 进行分布式训练。spark-tensorflow-distributor is an open-source native package in TensorFlow for distributed training with TensorFlow on Spark clusters.