horovod.spark
: distributed deep learning with Horovod
Important
Horovod and HorovodRunner are now deprecated and will not be pre-installed in Databricks Runtime 16.0 ML and above. For distributed deep learning, Databricks recommends using TorchDistributor for distributed training with PyTorch or the tf.distribute.Strategy
API for distributed training with TensorFlow.
Learn how to use the horovod.spark
package to perform distributed training of machine learning models.
horovod.spark
on Azure Databricks
Azure Databricks supports the horovod.spark
package, which provides an estimator API that you can use in ML pipelines with Keras and PyTorch. For details, see Horovod on Spark, which includes a section on Horovod on Databricks.
Note
- Azure Databricks installs the
horovod
package with dependencies. If you upgrade or downgrade these dependencies, there might be compatibility issues. - When using
horovod.spark
with custom callbacks in Keras, you must save models in the TensorFlow SavedModel format.- With TensorFlow 2.x, use the
.tf
suffix in the file name. - With TensorFlow 1.x, set the option
save_weights_only=True
.
- With TensorFlow 2.x, use the
Requirements
Databricks Runtime ML 7.4 or above.
Note
horovod.spark
does not support pyarrow versions 11.0 and above (see relevant GitHub Issue). Databricks Runtime 15.0 ML includes pyarrow version 14.0.1. To use horovod.spark
with Databricks Runtime 15.0 ML or above, you must manually install pyarrow, specifying a version below 11.0.
Example: Distributed training function
Here is a basic example to run a distributed training function using horovod.spark
:
def train():
import horovod.tensorflow as hvd
hvd.init()
import horovod.spark
horovod.spark.run(train, num_proc=2)
Example notebooks: Horovod Spark estimators using Keras and PyTorch
The following notebooks demonstrate how to use the Horovod Spark Estimator API with Keras and PyTorch.