将 Hyperopt 与 HorovodRunner 和 Apache Spark MLlib 配合使用 Hyperopt with HorovodRunner and Apache Spark MLlib

Hyperopt 通常用于优化可在单台计算机上计算的客观函数。Hyperopt is typically used to optimize objective functions that can be evaluated on a single machine. 不过,还可以使用 Hyperopt 优化需要通过分布式训练进行评估的客观函数。However, you can also use Hyperopt to optimize objective functions that require distributed training to evaluate. 在这种情况下,Hyperopt 会生成表示驱动程序节点上不同超参数设置的试验,并且每个试验均通过可使用整个群集的分布式训练算法进行评估。In this case, Hyperopt generates trials representing different hyperparameter settings on the driver node, and each trial is evaluated via distributed training algorithms that can use your full cluster. 此设置适用于任何分布式机器学习算法或库,包括:This setup applies to any distributed machine learning algorithms or libraries, including:

  • Apache Spark MLlib:Apache Spark 自适应机器学习库,由常见的学习算法和实用工具组成。Apache Spark MLlib: the Apache Spark scalable machine learning library consisting of common learning algorithms and utilities.
  • HorovodRunner一种通用 API,用于使用 Horovod 框架在 |Databricks| 上运行分布式深度学习工作负载。HorovodRunner: a general API to run distributed deep learning workloads on |Databricks| using the Horovod framework. 通过将 Horovod 与 Spark 的屏障模式集成,Azure Databricks 可为 Spark 上长期运行的深度学习训练作业提供更高的稳定性。By integrating Horovod with Spark’s barrier mode, Azure Databricks is able to provide higher stability for long-running deep learning training jobs on Spark.

下一节演示了使用 Hyperopt 和 HorovodRunner 进行分布式训练的超参数优化示例。The following section demonstrates an example of hyperparameter tuning for distributed training using Hyperopt with HorovodRunner.

如何将 Hyperopt 与 HorovodRunner 配合使用How to use Hyperopt with HorovodRunner

粗略看来,你只需在传递给 Hyperopt 的客观函数中以分布式模式(即 HorovodRunner(np>0))启动 HorovodRunner 即可。At a high level, you just need to launch HorovodRunner in the distributed mode, that is, HorovodRunner(np>0), in the objective function you pass to Hyperopt.

具体技术详细信息的描述如下。The description of low-level technical details is as follows. 在 Hyperopt 中,所有试验都在 Spark 驱动程序节点上按顺序进行评估。In Hyperopt, all trials are evaluated sequentially on the Spark driver node. 相比之下,HorovodRunner 在 Spark 驱动程序节点上启动,它将训练作业分配给 Spark 工作器节点。In contrast, HorovodRunner is launched on the Spark driver node, and it distributes training jobs to Spark worker nodes. HorovodRunner 将返回值收集到驱动程序节点,然后将其传递给 Hyperopt。HorovodRunner collects the return values to the driver node and then passes them to Hyperopt.

备注

Azure Databricks 不支持将常规试验自动记录到 MLflow,因此你必须手动调用 MLflow 来记录 Hyperopt 的试验。Azure Databricks does not support automatic logging to MLflow with regular trials, so you must manually call MLflow to log trials for Hyperopt.

Hyperopt 和 HorovodRunner 分布式训练笔记本Hyperopt and HorovodRunner distributed training notebook

获取笔记本Get notebook