分布式 Hyperopt 和自动化 MLflow 跟踪Distributed Hyperopt and automated MLflow tracking

用于机器学习的 Databricks Runtime 包括 Hyperopt,并通过 Apache Spark 支持的实现进行了增强。Databricks Runtime for Machine Learning includes Hyperopt, augmented with an implementation powered by Apache Spark. 通过使用 hyperopt.TrialsSparkTrials 扩展,你可以轻松地分发 Hyperopt 运行,而无需对 Hyperopt 的用法进行其他更改。By using the SparkTrials extension of hyperopt.Trials, you can easily distribute a Hyperopt run without making other changes to your Hyperopt usage. 应用 hyperopt.fmin() 函数时,会传入 SparkTrials 类。When applying the hyperopt.fmin() function, you pass in the SparkTrials class. SparkTrials 通过向 Spark 辅助角色分配“试验”,可以加速单机优化。SparkTrials can accelerate single-machine tuning by distributing trials to Spark workers.

MLflow 是用于管理端到端机器学习生命周期的开源平台。MLflow is an open source platform for managing the end-to-end machine learning lifecycle. 对 MLflow 的支持级别取决于 Databricks Runtime 版本:The level of support for MLflow depends on the Databricks Runtime version:

  • Databricks Runtime 5.5 LTS MLDatabricks Runtime 5.5 LTS ML
    • 支持自动化 MLflow 跟踪,用于在 Python 中使用 Hyperopt 和 SparkTrials 实现超参数优化。Support automated MLflow tracking for hyperparameter tuning with Hyperopt and SparkTrials in Python. 启用自动化 MLflow 跟踪并使用 SparkTrials 运行 fmin() 时,系统会自动在 MLflow 中记录超参数和评估指标。When automated MLflow tracking is enabled and you run fmin() with SparkTrials, hyperparameters and evaluation metrics are automatically logged in MLflow. 如果没有自动化 MLflow 跟踪,需要进行显式 API 调用来记录 MLflow。Without automated MLflow tracking, you must make explicit API calls to log to MLflow. 自动化 MLflow 跟踪功能是默认启用的。Automated MLflow tracking is enabled by default. 若要禁用此设置,请将 Spark 配置 spark.databricks.mlflow.trackHyperopt.enabled设置为 falseTo disable it, set the Spark configuration spark.databricks.mlflow.trackHyperopt.enabled to false. 即使没有自动化 MLflow 跟踪,仍然可以使用 SparkTrials 来分发优化。You can still use SparkTrials to distribute tuning even without automated MLflow tracking.
    • 包括 MLflow,因此无需单独安装。Include MLflow, so you do not need to install it separately.
  • Databricks Runtime 6.3 ML(不受支持)及以上版本支持从辅助角色记录到 MLflow。Databricks Runtime 6.3 ML (Unsupported) and above have support for logging to MLflow from workers. 可以在传递给 Hyperopt 的目标函数中添加自定义日志代码。You can add custom logging code in the objective function you pass to Hyperopt.

如何将 Hyperopt 用于 SparkTrialsHow to use Hyperopt with SparkTrials

本节介绍如何配置传递给 Hyperopt 的参数、有关使用 Hyperopt 的最佳做法以及使用 Hyperopt 时可能出现的问题的故障排除。This section describes how to configure the arguments you pass to Hyperopt, best practices in using Hyperopt, and troubleshooting issues that may arise when using Hyperopt.


对于使用 MLlib 创建的模型,不要使用 SparkTrials,因为模型构建过程会在群集上自动并行化。For models created with MLlib, do not use SparkTrials, as the model building process is automatically parallelized on the cluster. 相反,请使用 Hyperopt 类 TrialsInstead, use the Hyperopt class Trials.

fmin() 参数fmin() arguments

fmin()文档对所有参数都有详细说明。The fmin() documentation has detailed explanations for all the arguments. 下面是重要的内容:Here are the important ones:

  • fn:通过从超参数空间 (space) 生成的值调用的目标函数。fn: The objective function to be called with a value generated from the hyperparameter space (space). fn 可以以标量值或字典形式返回损失(有关详细信息,请参阅 Hyperopt 文档)。fn can return the loss as a scalar value or in a dictionary (refer to Hyperopt docs for details). 这通常是大多数代码所在的位置,例如,损失计算、模型训练等等。This is usually where most your code would be, for example, loss calculation, model training, and so on.
  • space:用于生成超参数空间 Hyperopt 搜索的表达式。space: An expression that generates the hyperparameter space Hyperopt searches. 一个简单的例子是 hp.uniform('x', -10, 10),它定义了 -10 到 10 之间的一维搜索空间。A simple example is hp.uniform('x', -10, 10), which defines a single-dimension search space between -10 and 10. Hyperopt 在定义超参数空间方面提供了很大的灵活性。Hyperopt provides great flexibility in defining the hyperparameter space. 熟悉 Hyperopt 后,可以使用此参数使优化更有效。After you are familiar with Hyperopt you can use this argument to make your tuning more efficient.
  • algo:搜索算法 Hyperopt 用于搜索超参数空间 (space)。algo: The search algorithm Hyperopt uses to search the hyperparameter space (space). 典型值为随机搜索的 hyperopt.rand.suggest 和 TPE 的 hyperopt.tpe.suggestTypical values are hyperopt.rand.suggest for Random Search and hyperopt.tpe.suggest for TPE.
  • max_evals:要尝试的超参数设置的数量,也就是要匹配的模型的数量。max_evals: The number of hyperparameter settings to try, that is, the number of models to fit. 此数字应该足够大以摊销开销。This number should be large enough to amortize overhead.
  • max_queue_len:Hyperopt 应提前生成的超参数设置数目。max_queue_len: The number of hyperparameter settings Hyperopt should generate ahead of time. 由于 Hyperopt TPE 生成算法可能需要一些时间,因此将时间增加到超过默认值 1 的值可能会有所帮助,但通常不会超过 SparkTrials 设置 parallelismSince the Hyperopt TPE generation algorithm can take some time, it can be helpful to increase this beyond the default value of 1, but generally no larger than the SparkTrials setting parallelism.

SparkTrials 参数和实现SparkTrials arguments and implementation

本部分描述如何配置传递给 SparkTrials 的参数和 SparkTrials 的实现方面。This section describes how to configure the arguments you pass to SparkTrials and implementation aspects of SparkTrials.


  • parallelism:要同时评估的试验的最大数量。parallelism: The maximum number of trials to evaluate concurrently. 更高的并行度允许对更多的超参数设置进行横向扩展测试。Greater parallelism allows scale-out testing of more hyperparameter settings.
    • 权衡:由于 Hyperopt 会基于过去的结果提议新试验,所以在并行度和适应度之间存在一个权衡。Trade-offs: Since Hyperopt proposes new trials based on past results, there is a trade-off between parallelism and adaptivity. 对于固定的 max_evals,即 fmin() 评估的最大试验数,较高的并行度将导致加速,而较低的并行度可能会得到更好的结果,因为每次迭代都可以访问更多过去的结果。For a fixed max_evals, which is the max number of trials evaluated by fmin(), greater parallelism will lead to speedups, but lower parallelism may lead to better results since each iteration will have access to more past results.
    • 限制:目前,并行度存在 128 的硬上限。Limits: There is currently a hard cap on parallelism of 128. SparkTrials 还将检查群集的配置,以了解 Spark 允许多少并发任务;如果并行度超过这个最大值,SparkTrials 将把并行度降低到该最大值。SparkTrials will also check the cluster’s configuration to see how many concurrent tasks Spark will allow; if parallelism exceeds this maximum, SparkTrials will reduce parallelism to this maximum.
  • timeoutfmin() 调用所能采用的最大秒数。timeout: The maximum number of seconds an fmin() call can take. 超过此数目后,所有运行都将终止,fmin() 将退出。Once this number is exceeded, all runs are terminated and fmin() would then exit. 将根据选择的最佳模型保留已完成的运行的所有信息。All information about completed runs is preserved based on which best model is selected. 此参数可以节省时间,也可以帮助控制群集成本。This argument can save you time as well as help you control your cluster cost.

示例笔记本中包含了完整的 SparkTrials API。The complete SparkTrials API is included in the example notebook. 若要查找它,请搜索 help(SparkTrials)To find it, search for help(SparkTrials).


当定义传递给 fmin() 的目标函数 fn 时,以及在选择群集设置时,了解 SparkTrials 如何分配优化任务是很有帮助的。When defining the objective function fn passed to fmin(), and when selecting a cluster setup, it is helpful to understand how SparkTrials distributes tuning tasks.

在 Hyperopt 中,一次试验通常相当于在一组超参数上拟合一个模型。In Hyperopt, a trial generally corresponds to fitting one model on one setting of hyperparameters. Hyperopt 以迭代方式生成试用,评估它们,并重复执行。Hyperopt iteratively generates trials, evaluates them, and repeats.

使用 SparkTrials,群集的驱动程序节点生成新的试用,工作器节点评估这些试用。With SparkTrials, the driver node of your cluster generates new trials, and worker nodes evaluate those trials. 每个试用都是由具有一个任务的 Spark 作业生成的,并在辅助角色计算机上的任务中进行评估。Each trial is generated with a Spark job which has one task, and is evaluated in the task on a worker machine. 如果群集设置为每个辅助角色运行多个任务,则可以在该辅助角色上同时评估多个试用。If your cluster is set up to run multiple tasks per worker, then multiple trials may be evaluated at once on that worker.

使用 SparkTrials 管理 MLflow 运行Manage MLflow runs with SparkTrials

SparkTrials 将优化结果记录为嵌套的 MLflow 运行,如下所示:SparkTrials logs tuning results as nested MLflow runs as follows:

  • 主运行或父运行:对 fmin() 的调用被记录为“主”运行。Main or parent run: The call to fmin() is logged as the “main” run. 如果有一个活动的运行,SparkTrials 将在这个活动运行下记录,并且在 fmin() 返回时不结束运行。If there is an active run, SparkTrials logs under this active run and does not end the run when fmin() returns. 如果没有活动的运行,SparkTrials 将创建一个新的运行,在其下进行记录,并在 fmin() 返回之前结束该运行。If there is no active run, SparkTrials creates a new run, logs under it, and ends the run before fmin() returns.
  • 子运行:经过测试的每个超参数设置(“试验”)都记录为主运行下的子运行。Child runs: Each hyperparameter setting tested (a “trial”) is logged as a child run under the main run. 对于运行在 Databricks Runtime 6.0 ML 及更高版本上的群集,来自辅助角色的 MLflow 日志记录也存储在相应的子运行下。For clusters running on Databricks Runtime 6.0 ML and above, the MLflow log records from workers are also stored under the corresponding child runs.

当调用 fmin() 时,建议使用活动 MLflow 运行管理;也就是说,将对 fmin() 的调用包装在 with mlflow.start_run(): 语句中。When calling fmin(), we recommend active MLflow run management; that is, wrap the call to fmin() inside a with mlflow.start_run(): statement. 这样可以确保每个 fmin() 调用都记录在其自己的 MLflow“主”运行下,并且可以更轻松地将额外的标记、参数或指标记录到该运行中。This ensures that each fmin() call is logged under its own MLflow “main” run, and it makes it easier to log extra tags, params, or metrics to that run.


当在同一个活动 MLflow 运行中多次调用 fmin() 时,它会将这些多个 fmin() 调用记录到同一个“主”运行中。When fmin() is called multiple times within the same active MLflow run, it logs those multiple fmin() calls to that same “main” run. 若要解决 MLflow 参数和标记的名称冲突,可以通过附加一个 UUID 来处理有冲突的名称。To resolve name conflicts for MLflow params and tags, names with conflicts are mangled by appending a UUID.

当从 Databricks Runtime 6.3 ML(不受支持)及以上版本的辅助角色中进行日志记录时,无需在目标函数中显式管理运行。When logging from workers in Databricks Runtime 6.3 ML (Unsupported) and above, you do not need to manage runs explicitly in the objective function. 在目标函数中调用 mlflow.log_param("param_from_worker", x) 以将 param 记录到子运行中。Call mlflow.log_param("param_from_worker", x) in the objective function to log a param in to the child run. 可以在目标函数中记录参数、指标、标记和项目。You can log params, metrics, tags and artifacts in the objective function.

笔记本显示了分布式 Hyperopt + 自动化 MLflow 跟踪的运行情况。The notebook shows distributed Hyperopt + automated MLflow tracking in action.

分布式 Hyperopt + 自动化 MLflow 跟踪笔记本 Distributed Hyperopt + automated MLflow tracking notebook

获取笔记本Get notebook

在笔记本的最后一个单元格中执行操作后,MLflow UI 应显示:After you perform the actions in the last cell in the notebook, your MLflow UI should display:

Hyperopt MLflow 演示Hyperopt MLflow demo

请参阅使用分布式 Hyperopt 和自动化 MLflow 跟踪功能为搜索建模,了解如何为多个模型优化超参数并获得整体上而言的最佳模型。See Model search using distributed Hyperopt and automated MLflow tracking for a demonstration of how to tune the hyperparameters for multiple models and arrive at a best model overall.