Apache Spark MLlib 和自动化 MLflow 跟踪Apache Spark MLlib and automated MLflow tracking

MLflow 是用于管理端到端机器学习生命周期的开源平台。MLflow is an open source platform for managing the end-to-end machine learning lifecycle. MLflow 支持对 Python、R 和 Scala 中的机器学习模型优化进行跟踪。MLflow supports tracking for machine learning model tuning in Python, R, and Scala. 仅适用于 Python 笔记本,Databricks Runtime用于机器学习的 Databricks Runtime 支持自动化 MLflow 跟踪以进行 Apache Spark MLlib 模型优化。For Python notebooks only, Databricks Runtime and Databricks Runtime for Machine Learning support automated MLflow Tracking for Apache Spark MLlib model tuning.

启用 MLlib 中的自动化 MLflow 跟踪并运行使用 CrossValidatorTrainValidationSplit 的优化代码时,系统会自动在 MLflow 中记录超参数和评估指标。When automated MLflow tracking from MLlib is enabled, and you run tuning code that uses CrossValidator or TrainValidationSplit, hyperparameters and evaluation metrics are automatically logged in MLflow. 如果没有自动化 MLflow 跟踪,需要进行显式 API 调用来记录 MLflow。Without automated MLflow tracking, you must make explicit API calls to log to MLflow.

管理 MLflow 运行Manage MLflow runs

CrossValidatorTrainValidationSplit 将优化结果记录为嵌套的 MLflow 运行:CrossValidator or TrainValidationSplit log tuning results as nested MLflow runs:

  • 主运行或父运行:CrossValidatorTrainValidationSplit 的信息被记录为“主”运行。Main or parent run: The info for CrossValidator or TrainValidationSplit is logged as the “main” run. 如果已有一个活动的运行,它将在这个活动运行下记录,并且不结束运行。If there is an active run already, it logs under this active run and does not end the run. 如果没有活动的运行,则会创建一个新的运行,在其下进行记录,并在返回之前结束该运行。If there is no active run, then it creates a new run, logs under it, and ends the run before returning.
  • 子运行:经过测试的每个超参数设置及其评估指标都记录为主运行下的子运行。Child runs: Each hyperparameter setting tested and its evaluation metric is logged as a child run under the main run.

当调用 fit() 时,建议使用活动 MLflow 运行管理;也就是说,将对 fit() 的调用包装在“with mlflow.start_run():”语句中。When calling fit(), we recommend active MLflow run management; that is, wrap the call to fit() inside a “with mlflow.start_run():” statement. 这样可以确保信息都记录在其自己的 MLflow“主”运行下,并且可以更轻松地将额外的标记、参数或指标记录到该运行中。This ensures that the info is logged under its own MLflow “main” run, and it makes it easier to log extra tags, params, or metrics to that run.

备注

当在同一个活动 MLflow 运行中多次调用 fit() 时,它会将这些多个运行记录到同一个“主”运行中。When fit() is called multiple times within the same active MLflow run, it logs those multiple runs to the same “main” run. 若要解决 MLflow 参数和标记的名称冲突,可以通过附加一个 UUID 来处理有冲突的名称。To resolve name conflicts for MLflow params and tags, names with conflicts are mangled by appending a UUID.

以下 Python 笔记本演示了自动化 MLflow 跟踪的实际操作。The following Python notebook demonstrates automated MLflow tracking in action.

自动化 MLflow 跟踪笔记本Automated MLflow tracking notebook

获取笔记本Get notebook

在笔记本的最后一个单元格中执行操作后,MLflow UI 应显示:After you perform the actions in the last cell in the notebook, your MLflow UI should display:

MLlib-MLflow 演示MLlib-MLflow demo