在 Azure Databricks 上运行 MLflow 项目 Run MLflow Projects on Azure Databricks

MLflow 项目是一种用于以可重用和可重现的方式打包数据科学代码的格式。An MLflow Project is a format for packaging data science code in a reusable and reproducible way. MLflow 项目组件包括用于运行项目的 API 和命令行工具,这些工具还与跟踪组件集成,以自动记录源代码的参数和 Git 提交,实现可重现性。The MLflow Projects component includes an API and command-line tools for running projects, which also integrate with the Tracking component to automatically record the parameters and git commit of your source code for reproducibility.

本文介绍了 MLflow 项目的格式,以及如何使用 MLflow CLI 在 Azure Databricks 群集上远程运行 MLflow 项目,以轻松纵向缩放数据科学代码。This article describes the format of an MLflow Project and how to run an MLflow project remotely on Azure Databricks clusters using the MLflow CLI, which makes it easy to vertically scale your data science code.

Databricks Community Edition 不支持 MLflow 项目执行。MLflow Project execution is not supported on Databricks Community Edition.

MLflow 项目格式MLflow project format

任何本地目录或 Git 存储库都可以视为 MLflow 项目。Any local directory or Git repository can be treated as an MLflow project. 项目使用以下约定进行定义:The following conventions define a project:

  • 项目的名称是目录的名称。The project’s name is the name of the directory.
  • conda.yaml中(如果存在)指定 Conda 环境。The Conda environment is specified in conda.yaml, if present. 如果不存在 conda.yaml 文件,MLflow 在运行项目时使用仅包含 Python(特别是可用于 Conda 的最新 Python)的 Conda 环境。If no conda.yaml file is present, MLflow uses a Conda environment containing only Python (specifically, the latest Python available to Conda) when running the project.
  • 项目中的任何 .py.sh 文件都可以是入口点,不显式声明参数。Any .py or .sh file in the project can be an entry point, with no parameters explicitly declared. 当使用一组参数运行此类命令时,MLflow 会使用 --key <value> 语法在命令行上传递每个参数。When you run such a command with a set of parameters, MLflow passes each parameter on the command line using --key <value> syntax.

可通过添加 MLproject 文件来指定更多选项,MLproject 文件是 YAML 语法中的文本文件。You specify more options by adding an MLproject file, which is a text file in YAML syntax. 示例 MLproject 文件如下所示:An example MLproject file looks like this:

name: My Project

conda_env: my_env.yaml

entry_points:
  main:
    parameters:
      data_file: path
      regularization: {type: float, default: 0.1}
    command: "python train.py -r {regularization} {data_file}"
  validate:
    parameters:
      data_file: path
    command: "python validate.py {data_file}"

运行 MLflow 项目Run an MLflow project

若要在默认工作区中的 Azure Databricks 群集上运行 MLflow 项目,请使用以下命令:To run an MLflow project on an Azure Databricks cluster in the default workspace, use the command:

mlflow run <uri> -b databricks --backend-config <json-new-cluster-spec>

其中,<uri> 是包含 MLflow 项目的 Git 存储库 URI 或文件夹,<json-new-cluster-spec> 是包含群集规范的 JSON 文档。where <uri> is a Git repository URI or folder containing an MLflow project and <json-new-cluster-spec> is a JSON document containing a cluster specification. Git URI 的格式应为:https://github.com/<repo>#<project-folder>The Git URI should be of the form: https://github.com/<repo>#<project-folder>.

群集规范示例如下:An example cluster specification is:

{
  "spark_version": "5.2.x-scala2.11",
  "num_workers": 1,
  "node_type_id": "Standard_DS3_v2"
}

如果需要在辅助角色上安装库,请使用“群集规范”格式。If you need to install libraries on the worker, use the “cluster specification” format. 注意,wheel 必须上传到 DBFS 并指定为 pypi 依赖项。Note that wheels must be uploaded to DBFS and specified as pypi dependencies. 例如:For example:

{
  "new_cluster": {
    "spark_version": "5.2.x-scala2.11",
    "num_workers": 1,
    "node_type_id": "Standard_DS3_v2"
  },
  "libraries": [
    {
      "pypi": {
        "package": "tensorflow"
      }
    },
    {
      "pypi": {
         "package": "/dbfs/path_to_my_lib.whl"
      }
    }
  ]
}

重要

  • MLflow 项目不支持 .egg.jar 依赖项。.egg and .jar dependencies are not supported for MLflow projects.
  • 不支持使用 Docker 环境执行 MLflow 项目。Execution for MLflow projects with Docker environments is not supported.
  • 在 Databricks 上运行 MLflow 项目时,必须使用新的群集规范。You must use a new cluster specification when running an MLflow Project on Databricks. 不支持对现有群集运行项目。Running Projects against existing clusters is not supported.

使用 SparkRUsing SparkR

若要在 MLflow 项目运行中使用 SparkR,项目代码必须首先安装并导入 SparkR,如下所示:In order to use SparkR in an MLflow Project run, your project code must first install and import SparkR as follows:

if (file.exists("/databricks/spark/R/pkg")) {
    install.packages("/databricks/spark/R/pkg", repos = NULL)
} else {
    install.packages("SparkR")
}

library(SparkR)

这样项目才可以初始化 SparkR 会话并正常使用 SparkR:Your project can then initialize a SparkR session and use SparkR as normal:

sparkR.session()
...

示例 Example

此示例展示了如何创建试验,在 Azure Databricks 群集上运行 MLflow 教程项目,查看作业运行输出以及查看试验中的运行情况。This example shows how to create an experiment, run the MLflow tutorial project on an Azure Databricks cluster, view the job run output, and view the run in the experiment.

先决条件Prerequisites

  1. 使用 pip install mlflow 安装 MLflow。Install MLflow using pip install mlflow.
  2. 安装并配置 Databricks CLIInstall and configure the Databricks CLI. 在 Azure Databricks 群集上运行作业需要 Databricks CLI 身份验证机制。The Databricks CLI authentication mechanism is required to run jobs on an Azure Databricks cluster.

步骤 1:创建试验Step 1: Create an experiment

  1. 在工作区中,选择“创建”>“MLflow 试验”。In the Workspace, select Create > MLflow Experiment.

  2. 在“名称”字段中,输入 TutorialIn the Name field, enter Tutorial.

  3. 单击“创建”。Click Create. 记下试验 ID。Note the Experiment ID. 在此示例中,它是 14622565In this example, it is 14622565.

    试验 IDExperiment ID

步骤 2:运行 MLflow 教程项目Step 2: Run the MLflow tutorial project

以下步骤设置 MLFLOW_TRACKING_URI 环境变量并运行项目,将训练参数、指标和训练模型记录到前一步中提到的试验中:The following steps set up the MLFLOW_TRACKING_URI environment variable and run the project, recording the training parameters, metrics, and the trained model to the experiment noted in the preceding step:

  1. MLFLOW_TRACKING_URI 环境变量设置为 Azure Databricks 工作区。Set the MLFLOW_TRACKING_URI environment variable to the Azure Databricks workspace.

    export MLFLOW_TRACKING_URI=databricks
    
  2. 运行 MLflow 教程项目,训练酒品模型Run the MLflow tutorial project, training a wine model. <experiment-id> 替换为上一步中记下的试验 ID。Replace <experiment-id> with the Experiment ID you noted in the preceding step.

    mlflow run https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine -b databricks --backend-config cluster-spec.json --experiment-id <experiment-id>
    
    === Fetching project from https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine into /var/folders/kc/l20y4txd5w3_xrdhw6cnz1080000gp/T/tmpbct_5g8u ===
    === Uploading project to DBFS path /dbfs/mlflow-experiments/<experiment-id>/projects-code/16e66ccbff0a4e22278e4d73ec733e2c9a33efbd1e6f70e3c7b47b8b5f1e4fa3.tar.gz ===
    === Finished uploading project to /dbfs/mlflow-experiments/<experiment-id>/projects-code/16e66ccbff0a4e22278e4d73ec733e2c9a33efbd1e6f70e3c7b47b8b5f1e4fa3.tar.gz ===
    === Running entry point main of project https://github.com/mlflow/mlflow#examples/sklearn_elasticnet_wine on Databricks ===
    === Launched MLflow run as Databricks job run with ID 8651121. Getting run status page URL... ===
    === Check the run's status at https://<databricks-instance>#job/<job-id>/run/1 ===
    
  3. 复制 MLflow 运行输出的最后一行中的 URL https://<databricks-instance>#job/<job-id>/run/1Copy the URL https://<databricks-instance>#job/<job-id>/run/1 in the last line of the MLflow run output.

步骤 3:查看 Azure Databricks 作业运行Step 3: View the Azure Databricks job run

  1. 在浏览器中打开在上一步中复制的 URL 以查看 Azure Databricks 作业运行输出:Open the URL you copied in the preceding step in a browser to view the Azure Databricks job run output:

    作业运行输出Job run output

步骤 4:查看试验和 MLflow 运行详细信息Step 4: View the experiment and MLflow run details

  1. 导航到 Azure Databricks 工作区中的试验。Navigate to the experiment in your Azure Databricks workspace.

    转向试验Go to experiment

  2. 单击试验。Click the experiment.

    查看试验View experiment

  3. 若要显示运行详细信息,请单击 Date 列中的链接。To display run details, click a link in the Date column.

    运行详细信息Run details

通过单击作业输出字段中的日志链接,可以查看运行中的日志。You can view logs from your run by clicking the Logs link in the Job Output field.

资源Resources

如需了解一些 MLflow 项目示例,请参阅 MLflow 应用库,该库包含一个现成的项目存储库,旨在使代码中易于包含 ML 功能。For some example MLflow projects, see the MLflow App Library, which contains a repository of ready-to-run projects aimed at making it easy to include ML functionality into your code.