跟踪机器学习训练运行 Track machine learning training runs

可以通过 MLflow 跟踪组件记录与训练机器学习模型相关的源属性、参数、指标、标记和项目。The MLflow tracking component lets you log source properties, parameters, metrics, tags, and artifacts related to training a machine learning model. MLflow 跟踪基于两个概念,即试验和运行:MLflow tracking is based on two concepts, experiments and runs:

  • MLflow 试验是组织的基本构成单位,并且是对 MLflow 运行的访问控制;所有 MLflow 运行都属于一个试验。An MLflow experiment is the primary unit of organization and access control for MLflow runs; all MLflow runs belong to an experiment. 可以通过试验来可视化、搜索和比较运行,以及下载运行项目和元数据以便在其他工具中进行分析。Experiments let you visualize, search for, and compare runs, as well as download run artifacts and metadata for analysis in other tools.
  • 一个 MLflow 运行对应于模型代码的单次执行。An MLflow run corresponds to a single execution of model code. 每个运行都记录以下信息:Each run records the following information:
    • :启动了此运行的笔记本的名称,或者此运行的项目名称和入口点。Source: Name of the notebook that launched the run or the project name and entry point for the run.
    • 版本:如果从笔记本运行,则为笔记本修订版本;如果从 MLflow 项目运行,则为 Git 提交哈希。Version: Notebook revision if run from a notebook or Git commit hash if run from an MLflow Project.
    • 开始和结束时间:运行的开始和结束时间。Start & end time: Start and end time of the run.
    • 参数:保存为键值对的模型参数。Parameters: Model parameters saved as key-value pairs. 键和值都是字符串。Both keys and values are strings.
    • 指标:保存为键值对的模型评估指标。Metrics: Model evaluation metrics saved as key-value pairs. 值为数字。The value is numeric. 每个指标都可以在整个运行过程中更新(例如,用于跟踪模型的损失函数如何聚合),而 MLflow 会记录指标的历史记录并将其可视化。Each metric can be updated throughout the course of the run (for example, to track how your model’s loss function is converging), and MLflow records and lets you visualize the metric’s history.
    • 标记:运行保存为键值对的元数据。Tags: Run metadata saved as key-value pairs. 可以在运行过程中和运行完成后更新标记。You can update tags during and after a run completes. 键和值都是字符串。Both keys and values are strings.
    • 项目:任意格式的输出文件。Artifacts: Output files in any format. 例如,可以将图像、模型(例如,pickle 格式的 scikit-learn 模型)和数据文件(例如,Parquet 文件)记录为项目。For example, you can record images, models (for example, a pickled scikit-learn model), and data files (for example, a Parquet file) as an artifact.

使用 MLflow 跟踪 API 可以记录模型运行中的参数、指标、标记和项目。Use the MLflow Tracking API to log parameters, metrics, tags, and artifacts from a model run. 跟踪 API 可与 MLflow 跟踪服务器通信。The Tracking API communicates with an MLflow tracking server. 使用 Databricks 时,Databricks 托管的跟踪服务器会记录数据。When you use Databricks, a Databricks-hosted tracking server logs the data. 托管的 MLflow 跟踪服务器具有 Python API、Java API 和 R API。The hosted MLflow tracking server has Python, Java, and R APIs.

若要了解如何控制对试验的访问,请参阅 MLflow 试验权限For information about controlling access to experiments, see MLflow Experiment permissions.

备注

MLflow 安装在 Databricks Runtime ML 群集上。MLflow is installed on Databricks Runtime ML clusters. 若要在 Databricks Runtime 群集上使用 MLflow,必须安装 PyPI 库 mlflow[extras]To use MLflow on a Databricks Runtime cluster, you must install the PyPI library mlflow[extras]. 请参阅在群集上安装库See Install a library on a cluster.

在何处记录 MLflow 运行Where MLflow runs are logged

所有 MLflow 运行都会记录到活动试验中,可以使用以下任一方法对其进行设置:All MLflow runs are logged to the active experiment, which can be set using any of the following ways:

如果未设置活动试验,则会将运行记录到笔记本试验If no active experiment is set, runs are logged to the notebook experiment.

试验 Experiments

有两种类型的试验:工作区和笔记本。There are two types of experiments: workspace and notebook.

  • 可以从工作区 UI 或 MLflow API 创建工作区试验。You can create a workspace experiment from the Workspace UI or the MLflow API. 工作区试验不与任何笔记本关联,任何笔记本都可以使用试验 ID 或试验名称将运行记录到这些试验中。Workspace experiments are not associated with any notebook, and any notebook can log a run to these experiments by using the experiment ID or the experiment name.
  • 笔记本试验与特定笔记本相关联。A notebook experiment is associated with a specific notebook. 如果在使用 mlflow.start_run() 启动运行时没有活动的试验,Azure Databricks 会自动创建笔记本试验。Azure Databricks automatically creates a notebook experiment if there is no active experiment when you start a run using mlflow.start_run().

若要了解如何控制对试验的访问,请参阅 MLflow 试验权限To learn how to control access to experiments, see MLflow Experiment permissions.

工作区试验Workspace experiments

此部分介绍如何使用 Azure Databricks UI 创建工作区试验。This section describes how to create a workspace experiment using the Azure Databricks UI. 也可使用 MLflow APIYou can also use the MLflow API.

有关如何将运行记录到工作区试验的说明,请参阅将运行记录到笔记本或工作区试验For instructions on logging runs to workspace experiments, see Log runs to a notebook or workspace experiment.

创建工作区试验Create workspace experiment

  1. 单击边栏中的“工作区”按钮 “工作区”图标 或“主页”按钮 “主页”图标Click the Workspace button Workspace Icon or the Home button Home Icon in the sidebar.

  2. 转到要在其中创建试验的文件夹。Go to the folder in which you want to create the experiment.

  3. 执行下列操作之一:Do one of the following:

    • 在任何文件夹旁边,单击文本右侧的 菜单下拉列表,然后选择“创建”>“MLflow 试验”。Next to any folder, click the Menu Dropdown on the right side of the text and select Create > MLflow Experiment.

      创建试验Create experiment

    • 在工作区或用户文件夹中,单击 向下的脱字号,然后选择“创建”>“MLflow 试验”。In the Workspace or a user folder, click Down Caret and select Create > MLflow Experiment.

  4. 在“创建 MLflow 试验”对话框中,输入试验的名称,还可以选择输入项目位置。In the Create MLflow Experiment dialog, enter a name for the experiment and an optional artifact location. 如果未指定项目位置,则项目会存储在 dbfs:/databricks/mlflow-tracking/<experiment-id> 中。If you do not specify an artifact location, artifacts are stored in dbfs:/databricks/mlflow-tracking/<experiment-id>.

    Azure Databricks 支持 DBFS 和 Azure Blob 存储项目位置。Azure Databricks supports DBFS and Azure Blob storage artifact locations.

    若要将项目存储在 Azure Blob 存储中,请指定 wasbs://<container>@<storage-account>.blob.core.chinacloudapi.cn/<path> 格式的 URI。To store artifacts in Azure Blob storage, specify a URI of the form wasbs://<container>@<storage-account>.blob.core.chinacloudapi.cn/<path>. 无法在 MLflow UI 中查看存储在 Azure Blob 存储中的项目;必须使用 blob 存储客户端下载它们。Artifacts stored in Azure Blob storage cannot be viewed in the MLflow UI; you must download them using a blob storage client.

  5. 单击“创建”。Click Create. 此时会显示一个空试验。An empty experiment displays.

查看工作区试验View workspace experiment

  1. 单击边栏中的“工作区”按钮 “工作区”图标 或“主页”按钮 “主页”图标Click the Workspace button Workspace Icon or the Home button Home Icon in the sidebar.
  2. 转到包含该试验的文件夹。Go to the folder containing the experiment.
  3. 单击试验名称。Click the experiment name.

删除工作区试验Delete workspace experiment

  1. 单击边栏中的“工作区”按钮 “工作区”图标 或“主页”按钮 “主页”图标Click the Workspace button Workspace Icon or the Home button Home Icon in the sidebar.
  2. 转到包含该试验的文件夹。Go to the folder containing the experiment.
  3. 单击试验右侧的 菜单下拉列表,然后选择“移至回收站”。Click the Menu Dropdown at the right side of the experiment and select Move to Trash.

笔记本试验 Notebook experiments

在笔记本中使用 mlflow.start_run() 命令时,运行会将指标和参数记录到活动试验中。When you use the mlflow.start_run() command in a notebook, the run logs metrics and parameters to the active experiment. 如果没有活动的试验,Azure Databricks 会创建笔记本试验。If no experiment is active, Azure Databricks creates a notebook experiment. 笔记本试验的名称和 ID 与相应笔记本的名称和 ID 相同。A notebook experiment shares the same name and ID as its corresponding notebook. 笔记本 ID 是笔记本 URL 和 ID 末尾的数字标识符。The notebook ID is the numerical identifier at the end of a Notebook URL and ID.

有关如何将运行记录到笔记本试验的说明,请参阅将运行记录到笔记本或工作区试验For instructions on logging runs to notebook experiments, see Log runs to a notebook or workspace experiment.

备注

如果使用 API(例如,Python 中的 MlflowClient.tracking.delete_experiment())删除笔记本试验,则笔记本本身会被移到回收站文件夹中。If you delete a notebook experiment using the API (for example, MlflowClient.tracking.delete_experiment() in Python), the notebook itself is moved into the Trash folder.

查看笔记本试验View notebook experiment

若要查看笔记本试验及其关联的运行,请单击笔记本工具栏中的“试验”图标 试验To view a notebook experiment and its associated runs, click the Experiment icon Experiment in the notebook toolbar:

笔记本工具栏Notebook toolbar

在“试验运行”边栏中,你可以查看运行参数和指标:From the Experiment Runs sidebar, you can view the run parameters and metrics:

查看运行参数和指标View run parameters and metrics

你还可以查看创建了此运行的笔记本的版本。You can also view the version of the notebook that created the run. 单击该试验运行的框中的“笔记本”图标 NotebookVersionClick the Notebook icon NotebookVersion in the box for that Experiment Run. 与该运行关联的笔记本的版本会在主窗口中显示,窗口中突出显示的条形显示了运行的日期和时间。The version of the notebook associated with that run appears in the main window with a highlight bar showing the date and time of the run.

单击“试验运行”上下文栏中的Click the 外部链接 图标以查看试验:icon in the Experiment Runs context bar to view the experiment:

查看试验View experiment

在“试验运行”边栏中,单击日期旁边的In the Experiment Runs sidebar, click the 外部链接 图标icon next to the date 运行日期

以查看某个运行:to view a run:

查看运行View run

删除笔记本试验Delete notebook experiment

笔记本试验是笔记本的一部分,不能单独删除。Notebook experiments are part of the notebook and cannot be deleted separately. 如果删除笔记本,则会删除笔记本试验。If you delete the notebook, the notebook experiment is deleted. 如果使用 API(例如,Python 中的 MlflowClient.tracking.delete_experiment())删除笔记本试验,则也会删除笔记本。If you delete a notebook experiment using the API (for example, MlflowClient.tracking.delete_experiment() in Python), the notebook is also deleted.

运行次数Runs

所有 MLflow 运行都会记录到活动试验中。All MLflow runs are logged to the active experiment. 如果尚未将某个试验显式设置为活动试验,则会将运行记录到笔记本试验。If you have not explicitly set an experiment as the active experiment, runs are logged to the notebook experiment.

将运行记录到笔记本或工作区试验 Log runs to a notebook or workspace experiment

此笔记本演示的示例说明了如何将运行记录到笔记本试验和工作区试验。This notebook shows examples of how to log runs to a notebook experiment and to a workspace experiment. 只有在笔记本中启动的 MLflow 运行才能记录到笔记本试验中。Only MLflow runs initiated within a notebook can be logged to the notebook experiment. 从任何笔记本启动的或从 API 启动的 MLflow 运行可以记录到工作区试验中。MLflow runs launched from any notebook or from the APIs can be logged to a workspace experiment. 若要了解如何查看已记录的运行,请参阅查看笔记本试验查看工作区试验For information about viewing logged runs, see View notebook experiment and View workspace experiment.

将 MLflow 运行记录到笔记本Log MLflow runs notebook

获取笔记本Get notebook

可以使用 MLflow Python、Java 或 Scala 以及 R API 来启动运行并记录运行数据。You can use MLflow Python, Java or Scala, and R APIs to start runs and record run data. 以下部分提供了详细信息。The following sections provide details. 如需示例笔记本,请参阅快速入门For example notebooks, see the Quick start.

PythonPython

  1. 将 PyPI 库 mlflow[extras] 安装到群集,其中的额外依赖项包括:Install the PyPI library mlflow[extras] to a cluster, where the extra dependencies are:

    • scikit-learn(当 Python 版本 >= ‘3.5’ 时)scikit-learn when Python version >= ‘3.5’
    • scikit-learn == 0.20(当 Python 版本 < ‘3.5’ 时)scikit-learn == 0.20 when Python version < ‘3.5’
    • boto3 >= 1.7.12boto3 >= 1.7.12
    • mleap >= 0.8.1mleap >= 0.8.1
    • azure-storage
    • google-cloud-storage
  2. 导入 MLflow 库:Import MLflow library:

    import mlflow
    
  3. 启动 MLflow 运行:Start an MLflow run:

    with mlflow.start_run() as run:
    
  4. 记录参数、指标和项目:Log parameters, metrics, and artifacts:

    # Log a parameter (key-value pair)
    mlflow.log_param("param1", 5)
    # Log a metric; metrics can be updated throughout the run
    mlflow.log_metric("foo", 2, step=1)
    mlflow.log_metric("foo", 4, step=2)
    mlflow.log_metric("foo", 6, step=3)
    # Log an artifact (output file)
    with open("output.txt", "w") as f:
        f.write("Hello world!")
    mlflow.log_artifact("output.txt")
    

ScalaScala

  1. 将 PyPI 库 mlflow 和 Maven 库 org.mlflow:mlflow-client:1.0.0 安装到群集。Install the PyPI library mlflow and the Maven library org.mlflow:mlflow-client:1.0.0 to a cluster.

  2. 导入 MLflow 和文件库:Import MLflow and file libraries:

    import org.mlflow.tracking.ActiveRun
    import org.mlflow.tracking.MlflowContext
    import java.io.{File,PrintWriter}
    
  3. 创建 MLflow 上下文:Create MLflow context:

    val mlflowContext = new MlflowContext()
    
  4. 创建试验。Create an experiment.

    val experimentName = "/Shared/QuickStart"
    val mlflowContext = new MlflowContext()
    val client = mlflowContext.getClient()
    val experimentOpt = client.getExperimentByName(experimentName);
    if (!experimentOpt.isPresent()) {
     client.createExperiment(experimentName)
    }
    mlflowContext.setExperimentName(experimentName)
    
  5. 记录参数、指标和文件:Log parameters, metrics, and file:

    import java.nio.file.Paths
    val run = mlflowContext.startRun("run")
    // Log a parameter (key-value pair)
    run.logParam("param1", "5")
    // Log a metric; metrics can be updated throughout the run
    run.logMetric("foo", 2.0, 1)
    run.logMetric("foo", 4.0, 2)
    run.logMetric("foo", 6.0, 3)
     new PrintWriter("/tmp/output.txt") { write("Hello, world!") ; close }
     run.logArtifact(Paths.get("/tmp/output.txt"))
    
  6. 关闭运行:Close the run:

    run.endRun()
    

RR

  1. 将 CRAN 库 mlflow 安装到群集。Install the CRAN library mlflow to a cluster.

  2. 导入并安装 MLflow 库:Import and install MLflow libraries:

    library(mlflow)
    install_mlflow()
    
  3. 创建新运行:Create a new run:

    run <- mlflow_start_run()
    
  4. 记录参数、指标和文件:Log parameters, metrics, and file:

    # Log a parameter (key-value pair)
    mlflow_log_param("param1", 5)
    # Log a metric; metrics can be updated throughout the run
    mlflow_log_metric("foo", 2, step = 1)
    mlflow_log_metric("foo", 4, step = 2)
    mlflow_log_metric("foo", 6, step = 3)
    # Log an artifact (output file)
    write("Hello world!", file = "output.txt")
    mlflow_log_artifact("output.txt")
    
  5. 关闭运行:Close the run:

    mlflow_end_run()
    

在试验中查看和管理运行 View and manage runs in experiments

在试验中,可以对其包含的运行执行许多操作。Within an experiment you can perform many operations on its contained runs.

筛选运行Filter runs

若要按参数或指标名称筛选运行,请在“筛选器[参数|指标]”字段中键入参数或指标名称,然后按 EnterTo filter runs by a parameter or metric name, type the parameter or metric name in the Filter [Params|Metric] field and press Enter.

若要筛选与某个包含参数和指标值的表达式匹配的运行,请执行以下操作:To filter runs that match an expression containing parameter and metric values:

  1. 在“搜索运行”字段中,指定一个表达式。In the Search Runs field, specify an expression. 例如:metrics.r2 > 0.3For example: metrics.r2 > 0.3.

    筛选运行Filter runs

  2. 单击“搜索”。 Click Search.

下载运行Download runs

  1. 选择一个或多个运行。Select one or more runs.
  2. 单击“下载 CSV”。Click Download CSV. 此时会下载包含以下字段的 CSV 文件:Run ID,Name,Source Type,Source Name,User,Status,<parameter1>,<parameter2>,...,<metric1>,<metric2>,...A CSV file containing the following fields downloads: Run ID,Name,Source Type,Source Name,User,Status,<parameter1>,<parameter2>,...,<metric1>,<metric2>,....

显示运行详细信息Display run details

单击某个运行的日期链接。Click the date link of a run. 此时会显示运行详细信息屏幕。The run details screen displays. 此屏幕显示了用于运行的参数、运行所生成的指标以及任何标记或注释。This screen shows the parameters used for the run, the metrics resulting from the run, and any tags or notes. 你还可以在此屏幕中访问从运行中保存的项目。You also access artifacts saved from a run in this screen.

若要查看用于运行的笔记本或 Git 项目的特定版本,请执行以下操作:To view the specific version of the notebook or Git project used for a run:

  • 如果运行已在 Azure Databricks 笔记本或作业中以本地方式启动,请单击“源”字段中的链接以打开在运行中使用的特定笔记本版本If the run was launched locally in an Azure Databricks notebook or job, click the link in the Source field to open the specific notebook version used in the run.
  • 如果已从 Git 项目以远程方式启动了运行,则请单击“Git 提交”字段中的链接,以打开运行中所用项目的特定版本。If the run was launched remotely from a Git project, click the link in the Git Commit field to open the specific version of the project used in the run. “源”字段中的链接可打开运行中所用的 Git 项目的 master 分支。The link in the Source field opens the master branch of the Git project used in the run.

比较运行Compare runs

  1. 在试验中,通过单击运行左侧的复选框来选择两个或多个运行。In the experiment, select two or more runs by clicking in the checkbox to the left of the run.
  2. 单击“比较”。Click Compare. 此时会显示“比较 个运行”屏幕。The Comparing Runs screen displays.
  3. 执行下列操作之一:Do one of the following:
    • 选择指标名称以显示指标图形。Select a metric name to display a graph of the metric.

    • 从“X 轴”和“Y 轴”下拉列表中选择参数和指标来生成散点图。Select parameters and metrics from the X-axis and Y-axis drop-down lists to generate a scatter plot.

      散点图Scatter plot

删除运行Delete runs

  1. 在试验中,通过单击运行左侧的复选框来选择一个或多个运行。In the experiment, select one or more runs by clicking in the checkbox to the left of the run.
  2. 单击 “删除”Click Delete.
  3. 如果运行为父运行,则确定是否也要删除后代运行。If the run is a parent run, decide whether you also want to delete descendant runs. 默认情况下选择此选项。This option is selected by default.
  4. 单击“删除”进行确认,或单击“取消”进行取消。Click Delete to confirm or Cancel to cancel. 删除的运行保存 30 天。Deleted runs are saved for 30 days. 若要显示删除的运行,请选择“状态”字段中的“已删除”。To display deleted runs, select Deleted in the State field.

从 Azure Databricks 外部访问 MLflow 跟踪服务器Access the MLflow tracking server from outside Azure Databricks

你还可以从 Azure Databricks 外部写入和读取跟踪服务器,例如,使用 MLflow CLI 来这样做。You can also write to and read from the tracking server from outside Azure Databricks, for example using the MLflow CLI.

使用数据帧分析 MLflow 运行 Analyze MLflow runs using DataFrames

可使用以下两个数据帧 API 以编程方式访问 MLflow 运行数据:You can access MLflow run data programmatically using the following two DataFrame APIs:

此示例演示如何使用 MLflow Python 客户端生成一个仪表板。该仪表板可直观显示一段时间内的评估指标更改、跟踪特定用户启动的运行的数目,以及度量所有用户的运行总数:This example demonstrates how to use the MLflow Python client to build a dashboard that visualizes changes in evaluation metrics over time, tracks the number of runs started by a specific user, and measures the total number of runs across all users:

示例 Examples

以下笔记本演示了如何在 MLflow 中训练多种类型的模型和跟踪训练数据,以及如何在 Delta Lake 中存储跟踪数据。The following notebooks demonstrate how to train several types of models and track the training data in MLflow and how to store tracking data in Delta Lake.