使用 MLflow 和 Azure 机器学习训练和跟踪 ML 模型(预览版)Train and track ML models with MLflow and Azure Machine Learning (preview)

本文介绍如何启用 MLflow 的跟踪 URI 和记录 API(统称为 MLflow 跟踪),以将 Azure 机器学习作为 MLflow 试验的后端进行连接。In this article, learn how to enable MLflow's tracking URI and logging API, collectively known as MLflow Tracking, to connect Azure Machine Learning as the backend of your MLflow experiments.

支持的功能包括:Supported capabilities include:

  • Azure 机器学习工作区中跟踪和记录试验指标及项目。Track and log experiment metrics and artifacts in your Azure Machine Learning workspace. 如果已为试验使用 MLflow 跟踪,工作区可提供集中、安全和可缩放的位置,用于存储训练指标和模型。If you already use MLflow Tracking for your experiments, the workspace provides a centralized, secure, and scalable location to store training metrics and models.

  • 使用具有 Azure 机器学习后端支持(预览版)的 MLflow 项目提交训练作业。Submit training jobs with MLflow Projects with Azure Machine Learning backend support (preview). 你可以使用 Azure 机器学习跟踪在本地提交作业,也可以像通过 Azure 机器学习计算那样将运行迁移到云中。You can submit jobs locally with Azure Machine Learning tracking or migrate your runs to the cloud like via an Azure Machine Learning Compute.

  • 在 MLflow 和 Azure 机器学习模型注册表中跟踪和管理模型。Track and manage models in MLflow and Azure Machine Learning model registry.

MLflow 是一个开放源代码库,用于管理机器学习试验的生命周期。MLflow is an open-source library for managing the life cycle of your machine learning experiments. MLFlow 跟踪是 MLflow 的一个组件,它可以记录和跟踪训练运行指标及模型项目,无论试验环境是在本地计算机上、远程计算目标上、虚拟机上,还是在 Azure Databricks 群集上。MLFlow Tracking is a component of MLflow that logs and tracks your training run metrics and model artifacts, no matter your experiment's environment--locally on your computer, on a remote compute target, a virtual machine, or an Azure Databricks cluster.

备注

作为开放源代码库,MLflow 会经常更改。As an open source library, MLflow changes frequently. 因此,通过 Azure 机器学习和 MLflow 集成提供的功能应视为预览版,Microsoft 并不完全支持它。As such, the functionality made available via the Azure Machine Learning and MLflow integration should be considered as a preview, and not fully supported by Microsoft.

下图说明使用 MLflow 跟踪,你可以跟踪试验的运行指标,并将模型项目存储在 Azure 机器学习工作区中。The following diagram illustrates that with MLflow Tracking, you track an experiment's run metrics and store model artifacts in your Azure Machine Learning workspace.

使用 Azure 机器学习的 MLflow 示意图

提示

本文档中的信息主要是为希望监视模型训练过程的数据科学家和开发人员提供的。The information in this document is primarily for data scientists and developers who want to monitor the model training process. 如果您是一名管理员,希望监视 Azure 机器学习的资源使用情况和事件,例如配额、已完成的训练运行或已完成的模型部署,请参阅监视 Azure 机器学习If you are an administrator interested in monitoring resource usage and events from Azure Machine learning, such as quotas, completed training runs, or completed model deployments, see Monitoring Azure Machine Learning.

比较 MLflow 和 Azure 机器学习客户端Compare MLflow and Azure Machine Learning clients

下表汇总了可以使用 Azure 机器学习的不同客户端,以及它们各自的功能。The following table summarizes the different clients that can use Azure Machine Learning, and their respective function capabilities.

MLflow 跟踪提供指标记录和项目存储功能,这些功能仅通过 Azure 机器学习 Python SDK 提供。MLflow Tracking offers metric logging and artifact storage functionalities that are only otherwise available via the Azure Machine Learning Python SDK.

功能Capability MLflow 跟踪和部署MLflow Tracking & Deployment Azure 机器学习 Python SDKAzure Machine Learning Python SDK Azure 机器学习 CLIAzure Machine Learning CLI Azure 机器学习工作室Azure Machine Learning studio
管理工作区Manage workspace
使用数据存储Use data stores
记录指标Log metrics
上传项目Upload artifacts
查看指标View metrics
管理计算Manage compute
部署模型Deploy models
监视模型性能Monitor model performance
检测数据偏差Detect data drift

先决条件Prerequisites

跟踪本地运行Track local runs

使用 Azure 机器学习进行 MLflow 跟踪,你可以将本地运行中记录的指标和项目存储到 Azure 机器学习工作区中。MLflow Tracking with Azure Machine Learning lets you store the logged metrics and artifacts from your local runs into your Azure Machine Learning workspace.

导入 mlflowWorkspace 类以访问 MLflow 的跟踪 URI 并配置工作区。Import the mlflow and Workspace classes to access MLflow's tracking URI and configure your workspace.

在下面的代码中,get_mlflow_tracking_uri() 方法会向工作区 ws 分配唯一的跟踪 URI 地址,并且 set_tracking_uri() 会将 MLflow 跟踪 URI 指向该地址。In the following code, the get_mlflow_tracking_uri() method assigns a unique tracking URI address to the workspace, ws, and set_tracking_uri() points the MLflow tracking URI to that address.

import mlflow
from azureml.core import Workspace

ws = Workspace.from_config()

mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())

备注

跟踪 URI 的有效时间不超过一小时。The tracking URI is valid up to an hour or less. 如果在一段空闲时间后重新启动脚本,请使用 get_mlflow_tracking_uri API 来获取新的 URI。If you restart your script after some idle time, use the get_mlflow_tracking_uri API to get a new URI.

使用 set_experiment() 设置 MLflow 试验名称,并通过 start_run() 启动训练运行。Set the MLflow experiment name with set_experiment() and start your training run with start_run(). 然后使用 log_metric() 激活 MLflow 记录 API 并开始记录训练运行指标。Then use log_metric() to activate the MLflow logging API and begin logging your training run metrics.

experiment_name = 'experiment_with_mlflow'
mlflow.set_experiment(experiment_name)

with mlflow.start_run():
    mlflow.log_metric('alpha', 0.03)

跟踪远程运行Track remote runs

远程运行可以让你通过更强大的计算(例如启用 GPU 的虚拟机或机器学习计算群集)训练模型。Remote runs let you train your models on more powerful computes, such as GPU enabled virtual machines, or Machine Learning Compute clusters. 请参阅使用计算目标进行模型训练,了解不同的计算选项。See Use compute targets for model training to learn about different compute options.

使用 Azure 机器学习进行 MLflow 跟踪,你可以将远程运行中记录的指标和项目存储在 Azure 机器学习工作区中。MLflow Tracking with Azure Machine Learning lets you store the logged metrics and artifacts from your remote runs into your Azure Machine Learning workspace. 任何包含 MLflow 跟踪代码的运行都会自动将指标记录到工作区。Any run with MLflow Tracking code in it will have metrics logged automatically to the workspace.

以下示例 conda 环境包括 mlflowazureml-mlflow 作为 pip 包。The following example conda environment includes mlflow and azureml-mlflow as pip packages.

name: sklearn-example
dependencies:
  - python=3.6.2
  - scikit-learn
  - matplotlib
  - numpy
  - pip:
    - azureml-mlflow
    - numpy

在脚本中,使用 Environment 类配置计算和训练运行环境。In your script, configure your compute and training run environment with the Environment class. 然后,将远程计算作为计算目标构造 ScriptRunConfigThen, construct ScriptRunConfig with your remote compute as the compute target.

import mlflow

with mlflow.start_run():
    mlflow.log_metric('example', 1.23)

在具有此计算和训练运行配置的情况下,使用 Experiment.submit() 方法提交运行。With this compute and training run configuration, use the Experiment.submit() method to submit a run. 此方法会自动设置 MLflow 跟踪 URI 并将记录从 MLflow 定向到工作区。This method automatically sets the MLflow tracking URI and directs the logging from MLflow to your Workspace.

run = exp.submit(src)

使用 MLflow 项目进行训练Train with MLflow Projects

MLflow 项目允许你组织和描述你的代码,使其他数据科学家(或自动化工具)可以运行它。MLflow Projects allow for you to organize and describe your code to let other data scientists (or automated tools) run it. 使用 Azure 机器学习的 MLflow 项目使你可以在工作区中跟踪和管理你的训练运行。MLflow Projects with Azure Machine Learning enables you to track and manage your training runs in your workspace.

此示例演示如何使用 Azure 机器学习跟踪在本地提交 MLflow 项目。This example shows how to submit MLflow projects locally with Azure Machine Learning tracking.

安装 azureml-mlflow 包,以通过 Azure 机器学习对本地试验使用 MLflow 跟踪。Install the azureml-mlflow package to use MLflow Tracking with Azure Machine Learning on your experiments locally. 可以通过 Jupyter Notebook 或代码编辑器运行试验。Your experiments can run via a Jupyter notebook or code editor.

pip install azureml-mlflow

导入 mlflowWorkspace 类以访问 MLflow 的跟踪 URI 并配置工作区。Import the mlflow and Workspace classes to access MLflow's tracking URI and configure your workspace.

import mlflow
from azureml.core import Workspace

ws = Workspace.from_config()

mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())

使用 set_experiment() 设置 MLflow 试验名称,并通过 start_run() 启动训练运行。Set the MLflow experiment name with set_experiment() and start your training run with start_run(). 然后使用 log_metric() 激活 MLflow 记录 API 并开始记录训练运行指标。Then, use log_metric() to activate the MLflow logging API and begin logging your training run metrics.

experiment_name = 'experiment-with-mlflow-projects'
mlflow.set_experiment(experiment_name)

创建后端配置对象以存储集成所需的信息,如计算目标以及要使用的托管环境的类型。Create the backend configuration object to store necessary information for the integration such as, the compute target and which type of managed environment to use.

backend_config = {"USE_CONDA": False}

azureml-mlflow 包作为 pip 依赖项添加到环境配置文件,以便在工作区中跟踪指标和关键项目。Add the azureml-mlflow package as a pip dependency to your environment configuration file in order to track metrics and key artifacts in your workspace.

name: mlflow-example
channels:
  - defaults
  - anaconda
  - conda-forge
dependencies:
  - python=3.6
  - scikit-learn=0.19.1
  - pip
  - pip:
    - mlflow
    - azureml-mlflow

提交本地运行,并确保设置参数 backend = "azureml" Submit the local run and ensure you set the parameter backend = "azureml" . 利用此设置,你可以在本地提交运行,并在工作区中获得对自动输出跟踪、日志文件、快照和打印错误的附加支持。With this setting, you can submit runs locally and get the added support of automatic output tracking, log files, snapshots, and printed errors in your workspace.

Azure 机器学习工作室中查看运行和指标。View your runs and metrics in the Azure Machine Learning studio.

local_env_run = mlflow.projects.run(uri=".", 
                                    parameters={"alpha":0.3},
                                    backend = "azureml",
                                    use_conda=False,
                                    backend_config = backend_config, 
                                    )

查看工作区中的指标和项目View metrics and artifacts in your workspace

MLflow 记录的指标和项目保存在工作区中。The metrics and artifacts from MLflow logging are kept in your workspace. 若要随时查看它们,请在 Azure 机器学习工作室中导航到你的工作区,并在该工作区中按名称找到试验。To view them anytime, navigate to your workspace and find the experiment by name in your workspace in Azure Machine Learning studio. 或运行以下代码。Or run the below code.

run.get_metrics()

管理模型Manage models

使用支持 MLflow 模型注册表的 Azure 机器学习模型注册表注册和跟踪模型。Register and track your models with the Azure Machine Learning model registry which supports the MLflow model registry. Azure 机器学习模型与 MLflow 模型架构一致,从而可以轻松地在不同的工作流之间导出和导入这些模型。Azure Machine Learning models are aligned with the MLflow model schema making it easy to export and import these models across different workflows. 与 MLflow 相关的元数据(如运行 ID)还使用注册的模型进行标记,以进行跟踪。The MLflow related metadata such as, run ID is also tagged with the registered model for traceability. 用户可以提交训练运行、注册和部署 MLflow 运行生成的模型。Users can submit training runs, register, and deploy models produced from MLflow runs.

如果要一步部署和注册生产就绪模型,请参阅部署和注册 MLflow 模型If you want to deploy and register your production ready model in one step, see Deploy and register MLflow models.

若要注册并查看运行中的模型,请执行以下步骤:To register and view a model from a run, use the following steps:

  1. 运行完成后,调用 register_model() 方法。Once the run is complete call the register_model() method.

    # the model folder produced from the run is registered. This includes the MLmodel file, model.pkl and the conda.yaml.
    run.register_model(model_name = 'my-model', model_path = 'model')
    
  2. 使用 Azure 机器学习工作室查看工作区中的已注册模型。View the registered model in your workspace with Azure Machine Learning studio.

    在以下示例中,已注册的模型 my-model 标记了 MLflow 跟踪元数据。In the following example the registered model, my-model has MLflow tracking metadata tagged.

    register-mlflow-model

  3. 选择“项目”选项卡以查看与 MLflow 模型架构(conda.yaml、MLmodel 和 model.pkl)一致的所有模型文件。Select the Artifacts tab to see all the model files that align with the MLflow model schema (conda.yaml, MLmodel, model.pkl).

    model-schema

  4. 选择 MLmodel 以查看运行生成的 MLmodel 文件。Select MLmodel to see the MLmodel file generated by the run.

    MLmodel-schema

清理资源Clean up resources

如果不打算使用工作区中记录的指标和项目,目前尚未提供单独删除它们的功能。If you don't plan to use the logged metrics and artifacts in your workspace, the ability to delete them individually is currently unavailable. 可以改为删除包含存储帐户和工作区的资源组,这样就不会产生任何费用:Instead, delete the resource group that contains the storage account and workspace, so you don't incur any charges:

  1. 在 Azure 门户中,选择最左侧的“资源组”。In the Azure portal, select Resource groups on the far left.

    在 Azure 门户中删除

  2. 从列表中选择已创建的资源组。From the list, select the resource group you created.

  3. 选择“删除资源组”。Select Delete resource group.

  4. 输入资源组名称。Enter the resource group name. 然后选择“删除”。Then select Delete.

示例笔记本Example notebooks

将 MLflow 与 Azure ML 笔记本配合使用演示了本文中所述的概念,并在这些概念的基础上有所延伸。The MLflow with Azure ML notebooks demonstrate and expand upon concepts presented in this article.

备注

可在 https://github.com/Azure/azureml-examples 找到使用 mlflow 的社区主导的示例存储库。A community-driven repository of examples using mlflow can be found at https://github.com/Azure/azureml-examples.

后续步骤Next steps