使用 MLflow 和 Azure 机器学习(预览版)跟踪模型指标并部署 ML 模型Track model metrics and deploy ML models with MLflow and Azure Machine Learning (preview)

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

本文演示如何启用 MLflow 的跟踪 URI 和记录 API(统称为 MLflow 跟踪),以连接 MLflow 试验和 Azure 机器学习。This article demonstrates how to enable MLflow's tracking URI and logging API, collectively known as MLflow Tracking, to connect your MLflow experiments and Azure Machine Learning. 这样做可以实现以下目的:Doing so enables you to,

  • Azure 机器学习工作区中跟踪和记录试验指标及项目。Track and log experiment metrics and artifacts in your Azure Machine Learning workspace. 如果已为试验使用 MLflow 跟踪,工作区可提供集中、安全和可缩放的位置,用于存储训练指标和模型。If you already use MLflow Tracking for your experiments, the workspace provides a centralized, secure, and scalable location to store training metrics and models.

  • 将 MLflow 试验部署为 Azure 机器学习 Web 服务。Deploy your MLflow experiments as an Azure Machine Learning web service. 通过部署为 Web 服务,你可以将 Azure 机器学习监视和数据偏移检测功能应用到生产模型中。By deploying as a web service, you can apply the Azure Machine Learning monitoring and data drift detection functionalities to your production models.

MLflow 是一个开放源代码库,用于管理机器学习试验的生命周期。MLflow is an open-source library for managing the life cycle of your machine learning experiments. MLFlow 跟踪是 MLflow 的一个组件,它可以记录和跟踪训练运行指标及模型项目,无论试验环境是在本地计算机上、远程计算目标上、虚拟机上,还是在 Azure Databricks 群集上。MLFlow Tracking is a component of MLflow that logs and tracks your training run metrics and model artifacts, no matter your experiment's environment--locally on your computer, on a remote compute target, a virtual machine, or an Azure Databricks cluster.

备注

作为开放源代码库,MLflow 会经常更改。As an open source library, MLflow changes frequently. 因此,通过 Azure 机器学习和 MLflow 集成提供的功能应视为预览版,Microsoft 并不完全支持它。As such, the functionality made available via the Azure Machine Learning and MLflow integration should be considered as a preview, and not fully supported by Microsoft.

下图说明使用 MLflow 跟踪,你可以跟踪试验的运行指标,并将模型项目存储在 Azure 机器学习工作区中。The following diagram illustrates that with MLflow Tracking, you track an experiment's run metrics and store model artifacts in your Azure Machine Learning workspace.

使用 Azure 机器学习的 MLflow 示意图

提示

本文档中的信息主要是为希望监视模型训练过程的数据科学家和开发人员提供的。The information in this document is primarily for data scientists and developers who want to monitor the model training process. 如果您是一名管理员,希望监视 Azure 机器学习的资源使用情况和事件,例如配额、已完成的训练运行或已完成的模型部署,请参阅监视 Azure 机器学习If you are an administrator interested in monitoring resource usage and events from Azure Machine learning, such as quotas, completed training runs, or completed model deployments, see Monitoring Azure Machine Learning.

比较 MLflow 和 Azure 机器学习客户端Compare MLflow and Azure Machine Learning clients

下表汇总了可以使用 Azure 机器学习的不同客户端,以及它们各自的功能。The below table summarizes the different clients that can use Azure Machine Learning, and their respective function capabilities.

MLflow 跟踪提供指标记录和项目存储功能,这些功能仅通过 Azure 机器学习 Python SDK 提供。MLflow Tracking offers metric logging and artifact storage functionalities that are only otherwise available via the Azure Machine Learning Python SDK.

功能Capability MLflow 跟踪和部署MLflow Tracking & Deployment Azure 机器学习 Python SDKAzure Machine Learning Python SDK Azure 机器学习 CLIAzure Machine Learning CLI Azure 机器学习工作室Azure Machine Learning studio
管理工作区Manage workspace
使用数据存储Use data stores
记录指标Log metrics
上传项目Upload artifacts
查看指标View metrics
管理计算Manage compute
部署模型Deploy models
监视模型性能Monitor model performance
检测数据偏差Detect data drift

先决条件Prerequisites

跟踪本地运行Track local runs

使用 Azure 机器学习进行 MLflow 跟踪,你可以将本地运行中记录的指标和项目存储到 Azure 机器学习工作区中。MLflow Tracking with Azure Machine Learning lets you store the logged metrics and artifacts from your local runs into your Azure Machine Learning workspace.

安装 azureml-mlflow 包,以通过 Azure 机器学习对在 Jupyter Notebook 或代码编辑器中本地运行的试验使用 MLflow 跟踪。Install the azureml-mlflow package to use MLflow Tracking with Azure Machine Learning on your experiments locally run in a Jupyter Notebook or code editor.

pip install azureml-mlflow

导入 mlflowWorkspace 类以访问 MLflow 的跟踪 URI 并配置工作区。Import the mlflow and Workspace classes to access MLflow's tracking URI and configure your workspace.

在下面的代码中,get_mlflow_tracking_uri() 方法会向工作区 ws 分配唯一的跟踪 URI 地址,并且 set_tracking_uri() 会将 MLflow 跟踪 URI 指向该地址。In the following code, the get_mlflow_tracking_uri() method assigns a unique tracking URI address to the workspace, ws, and set_tracking_uri() points the MLflow tracking URI to that address.

import mlflow
from azureml.core import Workspace

ws = Workspace.from_config()

mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())

备注

跟踪 URI 的有效时间不超过一小时。The tracking URI is valid up to an hour or less. 如果在一段空闲时间后重新启动脚本,请使用 get_mlflow_tracking_uri API 来获取新的 URI。If you restart your script after some idle time, use the get_mlflow_tracking_uri API to get a new URI.

使用 set_experiment() 设置 MLflow 试验名称,并通过 start_run() 启动训练运行。Set the MLflow experiment name with set_experiment() and start your training run with start_run(). 然后使用 log_metric() 激活 MLflow 记录 API 并开始记录训练运行指标。Then use log_metric() to activate the MLflow logging API and begin logging your training run metrics.

experiment_name = 'experiment_with_mlflow'
mlflow.set_experiment(experiment_name)

with mlflow.start_run():
    mlflow.log_metric('alpha', 0.03)

跟踪远程运行Track remote runs

使用 Azure 机器学习进行 MLflow 跟踪,你可以将远程运行中记录的指标和项目存储在 Azure 机器学习工作区中。MLflow Tracking with Azure Machine Learning lets you store the logged metrics and artifacts from your remote runs into your Azure Machine Learning workspace.

远程运行可以让你通过更强大的计算(例如启用 GPU 的虚拟机或机器学习计算群集)训练模型。Remote runs let you train your models on more powerful computes, such as GPU enabled virtual machines, or Machine Learning Compute clusters. 请参阅使用计算目标进行模型训练,了解不同的计算选项。See Use compute targets for model training to learn about different compute options.

使用 Environment 类配置计算和训练运行环境。Configure your compute and training run environment with the Environment class. mlflowazureml-mlflow pip 包包含在环境的 CondaDependencies 部分中。Include mlflow and azureml-mlflow pip packages in environment's CondaDependencies section. 然后,将远程计算作为计算目标构造 ScriptRunConfigThen construct ScriptRunConfig with your remote compute as the compute target.

from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core import ScriptRunConfig

exp = Experiment(workspace = 'my_workspace',
                 name='my_experiment')

mlflow_env = Environment(name='mlflow-env')

cd = CondaDependencies.create(pip_packages=['mlflow', 'azureml-mlflow'])

mlflow_env.python.conda_dependencies = cd

src = ScriptRunConfig(source_directory='./my_script_location', script='my_training_script.py')

src.run_config.target = 'my-remote-compute-compute'
src.run_config.environment = mlflow_env

在训练脚本中,导入 mlflow 以使用 MLflow 记录 API,并开始记录运行指标。In your training script, import mlflow to use the MLflow logging APIs, and start logging your run metrics.

import mlflow

with mlflow.start_run():
    mlflow.log_metric('example', 1.23)

在具有此计算和训练运行配置的情况下,使用 Experiment.submit('train.py') 方法提交运行。With this compute and training run configuration, use the Experiment.submit('train.py') method to submit a run. 此方法会自动设置 MLflow 跟踪 URI 并将记录从 MLflow 定向到工作区。This method automatically sets the MLflow tracking URI and directs the logging from MLflow to your Workspace.

run = exp.submit(src)

将 MLflow 模型部署为 Web 服务Deploy MLflow models as a web service

通过将 MLflow 试验作为 Azure 机器学习 Web 服务进行部署,你可以利用 Azure 机器学习模型管理和数据偏移检测功能,并将它们应用于生产模型。Deploying your MLflow experiments as an Azure Machine Learning web service allows you to leverage the Azure Machine Learning model management and data drift detection capabilities and apply them to your production models.

下图演示了使用 MLflow 部署 API,你可以将现有 MLflow 模型部署为 Azure 机器学习 Web 服务,而无需考虑它们的框架(PyTorch、Tensorflow、scikit-learn、ONNX 等),并可在工作区中管理你的生产模型。The following diagram demonstrates that with the MLflow deploy API you can deploy your existing MLflow models as an Azure Machine Learning web service, despite their frameworks--PyTorch, Tensorflow, scikit-learn, ONNX, etc., and manage your production models in your workspace.

使用 Azure 机器学习的 MLflow 示意图

记录模型Log your model

在可以部署之前,请确保已保存模型,以便可以引用它和它的部署路径位置。Before you can deploy, be sure that your model is saved so you can reference it and its path location for deployment. 在训练脚本中,应会有类似于以下 mlflow.sklearn.log_model() 方法的代码,它会将模型保存到指定的输出目录。In your training script, there should be code similar to the following mlflow.sklearn.log_model() method, that saves your model to the specified outputs directory.

# change sklearn to pytorch, tensorflow, etc. based on your experiment's framework 
import mlflow.sklearn

# Save the model to the outputs directory for capture
mlflow.sklearn.log_model(regression_model, model_save_path)

备注

请包含 conda_env 参数,以传递应在其中运行此模型的依赖项和环境的字典表示形式。Include the conda_env parameter to pass a dictionary representation of the dependencies and environment this model should be run in.

从以前的运行中检索模型Retrieve model from previous run

要检索运行,你需要运行 ID 和运行历史记录中保存模型的路径。To retrieve the run, you need the run ID and the path in run history of where the model was saved.

# gets the list of runs for your experiment as an array
experiment_name = 'experiment-with-mlflow'
exp = ws.experiments[experiment_name]
runs = list(exp.get_runs())

# get the run ID and the path in run history
runid = runs[0].id
model_save_path = 'model'

部署模型Deploy the model

使用 Azure 机器学习 SDK 将模型部署为 Web 服务。Use the Azure Machine Learning SDK to deploy the model as a web service.

首先,请指定部署配置。First, specify the deployment configuration. Azure 容器实例 (ACI) 适合用于快速的开发测试部署,而 Azure Kubernetes 服务 (AKS) 适合用于可缩放的生产部署。Azure Container Instance (ACI) is a suitable choice for a quick dev-test deployment, while Azure Kubernetes Service (AKS) is suitable for scalable production deployments.

部署到 ACIDeploy to ACI

使用 deploy_configuration() 方法设置部署配置。Set up your deployment configuration with the deploy_configuration() method. 你还可以添加标记和说明来帮助跟踪你的 Web 服务。You can also add tags and descriptions to help keep track of your web service.

from azureml.core.webservice import AciWebservice, Webservice

# Configure 
aci_config = AciWebservice.deploy_configuration(cpu_cores=1, 
                                                memory_gb=1, 
                                                tags={'method' : 'sklearn'}, 
                                                description='Diabetes model',
                                                location='chinaeast')

然后,使用 Azure 机器学习 SDK deploy 方法来注册和部署模型。Then, register and deploy the model by using the Azure Machine Learning SDK deploy method.

(webservice,model) = mlflow.azureml.deploy( model_uri='runs:/{}/{}'.format(run.id, model_path),
                      workspace=ws,
                      model_name='sklearn-model', 
                      service_name='diabetes-model-1', 
                      deployment_config=aci_config, 
                      tags=None, mlflow_home=None, synchronous=True)

webservice.wait_for_deployment(show_output=True)

部署到 AKSDeploy to AKS

若要部署到 AKS,请先创建 AKS 群集。To deploy to AKS, first create an AKS cluster. 使用 ComputeTarget.create() 方法创建 AKS 群集。Create an AKS cluster using the ComputeTarget.create() method. 创建新群集可能需要 20-25 分钟。It may take 20-25 minutes to create a new cluster.

from azureml.core.compute import AksCompute, ComputeTarget

# Use the default configuration (can also provide parameters to customize)
prov_config = AksCompute.provisioning_configuration()

aks_name = 'aks-mlflow' 

# Create the cluster
aks_target = ComputeTarget.create(workspace=ws, 
                                  name=aks_name, 
                                  provisioning_configuration=prov_config)

aks_target.wait_for_completion(show_output = True)

print(aks_target.provisioning_state)
print(aks_target.provisioning_errors)

使用 deploy_configuration() 方法设置部署配置。Set up your deployment configuration with the deploy_configuration() method. 你还可以添加标记和说明来帮助跟踪你的 Web 服务。You can also add tags and descriptions to help keep track of your web service.

from azureml.core.webservice import Webservice, AksWebservice

# Set the web service configuration (using default here with app insights)
aks_config = AksWebservice.deploy_configuration(enable_app_insights=True, compute_target_name='aks-mlflow')

然后,使用 Azure 机器学习 SDK [deploy()] 来部署映像(然后,使用 Azure 机器学习 SDK deploy 方法来注册和部署模型)。Then, deploy the image by using the Azure Machine Learning SDK [deploy()](Then, register and deploy the model by using the Azure Machine Learning SDK deploy method.

# Webservice creation using single command
from azureml.core.webservice import AksWebservice, Webservice
(webservice, model) = mlflow.azureml.deploy( model_uri='runs:/{}/{}'.format(run.id, model_path),
                      workspace=ws,
                      model_name='sklearn-model', 
                      service_name='my-aks', 
                      deployment_config=aks_config, 
                      tags=None, mlflow_home=None, synchronous=True)


webservice.wait_for_deployment()

服务部署可能需要几分钟的时间。The service deployment can take several minutes.

清理资源Clean up resources

如果不打算使用工作区中记录的指标和项目,目前尚未提供单独删除它们的功能。If you don't plan to use the logged metrics and artifacts in your workspace, the ability to delete them individually is currently unavailable. 可以改为删除包含存储帐户和工作区的资源组,这样就不会产生任何费用:Instead, delete the resource group that contains the storage account and workspace, so you don't incur any charges:

  1. 在 Azure 门户中,选择最左侧的“资源组”。In the Azure portal, select Resource groups on the far left.

    在 Azure 门户中删除

  2. 从列表中选择已创建的资源组。From the list, select the resource group you created.

  3. 选择“删除资源组”。Select Delete resource group.

  4. 输入资源组名称。Enter the resource group name. 然后选择“删除”。Then select Delete.

示例笔记本Example notebooks

将 MLflow 与 Azure ML 笔记本配合使用演示了本文中所述的概念,并在这些概念的基础上有所延伸。The MLflow with Azure ML notebooks demonstrate and expand upon concepts presented in this article.

后续步骤Next steps