MLOps:使用 Azure 机器学习进行模型管理、部署和监视MLOps: Model management, deployment and monitoring with Azure Machine Learning

本文介绍如何使用 Azure 机器学习来管理模型的生命周期。In this article, learn about how to use Azure Machine Learning to manage the lifecycle of your models. Azure 机器学习使用机器学习操作 (MLOps) 方法。Azure Machine Learning uses a Machine Learning Operations (MLOps) approach. MLOps 改善了机器学习解决方案的质量和一致性。MLOps improves the quality and consistency of your machine learning solutions.

什么是 MLOps?What is MLOps?

机器学习操作 (MLOps) 基于可提高工作流效率的 DevOps 原理和做法。Machine Learning Operations (MLOps) is based on DevOps principles and practices that increase the efficiency of workflows. 例如持续集成、持续交付和持续部署。For example, continuous integration, delivery, and deployment. MLOps 将这些原理应用到机器学习过程,其目标是:MLOps applies these principles to the machine learning process, with the goal of:

  • 更快地试验和开发模型Faster experimentation and development of models
  • 更快地将模型部署到生产环境Faster deployment of models into production
  • 质量保证Quality assurance

Azure 机器学习提供以下 MLOps 功能:Azure Machine Learning provides the following MLOps capabilities:

  • 创建可重现的 ML 管道Create reproducible ML pipelines . 使用机器学习管道可为数据准备、训练和评分过程定义可重复且可重用的步骤。Machine Learning pipelines allow you to define repeatable and reusable steps for your data preparation, training, and scoring processes.
  • 创建可重用的软件环境用于训练和部署模型。Create reusable software environments for training and deploying models.
  • 从任意位置注册、打包和部署模型。Register, package, and deploy models from anywhere . 还可以跟踪使用模型时所需的关联元数据。You can also track associated metadata required to use the model.
  • 捕获端到端 ML 生命周期的监管数据。Capture the governance data for the end-to-end ML lifecycle . 记录的信息可以包括模型的发布者、做出更改的原因,以及在生产环境中部署或使用模型的时间。The logged information can include who is publishing models, why changes were made, and when models were deployed or used in production.
  • 针对 ML 生命周期中的事件发出通知和警报。Notify and alert on events in the ML lifecycle . 例如,试验完成、模型注册、模型部署和数据偏移检测。For example, experiment completion, model registration, model deployment, and data drift detection.
  • 监视 ML 应用程序中的操作和 ML 相关问题Monitor ML applications for operational and ML-related issues . 比较训练与推理之间的模型输入,浏览特定于模型的指标,以及针对 ML 基础结构提供监视和警报。Compare model inputs between training and inference, explore model-specific metrics, and provide monitoring and alerts on your ML infrastructure.
  • 使用 Azure 机器学习和 Azure Pipelines 自动化端到端 ML 生命周期。Automate the end-to-end ML lifecycle with Azure Machine Learning and Azure Pipelines . 使用管道可以频繁更新模型、测试新模型,并连同其他应用程序和服务持续推出新的 ML 模型。Using pipelines allows you to frequently update models, test new models, and continuously roll out new ML models alongside your other applications and services.

创建可重现的 ML 管道Create reproducible ML pipelines

使用 Azure 机器学习中的 ML 管道,将模型训练过程涉及的所有步骤汇聚到一起。Use ML pipelines from Azure Machine Learning to stitch together all of the steps involved in your model training process.

ML 管道可以包含从数据准备、到特征提取、到超参数优化、再到模型评估的所有步骤。An ML pipeline can contain steps from data preparation to feature extraction to hyperparameter tuning to model evaluation. 有关详细信息,请参阅 ML 管道For more information, see ML pipelines.

如果使用设计器创建 ML 管道,则随时可以单击设计器页面右上角的“...”,然后选择“克隆”。 If you use the Designer to create your ML pipelines, you may at any time click the "..." at the top-right of the Designer page and then select Clone . 克隆管道可以迭代管道设计,而不会丢失旧版本。Cloning your pipeline allows you to iterate your pipeline design without losing your old versions.

创建可重用的软件环境Create reusable software environments

在 Azure 机器学习环境中,可对不断演进的项目软件依赖项进行跟踪和再现。Azure Machine Learning environments allow you to track and reproduce your projects' software dependencies as they evolve. 在环境中,无需进行手动软件配置,就能确保生成项目可以再现。Environments allow you to ensure that builds are reproducible without manual software configurations.

环境描述项目的 pip 和 Conda 依赖项,可用于模型的训练和部署。Environments describe the pip and Conda dependencies for your projects, and can be used for both training and deployment of models. 有关详细信息,请参阅什么是 Azure 机器学习环境For more information, see What are Azure Machine Learning environments.

从任意位置注册、打包和部署模型Register, package, and deploy models from anywhere

注册和跟踪 ML 模型Register and track ML models

通过模型注册,可在 Azure 云的工作区中存储模型并确定模型版本。Model registration allows you to store and version your models in the Azure cloud, in your workspace. 使用模型注册表,可轻松组织和跟踪定型的模型。The model registry makes it easy to organize and keep track of your trained models.

提示

已注册的模型是构成模型的一个或多个文件的逻辑容器。A registered model is a logical container for one or more files that make up your model. 例如,如果你有一个存储在多个文件中的模型,则可以在 Azure 机器学习工作区中将这些文件注册为单个模型。For example, if you have a model that is stored in multiple files, you can register them as a single model in your Azure Machine Learning workspace. 注册后,可以下载或部署已注册的模型,并接收注册的所有文件。After registration, you can then download or deploy the registered model and receive all the files that were registered.

按名称和版本标识已注册的模型。Registered models are identified by name and version. 每次使用与现有名称相同的名称来注册模型时,注册表都会将版本递增。Each time you register a model with the same name as an existing one, the registry increments the version. 在注册期间可以提供其他元数据标记。Additional metadata tags can be provided during registration. 然后,可以在搜索模型时使用这些标记。These tags are then used when searching for a model. Azure 机器学习支持可以使用 Python 3.5.2 或更高版本加载的任何模型。Azure Machine Learning supports any model that can be loaded using Python 3.5.2 or higher.

提示

还可以注册在 Azure 机器学习外部训练的模型。You can also register models trained outside Azure Machine Learning.

无法删除在活动部署中使用的已注册模型。You can't delete a registered model that is being used in an active deployment. 有关详细信息,请参阅部署模型的注册模型部分。For more information, see the register model section of Deploy models.

分析模型Profile models

Azure 机器学习可帮助你了解部署模型时要创建的服务的 CPU 和内存要求。Azure Machine Learning can help you understand the CPU and memory requirements of the service that will be created when you deploy your model. 分析可测试运行模型并返回 CPU 使用率、内存使用率和响应延迟等信息的服务。Profiling tests the service that runs your model and returns information such as the CPU usage, memory usage, and response latency. 它还根据资源使用率提供 CPU 和内存建议。It also provides a CPU and memory recommendation based on the resource usage. 有关详细信息,请参阅部署模型的“分析”部分。For more information, see the profiling section of Deploy models.

打包和调试模型Package and debug models

在将模型部署到生产环境之前,需将其打包成 Docker 映像。Before deploying a model into production, it is packaged into a Docker image. 大多数情况下,映像创建操作会在部署期间在后台自动发生。In most cases, image creation happens automatically in the background during deployment. 可以手动指定映像。You can manually specify the image.

如果部署时遇到问题,可以在本地开发环境中部署,以进行故障排除和调试。If you run into problems with the deployment, you can deploy on your local development environment for troubleshooting and debugging.

有关详细信息,请参阅部署模型排查部署问题For more information, see Deploy models and Troubleshooting deployments.

转换和优化模型Convert and optimize models

将模型转换为 Open Neural Network Exchange (ONNX) 可以提高性能。Converting your model to Open Neural Network Exchange (ONNX) may improve performance. 一般情况下,转换为 ONNX 可将性能提高 2 倍。On average, converting to ONNX can yield a 2x performance increase.

有关包含 Azure 机器学习的 ONNX 的详细信息,请参阅创建和加速 ML 模型一文。For more information on ONNX with Azure Machine Learning, see the Create and accelerate ML models article.

使用模型Use models

已训练的机器学习模型将在云中或本地部署为 Web 服务。Trained machine learning models are deployed as web services in the cloud or locally. 也可以将模型部署到 Azure IoT Edge 设备。You can also deploy models to Azure IoT Edge devices. 部署使用 CPU、GPU 或现场可编程门阵列 (FPGA) 进行推理。Deployments use CPU, GPU, or field-programmable gate arrays (FPGA) for inferencing. 也可以使用 Power BI 中的模型。You can also use models from Power BI.

将模型用作 Web 服务或 IoT Edge 设备时,需提供以下项:When using a model as a web service or IoT Edge device, you provide the following items:

  • 用于对提交到服务/设备的数据进行评分的模型。The model(s) that are used to score data submitted to the service/device.
  • 一个入口脚本。An entry script. 此脚本接受请求,使用模型对数据评分,然后返回响应。This script accepts requests, uses the model(s) to score the data, and return a response.
  • 一个描述模型和入口脚本所需 pip 和 Conda 依赖项的 Azure 机器学习环境。An Azure Machine Learning environment that describes the pip and Conda dependencies required by the model(s) and entry script.
  • 模型和入口脚本所需的任何其他资产,例如文本、数据等。Any additional assets such as text, data, etc. that are required by the model(s) and entry script.

还需要提供目标部署平台的配置。You also provide the configuration of the target deployment platform. 例如,部署到 Azure Kubernetes 服务时的 VM 系列类型、可用内存和核心数。For example, the VM family type, available memory, and number of cores when deploying to Azure Kubernetes Service.

创建映像时,还会添加 Azure 机器学习所需的组件。When the image is created, components required by Azure Machine Learning are also added. 例如,运行 Web 服务以及与 IoT Edge 交互所需的资产。For example, assets needed to run the web service and interact with IoT Edge.

批评分Batch scoring

支持通过 ML 管道进行批量评分。Batch scoring is supported through ML pipelines. 有关详细信息,请参阅针对大数据的批量预测For more information, see Batch predictions on big data.

实时 Web 服务Real-time web services

可以在包含以下计算目标的 Web 服务 中使用模型:You can use your models in web services with the following compute targets:

  • Azure 容器实例Azure Container Instance
  • Azure Kubernetes 服务Azure Kubernetes Service
  • 本地开发环境Local development environment

若要将模型部署为 Web 服务,必须提供以下项:To deploy the model as a web service, you must provide the following items:

  • 模型或模型系综。The model or ensemble of models.
  • 使用模型所需的依赖项。Dependencies required to use the model. 例如,接受请求并调用模型、conda 依赖项等的脚本。For example, a script that accepts requests and invokes the model, conda dependencies, etc.
  • 用于描述如何以及在何处部署模型的部署配置。Deployment configuration that describes how and where to deploy the model.

有关详细信息,请参阅部署模型For more information, see Deploy models.

受控推出Controlled rollout

部署到 Azure Kubernetes 服务时,可以使用受控推出来实现以下方案:When deploying to Azure Kubernetes Service, you can use controlled rollout to enable the following scenarios:

  • 为部署创建终结点的多个版本Create multiple versions of an endpoint for a deployment
  • 通过将流量路由到不同版本的终结点来执行 A/B 测试。Perform A/B testing by routing traffic to different versions of the endpoint.
  • 通过在终结点配置中更新流量百分比,在终结点版本之间切换。Switch between endpoint versions by updating the traffic percentage in endpoint configuration.

有关详细信息,请参阅 ML 模型的受控推出For more information, see Controlled rollout of ML models.

IoT Edge 设备IoT Edge devices

可以通过 Azure IoT Edge 模块 在 IoT 设备中使用模型。You can use models with IoT devices through Azure IoT Edge modules . IoT Edge 模块将部署到支持推理或模型评分的硬件设备。IoT Edge modules are deployed to a hardware device, which enables inference, or model scoring, on the device.

有关详细信息,请参阅部署模型For more information, see Deploy models.

分析Analytics

Microsoft Power BI 支持使用机器学习模型进行数据分析。Microsoft Power BI supports using machine learning models for data analytics. 有关详细信息,请参阅 Power BI 中的 Azure 机器学习集成(预览版)For more information, see Azure Machine Learning integration in Power BI (preview).

捕获所需的监管数据以捕获端到端的 ML 生命周期Capture the governance data required for capturing the end-to-end ML lifecycle

Azure ML 提供使用元数据跟踪所有 ML 资产的端到端审核线索的功能。Azure ML gives you the capability to track the end-to-end audit trail of all of your ML assets by using metadata.

  • Azure ML 与 Git 集成,可跟踪有关存储库/分支/提交代码的来源位置的信息。Azure ML integrates with Git to track information on which repository / branch / commit your code came from.
  • Azure ML 数据集可帮助你跟踪、分析数据及控制其版本。Azure ML Datasets help you track, profile, and version data.
  • 借助可解释性,可以解释模型、满足法规要求,并了解模型如何针对给定输入来提供结果。Interpretability allows you to explain your models, meet regulatory compliance, and understand how models arrive at a result for given input.
  • Azure ML 运行历史记录存储用于训练模型的代码、数据和计算的快照。Azure ML Run history stores a snapshot of the code, data, and computes used to train a model.
  • Azure ML 模型注册表捕获与模型关联的所有元数据(训练该模型的试验、模型的部署位置、其部署是否正常)。The Azure ML Model Registry captures all of the metadata associated with your model (which experiment trained it, where it is being deployed, if its deployments are healthy).
  • 通过与 Azure 集成,可以对机器学习生命周期中的事件进行操作。Integration with Azure allows you to act on events in the ML lifecycle. 例如,模型注册、部署、数据偏移和训练(运行)事件。For example, model registration, deployment, data drift, and training (run) events.

提示

系统会自动捕获有关模型和数据集的某些信息,同时你可以使用“标记”添加其他信息。While some information on models and datasets is automatically captured, you can add additional information by using tags . 在工作区中查找已注册的模型和数据集时,可以使用标记作为筛选器。When looking for registered models and datasets in your workspace, you can use tags as a filter.

可以选择将数据集与已注册的模型相关联。Associating a dataset with a registered model is an optional step. 若要了解如何在注册模型时引用数据集,请参阅 Model 类参考信息。For information on referencing a dataset when registering a model, see the Model class reference.

针对 ML 生命周期中的事件发出通知和警报以及进行自动化处理Notify, automate, and alert on events in the ML lifecycle

Azure ML 将关键事件发布到 Azure 事件网格。使用事件网格可以针对 ML 生命周期中的事件发出通知并自动采取措施。Azure ML publishes key events to Azure EventGrid, which can be used to notify and automate on events in the ML lifecycle. 有关详细信息,请参阅此文档For more information, please see this document.

监视操作和 ML 问题Monitor for operational & ML issues

使用监视功能可以了解正在将哪些数据发送到模型,以及模型返回的预测。Monitoring enables you to understand what data is being sent to your model, and the predictions that it returns.

此信息可帮助你了解模型的使用方式。This information helps you understand how your model is being used. 收集的输入数据还可以在训练模型的未来版本时使用。The collected input data may also be useful in training future versions of the model.

有关详细信息,请参阅如何启用模型数据收集For more information, see How to enable model data collection.

基于新数据重新训练模型Retrain your model on new data

我们经常需要验证并更新模型,甚至需要在收到新信息时从头开始重新训练该模型。Often, you'll want to validate your model, update it, or even retrain it from scratch, as you receive new information. 有时,接收新数据是该领域的预期环节。Sometimes, receiving new data is an expected part of the domain. 在其他时候,如检测数据集中的数据偏移(预览版)中所述,在面对特定传感器更改、自然数据更改(例如季节性影响),或者某些功能因其他功能而发生改变等情况时,模型性能可能会下降。Other times, as discussed in Detect data drift (preview) on datasets, model performance can degrade in the face of such things as changes to a particular sensor, natural data changes such as seasonal effects, or features shifting in their relation to other features.

对于“如何知道我是否应重新训练模型?”这一问题,没有统一的答案。There is no universal answer to "How do I know if I should retrain?" 但是,前面所述 Azure ML 事件和监视工具是实现自动化的良好起点。but Azure ML event and monitoring tools previously discussed are good starting points for automation. 在决定重新训练模型后,应该:Once you have decided to retrain, you should:

  • 使用可重复的自动化过程预处理数据Preprocess your data using a repeatable, automated process
  • 训练新模型Train your new model
  • 将新模型的输出与旧模型的输出进行比较Compare the outputs of your new model to those of your old model
  • 使用预定义的条件来确定是否替换旧模型Use predefined criteria to choose whether to replace your old model

上述步骤的一个主旨是,重新训练应该是自动化的,而不是临时性的。A theme of the above steps is that your retraining should be automated, not ad hoc. Azure 机器学习管道非常适合用于创建与数据准备、训练、验证和部署相关的工作流。Azure Machine Learning pipelines are a good answer for creating workflows relating to data preparation, training, validation, and deployment. 请阅读使用 Azure 机器学习设计器重新训练模型,了解管道和 Azure 机器学习设计器如何适应重新训练方案。Read Retrain models with Azure Machine Learning designer to see how pipelines and the Azure Machine Learning designer fit into a retraining scenario.

自动化 ML 生命周期Automate the ML lifecycle

可以使用 GitHub 和 Azure Pipelines 来创建用于训练模型的持续集成过程。You can use GitHub and Azure Pipelines to create a continuous integration process that trains a model. 在典型方案中,当数据科学家将某项更改签入项目的 Git 存储库时,Azure 管道将启动训练运行。In a typical scenario, when a Data Scientist checks a change into the Git repo for a project, the Azure Pipeline will start a training run. 然后,可以检查该运行的结果,以了解已训练模型的性能特征。The results of the run can then be inspected to see the performance characteristics of the trained model. 还可以创建一个管道用于将模型部署为 Web 服务。You can also create a pipeline that deploys the model as a web service.

安装 Azure 机器学习扩展可以更轻松地使用 Azure Pipelines。The Azure Machine Learning extension makes it easier to work with Azure Pipelines. 该扩展为 Azure Pipelines 提供以下增强:It provides the following enhancements to Azure Pipelines:

  • 定义服务连接时启用工作区选择。Enables workspace selection when defining a service connection.
  • 使发布管道可由训练管道中创建的已训练模型触发。Enables release pipelines to be triggered by trained models created in a training pipeline.

若要了解如何配合使用 Azure Pipelines 与 Azure 机器学习,请参阅以下链接:For more information on using Azure Pipelines with Azure Machine Learning, see the following links:

还可以使用 Azure 数据工厂创建数据引入管道,以准备好用于训练的数据。You can also use Azure Data Factory to create a data ingestion pipeline that prepares data for use with training. 有关详细信息,请参阅数据引入管道For more information, see Data ingestion pipeline.

后续步骤Next steps

阅读并探索以下资源来了解详细信息:Learn more by reading and exploring the following resources: