Azure 机器学习的工作原理:体系结构和概念How Azure Machine Learning works: Architecture and concepts

了解 Azure 机器学习的体系结构和概念。Learn about the architecture and concepts for Azure Machine Learning. 本文使你对组件以及它们如何协同工作以帮助你构建、部署和维护机器学习模型的过程有一个概要了解。This article gives you a high-level understanding of the components and how they work together to assist in the process of building, deploying, and maintaining machine learning models.

工作区Workspace

机器学习工作区是 Azure 机器学习的顶级资源。A machine learning workspace is the top-level resource for Azure Machine Learning.

示意图:工作区及其组件的 Azure 机器学习体系结构

工作区是用来执行以下操作的集中位置:The workspace is the centralized place to:

工作区包括工作区使用的其他 Azure 资源:A workspace includes other Azure resources that are used by the workspace:

  • Azure 容器注册表 (ACR):注册在训练期间和部署模型时使用的 Docker 容器。Azure Container Registry (ACR): Registers docker containers that you use during training and when you deploy a model. 为最大程度地降低成本,只有在创建部署映像时才会创建 ACR。To minimize costs, ACR is only created when deployment images are created.
  • Azure 存储帐户,用作工作区的默认数据存储。Azure Storage account: Is used as the default datastore for the workspace. 与 Azure 机器学习计算实例一起使用的 Jupyter 笔记本也存储在此处。Jupyter notebooks that are used with your Azure Machine Learning compute instances are stored here as well.
  • Azure Application Insights:存储有关模型的监视信息。Azure Application Insights: Stores monitoring information about your models.
  • Azure Key Vault:存储计算目标使用的机密和工作区所需的其他敏感信息。Azure Key Vault: Stores secrets that are used by compute targets and other sensitive information that's needed by the workspace.

可与其他人共享工作区。You can share a workspace with others.

计算Computes

计算目标是用于运行训练脚本或承载服务部署的任何计算机或计算机集。A compute target is any machine or set of machines you use to run your training script or host your service deployment. 你可以使用本地计算机或远程计算资源作为计算目标。You can use your local machine or a remote compute resource as a compute target. 使用计算目标,你可以在本地计算机上开始训练,然后在不更改训练脚本的情况下横向扩展到云。With compute targets, you can start training on your local machine and then scale out to the cloud without changing your training script.

Azure 机器学习引入了为机器学习任务配置的两个完全托管的基于云的虚拟机 (VM):Azure Machine Learning introduces two fully managed cloud-based virtual machines (VM) that are configured for machine learning tasks:

  • 计算实例:计算实例是一个 VM,其中包含为机器学习安装的多个工具和环境。Compute instance: A compute instance is a VM that includes multiple tools and environments installed for machine learning. 计算实例的主要用途是用于你的开发工作站。The primary use of a compute instance is for your development workstation. 你可以开始运行示例笔记本,无需进行任何设置。You can start running sample notebooks with no setup required. 还可将计算实例用作训练和推理作业的计算目标。A compute instance can also be used as a compute target for training and inferencing jobs.

  • 计算群集:计算群集是包含多节点缩放功能的 VM 群集。Compute clusters: Compute clusters are a cluster of VMs with multi-node scaling capabilities. 计算群集更适合用于大型作业和生产的计算目标。Compute clusters are better suited for compute targets for large jobs and production. 提交作业时,群集会自动纵向扩展。The cluster scales up automatically when a job is submitted. 用作训练计算目标,或用于开发/测试部署。Use as a training compute target or for dev/test deployment.

若要详细了解如何训练计算目标,请参阅训练计算目标For more information about training compute targets, see Training compute targets. 有关部署计算目标的详细信息,请参阅部署目标For more information about deployment compute targets, see Deployment targets.

数据集和数据存储Datasets and datastores

使用 Azure 机器学习数据集可以更轻松地访问和处理数据。Azure Machine Learning Datasets make it easier to access and work with your data. 创建数据集时,将会创建对数据源位置的引用及其元数据的副本。By creating a dataset, you create a reference to the data source location along with a copy of its metadata. 由于数据保留在其现有位置中,因此不会产生额外的存储成本,也不会损害数据源的完整性。Because the data remains in its existing location, you incur no extra storage cost, and don't risk the integrity of your data sources.

有关详细信息,请参阅创建和注册 Azure 机器学习数据集For more information, see Create and register Azure Machine Learning Datasets. 有关使用数据集的更多示例,请参阅示例笔记本For more examples using Datasets, see the sample notebooks.

数据集使用数据存储来安全地连接到你的 Azure 存储服务。Datasets use datastores to securely connect to your Azure storage services. 数据存储用于存储连接信息,不会损害你的身份验证凭据以及原始数据源的完整性。Datastores store connection information without putting your authentication credentials and the integrity of your original data source at risk. 它们会存储连接信息,例如与工作区关联的 Key Vault 中的订阅 ID 和令牌授权,让你能够安全地访问存储,而无需在脚本中对其进行硬编码。They store connection information, like your subscription ID and token authorization in your Key Vault associated with the workspace, so you can securely access your storage without having to hard code them in your script.

环境Environments

工作区 > 环境Workspace > Environments

环境是在其中进行机器学习模型的训练或评分的环境的封装。An environment is the encapsulation of the environment where training or scoring of your machine learning model happens. 此环境指定了与训练和评分脚本有关的 Python 包、环境变量和软件设置。The environment specifies the Python packages, environment variables, and software settings around your training and scoring scripts.

如需代码示例,请参阅如何使用环境中的“管理环境”部分。For code samples, see the "Manage environments" section of How to use environments.

试验Experiments

工作区 > 试验Workspace > Experiments

试验是指定的脚本中多个运行的分组。An experiment is a grouping of many runs from a specified script. 它始终属于工作区。It always belongs to a workspace. 当你提交运行时,需提供试验名称。When you submit a run, you provide an experiment name. 运行的信息存储在该试验下。Information for the run is stored under that experiment. 如果你提交试验时该名称不存在,系统会自动创建一个新试验。If the name doesn't exist when you submit an experiment, a new experiment is automatically created.

有关使用试验的示例,请参阅教程:训练第一个模型For an example of using an experiment, see Tutorial: Train your first model.

运行次数Runs

工作区 > 试验 > 运行Workspace > Experiments > Run

一次运行就是执行一次训练脚本。A run is a single execution of a training script. 试验通常包含多个运行。An experiment will typically contain multiple runs.

Azure 机器学习在试验中记录所有运行并存储以下信息:Azure Machine Learning records all runs and stores the following information in the experiment:

  • 有关运行的元数据(时间戳、持续时间等)Metadata about the run (timestamp, duration, and so on)
  • 脚本记录的指标Metrics that are logged by your script
  • 试验自动收集的或由你显式上传的输出文件Output files that are autocollected by the experiment or explicitly uploaded by you
  • 在运行之前包含脚本的目录的快照A snapshot of the directory that contains your scripts, prior to the run

提交脚本以训练模型时,会生成运行。You produce a run when you submit a script to train a model. 运行可以有零次或多次子级运行。A run can have zero or more child runs. 例如,顶级运行可以有两次子级运行,其中每个可以有其自己的子级运行。For example, the top-level run might have two child runs, each of which might have its own child run.

运行配置Run configurations

工作区 > 试验 > 运行 > 运行配置Workspace > Experiments > Run > Run configuration

运行配置是一组指令,用于定义如何在指定的计算目标中运行脚本。A run configuration is a set of instructions that defines how a script should be run in a specified compute target. 该配置包括一组广泛的行为定义,例如,是使用现有 Python 环境还是使用根据规范构建的 Conda 环境。The configuration includes a wide set of behavior definitions, such as whether to use an existing Python environment or to use a Conda environment that's built from a specification.

运行配置可以保存到包含训练脚本的目录内的文件中,A run configuration can be persisted into a file inside the directory that contains your training script. 或构造为内存中对象以及用于提交运行。Or it can be constructed as an in-memory object and used to submit a run.

如需示例运行配置,请参阅使用计算目标来训练模型For example run configurations, see Use a compute target to train your model.

估算器Estimators

为了便于使用流行框架进行模型训练,可以通过估算器类轻松构造运行配置。To facilitate model training with popular frameworks, the estimator class allows you to easily construct run configurations. 可以创建并使用泛型估算器来提交使用所选任何学习框架(例如 scikit-learn)的训练脚本。You can create and use a generic Estimator to submit training scripts that use any learning framework you choose (such as scikit-learn).

有关估算器的详细信息,请参阅使用估算器训练 ML 模型For more information about estimators, see Train ML models with estimators.

快照Snapshots

工作区 > 试验 > 运行 > 快照Workspace > Experiments > Run > Snapshot

提交运行时,Azure 机器学习会将包含该脚本的目录压缩为 zip 文件并将其发送到计算目标。When you submit a run, Azure Machine Learning compresses the directory that contains the script as a zip file and sends it to the compute target. 然后解压缩 zip 文件并运行脚本。The zip file is then extracted, and the script is run there. Azure 机器学习还将该 zip 文件存储为快照,作为运行记录的一部分。Azure Machine Learning also stores the zip file as a snapshot as part of the run record. 有权限访问工作区的任何用户都可以浏览运行记录并下载快照。Anyone with access to the workspace can browse a run record and download the snapshot.

日志记录Logging

Azure 机器学习会自动为你记录标准运行指标。Azure Machine Learning automatically logs standard run metrics for you. 不过,你也可以使用 Python SDK 记录任意指标However, you can also use the Python SDK to log arbitrary metrics.

查看日志的方法有多种:实时监视运行状态,或在完成后查看结果。There are multiple ways to view your logs: monitoring run status in real time, or viewing results after completion. 有关详细信息,请参阅监视和查看 ML 运行日志For more information, see Monitor and view ML run logs.

备注

为了防止在快照中包含不必要的文件,请在目录中创建 ignore 文件(.gitignore.amlignore)。To prevent unnecessary files from being included in the snapshot, make an ignore file (.gitignore or .amlignore) in the directory. 将要排除的文件和目录添加到此文件中。Add the files and directories to exclude to this file. 有关此文件中使用的语法的详细信息,请参阅 .gitignore语法和模式For more information on the syntax to use inside this file, see syntax and patterns for .gitignore. .amlignore 文件使用相同的语法。The .amlignore file uses the same syntax. 如果这两个文件都存在,则 .amlignore 文件的优先级更高。If both files exist, the .amlignore file takes precedence.

Git 跟踪与集成Git tracking and integration

如果以本地 Git 存储库作为源目录开始训练运行,有关存储库的信息将存储在运行历史记录中。When you start a training run where the source directory is a local Git repository, information about the repository is stored in the run history. 这适用于使用估算器、ML 管道或脚本运行提交的运行。This works with runs submitted using an estimator, ML pipeline, or script run. 此外,还适用于从 SDK 或机器学习 CLI 提交的运行。It also works for runs submitted from the SDK or Machine Learning CLI.

有关详细信息,请参阅 Azure 机器学习的 Git 集成For more information, see Git integration for Azure Machine Learning.

模型Models

简单地说,模型是一段接受输入并生成输出的代码。At its simplest, a model is a piece of code that takes an input and produces output. 创建机器学习模型涉及选择算法、为其提供数据以及优化超参数Creating a machine learning model involves selecting an algorithm, providing it with data, and tuning hyperparameters. 培训是一个迭代过程,将生成经过培训的模型,它会封装模型在培训过程中学到的内容。Training is an iterative process that produces a trained model, which encapsulates what the model learned during the training process.

你可以引入在 Azure 机器学习外部训练的模型。You can bring a model that was trained outside of Azure Machine Learning. 你还可以通过向 Azure 机器学习中的计算目标提交试验运行来训练模型。Or you can train a model by submitting a run of an experiment to a compute target in Azure Machine Learning. 拥有模型后,请在工作区中注册模型Once you have a model, you register the model in the workspace.

Azure 机器学习与框架无关。Azure Machine Learning is framework agnostic. 创建模型时,可以使用任何流行的机器学习框架,例如 Scikit-learn、XGBoost、PyTorch、TensorFlow 和 Chainer。When you create a model, you can use any popular machine learning framework, such as Scikit-learn, XGBoost, PyTorch, TensorFlow, and Chainer.

有关使用 Scikit-learn 训练模型的示例,请参阅教程:使用 Azure 机器学习训练图像分类模型For an example of training a model using Scikit-learn, see Tutorial: Train an image classification model with Azure Machine Learning.

模型注册表Model registry

工作区 > 模型Workspace > Models

模型注册表用于跟踪 Azure 机器学习工作区中的所有模型。The model registry lets you keep track of all the models in your Azure Machine Learning workspace.

模型按名称和版本标识。Models are identified by name and version. 每次使用与现有相同的名称注册模型时,注册表都会假定它是新版本。Each time you register a model with the same name as an existing one, the registry assumes that it's a new version. 该版本将递增并且新模型会以同一名称注册。The version is incremented, and the new model is registered under the same name.

注册模型时,可以提供其他元数据标记,然后在搜索模型时使用这些标记。When you register the model, you can provide additional metadata tags and then use the tags when you search for models.

提示

已注册的模型是构成模型的一个或多个文件的逻辑容器。A registered model is a logical container for one or more files that make up your model. 例如,如果你有一个存储在多个文件中的模型,则可以在 Azure 机器学习工作区中将这些文件注册为单个模型。For example, if you have a model that is stored in multiple files, you can register them as a single model in your Azure Machine Learning workspace. 注册后,可以下载或部署已注册的模型,并接收注册的所有文件。After registration, you can then download or deploy the registered model and receive all the files that were registered.

无法删除在活动部署正在使用的已注册模型。You can't delete a registered model that is being used by an active deployment.

有关注册模型的示例,请参阅使用 Azure 机器学习训练映像分类模型For an example of registering a model, see Train an image classification model with Azure Machine Learning.

部署Deployment

已注册的模型部署为服务终结点。You deploy a registered model as a service endpoint. 你需要下列组件:You need the following components:

  • 环境Environment. 此环境封装运行模型进行推理所需的依赖项。This environment encapsulates the dependencies required to run your model for inference.
  • 评分代码Scoring code. 此脚本接受请求、使用模型为请求评分并返回结果。This script accepts requests, scores the requests by using the model, and returns the results.
  • 推理配置Inference configuration. 推理配置指定以服务形式运行模型所需的环境、入口脚本和其他组件。The inference configuration specifies the environment, entry script, and other components needed to run the model as a service.

有关这些组件的详细信息,请参阅使用 Azure 机器学习部署模型For more information about these components, see Deploy models with Azure Machine Learning.

终结点Endpoints

工作区 > 终结点Workspace > Endpoints

终结点是模型在 Web 服务(可托管于云中)中的实例化,或用于集成设备部署的 IoT 模块。An endpoint is an instantiation of your model into either a web service that can be hosted in the cloud or an IoT module for integrated device deployments.

Web 服务终结点Web service endpoint

将模型部署为 Web 服务时,可以在 Azure 容器实例、Azure Kubernetes 服务或 FPGA 上部署终结点。When deploying a model as a web service, the endpoint can be deployed on Azure Container Instances, Azure Kubernetes Service, or FPGAs. 可以从模型、脚本和关联的文件创建服务。You create the service from your model, script, and associated files. 这些项目已放入到包含模型执行环境的基础容器映像中。These are placed into a base container image, which contains the execution environment for the model. 映像具有负载均衡的 HTTP 终结点,可接收发送到 Web 服务的评分请求。The image has a load-balanced, HTTP endpoint that receives scoring requests that are sent to the web service.

可以启用 Application Insights 遥测或模型遥测来监视 Web 服务。You can enable Application Insights telemetry or model telemetry to monitor your web service. 遥测数据仅可供你访问。The telemetry data is accessible only to you. 它存储在你的 Application Insights 和存储帐户实例中。It's stored in your Application Insights and storage account instances.

如果已启用自动缩放,Azure 将自动缩放部署。If you've enabled automatic scaling, Azure automatically scales your deployment.

有关将模型部署为 Web 服务的示例,请参阅在 Azure 容器实例中部署映像分类模型For an example of deploying a model as a web service , see Deploy an image classification model in Azure Container Instances.

实时终结点Real-time endpoints

在设计器(预览版)中部署经过训练的模型时,可以将模型部署为实时终结点When you deploy a trained model in the designer (preview), you can deploy the model as a real-time endpoint. 实时终结点通常通过 REST 终结点接收单个请求,并实时返回预测结果。A real-time endpoint commonly receives a single request via the REST endpoint and returns a prediction in real-time. 这与批处理相反,批处理一次处理多个值,并将完成后的结果保存到数据存储中。This is in contrast to batch processing, which processes multiple values at once and saves the results after completion to a datastore.

管道终结点Pipeline endpoints

管道终结点允许通过 REST 终结点以编程方式调用 ML 管道Pipeline endpoints let you call your ML Pipelines programatically via a REST endpoint. 管道终结点可用于自动执行管道工作流。Pipeline endpoints let you automate your pipeline workflows.

管道终结点是已发布管道的集合。A pipeline endpoint is a collection of published pipelines. 通过这种逻辑组织,你可以使用同一终结点管理和调用多个管道。This logical organization lets you manage and call multiple pipelines using the same endpoint. 管道终结点中的每个已发布管道都经过版本控制。Each published pipeline in a pipeline endpoint is versioned. 你可以为终结点选择默认管道,也可以在 REST 调用中指定版本。You can select a default pipeline for the endpoint, or specify a version in the REST call.

IoT 模块终结点IoT module endpoints

已部署 IoT 模块终结点是一个 Docker 容器,其中包含模型和关联脚本或应用程序,以及任何其他依赖项。A deployed IoT module endpoint is a Docker container that includes your model and associated script or application and any additional dependencies. 在边缘设备上使用 Azure IoT Edge 部署这些模块。You deploy these modules by using Azure IoT Edge on edge devices.

如果已启用监视,Azure 会从 Azure IoT Edge 模块内的模型中收集遥测数据。If you've enabled monitoring, Azure collects telemetry data from the model inside the Azure IoT Edge module. 遥测数据仅供你访问,并且存储在存储帐户实例中。The telemetry data is accessible only to you, and it's stored in your storage account instance.

Azure IoT Edge 将确保模块正在运行并且监视托管它的设备。Azure IoT Edge ensures that your module is running, and it monitors the device that's hosting it.

自动化Automation

Azure 机器学习 CLIAzure Machine Learning CLI

Azure 机器学习 CLI 是 Azure CLI(适用于 Azure 平台的跨平台命令行界面)的一个扩展。The Azure Machine Learning CLI is an extension to the Azure CLI, a cross-platform command-line interface for the Azure platform. 此扩展提供了用于自动执行机器学习活动的命令。This extension provides commands to automate your machine learning activities.

ML 管道ML Pipelines

使用机器学习管道可以创建和管理将各个机器学习阶段整合到一起的工作流。You use machine learning pipelines to create and manage workflows that stitch together machine learning phases. 例如,管道可以包括数据准备、模型训练、模型部署以及推理/评分阶段。For example, a pipeline might include data preparation, model training, model deployment, and inference/scoring phases. 每个阶段可以包含多个步骤,每个步骤都能够以无人参与方式在各种计算目标中运行。Each phase can encompass multiple steps, each of which can run unattended in various compute targets.

管道步骤可重用,如果这些步骤的输出没有更改,则无需重新运行前面的步骤即可运行。Pipeline steps are reusable, and can be run without rerunning the previous steps if the output of those steps hasn't changed. 例如,如果数据未更改,你可以重新训练模型,不需要重新运行成本高昂的数据准备步骤。For example, you can retrain a model without rerunning costly data preparation steps if the data hasn't changed. 管道还使数据科学家能够展开协作,同时可以处理机器学习工作流的不同环节。Pipelines also allow data scientists to collaborate while working on separate areas of a machine learning workflow.

与工作区交互Interacting with your workspace

工作室Studio

Azure 机器学习工作室提供了你的工作区中所有项目的 Web 视图。Azure Machine Learning studio provides a web view of all the artifacts in your workspace. 你可以查看数据集、试验、管道、模型和终结点的结果和详细信息。You can view results and details of your datasets, experiments, pipelines, models, and endpoints. 你还可以在工作室中管理计算资源和数据存储。You can also manage compute resources and datastores in the studio.

你还可以通过工作室访问 Azure 机器学习中包含的交互工具:The studio is also where you access the interactive tools that are part of Azure Machine Learning:

编程工具Programming tools

重要

下面标记了“(预览版)”的工具目前为公共预览版。Tools marked (preview) below are currently in public preview. 该预览版在提供时没有附带服务级别协议,建议不要将其用于生产工作负载。The preview version is provided without a service level agreement, and it's not recommended for production workloads. 某些功能可能不受支持或者受限。Certain features might not be supported or might have constrained capabilities.

后续步骤Next steps

若要开始使用 Azure 机器学习,请参阅:To get started with Azure Machine Learning, see: