Azure 机器学习的工作原理:体系结构和概念How Azure Machine Learning works: Architecture and concepts

了解 Azure 机器学习的体系结构、概念和工作流。Learn about the architecture, concepts, and workflow for Azure Machine Learning. 下图显示该服务的主要组件,以及使用该服务时的常规工作流:The major components of the service and the general workflow for using the service are shown in the following diagram:

Azure 机器学习体系结构和工作流


机器学习模型工作流通常遵循以下顺序:The machine learning model workflow generally follows this sequence:

  1. 训练Train

    • Python 或视觉设计器中开发机器学习训练脚本。Develop machine learning training scripts in Python or with the visual designer.
    • 创建和配置计算目标Create and configure a compute target.
    • 将脚本提交到配置的计算目标以在该环境中运行。Submit the scripts to the configured compute target to run in that environment. 在训练期间,脚本可以读取或写入数据存储During training, the scripts can read from or write to datastore. 并且执行记录在工作区中保存为运行,并在试验下分组。And the records of execution are saved as runs in the workspace and grouped under experiments.
  2. 打包 - 找到满意的运行后,在模型注册表中注册持久化模型。Package - After a satisfactory run is found, register the persisted model in the model registry.

  3. 验证 - 查询试验了解当前和过去的运行中已记录的指标。Validate - Query the experiment for logged metrics from the current and past runs. 如果指标未指示所需结果,请循环回到步骤 1 并循环访问脚本。If the metrics don't indicate a desired outcome, loop back to step 1 and iterate on your scripts.

  4. 部署 - 开发一个使用该模型的评分脚本,并将该模型部署为 Azure 中的 Web 服务,或部署到 IoT Edge 设备Deploy - Develop a scoring script that uses the model and Deploy the model as a web service in Azure, or to an IoT Edge device.

  5. 监视 - 监视已部署模型的训练数据集与推理数据之间的数据偏移Monitor - Monitor for data drift between the training dataset and inference data of a deployed model. 必要时,循环回到步骤 1,以使用新的训练数据重新训练模型。When necessary, loop back to step 1 to retrain the model with new training data.

适用于 Azure 机器学习的工具Tools for Azure Machine Learning

使用以下适用于 Azure 机器学习的工具:Use these tools for Azure Machine Learning:


本文定义了 Azure 机器学习使用的术语和概念,但未定义 Azure 平台的术语和概念。Although this article defines terms and concepts used by Azure Machine Learning, it does not define terms and concepts for the Azure platform. 有关 Azure 平台术语的详细信息,请参阅 Microsoft Azure 词汇表For more information about Azure platform terminology, see the Microsoft Azure glossary.



活动表示长时间运行的操作。An activity represents a long running operation. 以下操作是活动的示例:The following operations are examples of activities:

  • 创建或删除计算目标Creating or deleting a compute target
  • 在计算目标上运行脚本Running a script on a compute target

活动可通过 SDK 或 Web UI 提供通知,使你能够轻松监视这些操作的进度。Activities can provide notifications through the SDK or the web UI so that you can easily monitor the progress of these operations.


工作区是 Azure 机器学习的顶级资源。The workspace is the top-level resource for Azure Machine Learning. 它提供一个中心位置来处理使用 Azure 机器学习时创建的所有项目。It provides a centralized place to work with all the artifacts you create when you use Azure Machine Learning. 可与其他人共享工作区。You can share a workspace with others. 有关工作区的详细介绍,请参阅什么是 Azure 机器学习工作区?For a detailed description of workspaces, see What is an Azure Machine Learning workspace?.


试验是指定的脚本中多个运行的分组。An experiment is a grouping of many runs from a specified script. 它始终属于工作区。It always belongs to a workspace. 当你提交运行时,需提供试验名称。When you submit a run, you provide an experiment name. 运行的信息存储在该试验下。Information for the run is stored under that experiment. 如果提交运行,并指定一个不存在的试验名称,则系统将使用新指定的名称自动创建一个新试验。If you submit a run and specify an experiment name that doesn't exist, a new experiment with that newly specified name is automatically created.

有关使用试验的示例,请参阅教程:训练第一个模型For an example of using an experiment, see Tutorial: Train your first model.


运行是训练脚本的单次执行。A run is a single execution of a training script. 试验通常包含多个运行。An experiment will typically contain multiple runs.

Azure 机器学习在试验中记录所有运行并存储以下信息:Azure Machine Learning records all runs and stores the following information in the experiment:

  • 有关运行的元数据(时间戳、持续时间等)Metadata about the run (timestamp, duration, and so on)
  • 脚本记录的指标Metrics that are logged by your script
  • 试验自动收集的或由你显式上传的输出文件Output files that are autocollected by the experiment or explicitly uploaded by you
  • 在运行之前包含脚本的目录的快照A snapshot of the directory that contains your scripts, prior to the run

提交脚本以训练模型时,会生成运行。You produce a run when you submit a script to train a model. 运行可以有零次或多次子级运行。A run can have zero or more child runs. 例如,顶级运行可以有两次子级运行,其中每个可以有其自己的子级运行。For example, the top-level run might have two child runs, each of which might have its own child run.

运行配置Run configurations

运行配置是一组指令,用于定义如何在指定的计算目标中运行脚本。A run configuration is a set of instructions that defines how a script should be run in a specified compute target. 该配置包括一组广泛的行为定义,例如,是使用现有 Python 环境还是使用根据规范构建的 Conda 环境。The configuration includes a wide set of behavior definitions, such as whether to use an existing Python environment or to use a Conda environment that's built from a specification.

运行配置可以保存到包含训练脚本的目录内的文件中,或构造为内存中对象以及用于提交运行。A run configuration can be persisted into a file inside the directory that contains your training script, or it can be constructed as an in-memory object and used to submit a run.

有关示例运行配置,请参阅选择并使用计算目标来训练模型For example run configurations, see Select and use a compute target to train your model.


提交运行时,Azure 机器学习会将包含该脚本的目录压缩为 zip 文件并将其发送到计算目标。When you submit a run, Azure Machine Learning compresses the directory that contains the script as a zip file and sends it to the compute target. 然后解压缩 zip 文件并运行脚本。The zip file is then extracted, and the script is run there. Azure 机器学习还将该 zip 文件存储为快照,作为运行记录的一部分。Azure Machine Learning also stores the zip file as a snapshot as part of the run record. 有权限访问工作区的任何用户都可以浏览运行记录并下载快照。Anyone with access to the workspace can browse a run record and download the snapshot.


为了防止在快照中包含不必要的文件,请使用 ignore 文件(.gitignore 或 .amlignore)。To prevent unnecessary files from being included in the snapshot, make an ignore file (.gitignore or .amlignore). 将此文件放在 Snapshot 目录中,并在其中添加要忽略的文件名。Place this file in the Snapshot directory and add the filenames to ignore in it. .amlignore 文件使用的语法和模式与 .gitignore 文件相同The .amlignore file uses the same syntax and patterns as the .gitignore file. 如果这两个文件都存在,则 .amlignore 文件的优先级更高。If both files exist, the .amlignore file takes precedence.

GitHub 跟踪与集成GitHub tracking and integration

如果以本地 Git 存储库作为源目录开始训练运行,有关存储库的信息将存储在运行历史记录中。When you start a training run where the source directory is a local Git repository, information about the repository is stored in the run history. 这适用于使用估算器、ML 管道或脚本运行提交的运行。This works with runs submitted using an estimator, ML pipeline, or script run. 此外,还适用于从 SDK 或机器学习 CLI 提交的运行。It also works for runs submitted from the SDK or Machine Learning CLI.

有关详细信息,请参阅 Azure 机器学习的 Git 集成For more information, see Git integration for Azure Machine Learning.


开发解决方案时,请在 Python 脚本中使用 Azure 机器学习 Python SDK 记录任意指标。When you develop your solution, use the Azure Machine Learning Python SDK in your Python script to log arbitrary metrics. 运行后,查询指标以确定运行是否生成了要部署的模型。After the run, query the metrics to determine whether the run has produced the model you want to deploy.

ML 管道ML Pipelines

使用机器学习管道可以创建和管理将各个机器学习阶段整合到一起的工作流。You use machine learning pipelines to create and manage workflows that stitch together machine learning phases. 例如,管道可以包括数据准备、模型训练、模型部署以及推理/评分阶段。For example, a pipeline might include data preparation, model training, model deployment, and inference/scoring phases. 每个阶段可以包含多个步骤,每个步骤都能够以无人参与方式在各种计算目标中运行。Each phase can encompass multiple steps, each of which can run unattended in various compute targets.

管道步骤可重用,如果这些步骤的输出没有更改,则无需重新运行前面的步骤即可运行。Pipeline steps are reusable, and can be run without rerunning the previous steps if the output of those steps hasn't changed. 例如,如果数据未更改,则无需重新运行高开销的数据准备步骤,即可重新训练模型。For example, you can retrain a model without rerunning costly data preparation steps if the data hasn't changed. 管道还使数据科学家能够展开协作,同时可以处理机器学习工作流的不同环节。Pipelines also allow data scientists to collaborate while working on separate areas of a machine learning workflow.

有关机器学习管道与此服务的详细信息,请参阅管道和 Azure 机器学习For more information about machine learning pipelines with this service, see Pipelines and Azure Machine Learning.


简单地说,模型是一段接受输入并生成输出的代码。At its simplest, a model is a piece of code that takes an input and produces output. 创建机器学习模型将涉及选择算法、为其提供数据以及优化超参数。Creating a machine learning model involves selecting an algorithm, providing it with data, and tuning hyperparameters. 培训是一个迭代过程,将生成经过培训的模型,它会封装模型在培训过程中学到的内容。Training is an iterative process that produces a trained model, which encapsulates what the model learned during the training process.

模型通过 Azure 机器学习中的运行生成。A model is produced by a run in Azure Machine Learning. 还可以使用在 Azure 机器学习外部训练的模型。You can also use a model that's trained outside of Azure Machine Learning. 可在 Azure 机器学习工作区中注册模型。You can register a model in an Azure Machine Learning workspace.

Azure 机器学习与框架无关。Azure Machine Learning is framework agnostic. 创建模型时,可以使用任何流行的机器学习框架,例如 Scikit-learn、XGBoost、PyTorch、TensorFlow 和 Chainer。When you create a model, you can use any popular machine learning framework, such as Scikit-learn, XGBoost, PyTorch, TensorFlow, and Chainer.

有关使用 Scikit-learn 和估算器训练模型的示例,请参阅教程:使用 Azure 机器学习训练图像分类模型For an example of training a model using Scikit-learn and an estimator, see Tutorial: Train an image classification model with Azure Machine Learning.

模型注册表跟踪 Azure 机器学习工作区中的所有模型。The model registry keeps track of all the models in your Azure Machine Learning workspace.

模型按名称和版本标识。Models are identified by name and version. 每次使用与现有相同的名称注册模型时,注册表都会假定它是新版本。Each time you register a model with the same name as an existing one, the registry assumes that it's a new version. 该版本将递增并且新模型会以同一名称注册。The version is incremented, and the new model is registered under the same name.

注册模型时,可以提供其他元数据标记,然后在搜索模型时使用这些标记。When you register the model, you can provide additional metadata tags and then use the tags when you search for models.


已注册的模型是构成模型的一个或多个文件的逻辑容器。A registered model is a logical container for one or more files that make up your model. 例如,如果你有一个存储在多个文件中的模型,则可以在 Azure 机器学习工作区中将这些文件注册为单个模型。For example, if you have a model that is stored in multiple files, you can register them as a single model in your Azure Machine Learning workspace. 注册后,可以下载或部署已注册的模型,并接收注册的所有文件。After registration, you can then download or deploy the registered model and receive all the files that were registered.

无法删除在活动部署正在使用的已注册模型。You can't delete a registered model that is being used by an active deployment.

有关注册模型的示例,请参阅使用 Azure 机器学习训练映像分类模型For an example of registering a model, see Train an image classification model with Azure Machine Learning.


Azure ML 环境用于指定在为数据准备、模型训练和模型服务创建可再现环境时所用的配置(Docker、Python、Spark 等)。Azure ML Environments are used to specify the configuration (Docker / Python / Spark / etc.) used to create a reproducible environment for data preparation, model training and model serving. 它们是 Azure 机器学习工作区中受到管理和版本控制的实体,可跨不同的计算目标支持可再现、可审核和可移植的机器学习工作流。They are managed and versioned entities within your Azure Machine Learning workspace that enable reproducible, auditable, and portable machine learning workflows across different compute targets.

可以在本地计算上使用环境对象来开发训练脚本、在 Azure 机器学习计算上重复使用同一环境进行大规模的模型训练,甚至可以使用相同的环境部署模型。You can use an environment object on your local compute to develop your training script, reuse that same environment on Azure Machine Learning Compute for model training at scale, and even deploy your model with that same environment.

了解如何创建和管理用于训练与推理的可重用 ML 环境Learn how to create and manage a reusable ML environment for training and inference.

定型脚本Training scripts

若要定型模型,你可以指定包含培训脚本和关联文件的目录。To train a model, you specify the directory that contains the training script and associated files. 此外,还可指定一个试验名称,用于存储在训练期间收集的信息。You also specify an experiment name, which is used to store information that's gathered during training. 在训练期间,会将整个目录复制到训练环境(计算目标),并启动运行配置指定的脚本。During training, the entire directory is copied to the training environment (compute target), and the script that's specified by the run configuration is started. 目录的快照同样存储在工作区中的试验下。A snapshot of the directory is also stored under the experiment in the workspace.

有关示例,请参阅教程:使用 Azure 机器学习训练图像分类模型For an example, see Tutorial: Train an image classification model with Azure Machine Learning.


为了便于使用流行框架进行模型训练,可以通过估算器类轻松构造运行配置。To facilitate model training with popular frameworks, the estimator class allows you to easily construct run configurations. 可以创建并使用泛型估算器来提交使用所选任何学习框架(例如 scikit-learn)的训练脚本。You can create and use a generic Estimator to submit training scripts that use any learning framework you choose (such as scikit-learn).

对于 PyTorch、TensorFlow 和 Chainer 任务,Azure 机器学习还提供了相应的 PyTorchTensorFlowChainer 估算器,以便使用这些框架进行简化。For PyTorch, TensorFlow, and Chainer tasks, Azure Machine Learning also provides respective PyTorch, TensorFlow, and Chainer estimators to simplify using these frameworks.

有关详细信息,请参阅以下文章:For more information, see the following articles:


终结点是模型在 Web 服务(可托管于云中)中的实例化,或用于集成设备部署的 IoT 模块。An endpoint is an instantiation of your model into either a web service that can be hosted in the cloud or an IoT module for integrated device deployments.

Web 服务终结点Web service endpoint

将模型部署为 Web 服务时,可以在 Azure 容器实例、Azure Kubernetes 服务或 FPGA 上部署终结点。When deploying a model as a web service the endpoint can be deployed on Azure Container Instances, Azure Kubernetes Service, or FPGAs. 可以从模型、脚本和关联的文件创建服务。You create the service from your model, script, and associated files. 这些对象已放入到包含模型执行环境的基础容器映像中。These are placed into a base container image which contains the execution environment for the model. 映像具有负载均衡的 HTTP 终结点,可接收发送到 Web 服务的评分请求。The image has a load-balanced, HTTP endpoint that receives scoring requests that are sent to the web service.

如果已选择启用此功能,Azure 可通过收集 Application Insights 遥测数据或模型遥测数据帮助监视 Web 服务。Azure helps you monitor your web service by collecting Application Insights telemetry or model telemetry, if you've chosen to enable this feature. 遥测数据仅供你访问,并且存储在 Application Insights 和存储帐户实例中。The telemetry data is accessible only to you, and it's stored in your Application Insights and storage account instances.

如果已启用自动缩放,Azure 将自动缩放部署。If you've enabled automatic scaling, Azure automatically scales your deployment.

有关将模型部署为 Web 服务的示例,请参阅在 Azure 容器实例中部署映像分类模型For an example of deploying a model as a web service , see Deploy an image classification model in Azure Container Instances.

IoT 模块终结点IoT module endpoints

已部署 IoT 模块终结点是一个 Docker 容器,其中包含模型和关联脚本或应用程序,以及任何其他依赖项。A deployed IoT module endpoint is a Docker container that includes your model and associated script or application and any additional dependencies. 在边缘设备上使用 Azure IoT Edge 部署这些模块。You deploy these modules by using Azure IoT Edge on edge devices.

如果已启用监视,Azure 会从 Azure IoT Edge 模块内的模型中收集遥测数据。If you've enabled monitoring, Azure collects telemetry data from the model inside the Azure IoT Edge module. 遥测数据仅供你访问,并且存储在存储帐户实例中。The telemetry data is accessible only to you, and it's stored in your storage account instance.

Azure IoT Edge 将确保模块正在运行并且监视托管它的设备。Azure IoT Edge ensures that your module is running, and it monitors the device that's hosting it.

计算实例(预览版)Compute instance (preview)

Azure 机器学习计算实例(前称为 Notebook VM)是完全托管式的基于云的工作站,其中包含为机器学习安装的多个工具和环境。An Azure Machine Learning compute instance (formerly Notebook VM) is a fully managed cloud-based workstation that includes multiple tools and environments installed for machine learning. 可将计算实例用作训练和推理作业的计算目标。Compute instances can be used as a compute target for training and inferencing jobs. 对于大型任务,具有多节点缩放功能的 Azure 机器学习计算群集是更好的计算目标选项。For large tasks, Azure Machine Learning compute clusters with multi-node scaling capabilities is a better compute target choice.

详细了解计算实例Learn more about compute instances.

数据集和数据存储Datasets and datastores

使用 Azure 机器学习数据集(预览版)可以更轻松地访问和处理数据。Azure Machine Learning Datasets (preview) make it easier to access and work with your data. 数据集管理各种方案(例如模型训练和管道创建)中的数据。Datasets manage data in various scenarios such as model training and pipeline creation. 使用 Azure 机器学习 SDK 可以访问底层存储、浏览数据,以及管理不同数据集定义的生命周期。Using the Azure Machine Learning SDK, you can access underlying storage, explore data, and manage the life cycle of different Dataset definitions.

数据集提供多种方法用于处理常用格式的数据,例如使用 from_delimited_files()to_pandas_dataframe()Datasets provide methods for working with data in popular formats, such as using from_delimited_files() or to_pandas_dataframe().

有关详细信息,请参阅创建和注册 Azure 机器学习数据集For more information, see Create and register Azure Machine Learning Datasets. 有关使用数据集的更多示例,请参阅示例笔记本For more examples using Datasets, see the sample notebooks.

数据存储是通过 Azure 存储帐户实现的存储抽象。A datastore is a storage abstraction over an Azure storage account. 数据存储可以使用 Azure blob 容器或 Azure 文件共享作为后端存储。The datastore can use either an Azure blob container or an Azure file share as the back-end storage. 每个工作区都有默认数据存储,并且你可以注册其他数据存储。Each workspace has a default datastore, and you can register additional datastores. 使用 Python SDK API 或 Azure 机器学习 CLI 可从数据存储中存储和检索文件。Use the Python SDK API or the Azure Machine Learning CLI to store and retrieve files from the datastore.

计算目标Compute targets

计算目标上,可以指定用于运行训练脚本或托管服务部署的计算资源。A compute target lets you specify the compute resource where you run your training script or host your service deployment. 此位置可以是你的本地计算机,也可以是基于云的计算资源。This location may be your local machine or a cloud-based compute resource.

详细了解可用于训练和部署的计算目标Learn more about the available compute targets for training and deployment.

后续步骤Next steps

若要开始使用 Azure 机器学习,请参阅:To get started with Azure Machine Learning, see: