什么是 Azure 机器学习管道?What are Azure Machine Learning pipelines?

本文介绍如何借助机器学习管道来生成、优化和管理机器学习工作流。In this article, you learn how a machine learning pipeline helps you build, optimize, and manage your machine learning workflow.

我应该使用哪种 Azure 管道技术?Which Azure pipeline technology should I use?

Azure 云提供了多种类型的管道,每一种都有不同用途。The Azure cloud provides several types of pipeline, each with a different purpose. 下表列出了各种不同的管道及其用途:The following table lists the different pipelines and what they are used for:

方案Scenario 主要角色Primary persona Azure 产品/服务Azure offering OSS 产品/服务OSS offering 规范管道Canonical pipe 优势Strengths
模型业务流程(机器学习)Model orchestration (Machine learning) 数据科学家Data scientist Azure 机器学习管道Azure Machine Learning Pipelines Kubeflow 管道Kubeflow Pipelines 数据 -> 模型Data -> Model 分布、缓存、代码优先、重用Distribution, caching, code-first, reuse
数据业务流程(数据准备)Data orchestration (Data prep) 数据工程师Data engineer Azure 数据工厂管道Azure Data Factory pipelines Apache AirflowApache Airflow 数据 -> 数据Data -> Data 强类型的移动,以数据为中心的活动Strongly typed movement, data-centric activities
代码和应用业务流程 (CI/CD)Code & app orchestration (CI/CD) 应用开发人员/OpsApp Developer / Ops Azure PipelinesAzure Pipelines JenkinsJenkins 代码 + 模型 -> 应用/服务Code + Model -> App/Service 最开放和灵活的活动支持、审批队列、门控相位Most open and flexible activity support, approval queues, phases with gating

机器学习管道的作用是什么?What can machine learning pipelines do?

Azure 机器学习管道是整个机器学习任务的可独立执行的工作流。An Azure Machine Learning pipeline is an independently executable workflow of a complete machine learning task. 子任务封装为管道中的一系列步骤。Subtasks are encapsulated as a series of steps within the pipeline. Azure 机器学习管道可以像调用 Python 脚本一样简单,因此几乎可以执行任何操作。An Azure Machine Learning pipeline can be as simple as one that calls a Python script, so may do just about anything. 管道应专注于机器学习任务,例如:Pipelines should focus on machine learning tasks such as:

  • 数据准备,包括导入、验证、清理、修整、转换、规范化和暂存Data preparation including importing, validating and cleaning, munging and transformation, normalization, and staging
  • 训练配置,包括参数化自变量、文件路径和日志记录/报告配置Training configuration including parameterizing arguments, filepaths, and logging / reporting configurations
  • 反复有效地进行训练和验证。Training and validating efficiently and repeatedly. 可以通过指定特定的数据子集、不同的硬件计算资源、分布式处理和进度监视来提高有效性Efficiency might come from specifying specific data subsets, different hardware compute resources, distributed processing, and progress monitoring
  • 部署,包括版本控制、缩放、预配和访问控制Deployment, including versioning, scaling, provisioning, and access control

由于各个步骤相互独立,因此多个数据科学家可以同时处理同一个管道,且不会过度占用计算资源。Independent steps allow multiple data scientists to work on the same pipeline at the same time without over-taxing compute resources. 由于各个步骤相互分离,因此可以轻松地针对每个步骤使用不同的计算类型/大小。Separate steps also make it easy to use different compute types/sizes for each step.

管道设计好以后,通常会有更多围绕着管道的训练循环进行的优化。After the pipeline is designed, there is often more fine-tuning around the training loop of the pipeline. 重新运行某个管道时,运行会跳到需要重新运行的步骤(例如,更新后的训练脚本)。When you rerun a pipeline, the run jumps to the steps that need to be rerun, such as an updated training script. 不需要重新运行的步骤则会跳过。Steps that do not need to be rerun are skipped.

通过管道,你可以选择使用不同的硬件来执行不同的任务。With pipelines, you may choose to use different hardware for different tasks. Azure 协调会你使用的各种计算目标,因此中间数据可以无缝流向下游计算目标。Azure coordinates the various compute targets you use, so your intermediate data seamlessly flows to downstream compute targets.

可以直接在 Azure 门户中或在工作区登陆页面(预览)跟踪管道试验的指标You can track the metrics for your pipeline experiments directly in Azure portal or your workspace landing page (preview). 发布管道之后,可以配置 REST 终结点,这样就能够从任何平台或堆栈重新运行管道。After a pipeline has been published, you can configure a REST endpoint, which allows you to rerun the pipeline from any platform or stack.

简而言之,可以通过管道处理机器学习生命周期的所有复杂任务。In short, all of the complex tasks of the machine learning lifecycle can be helped with pipelines. 其他 Azure 管道技术各有各的优势。Other Azure pipeline technologies have their own strengths. Azure 数据工厂管道擅长处理数据,Azure Pipelines 适合用于持续集成和部署。Azure Data Factory pipelines excels at working with data and Azure Pipelines is the right tool for continuous integration and deployment. 但如果关注点是机器学习,则 Azure 机器学习管道可能是满足工作流需求的最佳选择。But if your focus is machine learning, Azure Machine Learning pipelines are likely to be the best choice for your workflow needs.

分析依赖项Analyzing dependencies

许多编程生态系统都安装了用于协调资源、库或编译依赖项的工具。Many programming ecosystems have tools that orchestrate resource, library, or compilation dependencies. 通常,这些工具使用文件时间戳来计算依赖项。Generally, these tools use file timestamps to calculate dependencies. 文件发生更改时,只更新(下载、重新编译或打包)文件及其依赖项。When a file is changed, only it and its dependents are updated (downloaded, recompiled, or packaged). Azure 机器学习管道扩展了这个概念。Azure Machine Learning pipelines extend this concept. 与传统的生成工具一样,管道会计算各个步骤之间的依赖项,并且只执行必要的重新计算。Like traditional build tools, pipelines calculate dependencies between steps and only perform the necessary recalculations.

不过,Azure 机器学习管道中的依赖项分析比简单的时间戳更为复杂。The dependency analysis in Azure Machine Learning pipelines is more sophisticated than simple timestamps though. 每个步骤都可以在不同的硬件和软件环境中运行。Every step may run in a different hardware and software environment. 数据准备可能是一个耗时的过程,但不需要在具有强大 GPU 的硬件上运行,某些步骤可能需要特定于操作系统的软件,因此建议使用分布式训练等等。Data preparation might be a time-consuming process but not need to run on hardware with powerful GPUs, certain steps might require OS-specific software, you might want to use distributed training, and so forth.

Azure 机器学习可自动协调管道步骤之间的所有依赖项。Azure Machine Learning automatically orchestrates all of the dependencies between pipeline steps. 此业务流程可能包括启动和关闭 Docker 映像、附加和拆离计算资源,以及以一致且自动的方式在各步骤之间移动数据。This orchestration might include spinning up and down Docker images, attaching and detaching compute resources, and moving data between the steps in a consistent and automatic manner.

协调涉及的步骤Coordinating the steps involved

创建和运行 Pipeline 对象时,将执行以下概要步骤:When you create and run a Pipeline object, the following high-level steps occur:

  • 对于每个步骤,服务将计算以下各项的要求:For each step, the service calculates requirements for:
    • 硬件计算资源Hardware compute resources
    • OS 资源(Docker 映像)OS resources (Docker image(s))
    • 软件资源(Conda/virtualenv 依赖项)Software resources (Conda / virtualenv dependencies)
    • 数据输入Data inputs
  • 该服务确定步骤之间的依赖项,以生成动态执行图The service determines the dependencies between steps, resulting in a dynamic execution graph
  • 当执行图中的每个节点运行时:When each node in the execution graph runs:
    • 该服务配置必要的硬件和软件环境(可能重用现有资源)The service configures the necessary hardware and software environment (perhaps reusing existing resources)
    • 步骤将运行,从而向其包含的 Experiment 对象提供日志记录和监视信息The step runs, providing logging and monitoring information to its containing Experiment object
    • 步骤完成后,其输出将准备作为下一步的输入和/或写入存储When the step completes, its outputs are prepared as inputs to the next step and/or written to storage
    • 最终确定并分离不再需要的资源Resources that are no longer needed are finalized and detached


使用 Python SDK 生成管道Building pipelines with the Python SDK

Azure 机器学习 Python SDK 中,管道是 azureml.pipeline.core 模块中定义的 Python 对象。In the Azure Machine Learning Python SDK, a pipeline is a Python object defined in the azureml.pipeline.core module. Pipeline 对象包含一个或多个 PipelineStep 对象的有序序列。A Pipeline object contains an ordered sequence of one or more PipelineStep objects. PipelineStep 类是抽象类,而实际步骤属于子类,如 EstimatorStepPythonScriptStepDataTransferStepThe PipelineStep class is abstract and the actual steps will be of subclasses such as EstimatorStep, PythonScriptStep, or DataTransferStep. ModuleStep 类包含一系列可重用的步骤,这些步骤可在管道之间共享。The ModuleStep class holds a reusable sequence of steps that can be shared among pipelines. Pipeline 作为 Experiment 的一部分运行。A Pipeline runs as part of an Experiment.

Azure 机器学习管道与 Azure 机器学习工作区相关联,而管道步骤与该工作区中提供的计算目标相关联。An Azure machine learning pipeline is associated with an Azure Machine Learning workspace and a pipeline step is associated with a compute target available within that workspace. 有关详细信息,请参阅在 Azure 门户中创建和管理 Azure 机器学习工作区什么是 Azure 机器学习中的计算目标?For more information, see Create and manage Azure Machine Learning workspaces in the Azure portal or What are compute targets in Azure Machine Learning?.

简单的 Python 管道A simple Python Pipeline

此代码片段显示了创建和运行 Pipeline 所需的对象和调用:This snippet shows the objects and calls needed to create and run a Pipeline:

ws = Workspace.from_config() 
blob_store = Datastore(ws, "workspaceblobstore")
compute_target = ws.compute_targets["STANDARD_NC6"]
experiment = Experiment(ws, 'MyExperiment') 

input_data = Dataset.File.from_files(
    DataPath(datastore, '20newsgroups/20news.pkl'))
prepped_data_path = OutputFileDatasetConfig(name="output_path")

dataprep_step = PythonScriptStep(
    arguments=["--prepped_data_path", prepped_data_path],
    inputs=[input_dataset.as_named_input('raw_data').as_mount() ]

prepped_data = prepped_data_path.read_delimited_files()

train_step = PythonScriptStep(
    arguments=["--prepped_data", prepped_data],
steps = [ dataprep_step, train_step ]

pipeline = Pipeline(workspace=ws, steps=steps)

pipeline_run = experiment.submit(pipeline)

代码片段以常用 Azure 机器学习对象(WorkspaceDatastoreComputeTargetExperiment)开头。The snippet starts with common Azure Machine Learning objects, a Workspace, a Datastore, a ComputeTarget, and an Experiment. 然后,该代码将创建用于保存 input_dataoutput_data 的对象。Then, the code creates the objects to hold input_data and output_data. input_dataFileDataset 的实例,output_dataOutputFileDatasetConfig 的实例。The input_data is an instance of FileDataset and the output_data is an instance of OutputFileDatasetConfig. 对于 OutputFileDatasetConfig,默认行为是将输出复制到路径 /dataset/{run-id}/{output-name} 下的 workspaceblobstore 数据存储,其中 run-id 是运行的 ID,output-name 是自动生成的值(如果未由开发人员指定)。For OutputFileDatasetConfig the default behavior is to copy the output to the workspaceblobstore datastore under the path /dataset/{run-id}/{output-name}, where run-id is the Run's ID and output-name is an autogenerated value if not specified by the developer. 数据准备代码(不显示)将带分隔符文件写入 prepped_data_pathThe data preparation code (not shown), writes delimited files to prepped_data_path. 数据准备步骤的这些输出作为 prepped_data 传递到训练步骤。These outputs from the data preparation step are passed as prepped_data to the training step.

数组 steps 包含两个 PythonScriptStepdataprep_steptrain_stepThe array steps holds the two PythonScriptSteps, dataprep_step and train_step. Azure 机器学习将分析 prepped_data 的数据依赖项,并在 train_step 之前运行 dataprep_stepAzure Machine Learning will analyze the data dependency of prepped_data and run dataprep_step before train_step.

然后,代码将实例化 Pipeline 对象本身,并将其传入工作区和步骤数组。Then, the code instantiates the Pipeline object itself, passing in the workspace and steps array. experiment.submit(pipeline) 的调用开始 Azure ML 管道运行。The call to experiment.submit(pipeline) begins the Azure ML pipeline run. 在管道完成之前,对 wait_for_completion() 的调用会被阻止。The call to wait_for_completion() blocks until the pipeline is finished.

若要详细了解如何将管道连接到数据,请参阅以下文章:Azure 机器学习中的数据访问将数据移入 ML 管道并在管道之间移动的步骤 (Python)To learn more about connecting your pipeline to your data, see the articles Data access in Azure Machine Learning and Moving data into and between ML pipeline steps (Python).

使用设计器生成管道Building pipelines with the designer

偏好可视化设计图面的开发人员可以使用 Azure 机器学习设计器创建管道。Developers who prefer a visual design surface can use the Azure Machine Learning designer to create pipelines. 可以从工作区主页上的“设计器”选项使用此工具。You can access this tool from the Designer selection on the homepage of your workspace. 通过设计器可以将步骤拖放到设计图面上。The designer allows you to drag and drop steps onto the design surface.

在直观地设计管道时,步骤的输入和输出显示为可见。When you visually design pipelines, the inputs and outputs of a step are displayed visibly. 通过拖放数据连接,可以快速了解和修改管道的数据流。You can drag and drop data connections, allowing you to quickly understand and modify the dataflow of your pipeline.

Azure 机器学习设计器示例

主要优点Key advantages

为机器学习工作流使用管道的主要优点包括:The key advantages of using pipelines for your machine learning workflows are:

主要优点Key advantage 说明Description
无人参与 运行Unattended runs 对步骤进行计划,使之以可靠且无人参与的方式并行或按顺序运行。Schedule steps to run in parallel or in sequence in a reliable and unattended manner. 数据准备和建模可以持续几天或几周,而管道可帮助你在进程运行时将精力集中在其他任务上。Data preparation and modeling can last days or weeks, and pipelines allow you to focus on other tasks while the process is running.
异类计算Heterogenous compute 使用多个管道,这些管道已跨异类的可缩放计算资源和存储位置进行可靠的协调。Use multiple pipelines that are reliably coordinated across heterogeneous and scalable compute resources and storage locations. 通过在不同的计算目标(例如 HDInsight、GPU Data Science VM 和 Databricks)上运行单独的管道步骤,可以有效地利用可用的计算资源。Make efficient use of available compute resources by running individual pipeline steps on different compute targets, such as HDInsight, GPU Data Science VMs, and Databricks.
可重用性Reusability 创建适用于特定方案(例如,重新训练和批量评分)的管道模板。Create pipeline templates for specific scenarios, such as retraining and batch-scoring. 触发器通过简单的 REST 调用从外部系统发布管道。Trigger published pipelines from external systems via simple REST calls.
跟踪和版本控制Tracking and versioning 可以使用管道 SDK 显式对数据源、输入和输出进行命名和版本控制,而不是在循环访问时手动跟踪数据和结果路径。Instead of manually tracking data and result paths as you iterate, use the pipelines SDK to explicitly name and version your data sources, inputs, and outputs. 还可以将脚本和数据分开管理以提高工作效率。You can also manage scripts and data separately for increased productivity.
模块化Modularity 分离关注区域和隔离更改可加快软件开发速度并提高质量。Separating areas of concerns and isolating changes allows software to evolve at a faster rate with higher quality.
协作Collaboration 使用管道,数据科学家能够在机器学习设计过程的所有方面进行协作,同时能够并行处理管道步骤。Pipelines allow data scientists to collaborate across all areas of the machine learning design process, while being able to concurrently work on pipeline steps.

后续步骤Next steps

Azure 机器学习管道是一个功能强大的工具,在早期开发阶段就开始产生价值。Azure Machine Learning pipelines are a powerful facility that begins delivering value in the early development stages. 价值随着团队和项目的发展而增加。The value increases as the team and project grows. 本文介绍了如何通过 Azure 机器学习 Python SDK 指定管道,并在 Azure 上进行协调。This article has explained how pipelines are specified with the Azure Machine Learning Python SDK and orchestrated on Azure. 你已了解一些简单的源代码和一些可用的 PipelineStep 类。You've seen some simple source code and been introduced to a few of the PipelineStep classes that are available. 你应该了解何时使用 Azure 机器学习管道以及 Azure 如何运行这些管道。You should have a sense of when to use Azure Machine Learning pipelines and how Azure runs them.