使用 Azure 机器学习 SDK 创建和运行机器学习管道Create and run machine learning pipelines with Azure Machine Learning SDK

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

本文介绍如何使用 Azure 机器学习 SDK 创建、发布、运行和跟踪机器学习管道In this article, you learn how to create, publish, run, and track a machine learning pipeline by using the Azure Machine Learning SDK. 使用 ML 管道创建将各个 ML 阶段整合到一起的工作流,然后将该管道发布到 Azure 机器学习工作区供以后访问或者与其他人共享。Use ML pipelines to create a workflow that stitches together various ML phases, and then publish that pipeline into your Azure Machine Learning workspace to access later or share with others. ML 管道非常适合用于批量评分方案,它们可以使用各种计算,重复使用步骤而不是重新运行步骤,以及与其他人共享 ML 工作流。ML pipelines are ideal for batch scoring scenarios, using various computes, reusing steps instead of rerunning them, as well as sharing ML workflows with others.

尽管可以使用一种称作 Azure 管道的不同类型的管道来实现 ML 任务的 CI/CD 自动化,但这种类型的管道永远不会存储在工作区中。While you can use a different kind of pipeline called an Azure Pipeline for CI/CD automation of ML tasks, that type of pipeline is never stored inside your workspace. 比较这些不同的管道Compare these different pipelines.

ML 管道的每个阶段(例如数据准备和模型训练)可以包含一个或多个步骤。Each phase of an ML pipeline, such as data preparation and model training, can include one or more steps.

Azure 机器学习工作区的成员可以看到创建的 ML 管道。The ML pipelines you create are visible to the members of your Azure Machine Learning workspace.

ML 管道使用远程计算目标进行计算,以及存储与该管道关联的中间数据和最终数据。ML pipelines use remote compute targets for computation and the storage of the intermediate and final data associated with that pipeline. 这些管道可以在支持的 Azure 存储位置读取和写入数据。They can read and write data to and from supported Azure Storage locations.

如果没有 Azure 订阅,请在开始前创建一个试用帐户。If you don't have an Azure subscription, create a trial account before you begin. 试用免费版或付费版 Azure 机器学习Try the free or paid version of Azure Machine Learning.

先决条件Prerequisites

首先附加工作区:Start by attaching your workspace:

import azureml.core
from azureml.core import Workspace, Datastore

ws = Workspace.from_config()

设置机器学习资源Set up machine learning resources

创建运行 ML 管道所需的资源:Create the resources required to run an ML pipeline:

  • 设置一个数据存储,用于访问管道步骤中所需的数据。Set up a datastore used to access the data needed in the pipeline steps.

  • 配置 Dataset 对象,使之指向驻留在数据存储中的或可在数据存储中访问的持久性数据。Configure a Dataset object to point to persistent data that lives in, or is accessible in, a datastore. 为在管道步骤间传递的临时数据配置 PipelineData 对象。Configure a PipelineData object for temporary data passed between pipeline steps.

  • 设置计算目标,管道步骤将在其上运行。Set up the compute targets on which your pipeline steps will run.

设置数据存储Set up a datastore

数据存储用于存储可供管道访问的数据。A datastore stores the data for the pipeline to access. 每个工作区有一个默认的数据存储。Each workspace has a default datastore. 可以注册其他数据存储。You can register additional datastores.

创建工作区时,会将 Azure 文件存储Azure Blob 存储附加到该工作区。When you create your workspace, Azure Files and Azure Blob storage are attached to the workspace. 已注册一个用于连接 Azure Blob 存储的默认数据存储。A default datastore is registered to connect to the Azure Blob storage. 要了解详细信息,请参阅确定何时使用 Azure 文件存储、Azure Blob 或 Azure 磁盘To learn more, see Deciding when to use Azure Files, Azure Blobs, or Azure Disks.

# Default datastore 
def_data_store = ws.get_default_datastore()

# Get the blob storage associated with the workspace
def_blob_store = Datastore(ws, "workspaceblobstore")

# Get file storage associated with the workspace
def_file_store = Datastore(ws, "workspacefilestore")

将数据文件或目录上传到数据存储,以便可以在管道中访问它们。Upload data files or directories to the datastore for them to be accessible from your pipelines. 此示例使用 Blob 存储作为数据存储:This example uses the Blob storage as the datastore:

def_blob_store.upload_files(
    ["iris.csv"],
    target_path="train-dataset",
    overwrite=True)

一个管道包含一个或多个步骤。A pipeline consists of one or more steps. 步骤是在计算目标上运行的单元。A step is a unit run on a compute target. 步骤可能会使用数据源,并生成“中间”数据。Steps might consume data sources and produce "intermediate" data. 步骤可以创建数据,例如模型、包含模型和依赖文件的目录,或临时数据。A step can create data such as a model, a directory with model and dependent files, or temporary data. 然后,此数据可供管道中的其他后续步骤使用。This data is then available for other steps later in the pipeline.

若要详细了解如何将管道连接到数据,请参阅如何访问数据如何注册数据集这两篇文章。To learn more about connecting your pipeline to your data, see the articles How to Access Data and How to Register Datasets.

使用 DatasetPipelineData 对象配置数据Configure data using Dataset and PipelineData objects

你刚刚创建了一个可在管道中作为步骤输入引用的数据源。You just created a data source that can be referenced in a pipeline as an input to a step. 向管道提供数据的首选方法是使用 Dataset 对象。The preferred way to provide data to a pipeline is a Dataset object. Dataset 对象指向驻留在数据存储中的或者可从数据存储或 Web URL 访问的数据。The Dataset object points to data that lives in or is accessible from a datastore or at a Web URL. Dataset 类是抽象类,因此,你将创建一个 FileDataset 的实例(引用一个或多个文件)或一个 TabularDataset 的实例(基于一个或多个包含分隔数据列的文件创建)。The Dataset class is abstract, so you will create an instance of either a FileDataset (referring to one or more files) or a TabularDataset that's created by from one or more files with delimited columns of data.

Dataset 对象支持版本控制、差异分析和汇总统计。Dataset objects support versioning, diffs, and summary statistics. Dataset 是惰性评估的(类似于 Python 生成器),有效的做法是通过拆分或筛选来划分其子集。Datasets are lazily evaluated (like Python generators) and it's efficient to subset them by splitting or filtering.

使用 from_filefrom_delimited_files 之类的方法创建 DatasetYou create a Dataset using methods like from_file or from_delimited_files.

from azureml.core import Dataset

iris_tabular_dataset = Dataset.Tabular.from_delimited_files([(def_blob_store, 'train-dataset/iris.csv')])

中间数据(或步骤输出)由 PipelineData 对象表示。Intermediate data (or output of a step) is represented by a PipelineData object. output_data1 生成为步骤的输出,并用作一个或多个后续步骤的输入。output_data1 is produced as the output of a step, and used as the input of one or more future steps. PipelineData 在步骤之间引入数据依赖项,并在管道中创建隐式执行顺序。PipelineData introduces a data dependency between steps, and creates an implicit execution order in the pipeline. 稍后在创建管道步骤时将使用此对象。This object will be used later when creating pipeline steps.

from azureml.pipeline.core import PipelineData

output_data1 = PipelineData(
    "output_data1",
    datastore=def_blob_store,
    output_name="output_data1")

有关使用数据集和管道数据的更多详细信息和示例代码,可参阅将数据移入 ML 管道步骤和在 ML 管道步骤之间移动数据 (Python)More details and sample code for working with datasets and pipeline data can be found in Moving data into and between ML pipeline steps (Python).

设置计算目标Set up a compute target

在 Azure 机器学习中,术语“计算”(或“计算目标”)是指在机器学习管道中执行计算步骤的计算机或群集 。In Azure Machine Learning, the term compute (or compute target) refers to the machines or clusters that perform the computational steps in your machine learning pipeline. 有关计算目标的完整列表以及如何创建计算目标并将其附加到工作区的详细信息,请参阅模型训练的计算目标See compute targets for model training for a full list of compute targets and how to create and attach them to your workspace. 无论是在训练模型还是运行管道步骤,创建和/或附加计算目标的过程都是相同的。The process for creating and or attaching a compute target is the same regardless of whether you are training a model or running a pipeline step. 创建并附加计算目标后,请使用管道步骤中的 ComputeTarget 对象。After you create and attach your compute target, use the ComputeTarget object in your pipeline step.

Important

内部远程作业不支持对计算目标执行管理操作。Performing management operations on compute targets is not supported from inside remote jobs. 由于机器学习管道作为远程作业提交,因此请勿对管道内的计算目标使用管理操作。Since machine learning pipelines are submitted as a remote job, do not use management operations on compute targets from inside the pipeline.

下面是创建和附加计算目标的示例:Below are examples of creating and attaching compute targets for:

  • Azure 机器学习计算Azure Machine Learning Compute
  • Azure Data Lake AnalyticsAzure Data Lake Analytics

Azure 机器学习计算Azure Machine Learning compute

可以创建 Azure 机器学习计算用于运行步骤。You can create an Azure Machine Learning compute for running your steps.

from azureml.core.compute import ComputeTarget, AmlCompute

compute_name = "aml-compute"
vm_size = "STANDARD_NC6"
if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('Found compute target: ' + compute_name)
else:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_size,  # STANDARD_NC6 is GPU-enabled
                                                                min_nodes=0,
                                                                max_nodes=4)
    # create the compute target
    compute_target = ComputeTarget.create(
        ws, compute_name, provisioning_config)

    # Can poll for a minimum number of nodes and for a specific timeout.
    # If no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(
        show_output=True, min_node_count=None, timeout_in_minutes=20)

    # For a more detailed view of current cluster status, use the 'status' property
    print(compute_target.status.serialize())

Azure Data Lake AnalyticsAzure Data Lake Analytics

Azure Data Lake Analytics 是 Azure 云中的大数据分析平台。Azure Data Lake Analytics is a big data analytics platform in the Azure cloud. 它可以用作 Azure 机器学习管道的计算目标。It can be used as a compute target with an Azure Machine Learning pipeline.

使用该平台之前,请先创建 Azure Data Lake Analytics 帐户。Create an Azure Data Lake Analytics account before using it. 若要创建此资源,请参阅 Azure Data Lake Analytics 入门文档。To create this resource, see the Get started with Azure Data Lake Analytics document.

若要将 Data Lake Analytics 附加为计算目标,必须使用 Azure 机器学习 SDK 并提供以下信息:To attach Data Lake Analytics as a compute target, you must use the Azure Machine Learning SDK and provide the following information:

  • 计算名称:要分配给此计算资源的名称。Compute name: The name you want to assign to this compute resource.
  • 资源组:包含 Data Lake Analytics 帐户的资源组。Resource Group: The resource group that contains the Data Lake Analytics account.
  • 帐户名称:Data Lake Analytics 帐户名。Account name: The Data Lake Analytics account name.

以下代码演示如何将 Data Lake Analytics 附加为计算目标:The following code demonstrates how to attach Data Lake Analytics as a compute target:

import os
from azureml.core.compute import ComputeTarget, AdlaCompute
from azureml.exceptions import ComputeTargetException


adla_compute_name = os.environ.get(
    "AML_ADLA_COMPUTE_NAME", "<adla_compute_name>")
adla_resource_group = os.environ.get(
    "AML_ADLA_RESOURCE_GROUP", "<adla_resource_group>")
adla_account_name = os.environ.get(
    "AML_ADLA_ACCOUNT_NAME", "<adla_account_name>")

try:
    adla_compute = ComputeTarget(workspace=ws, name=adla_compute_name)
    print('Compute target already exists')
except ComputeTargetException:
    print('compute not found')
    print('adla_compute_name {}'.format(adla_compute_name))
    print('adla_resource_id {}'.format(adla_resource_group))
    print('adla_account_name {}'.format(adla_account_name))
    # create attach config
    attach_config = AdlaCompute.attach_configuration(resource_group=adla_resource_group,
                                                     account_name=adla_account_name)
    # Attach ADLA
    adla_compute = ComputeTarget.attach(
        ws,
        adla_compute_name,
        attach_config
    )

    adla_compute.wait_for_completion(True)

有关更详细的示例,请参阅 GitHub 上的 示例笔记本For a more detailed example, see an example notebook on GitHub.

Tip

Azure 机器学习管道只能处理 Data Lake Analytics 帐户的默认数据存储中存储的数据。Azure Machine Learning pipelines can only work with data stored in the default data store of the Data Lake Analytics account. 如果需要处理的数据不在默认存储中,可以在训练之前使用 DataTransferStep 复制数据。If the data you need to work with is in a non-default store, you can use a DataTransferStep to copy the data before training.

构造管道步骤Construct your pipeline steps

创建计算目标并将其附加到工作区后,就可以定义管道步骤了。Once you create and attach a compute target to your workspace, you are ready to define a pipeline step. 可以通过 Azure 机器学习 SDK 使用许多内置步骤。There are many built-in steps available via the Azure Machine Learning SDK. 这些步骤中最基本的步骤是 PythonScriptStep,即在指定的计算目标中运行 Python 脚本:The most basic of these steps is a PythonScriptStep, which runs a Python script in a specified compute target:

from azureml.pipeline.steps import PythonScriptStep

ds_input = my_dataset.as_named_input('input1')

trainStep = PythonScriptStep(
    script_name="train.py",
    arguments=["--input", ds_input.as_download(), "--output", output_data1],
    inputs=[ds_input],
    outputs=[output_data1],
    compute_target=compute_target,
    source_directory=project_folder,
    allow_reuse=True
)

在协作环境中使用管道时,重复使用以前的结果 (allow_reuse) 非常关键,因为消除不必要的重新运行可以提高敏捷性。Reuse of previous results (allow_reuse) is key when using pipelines in a collaborative environment since eliminating unnecessary reruns offers agility. 当步骤的 script_name、输入和参数保持不变时,“重复使用”是默认行为。Reuse is the default behavior when the script_name, inputs, and the parameters of a step remain the same. 重复使用步骤的输出时,作业不会提交到计算,而是前一运行的结果立即可供下一步骤的运行使用。When the output of the step is reused, the job is not submitted to the compute, instead, the results from the previous run are immediately available to the next step's run. 如果 allow_reuse 设置为 false,则在管道执行过程中将始终为此步骤生成新的运行。If allow_reuse is set to false, a new run will always be generated for this step during pipeline execution.

定义步骤后,使用其中的部分或所有步骤生成管道。After you define your steps, you build the pipeline by using some or all of those steps.

Note

定义步骤或生成管道时,不会将任何文件或数据上传到 Azure 机器学习。No file or data is uploaded to Azure Machine Learning when you define the steps or build the pipeline.

# list of steps to run
compareModels = [trainStep, extractStep, compareStep]

from azureml.pipeline.core import Pipeline

# Build the pipeline
pipeline1 = Pipeline(workspace=ws, steps=[compareModels])

以下示例使用前面创建的 Azure Databricks 计算目标:The following example uses the Azure Databricks compute target created earlier:

from azureml.pipeline.steps import DatabricksStep

dbStep = DatabricksStep(
    name="databricksmodule",
    inputs=[step_1_input],
    outputs=[step_1_output],
    num_workers=1,
    notebook_path=notebook_path,
    notebook_params={'myparam': 'testparam'},
    run_name='demo run name',
    compute_target=databricks_compute,
    allow_reuse=False
)
# List of steps to run
steps = [dbStep]

# Build the pipeline
pipeline1 = Pipeline(workspace=ws, steps=steps)

使用数据集Use a dataset

从 Azure Blob 存储、Azure 文件存储、Azure Data Lake Storage Gen1、Azure Data Lake Storage Gen2、Azure SQL 数据库和 Azure Database for PostgreSQL 创建的数据集可以用作任何管道步骤的输入。Datasets created from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL can be used as input to any pipeline step. 可以将输出写入 DataTransferStepDatabricksStep,如果要将数据写入特定数据存储,请使用 PipelineDataYou can write output to a DataTransferStep, DatabricksStep, or if you want to write data to a specific datastore use PipelineData.

Important

仅 Azure Blob 和 Azure 文件共享数据存储支持使用 PipelineData 将输出数据写回到数据存储。Writing output data back to a datastore using PipelineData is only supported for Azure Blob and Azure File share datastores. 目前 ADLS Gen 2 数据存储不支持此功能。This functionality is not supported for ADLS Gen 2 datastores at this time.

dataset_consuming_step = PythonScriptStep(
    script_name="iris_train.py",
    inputs=[iris_tabular_dataset.as_named_input("iris_data")],
    compute_target=compute_target,
    source_directory=project_folder
)

然后,使用 Run.input_datasets 字典检索管道中的数据集。You then retrieve the dataset in your pipeline by using the Run.input_datasets dictionary.

# iris_train.py
from azureml.core import Run, Dataset

run_context = Run.get_context()
iris_dataset = run_context.input_datasets['iris_data']
dataframe = iris_dataset.to_pandas_dataframe()

Run.get_context() 行值得强调。The line Run.get_context() is worth highlighting. 此函数检索 Run(表示当前试验运行)。This function retrieves a Run representing the current experimental run. 在上面的示例中,我们使用它来检索已注册数据集。In the above sample, we use it to retrieve a registered dataset. Run 对象的另一个常见用途是检索试验本身和试验所在的工作区:Another common use of the Run object is to retrieve both the experiment itself and the workspace in which the experiment resides:

# Within a PythonScriptStep

ws = Run.get_context().experiment.workspace

有关更多详细信息(包括传递数据和访问数据的替代方法),请参阅将数据移入 ML 管道步骤和在 ML 管道步骤之间移动数据 (Python)For more detail, including alternate ways to pass and access data, see Moving data into and between ML pipeline steps (Python).

提交管道Submit the pipeline

提交管道时,Azure 机器学习将检查每个步骤的依赖项,并上传指定的源目录的快照。When you submit the pipeline, Azure Machine Learning checks the dependencies for each step and uploads a snapshot of the source directory you specified. 如果未指定源目录,则上传当前的本地目录。If no source directory is specified, the current local directory is uploaded. 该快照也作为试验的一部分存储在工作区中。The snapshot is also stored as part of the experiment in your workspace.

Important

为了防止在快照中包含不必要的文件,请在目录中创建 ignore 文件(.gitignore.amlignore)。To prevent unnecessary files from being included in the snapshot, make an ignore file (.gitignore or .amlignore) in the directory. 将要排除的文件和目录添加到此文件中。Add the files and directories to exclude to this file. 有关此文件中使用的语法的详细信息,请参阅 .gitignore语法和模式For more information on the syntax to use inside this file, see syntax and patterns for .gitignore. .amlignore 文件使用相同的语法。The .amlignore file uses the same syntax. 如果这两个文件都存在,则 .amlignore 文件的优先级更高。If both files exist, the .amlignore file takes precedence.

有关详细信息,请参阅快照For more information, see Snapshots.

from azureml.core import Experiment

# Submit the pipeline to be run
pipeline_run1 = Experiment(ws, 'Compare_Models_Exp').submit(pipeline1)
pipeline_run1.wait_for_completion()

第一次运行管道时,Azure 机器学习会:When you first run a pipeline, Azure Machine Learning:

  • 将项目快照从与工作区关联的 Blob 存储下载到计算目标。Downloads the project snapshot to the compute target from the Blob storage associated with the workspace.
  • 生成对应于管道中每个步骤的 Docker 映像。Builds a Docker image corresponding to each step in the pipeline.
  • 将每个步骤的 Docker 映像从容器注册表下载到计算目标。Downloads the Docker image for each step to the compute target from the container registry.
  • 配置对 DatasetPipelineData 对象的访问权限。Configures access to Dataset and PipelineData objects. as_mount() 访问模式下,FUSE 用于提供虚拟访问。For as as_mount() access mode, FUSE is used to provide virtual access. 如果不支持装载,或者用户将访问权限指定为 as_download(),则改为将数据复制到计算目标。If mount is not supported or if the user specified access as as_download(), the data is instead copied to the compute target.
  • 运行在步骤定义中指定的计算目标中的步骤。Runs the step in the compute target specified in the step definition.
  • 创建项目,例如日志、stdout 和 stderr、指标以及步骤指定的输出。Creates artifacts, such as logs, stdout and stderr, metrics, and output specified by the step. 然后上传这些项目并将其保存在用户的默认数据存储中。These artifacts are then uploaded and kept in the user's default datastore.

以管道方式运行实验的图

有关详细信息,请参阅 Experiment 类参考。For more information, see the Experiment class reference.

查看管道的结果View results of a pipeline

在工作室中查看所有管道的列表及其运行详细信息:See the list of all your pipelines and their run details in the studio:

  1. 登录到 Azure 机器学习工作室Sign in to Azure Machine Learning studio.

  2. 查看工作区View your workspace.

  3. 在左侧,选择“管道”以查看所有管道运行。On the left, select Pipelines to see all your pipeline runs. 机器学习管道列表list of machine learning pipelines

  4. 选择特定的管道以查看运行结果。Select a specific pipeline to see the run results.

Git 跟踪与集成Git tracking and integration

如果以本地 Git 存储库作为源目录开始训练运行,有关存储库的信息将存储在运行历史记录中。When you start a training run where the source directory is a local Git repository, information about the repository is stored in the run history. 有关详细信息,请参阅 Azure 机器学习的 Git 集成For more information, see Git integration for Azure Machine Learning.

发布管道Publish a pipeline

可以发布一个管道,以便稍后结合不同的输入运行该管道。You can publish a pipeline to run it with different inputs later. 要使已发布的管道的 REST 终结点接受参数,在发布之前必须将该管道参数化。For the REST endpoint of an already published pipeline to accept parameters, you must parameterize the pipeline before publishing.

  1. 若要创建管道参数,请使用带默认值的 PipelineParameter 对象。To create a pipeline parameter, use a PipelineParameter object with a default value.

    from azureml.pipeline.core.graph import PipelineParameter
    
    pipeline_param = PipelineParameter(
      name="pipeline_arg",
      default_value=10)
    
  2. 按如下所示,将此 PipelineParameter 对象作为参数添加到管道中的任一步骤:Add this PipelineParameter object as a parameter to any of the steps in the pipeline as follows:

    compareStep = PythonScriptStep(
      script_name="compare.py",
      arguments=["--comp_data1", comp_data1, "--comp_data2", comp_data2, "--output_data", out_data3, "--param1", pipeline_param],
      inputs=[ comp_data1, comp_data2],
      outputs=[out_data3],
      compute_target=compute_target,
      source_directory=project_folder)
    
  3. 发布此管道,调用时它会接受参数。Publish this pipeline that will accept a parameter when invoked.

    published_pipeline1 = pipeline_run1.publish_pipeline(
         name="My_Published_Pipeline",
         description="My Published Pipeline Description",
         version="1.0")
    

运行已发布的管道Run a published pipeline

所有已发布的管道都具有 REST 终结点。All published pipelines have a REST endpoint. 此终结点可以从非 Python 客户端等外部系统调用管道的运行。This endpoint invokes the run of the pipeline from external systems, such as non-Python clients. 在批量评分和重新训练方案中,此终结点支持“托管可重复性”。This endpoint enables "managed repeatability" in batch scoring and retraining scenarios.

若要调用上述管道的运行,需要使用 AzureCliAuthentication 类参考中所述的 Azure Active Directory 身份验证标头令牌,或在 Azure 机器学习中的身份验证笔记本中获取详细信息。To invoke the run of the preceding pipeline, you need an Azure Active Directory authentication header token, as described in AzureCliAuthentication class reference or get more details in the Authentication in Azure Machine Learning notebook.

from azureml.pipeline.core import PublishedPipeline
import requests

response = requests.post(published_pipeline1.endpoint,
                         headers=aad_token,
                         json={"ExperimentName": "My_Pipeline",
                               "ParameterAssignments": {"pipeline_arg": 20}})

创建版本受控的管道终结点Create a versioned pipeline endpoint

可以创建包含多个已发布管道的管道终结点。You can create a Pipeline Endpoint with multiple published pipelines behind it. 可以像使用已发布的管道一样使用此终结点,但在迭代和更新 ML 管道时,它可以充当固定的 REST 终结点。This can be used like a published pipeline but gives you a fixed REST endpoint as you iterate on and update your ML pipelines.

from azureml.pipeline.core import PipelineEndpoint

published_pipeline = PublishedPipeline.get(workspace="ws", name="My_Published_Pipeline")
pipeline_endpoint = PipelineEndpoint.publish(workspace=ws, name="PipelineEndpointTest",
                                            pipeline=published_pipeline, description="Test description Notebook")

将作业提交到管道终结点Submit a job to a pipeline endpoint

可将作业提交到管道终结点的默认版本:You can submit a job to the default version of a pipeline endpoint:

pipeline_endpoint_by_name = PipelineEndpoint.get(workspace=ws, name="PipelineEndpointTest")
run_id = pipeline_endpoint_by_name.submit("PipelineEndpointExperiment")
print(run_id)

还可将作业提交到特定的版本:You can also submit a job to a specific version:

run_id = pipeline_endpoint_by_name.submit("PipelineEndpointExperiment", pipeline_version="0")
print(run_id)

可以使用 REST API 来完成相同的操作:The same can be accomplished using the REST API:

rest_endpoint = pipeline_endpoint_by_name.endpoint
response = requests.post(rest_endpoint, 
                         headers=aad_token, 
                         json={"ExperimentName": "PipelineEndpointExperiment",
                               "RunSource": "API",
                               "ParameterAssignments": {"1": "united", "2":"city"}})

在工作室中使用已发布的管道Use published pipelines in the studio

也可以从工作室运行已发布的管道:You can also run a published pipeline from the studio:

  1. 登录到 Azure 机器学习工作室Sign in to Azure Machine Learning studio.

  2. 查看工作区View your workspace.

  3. 在左侧选择“终结点”。On the left, select Endpoints.

  4. 在顶部选择“管道终结点”。On the top, select Pipeline endpoints. 机器学习的已发布管道列表list of machine learning published pipelines

  5. 选择要运行的特定管道,使用或查看管道终结点的先前运行的结果。Select a specific pipeline to run, consume, or review results of previous runs of the pipeline endpoint.

禁用已发布的管道Disable a published pipeline

若要在已发布管道的列表中隐藏某个管道,请在工作室或 SDK 中禁用它:To hide a pipeline from your list of published pipelines, you disable it, either in the studio or from the SDK:

# Get the pipeline by using its ID from Azure Machine Learning studio
p = PublishedPipeline.get(ws, id="068f4885-7088-424b-8ce2-eeb9ba5381a6")
p.disable()

可以使用 p.enable() 再次启用它。You can enable it again with p.enable(). 有关详细信息,请参阅 PublishedPipeline 类参考。For more information, see PublishedPipeline class reference.

缓存和重复使用Caching & reuse

若要优化和自定义管道的行为,可以围绕缓存和重复使用采取某些措施。In order to optimize and customize the behavior of your pipelines, you can do a few things around caching and reuse. 例如,可以选择:For example, you can choose to:

  • 步骤定义期间通过设置 allow_reuse=False禁用默认的重复使用步骤运行输出的行为Turn off the default reuse of the step run output by setting allow_reuse=False during step definition. 在协作环境中使用管道时,“重复使用”非常关键,因为消除不必要的运行可以提高敏捷性。Reuse is key when using pipelines in a collaborative environment since eliminating unnecessary runs offers agility. 但是,可以选择禁用重复使用。However, you can opt out of reuse.
  • 使用 pipeline_run = exp.submit(pipeline, regenerate_outputs=False) 强制对运行中的所有步骤重新生成输出Force output regeneration for all steps in a run with pipeline_run = exp.submit(pipeline, regenerate_outputs=False)

默认情况下,已启用步骤的 allow_reuse,步骤定义中指定的 source_directory 将进行哈希处理。By default, allow_reuse for steps is enabled and the source_directory specified in the step definition is hashed. 因此,如果给定步骤的脚本保持不变(script_name、输入和参数),并且 source_directory 中未发生任何其他更改,则会重复使用前一个步骤运行的输出,不会将作业提交到计算,并且前一运行的结果立即可供下一步骤使用。So, if the script for a given step remains the same (script_name, inputs, and the parameters), and nothing else in the source_directory has changed, the output of a previous step run is reused, the job is not submitted to the compute, and the results from the previous run are immediately available to the next step instead.

step = PythonScriptStep(name="Hello World",
                        script_name="hello_world.py",
                        compute_target=aml_compute,
                        source_directory=source_directory,
                        allow_reuse=False,
                        hash_paths=['hello_world.ipynb'])

后续步骤Next steps

阅读使用 Jupyter 笔记本探索此服务一文,了解如何运行笔记本。Learn how to run notebooks by following the article Use Jupyter notebooks to explore this service.