教程：生成用于图像分类的Azure Machine Learning管道

适用于：Azure Machine Learning SDK v1 for Python

重要

本文提供有关使用 Azure Machine Learning SDK v1 的信息。 SDK v1 自 2025 年 3 月 31 日起弃用。对它的支持将于 2026 年 6 月 30 日结束。可以在该日期之前安装和使用 SDK v1。使用 SDK v1 的现有工作流将在支持结束日期后继续运行。但是，在产品发生体系结构更改时，可能会面临安全风险或中断性变更。

建议在 2026 年 6 月 30 日之前过渡到 SDK v2。有关 SDK v2 的详细信息，请参阅什么是 Azure Machine Learning CLI 和 Python SDK v2？和 SDK v2 参考。

注意事项

有关使用 SDK v2 构建管道的教程，请参阅 Tutorial：在 Jupyter Notebook 中使用 Python SDK v2 构建生产 ML 工作流的 ML 管道。

本教程介绍如何生成 Azure Machine Learning 管道来准备数据和训练machine learning模型。机器学习管道通过提升速度、可移植性和重用性来优化你的工作流程，使你可以专注于机器学习，而不是基础设施和自动化。

该示例训练小型 Keras卷积神经网络，以对 Fashion MNIST 数据集中的图像进行分类。

在本教程中，请完成以下任务：

配置工作区
创建一个实验以便管理您的工作
预配一个 ComputeTarget 完成工作
创建用于存储压缩数据的数据集
创建一个流水线步骤以准备用于训练的数据
定义执行训练的运行时环境
创建管道步骤以定义神经网络并执行训练
通过管道步骤撰写管道
在试验中运行管道
检查步骤输出和训练后的神经网络
注册模型供进一步使用

如果没有Azure订阅，请在开始之前创建一个试用订阅。立即尝试试用版订阅。

先决条件

如果您还没有 Azure Machine Learning 工作区，请完成创建资源以开始。
在一个 Python 环境中安装azureml-core包和azureml-pipeline包。使用此环境定义和控制Azure Machine Learning资源。它独立于运行时用于训练的环境。

重要

SDK v1 包（azureml-core 和 azureml-pipeline）需要 Python 3.8-3.10。建议使用 Python 3.10，因为它仍处于安全支持状态。如果在安装包时遇到困难，请确保这是 python --version 兼容的版本。有关说明，请参阅 Python 虚拟环境管理器（venv、conda 等）的文档。

启动交互式 Python 会话

本教程使用用于Azure Machine Learning的 Python SDK 来创建和控制Azure Machine Learning管道。本教程假定你在 Python REPL 环境或Jupyter notebook中以交互方式运行代码片段。

本教程基于在 v1/python-sdk/tutorials/using-pipelines 目录中找到的 image-classification.ipynb 笔记本，位于 Azure Machine Learning 示例 v1-archive 分支。步骤本身的源代码位于 keras-mnist-fashion 子目录中。

导入类型

导入本教程所需的所有Azure Machine Learning类型：

import os
import azureml.core
from azureml.core import (
    Workspace,
    Experiment,
    Dataset,
    Datastore,
    ComputeTarget,
    Environment,
    ScriptRunConfig
)
from azureml.data import OutputFileDatasetConfig
from azureml.core.compute import AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline

# check core SDK version number
print("Azure Machine Learning SDK Version: ", azureml.core.VERSION)

Azure Machine Learning SDK 版本应为 1.61 或最新可用版本。如非如此，请使用 pip install --upgrade azureml-core 进行升级。

配置工作区

从现有Azure Machine Learning工作区创建工作区对象。

workspace = Workspace.from_config()

重要

此代码片段需要将工作区配置保存到当前目录或其父目录中。若要详细了解如何创建工作区，请参阅创建工作区资源。有关将配置保存到文件的详细信息，请参阅创建工作区配置文件。

为管道创建基础结构

创建一个 Experiment 对象来保存管道运行的结果：

exp = Experiment(workspace=workspace, name="keras-mnist-fashion")

创建一个 ComputeTarget，用于表示管道所运行的机器资源。即使在基于 CPU 的计算机上，本教程中使用的简单神经网络也只需几分钟即可完成训练。如果要使用 GPU 进行训练，请设置为 use_gpuTrue. 预配计算目标通常需要大约五分钟。

use_gpu = False

# choose a name for your cluster
cluster_name = "gpu-cluster" if use_gpu else "cpu-cluster"

found = False
# Check if this compute target already exists in the workspace.
cts = workspace.compute_targets
if cluster_name in cts and cts[cluster_name].type == "AmlCompute":
    found = True
    print("Found existing compute target.")
    compute_target = cts[cluster_name]
if not found:
    print("Creating a new compute target...")
    compute_config = AmlCompute.provisioning_configuration(
        vm_size= "STANDARD_NC4AS_T4_V3" if use_gpu else "STANDARD_D2_V2"
        # vm_priority = 'lowpriority', # optional
        max_nodes=4,
    )

    # Create the cluster.
    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)

    # Can poll for a minimum number of nodes and for a specific timeout.
    # If no min_node_count is provided, it will use the scale settings for the cluster.
    compute_target.wait_for_completion(
        show_output=True, min_node_count=None, timeout_in_minutes=10
    )
# For a more detailed view of current AmlCompute status, use get_status().print(compute_target.get_status().serialize())

注意事项

GPU 可用性取决于Azure订阅的配额以及Azure容量。请参阅使用 Azure Machine Learning 管理并提高资源配额。

为Azure存储的数据创建数据集

Fashion-MNIST 是一个时尚图像数据集，包含 10 个类别。每张图像都是 28x28 的灰度图像，有 60,000 张训练图像和 10,000 张测试图像。作为图像分类问题，Fashion-MNIST 比经典 MNIST 手写数字数据库更难。它以与原始手写数字数据库相同的压缩二进制形式分发。

若要创建引用基于 Web 的数据的 Dataset，请运行：

data_urls = ["https://data4mldemo6150520719.blob.core.chinacloudapi.cn/demo/mnist-fashion"]
fashion_ds = Dataset.File.from_files(data_urls)

# list the files referenced by fashion_ds
print(fashion_ds.to_path())

此代码将快速完成。基础数据保留在 data_urls 数组中指定的Azure storage资源中。

创建数据准备管道步骤

此管道中的第一步将压缩数据文件 fashion_ds 转换为你自己的工作区中的数据集，其中包含可用于训练的 CSV 文件。注册到工作区后，协作者可以access此数据进行自己的分析、训练等。

datastore = workspace.get_default_datastore()
prepared_fashion_ds = OutputFileDatasetConfig(
    destination=(datastore, "outputdataset/{run-id}")
).register_on_complete(name="prepared_fashion_ds")

前面的代码指定基于管道步骤输出的数据集。基础处理的文件保存在工作区默认数据存储的 Blob 存储中指定的 destination 路径。数据集使用名称 prepared_fashion_ds 注册在工作区中。

创建管道步骤的来源

到目前为止执行的代码会创建和控制Azure资源。现在是时候编写代码来执行领域中的第一步。

如果遵循 Azure Machine Learning 示例存储库中的示例，则源文件已作为 keras-mnist-fashion/prepare.py 提供。

如果是从头开始操作，请创建名为 keras-mnist-fashion/ 的子目录。创建一个新文件，将以下代码添加到其中，并将文件命名为 prepare.py。

# prepare.py
# Converts MNIST-formatted files at the passed-in input path to a passed-in output path
import os
import sys

# Conversion routine for MNIST binary format
def convert(imgf, labelf, outf, n):
    f = open(imgf, "rb")
    l = open(labelf, "rb")
    o = open(outf, "w")

    f.read(16)
    l.read(8)
    images = []

    for i in range(n):
        image = [ord(l.read(1))]
        for j in range(28 * 28):
            image.append(ord(f.read(1)))
        images.append(image)

    for image in images:
        o.write(",".join(str(pix) for pix in image) + "\n")
    f.close()
    o.close()
    l.close()

# The MNIST-formatted source
mounted_input_path = sys.argv[1]
# The output directory at which the outputs will be written
mounted_output_path = sys.argv[2]

# Create the output directory
os.makedirs(mounted_output_path, exist_ok=True)

# Convert the training data
convert(
    os.path.join(mounted_input_path, "mnist-fashion/train-images-idx3-ubyte"),
    os.path.join(mounted_input_path, "mnist-fashion/train-labels-idx1-ubyte"),
    os.path.join(mounted_output_path, "mnist_train.csv"),
    60000,
)

# Convert the test data
convert(
    os.path.join(mounted_input_path, "mnist-fashion/t10k-images-idx3-ubyte"),
    os.path.join(mounted_input_path, "mnist-fashion/t10k-labels-idx1-ubyte"),
    os.path.join(mounted_output_path, "mnist_test.csv"),
    10000,
)

prepare.py 中的代码采用两个命令行参数：第一个分配给 mounted_input_path，第二个分配给 mounted_output_path。如果该子目录不存在，则调用 os.makedirs 会创建该目录。然后，程序将转换训练和测试数据，并将逗号分隔的文件输出到 mounted_output_path。

指定管道步骤

返回用于指定管道的 Python 环境，运行以下代码为准备代码创建 PythonScriptStep：

script_folder = "./keras-mnist-fashion"

prep_step = PythonScriptStep(
    name="prepare step",
    script_name="prepare.py",
    # On the compute target, mount fashion_ds dataset as input, prepared_fashion_ds as output
    arguments=[fashion_ds.as_named_input("fashion_ds").as_mount(), prepared_fashion_ds],
    source_directory=script_folder,
    compute_target=compute_target,
    allow_reuse=True,
)

当调用 PythonScriptStep 时，指定管道步骤运行：

script_folder 目录中的所有文件都上传到 compute_target
在上传的源文件中，文件 prepare.py 正在运行
fashion_ds和prepared_fashion_ds数据集, 装载在compute_target上并显示为目录
文件路径fashion_ds是prepare.py的第一个参数。在 prepare.py 中，此参数分配给 mounted_input_path
prepared_fashion_ds的路径是prepare.py的第二个参数。在 prepare.py 中，此参数分配给 mounted_output_path
因为 allow_reuse 是 True，所以它在源文件或输入更改之前不会重新运行。
此 PythonScriptStep 被命名为 prepare step

模块化和重用是流水线的主要优势。 Azure Machine Learning可以自动确定源代码或数据集更改。如果allow_reuse是True，则管道重用未受到影响的步骤的输出，而无需重新运行这些步骤。如果某个步骤依赖 Azure Machine Learning 外部可能会更改的数据源（例如，包含销售数据的 URL），请将 allow_reuse 设置为 False，这样每次管道运行时，管道步骤都会执行。

创建训练步骤

将数据从压缩格式转换为 CSV 文件后，可以使用它来训练卷积神经网络。

创建训练步骤的源代码

对于较大的pipelines，请将每个步骤的源代码放在单独的目录中，例如 src/prepare/ 或 src/train/。对于本教程，请在 train.py 源目录中使用或创建 keras-mnist-fashion/ 文件。

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.layers.normalization import BatchNormalization
from keras.utils import to_categorical
from keras.callbacks import Callback

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from azureml.core import Run

# dataset object from the run
run = Run.get_context()
dataset = run.input_datasets["prepared_fashion_ds"]

# split dataset into train and test set
(train_dataset, test_dataset) = dataset.random_split(percentage=0.8, seed=111)

# load dataset into pandas dataframe
data_train = train_dataset.to_pandas_dataframe()
data_test = test_dataset.to_pandas_dataframe()

img_rows, img_cols = 28, 28
input_shape = (img_rows, img_cols, 1)

X = np.array(data_train.iloc[:, 1:])
y = to_categorical(np.array(data_train.iloc[:, 0]))

# here we split validation data to optimiza classifier during training
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)

# test data
X_test = np.array(data_test.iloc[:, 1:])
y_test = to_categorical(np.array(data_test.iloc[:, 0]))


X_train = (
    X_train.reshape(X_train.shape[0], img_rows, img_cols, 1).astype("float32") / 255
)
X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1).astype("float32") / 255
X_val = X_val.reshape(X_val.shape[0], img_rows, img_cols, 1).astype("float32") / 255

batch_size = 256
num_classes = 10
epochs = 10

# construct neuron network
model = Sequential()
model.add(
    Conv2D(
        32,
        kernel_size=(3, 3),
        activation="relu",
        kernel_initializer="he_normal",
        input_shape=input_shape,
    )
)
model.add(MaxPooling2D((2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(128, (3, 3), activation="relu"))
model.add(Dropout(0.4))
model.add(Flatten())
model.add(Dense(128, activation="relu"))
model.add(Dropout(0.3))
model.add(Dense(num_classes, activation="softmax"))

model.compile(
    loss=keras.losses.categorical_crossentropy,
    optimizer=keras.optimizers.Adam(),
    metrics=["accuracy"],
)

# start an Azure ML run
run = Run.get_context()


class LogRunMetrics(Callback):
    # callback at the end of every epoch
    def on_epoch_end(self, epoch, log):
        # log a value repeated which creates a list
        run.log("Loss", log["loss"])
        run.log("Accuracy", log["accuracy"])


history = model.fit(
    X_train,
    y_train,
    batch_size=batch_size,
    epochs=epochs,
    verbose=1,
    validation_data=(X_val, y_val),
    callbacks=[LogRunMetrics()],
)

score = model.evaluate(X_test, y_test, verbose=0)

# log a single value
run.log("Final test loss", score[0])
print("Test loss:", score[0])

run.log("Final test accuracy", score[1])
print("Test accuracy:", score[1])

plt.figure(figsize=(6, 3))
plt.title("Fashion MNIST with Keras ({} epochs)".format(epochs), fontsize=14)
plt.plot(history.history["accuracy"], "b-", label="Accuracy", lw=4, alpha=0.5)
plt.plot(history.history["loss"], "r--", label="Loss", lw=4, alpha=0.5)
plt.legend(fontsize=12)
plt.grid(True)

# log an image
run.log_image("Loss v.s. Accuracy", plot=plt)

# create a ./outputs/model folder in the compute target
# files saved in the "./outputs" folder are automatically uploaded into run history
os.makedirs("./outputs/model", exist_ok=True)

# serialize NN architecture to JSON
model_json = model.to_json()
# save model JSON
with open("./outputs/model/model.json", "w") as f:
    f.write(model_json)
# save model weights
model.save_weights("./outputs/model/model.h5")
print("model saved in ./outputs/model folder")

ML 开发人员应熟悉这些代码的大部分内容：

数据被分区为训练集和验证集以用于训练，并有一个单独的测试子集用于最终评分。
输入形状为 28x28x1（之所以是 1 是因为输入为灰度），一批中有 256 个输入，有 10 个类别。
训练轮次数为 10。
该模型有三个卷积层，最大池化层和 dropout 层，后跟密集层和 softmax 层。
该模型适用于 10 个时期，然后进行评估。
模型体系结构写入 outputs/model/model.json，权重写入 outputs/model/model.h5。

不过，某些代码特定于Azure Machine Learning。 run = Run.get_context()检索包含当前服务上下文的 Run 对象。源 train.py 使用此 run 对象按其名称检索输入数据集。此方法是代码prepare.py的一种替代方案；它通过argv脚本参数数组检索数据集。

该 run 对象还记录每个时期结束时的训练进度，并在训练结束时记录随时间推移的损失和准确性图。

创建训练管道步骤

训练步骤的配置比准备步骤稍微复杂一些。准备步骤仅使用标准 Python 库。更常见的是，需要修改运行源代码的运行时环境。

创建包含以下内容的名为 conda_dependencies.yml 的文件：

dependencies:
- python=3.10
- pip:
  - azureml-core
  - azureml-dataset-runtime
  - keras==2.4.3
  - tensorflow>=2.15
  - numpy
  - scikit-learn
  - pandas
  - matplotlib

Environment 类表示运行machine learning任务的运行时环境。使用以下命令将上述规范与训练代码相关联：

keras_env = Environment.from_conda_specification(
    name="keras-env", file_path="./conda_dependencies.yml"
)

train_cfg = ScriptRunConfig(
    source_directory=script_folder,
    script="train.py",
    compute_target=compute_target,
    environment=keras_env,
)

用于创建训练步骤的代码类似于创建准备步骤的代码：

train_step = PythonScriptStep(
    name="train step",
    arguments=[
        prepared_fashion_ds.read_delimited_files().as_input(name="prepared_fashion_ds")
    ],
    source_directory=train_cfg.source_directory,
    script_name=train_cfg.script,
    runconfig=train_cfg.run_config,
)

创建并运行管道

指定数据输入和输出并创建管道的步骤后，将它们组合到管道中并运行它：

pipeline = Pipeline(workspace, steps=[prep_step, train_step])
run = exp.submit(pipeline)

Pipeline创建的对象在workspace中运行，由指定的准备和训练步骤组成。

注意事项

此管道有一个简单的依赖项关系图：训练步骤依赖于准备步骤，准备步骤依赖于 fashion_ds 数据集。生产流水线通常具有更复杂的依赖项。步骤可以依赖于多个上游步骤。早期步骤中的源代码更改可能会产生深远的后果。 Azure Machine Learning跟踪这些问题。只需传入一个steps数组。 Azure Machine Learning负责计算执行图。

对 submit 和 Experiment 的调用很快完成，并生成类似于以下内容的输出：

Submitted PipelineRun 5968530a-abcd-1234-9cc1-46168951b5eb
Link to Azure Machine Learning Portal: https://studio.ml.azure.cn/runs/abc-xyz...

可以通过打开链接来监视管道运行，也可以通过运行以下代码来阻止管道运行，直到管道运行完成：

run.wait_for_completion(show_output=True)

重要

首次管道运行需要大约 15 分钟。必须下载所有依赖项，必须创建 Docker 映像，并且必须预配和创建 Python 环境。再次运行管道所需的时间要少得多，因为管道会重复使用这些资源，而不是创建这些资源。但是，管道的总运行时间取决于脚本的工作负荷以及每个管道步骤中运行的进程。

管道完成后，您可以检索在训练步骤中记录的指标：

run.find_step_run("train step")[0].get_metrics()

如果对指标感到满意，请在工作区中注册模型：

run.find_step_run("train step")[0].register_model(
    model_name="keras-model",
    model_path="outputs/model/",
    datasets=[("train test data", fashion_ds)],
)

清理资源

如果计划运行其他Azure Machine Learning教程，请不要完成本部分。

停止计算实例

如果使用了计算实例，请在不使用 VM 时将其停止，以降低成本。

在工作区中选择“计算”。
从列表中选择计算实例的名称。
选择“停止” 。
准备好再次使用服务器时，选择“启动” 。

删除所有内容

如果不打算使用所创建的资源，请将其删除，以免产生任何费用：

在Azure portal的左侧菜单中，选择资源组。
在资源组列表中，选择创建的资源组。
选择“删除资源组”。
输入资源组名称。然后选择“删除”。

你还可以保留资源组，但删除单个工作区。显示工作区属性，然后选择“删除”。

后续步骤

在本教程中，你使用了以下类型：

Workspace 类型表示Azure Machine Learning工作区。它包含：
- 保存管道训练运行结果的 Experiment。
- 延迟 Dataset 加载 Fashion-MNIST 数据存储中保存的数据。
- 表示管道步骤运行的机器的ComputeTarget。
- Environment 是执行管道步骤的运行时环境。
- 构成Pipeline的PythonScriptStep整合为一个整体。
- 在你对训练过程感到满意后注册的Model。

该 Workspace 对象包含对本教程中未使用的其他资源（如笔记本和终结点）的引用。有关详细信息，请参阅什么是Azure Machine Learning工作区？

OutputFileDatasetConfig 将运行的输出提升为基于文件的数据集。有关数据集和使用数据的详细信息，请参阅如何访问数据。

有关计算目标和环境的详细信息，请参阅 Azure Machine Learning 中的计算目标是什么？和 Azure Machine Learning 环境是什么？

ScriptRunConfig 将 ComputeTarget 和 Environment 与 Python 源文件相关联。采用 PythonScriptStep 然后 ScriptRunConfig 定义其输入和输出。在此管道中，输出是由 OutputFileDatasetConfig 生成的文件数据集。

有关如何使用机器学习 SDK 构建管道的更多示例，请参阅示例库。

Last updated on 2026-03-10