使用 Azure 机器学习大规模训练 scikit-learn 模型

2024/08/14

本文介绍如何使用 Azure 机器学习 Python SDK v2 运行 scikit-learn 训练脚本。

本文中的示例脚本用来对鸢尾花图像进行分类，以基于 scikit-learn 的 iris 数据集构建机器学习模型。

无论是从头开始训练机器学习 scikit-learn 模型，还是将现有模型引入云中，都可以通过 Azure 机器学习使用弹性云计算资源来横向扩展开源训练作业。你可以通过 Azure 机器学习来构建、部署和监视生产级模型以及对其进行版本控制。

先决条件

可以在 Azure 机器学习计算实例或你自己的 Jupyter Notebook 中运行本文的代码。

Azure 机器学习计算实例
- 完成创建帮助入门的资源以创建计算实例。每个计算实例都包括一个预加载了 SDK 和笔记本示例存储库的专用笔记本服务器。
- 在 Azure 机器学习工作室中选择“笔记本”选项卡。在示例训练文件夹中，导航到以下目录来找到一个已完成且已展开的笔记本：v2 > sdk > jobs > single-step > scikit-learn > train-hyperparameter-tune-deploy-with-sklearn。
- 你可以使用示例训练文件夹中的预填充代码来完成本教程。
Jupyter Notebook 服务器。
- 安装 Azure 机器学习 SDK (v2)。

设置作业

本部分会加载所需的 Python 包、连接到工作区、创建计算资源来运行命令作业，并创建用于运行作业的环境，从而设置作业来进行训练。

连接到工作区

首先，需要连接到你的 Azure 机器学习工作区。 Azure 机器学习工作区是服务的顶级资源。它提供了一个集中的位置，用于处理使用 Azure 机器学习时创建的所有项目。

使用 DefaultAzureCredential 来访问工作区。该凭据应能够处理大多数 Azure SDK 身份验证方案。

如果 DefaultAzureCredential 不适用，请参阅 azure-identity reference documentation 或 Set up authentication 了解更多可用凭据。

# Handle to the workspace
from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

如果你更喜欢使用浏览器进行登录和身份验证，应移除以下代码中的注释，转而使用它。

# Handle to the workspace
# from azure.ai.ml import MLClient

# Authentication package
# from azure.identity import InteractiveBrowserCredential
# credential = InteractiveBrowserCredential()

接下来，通过提供订阅 ID、资源组名称和工作区名称来获取工作区的句柄。要查找这些参数：

在 Azure 机器学习工作室工具栏的右上角查找工作区名称。
选择工作区名称以显示资源组和订阅 ID。
将资源组和订阅 ID 的值复制到代码中。

# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id="<SUBSCRIPTION_ID>",
    resource_group_name="<RESOURCE_GROUP>",
    workspace_name="<AML_WORKSPACE_NAME>",
)

运行此脚本会得到一个工作区句柄，你使用它来管理其他资源和作业。

备注

创建 MLClient 不会将客户端连接到工作区。客户端初始化是惰性操作，将等待其所需的首次调用。在本文中，这将发生在计算创建期间。

创建计算资源

Azure 机器学习需要计算资源才能运行作业。此资源可以是具有 Linux 或 Windows OS 的单节点或多节点计算机，也可以是 Spark 等特定计算结构。

在以下示例脚本中，我们预配了 Linux compute cluster。可以查看 Azure Machine Learning pricing 页面，了解 VM 大小和价格的完整列表。对于此示例，我们只需要一个基本群集；因此，我们选取一个具有 2 个 vCPU 内核和 7 GB RAM 的 Standard_DS3_v2 模型来创建 Azure 机器学习计算。

from azure.ai.ml.entities import AmlCompute

# Name assigned to the compute cluster
cpu_compute_target = "cpu-cluster"

try:
    # let's see if the compute target already exists
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(
        f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print("Creating a new cpu compute target...")

    # Let's create the Azure ML compute object with the intended parameters
    cpu_cluster = AmlCompute(
        name=cpu_compute_target,
        # Azure ML Compute is the on-demand VM service
        type="amlcompute",
        # VM Family
        size="STANDARD_DS3_V2",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in cluster
        max_instances=4,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=180,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )

    # Now, we pass the object to MLClient's create_or_update method
    cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster).result()

print(
    f"AMLCompute with name {cpu_cluster.name} is created, the compute size is {cpu_cluster.size}"
)

创建作业环境

运行 Azure 机器学习作业需要一个环境。 Azure 机器学习环境封装了在计算资源上运行机器学习训练脚本所需的依赖项（例如软件运行时和库）。此环境类似于本地计算机上的 Python 环境。

通过 Azure 机器学习可使用策展（或现成）环境或使用 Docker 映像或 Conda 配置创建自定义环境。在本文中，请使用 Conda YAML 文件为作业创建一个自定义环境。

创建自定义环境

若要创建自定义环境，请在 YAML 文件中定义 Conda 依赖项。首先，创建一个用于存储文件的目录。在此示例中，我们将目录命名为 env。

import os

dependencies_dir = "./env"
os.makedirs(dependencies_dir, exist_ok=True)

然后，在依赖项目录中创建文件。在此示例中，我们将文件命名为 conda.yml。

%%writefile {dependencies_dir}/conda.yml
name: sklearn-env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - pip=21.2.4
  - scikit-learn=0.24.2
  - scipy=1.7.1
  - pip:  
    - mlflow== 1.26.1
    - azureml-mlflow==1.42.0

该规范包含一些在作业中使用的常用包（例如 numpy 和 pip）。

接着，使用 YAML 文件在工作区中创建并注册此自定义环境。此环境在运行时打包到 Docker 容器中。

from azure.ai.ml.entities import Environment

custom_env_name = "sklearn-env"

job_env = Environment(
    name=custom_env_name,
    description="Custom environment for sklearn image classification",
    conda_file=os.path.join(dependencies_dir, "conda.yml"),
    image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest",
)
job_env = ml_client.environments.create_or_update(job_env)

print(
    f"Environment with name {job_env.name} is registered to workspace, the environment version is {job_env.version}"
)

有关创建和使用环境的详细信息，请参阅在 Azure 机器学习中创建和使用软件环境。

配置并提交训练作业

在本部分，我们介绍如何使用提供的训练脚本运行训练作业。首先，通过配置用于运行训练脚本的命令来生成训练作业。然后，请提交训练作业以在 Azure 机器学习中运行它。

准备训练脚本

在本文中，我们提供了训练脚本 train_iris.py。实际上，你应该能够原样获取任何自定义训练脚本，并使用 Azure 机器学习运行它，而无需修改你的代码。

备注

提供的训练脚本会执行以下操作：

展示如何将一些指标记录到 Azure 机器学习运行中；
使用 iris = datasets.load_iris() 下载并提取训练数据；并且
训练一个模型，然后保存并注册它。

若要使用和访问自己的数据，请参阅如何在作业中读取和写入数据以使数据在训练期间可用。

若要使用训练脚本，请先创建一个目录来存储文件。

import os

src_dir = "./src"
os.makedirs(src_dir, exist_ok=True)

接下来，在源目录中创建脚本文件。

%%writefile {src_dir}/train_iris.py
# Modified from https://www.geeksforgeeks.org/multiclass-classification-using-scikit-learn/

import argparse
import os

# importing necessary libraries
import numpy as np

from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

import joblib

import mlflow
import mlflow.sklearn

def main():
    parser = argparse.ArgumentParser()

    parser.add_argument('--kernel', type=str, default='linear',
                        help='Kernel type to be used in the algorithm')
    parser.add_argument('--penalty', type=float, default=1.0,
                        help='Penalty parameter of the error term')

    # Start Logging
    mlflow.start_run()

    # enable autologging
    mlflow.sklearn.autolog()

    args = parser.parse_args()
    mlflow.log_param('Kernel type', str(args.kernel))
    mlflow.log_metric('Penalty', float(args.penalty))

    # loading the iris dataset
    iris = datasets.load_iris()

    # X -> features, y -> label
    X = iris.data
    y = iris.target

    # dividing X, y into train and test data
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

    # training a linear SVM classifier
    from sklearn.svm import SVC
    svm_model_linear = SVC(kernel=args.kernel, C=args.penalty)
    svm_model_linear = svm_model_linear.fit(X_train, y_train)
    svm_predictions = svm_model_linear.predict(X_test)

    # model accuracy for X_test
    accuracy = svm_model_linear.score(X_test, y_test)
    print('Accuracy of SVM classifier on test set: {:.2f}'.format(accuracy))
    mlflow.log_metric('Accuracy', float(accuracy))
    # creating a confusion matrix
    cm = confusion_matrix(y_test, svm_predictions)
    print(cm)

    registered_model_name="sklearn-iris-flower-classify-model"

    ##########################
    #<save and register model>
    ##########################
    # Registering the model to the workspace
    print("Registering the model via MLFlow")
    mlflow.sklearn.log_model(
        sk_model=svm_model_linear,
        registered_model_name=registered_model_name,
        artifact_path=registered_model_name
    )

    # # Saving the model to a file
    print("Saving the model via MLFlow")
    mlflow.sklearn.save_model(
        sk_model=svm_model_linear,
        path=os.path.join(registered_model_name, "trained_model"),
    )
    ###########################
    #</save and register model>
    ###########################
    mlflow.end_run()

if __name__ == '__main__':
    main()

生成训练作业

现在你已拥有运行作业所需的所有资产，是时候使用 Azure 机器学习 Python SDK v2 进行生成了。为了运行此作业，我们创建一个 command。

Azure 机器学习 command 是一种资源，用于指定在云中执行训练代码所需的所有详细信息。这些详细信息包括输入和输出、要使用的硬件类型、要安装的软件以及代码运行方式。 command 包含用于执行单个命令的信息。

配置命令

请使用常规用途 command 来运行训练脚本并执行所需的任务。创建一个 Command 对象来指定训练作业的配置详细信息。

此命令的输入包括 epoch 数、学习率、动量和输出目录。
对于参数值：
- 提供为运行此命令而创建的计算群集 cpu_compute_target = "cpu-cluster"；
- 提供为运行 Azure 机器学习作业而创建的自定义环境 sklearn-env；
- 配置命令行操作本身 - 在这种情况下，命令为 python train_iris.py。可通过 ${{ ... }} 表示法访问命令中的输入和输出；并且
- 配置显示名称和试验名称等元数据；其中一个试验是一个容器，包含在某个项目上所做的全部迭代。在同一试验名称下提交的所有作业将在 Azure 机器学习工作室中彼此相邻地列出。

from azure.ai.ml import command
from azure.ai.ml import Input

job = command(
    inputs=dict(kernel="linear", penalty=1.0),
    compute=cpu_compute_target,
    environment=f"{job_env.name}:{job_env.version}",
    code="./src/",
    command="python train_iris.py --kernel ${{inputs.kernel}} --penalty ${{inputs.penalty}}",
    experiment_name="sklearn-iris-flowers",
    display_name="sklearn-classify-iris-flower-images",
)

提交作业

现在可以提交要在 Azure 机器学习中运行的作业。这次在 ml_client.jobs 上使用 create_or_update。

ml_client.jobs.create_or_update(job)

完成后，该作业会在工作区中注册一个模型（这是训练的结果），并输出一个在 Azure 机器学习工作室中查看该作业的链接。

警告

Azure 机器学习通过复制整个源目录来运行训练脚本。如果你有不想上传的敏感数据，请使用 .ignore 文件或不将其包含在源目录中。

在作业执行过程中发生的情况

执行作业时，会经历以下阶段：

准备：根据所定义的环境创建 docker 映像。将映像上传到工作区的容器注册表，缓存以用于后续运行。还会将日志流式传输到运行历史记录，可以查看日志以监视进度。如果指定策展环境，则使用支持该策展环境的缓存映像。
缩放：如果群集执行运行所需的节点多于当前可用节点，则群集将尝试纵向扩展。
运行：脚本文件夹 src 中的所有脚本都上传到计算目标，装载或复制数据存储，然后执行脚本。 stdout 和 ./logs 文件夹中的输出流式传输到运行历史记录，并可用于监视运行。

优化模型超参数

现在，你已经了解如何使用 SDK 进行简单的 Scikit-learn 训练运行，接下来让我们看看是否可进一步提高模型的准确性。可以使用 Azure 机器学习的 sweep 功能调整和优化模型的超参数。

要调整模型的超参数，请定义要在训练期间搜索的参数空间。为此，请使用 azure.ml.sweep 包中的特殊输入替换传递给训练作业的一些参数（kernel 和 penalty）。

from azure.ai.ml.sweep import Choice

# we will reuse the command_job created before. we call it as a function so that we can apply inputs
# we do not apply the 'iris_csv' input again -- we will just use what was already defined earlier
job_for_sweep = job(
    kernel=Choice(values=["linear", "rbf", "poly", "sigmoid"]),
    penalty=Choice(values=[0.5, 1, 1.5]),
)

然后，请在命令作业上配置扫描，使用一些特定于扫描的参数，例如要监视的主要指标和要使用的采样算法。

在以下代码中，我们使用随机采样来尝试不同的超参数配置集，以尝试最大化我们的主要指标 Accuracy。

sweep_job = job_for_sweep.sweep(
    compute="cpu-cluster",
    sampling_algorithm="random",
    primary_metric="Accuracy",
    goal="Maximize",
    max_total_trials=12,
    max_concurrent_trials=4,
)

现在，可以像之前一样提交此作业。这一次，你是运行一个扫描作业来扫描训练作业。

returned_sweep_job = ml_client.create_or_update(sweep_job)

# stream the output and wait until the job is finished
ml_client.jobs.stream(returned_sweep_job.name)

# refresh the latest status of the job after streaming
returned_sweep_job = ml_client.jobs.get(name=returned_sweep_job.name)

可使用在作业运行期间显示的工作室用户界面链接来监视作业。

查找并注册最佳模型

所有运行完成后，可以找到生成具有最高准确度的模型的运行。

from azure.ai.ml.entities import Model

if returned_sweep_job.status == "Completed":

    # First let us get the run which gave us the best result
    best_run = returned_sweep_job.properties["best_child_run_id"]

    # lets get the model from this run
    model = Model(
        # the script stores the model as "sklearn-iris-flower-classify-model"
        path="azureml://jobs/{}/outputs/artifacts/paths/sklearn-iris-flower-classify-model/".format(
            best_run
        ),
        name="run-model-example",
        description="Model created from run.",
        type="custom_model",
    )

else:
    print(
        "Sweep job status: {}. Please wait until it completes".format(
            returned_sweep_job.status
        )
    )

然后，可以注册此模型。

registered_model = ml_client.models.create_or_update(model=model)

部署模型

注册模型后，可以像部署 Azure 机器学习中的任何其他已注册模型一样对其进行部署。有关部署的详细信息，请参阅使用 Python SDK v2 部署使用托管联机终结点的机器学习模型并为其评分。

后续步骤

在本文中，你训练并注册了一个 scikit-learn 模型，并了解了部署选项。有关 Azure 机器学习的详细信息，请参阅以下其他文章。