使用 Azure 机器学习大规模训练 scikit-learn 模型Train scikit-learn models at scale with Azure Machine Learning

本文介绍如何使用 Azure 机器学习运行 scikit-learn 训练脚本。In this article, learn how to run your scikit-learn training scripts with Azure Machine Learning.

本文中的示例脚本用来对鸢尾花图像进行分类,以基于 scikit-learn 的 iris 数据集构建机器学习模型。The example scripts in this article are used to classify iris flower images to build a machine learning model based on scikit-learn's iris dataset.

无论是从头开始训练机器学习 scikit-learn 模型,还是将现有模型引入云中,都可以通过 Azure 机器学习使用弹性云计算资源来横向扩展开源训练作业。Whether you're training a machine learning scikit-learn model from the ground-up or you're bringing an existing model into the cloud, you can use Azure Machine Learning to scale out open-source training jobs using elastic cloud compute resources. 你可以通过 Azure 机器学习来构建、部署和监视生产级模型以及对其进行版本控制。You can build, deploy, version, and monitor production-grade models with Azure Machine Learning.

先决条件Prerequisites

在以下任一环境中运行此代码:Run this code on either of these environments:

  • Azure 机器学习计算实例 - 无需下载或安装Azure Machine Learning compute instance - no downloads or installation necessary

    • 在开始本教程之前完成教程:设置环境和工作区以创建预先装载了 SDK 和示例存储库的专用笔记本服务器。Complete the Tutorial: Setup environment and workspace to create a dedicated notebook server pre-loaded with the SDK and the sample repository.
    • 在笔记本服务器上的示例训练文件夹中,导航到以下目录,查找一个已完成且已展开的笔记本:how-to-use-azureml > ml-frameworks > scikit-learn > train-hyperparameter-tune-deploy-with-sklearn 文件夹。In the samples training folder on the notebook server, find a completed and expanded notebook by navigating to this directory: how-to-use-azureml > ml-frameworks > scikit-learn > train-hyperparameter-tune-deploy-with-sklearn folder.
  • 你自己的 Jupyter 笔记本服务器Your own Jupyter Notebook server

设置试验Set up the experiment

本部分通过加载所需的 Python 包、初始化工作区、定义训练环境以及准备训练脚本来设置训练实验。This section sets up the training experiment by loading the required Python packages, initializing a workspace, defining the training environment, and preparing the training script.

初始化工作区Initialize a workspace

Azure 机器学习工作区是服务的顶级资源。The Azure Machine Learning workspace is the top-level resource for the service. 它提供了一个集中的位置来处理创建的所有项目。It provides you with a centralized place to work with all the artifacts you create. 在 Python SDK 中,可以通过创建 workspace 对象来访问工作区项目。In the Python SDK, you can access the workspace artifacts by creating a workspace object.

根据在先决条件部分中创建的 config.json 文件创建工作区对象。Create a workspace object from the config.json file created in the prerequisites section.

from azureml.core import Workspace

ws = Workspace.from_config()

准备脚本Prepare scripts

这里为你提供了在本教程中使用的训练脚本 train_iris.py。In this tutorial, the training script train_iris.py is already provided for you here. 实际上,你应该能够原样获取任何自定义训练脚本,并使用 Azure ML 运行它,而无需修改你的代码。In practice, you should be able to take any custom training script as is and run it with Azure ML without having to modify your code.

注意:Notes:

  • 提供的训练脚本展示了如何在脚本中使用 Run 对象将一些指标记录到 Azure ML 运行中。The provided training script shows how to log some metrics to your Azure ML run using the Run object within the script.
  • 提供的训练脚本使用了来自 iris = datasets.load_iris() 函数的示例数据。The provided training script uses example data from the iris = datasets.load_iris() function. 若要使用和访问自己的数据,请参阅如何使用数据集进行训练,以便使数据在训练期间可用。To use and access your own data, see how to train with datasets to make data available during training.

定义环境Define your environment

若要定义封装训练脚本依赖项的 Azure ML 环境,可以定义自定义环境或使用 Azure ML 特选环境。To define the Azure ML Environment that encapsulates your training script's dependencies, you can either define a custom environment or use and Azure ML curated environment.

使用特选环境Use a curated environment

或者,如果不想定义你自己的环境,也可使用 Azure ML 提供的预生成的特选环境。Optionally, Azure ML provides prebuilt, curated environments if you don't want to define your own environment. 有关详细信息,请参阅此文For more info, see here. 若要使用特选环境,可以改为运行以下命令:If you want to use a curated environment, you can run the following command instead:

from azureml.core import Environment

sklearn_env = Environment.get(workspace=ws, name='AzureML-Tutorial')

创建自定义环境Create a custom environment

你还可以创建自己的自定义环境。You can also create your own your own custom environment. 在 YAML 文件中定义 conda 依赖项;在本例中,该文件名为 conda_dependencies.ymlDefine your conda dependencies in a YAML file; in this example the file is named conda_dependencies.yml.

dependencies:
  - python=3.6.2
  - scikit-learn
  - numpy
  - pip:
    - azureml-defaults

基于此 Conda 环境规范创建 Azure ML 环境。Create an Azure ML environment from this Conda environment specification. 此环境将在运行时打包到 Docker 容器中。The environment will be packaged into a Docker container at runtime.

from azureml.core import Environment

sklearn_env = Environment.from_conda_specification(name='sklearn-env', file_path='conda_dependencies.yml')

有关创建和使用环境的详细信息,请参阅在 Azure 机器学习中创建和使用软件环境For more information on creating and using environments, see Create and use software environments in Azure Machine Learning.

配置和提交训练运行Configure and submit your training run

创建 ScriptRunConfigCreate a ScriptRunConfig

创建一个 ScriptRunConfig 对象,以指定训练作业的配置详细信息,包括训练脚本、要使用的环境,以及要在其上运行的计算目标。Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on. 如果在 arguments 参数中指定,训练脚本的任何自变量都将通过命令行传递。Any arguments to your training script will be passed via command line if specified in the arguments parameter.

下面的代码将配置一个 ScriptRunConfig 对象,用于提交在本地计算机上执行的作业。The following code will configure a ScriptRunConfig object for submitting your job for execution on your local machine.

from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory='.',
                      script='train_iris.py',
                      arguments=['--kernel', 'linear', '--penalty', 1.0],
                      environment=sklearn_env)

如果要改为在远程群集上运行作业,则可以为 ScriptRunConfig 的 compute_target 参数指定所需的计算目标。If you want to instead run your job on a remote cluster, you can specify the desired compute target to the compute_target parameter of ScriptRunConfig.

from azureml.core import ScriptRunConfig

compute_target = ws.compute_targets['<my-cluster-name>']
src = ScriptRunConfig(source_directory='.',
                      script='train_iris.py',
                      arguments=['--kernel', 'linear', '--penalty', 1.0],
                      compute_target=compute_target,
                      environment=sklearn_env)

提交运行Submit your run

from azureml.core import Experiment

run = Experiment(ws,'train-iris').submit(src)
run.wait_for_completion(show_output=True)

警告

Azure 机器学习通过复制整个源目录来运行训练脚本。Azure Machine Learning runs training scripts by copying the entire source directory. 如果你有不想上传的敏感数据,请使用 .ignore 文件或不将其包含在源目录中。If you have sensitive data that you don't want to upload, use a .ignore file or don't include it in the source directory . 改为使用 Azure ML 数据集来访问数据。Instead, access your data using an Azure ML dataset.

在运行执行过程中发生的情况What happens during run execution

执行运行时,会经历以下阶段:As the run is executed, it goes through the following stages:

  • 准备:根据所定义的环境创建 docker 映像。Preparing: A docker image is created according to the environment defined. 将映像上传到工作区的容器注册表,缓存以用于后续运行。The image is uploaded to the workspace's container registry and cached for later runs. 还会将日志流式传输到运行历史记录,可以查看日志以监视进度。Logs are also streamed to the run history and can be viewed to monitor progress. 如果改为指定特选环境,则会使用支持该特选环境的缓存映像。If a curated environment is specified instead, the cached image backing that curated environment will be used.

  • 缩放:如果 Batch AI 群集执行运行所需的节点多于当前可用节点,则群集将尝试纵向扩展。Scaling: The cluster attempts to scale up if the Batch AI cluster requires more nodes to execute the run than are currently available.

  • 正在运行:将脚本文件夹中的所有脚本上传到计算目标,装载或复制数据存储,然后执行 scriptRunning: All scripts in the script folder are uploaded to the compute target, data stores are mounted or copied, and the script is executed. 将 stdout 和 ./logs 文件夹中的输出流式传输到运行历史记录,即可将其用于监视运行。Outputs from stdout and the ./logs folder are streamed to the run history and can be used to monitor the run.

  • 后期处理:将运行的 ./outputs 文件夹复制到运行历史记录。Post-Processing: The ./outputs folder of the run is copied over to the run history.

保存并注册模型Save and register the model

训练模型后,可以将其保存并注册到工作区。Once you've trained the model, you can save and register it to your workspace. 凭借模型注册,可以在工作区中存储模型并对其进行版本控制,从而简化模型管理和部署Model registration lets you store and version your models in your workspace to simplify model management and deployment.

将以下代码添加到训练脚本 train_iris.py 以保存模型。Add the following code to your training script, train_iris.py, to save the model.

import joblib

joblib.dump(svm_model_linear, 'model.joblib')

使用以下代码将模型注册到工作区。Register the model to your workspace with the following code. 通过指定参数 model_frameworkmodel_framework_versionresource_configuration,无代码模型部署将可供使用。By specifying the parameters model_framework, model_framework_version, and resource_configuration, no-code model deployment becomes available. 无代码模型部署允许你通过已注册模型直接将模型部署为 Web 服务,ResourceConfiguration 对象定义 Web 服务的计算资源。No-code model deployment allows you to directly deploy your model as a web service from the registered model, and the ResourceConfiguration object defines the compute resource for the web service.

from azureml.core import Model
from azureml.core.resource_configuration import ResourceConfiguration

model = run.register_model(model_name='sklearn-iris', 
                           model_path='outputs/model.joblib',
                           model_framework=Model.Framework.SCIKITLEARN,
                           model_framework_version='0.19.1',
                           resource_configuration=ResourceConfiguration(cpu=1, memory_in_gb=0.5))

部署Deployment

可采用与 Azure ML 中任何其他已注册模型完全相同的方式部署你刚才注册的模型。The model you just registered can be deployed the exact same way as any other registered model in Azure ML. 部署指南包含有关模型注册的部分,但由于你已有一个已注册的模型,因而可以直接跳到创建计算目标进行部署。The deployment how-to contains a section on registering models, but you can skip directly to creating a compute target for deployment, since you already have a registered model.

(预览版)无代码模型部署(Preview) No-code model deployment

还可以为 scikit-learn 使用无代码部署功能(预览版)来代替传统的部署路线。Instead of the traditional deployment route, you can also use the no-code deployment feature (preview) for scikit-learn. 所有内置 scikit-learn 模型类型都支持无代码模型部署。No-code model deployment is supported for all built-in scikit-learn model types. 通过使用 model_frameworkmodel_framework_versionresource_configuration 参数注册你的模型(如上所示),可以简单地使用 deploy() 静态函数来部署模型。By registering your model as shown above with the model_framework, model_framework_version, and resource_configuration parameters, you can simply use the deploy() static function to deploy your model.

web_service = Model.deploy(ws, "scikit-learn-service", [model])

注意:这些依赖项包含在预建的 scikit-learn 推理容器中。NOTE: These dependencies are included in the pre-built scikit-learn inference container.

    - azureml-defaults
    - inference-schema[numpy-support]
    - scikit-learn
    - numpy

完整的操作指南更深入地介绍了 Azure 机器学习。The full how-to covers deployment in Azure Machine Learning in greater depth.

后续步骤Next steps

在本文中,你训练并注册了一个 scikit-learn 模型,并了解了部署选项。In this article, you trained and registered a scikit-learn model, and learned about deployment options. 有关 Azure 机器学习的详细信息,请参阅以下其他文章。See these other articles to learn more about Azure Machine Learning.