教程:使用 Azure 机器学习通过 MNIST 数据和 scikit-learn 训练映像分类模型Tutorial: Train image classification models with MNIST data and scikit-learn using Azure Machine Learning

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

在本教程中,你将在远程计算资源上训练一个机器学习模型。In this tutorial, you train a machine learning model on remote compute resources. 将在 Python Jupyter Notebook 中使用 Azure 机器学习的训练和部署工作流。You'll use the training and deployment workflow for Azure Machine Learning in a Python Jupyter notebook. 然后可以将 Notebook 用作模板,使用你自己的数据来定型机器学习。You can then use the notebook as a template to train your own machine learning model with your own data. 本教程是由两个部分构成的系列教程的第一部分 。This tutorial is part one of a two-part tutorial series.

本教程将 MNIST 数据集和 scikit-learn 与 Azure 机器学习配合使用来训练简单的逻辑回归。This tutorial trains a simple logistic regression by using the MNIST dataset and scikit-learn with Azure Machine Learning. MNIST 是包含 70,000 张灰度图像的常用数据集。MNIST is a popular dataset consisting of 70,000 grayscale images. 每个图像是 28 x 28 像素的手写数字,代表一个从零到九的数字。Each image is a handwritten digit of 28 x 28 pixels, representing a number from zero to nine. 目标是创建多类分类器,以确定给定图像代表的数字。The goal is to create a multi-class classifier to identify the digit a given image represents.

了解如何执行以下操作:Learn how to take the following actions:

  • 设置开发环境。Set up your development environment.
  • 访问和检查数据。Access and examine the data.
  • 在远程群集上训练一个简单的逻辑回归模型。Train a simple logistic regression model on a remote cluster.
  • 查看定型结果,然后注册最佳模型。Review training results and register the best model.

你会在本教程的第二部分学习如何选择模型并对其进行部署。You learn how to select a model and deploy it in part two of this tutorial.

如果没有 Azure 订阅,请在开始之前创建一个试用帐户。If you don’t have an Azure subscription, create a trial account before you begin.


本文中的代码已使用 Azure 机器学习 SDK 版本 1.0.65 进行测试。Code in this article was tested with Azure Machine Learning SDK version 1.0.65.


  • 在开始本教程之前完成教程:开始创建第一个 Azure ML 试验,以执行以下操作:Complete the Tutorial: Get started creating your first Azure ML experiment to:

    • 创建工作区Create a workspace
    • 将教程笔记本克隆到工作区中的文件夹。Clone the tutorials notebook to your folder in the workspace.
    • 创建基于云的计算实例。Create a cloud-based compute instance.
  • 在克隆的 tutorials/image-classification-mnist-data 文件夹中,打开 img-classification-part1-training.ipynb 笔记本 。In your cloned tutorials/image-classification-mnist-data folder, open the img-classification-part1-training.ipynb notebook.

如果希望在自己的本地环境中使用此教程及其附带的 utils.py 文件,也可以在 GitHub 上找到它。The tutorial and accompanying utils.py file is also available on GitHub if you wish to use it on your own local environment. 运行 pip install azureml-sdk[notebooks] azureml-opendatasets matplotlib 以便安装本教程的依赖项。Run pip install azureml-sdk[notebooks] azureml-opendatasets matplotlib to install dependencies for this tutorial.


本文的其余部分包含的内容与在笔记本中看到的内容相同。The rest of this article contains the same content as you see in the notebook.

如果要在运行代码时继续阅读,请立即切换到 Jupyter 笔记本。Switch to the Jupyter notebook now if you want to read along as you run the code. 若要在笔记本中运行单个代码单元,请单击代码单元,然后按 Shift+EnterTo run a single code cell in a notebook, click the code cell and hit Shift+Enter. 或者,通过从顶部工具栏中选择“全部运行” 来运行整个笔记本。Or, run the entire notebook by choosing Run all from the top toolbar.

设置开发环境Set up your development environment

开发工作的所有设置都可以在 Python Notebook 中完成。All the setup for your development work can be accomplished in a Python notebook. 安装包括以下操作:Setup includes the following actions:

  • 导入 Python 包。Import Python packages.
  • 连接到工作区,以便本地计算机能够与远程资源通信。Connect to a workspace, so that your local computer can communicate with remote resources.
  • 创建试验来跟踪所有运行。Create an experiment to track all your runs.
  • 创建要用于定型的远程计算目标。Create a remote compute target to use for training.

导入程序包Import packages

导入在此会话中所需的 Python 包。Import Python packages you need in this session. 此外显示 Azure 机器学习 SDK 版本:Also display the Azure Machine Learning SDK version:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

import azureml.core
from azureml.core import Workspace

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

连接到工作区Connect to a workspace

从现有工作区创建工作区对象。Create a workspace object from the existing workspace. Workspace.from_config() 读取文件 config.json 并将详细信息加载到一个名为 ws 的对象 :Workspace.from_config() reads the file config.json and loads the details into an object named ws:

# load workspace configuration from the config.json file in the current folder.
ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, sep='\t')

创建试验Create an experiment

创建一个试验来跟踪工作区中的运行。Create an experiment to track the runs in your workspace. 一个工作区可有多个试验:A workspace can have multiple experiments:

from azureml.core import Experiment
experiment_name = 'sklearn-mnist'

exp = Experiment(workspace=ws, name=experiment_name)

创建或附加现有的计算目标Create or attach an existing compute target

Azure 机器学习计算是一项托管服务,可让数据科学家在 Azure 虚拟机群集上训练机器学习模型。By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. 示例包括带 GPU 支持的 VM。Examples include VMs with GPU support. 在本教程中,你将创建 Azure 机器学习计算作为训练环境。In this tutorial, you create Azure Machine Learning Compute as your training environment. 在本教程的后面部分,你将提交要在此 VM 上运行的 Python 代码。You will submit Python code to run on this VM later in the tutorial.

如果工作区中尚无计算群集,以下代码将创建计算群集。The code below creates the compute clusters for you if they don't already exist in your workspace.

创建计算目标需要大约 5 分钟。Creation of the compute target takes about five minutes. 如果计算资源已经在工作区中,则代码将使用它并跳过创建过程。If the compute resource is already in the workspace, the code uses it and skips the creation process.

from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os

# choose a name for your cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "cpucluster")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 4)

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_D2_V2")

if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target. just use it. ' + compute_name)
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_size,

    # create the cluster
    compute_target = ComputeTarget.create(
        ws, compute_name, provisioning_config)

    # can poll for a minimum number of nodes and for a specific timeout.
    # if no min node count is provided it will use the scale settings for the cluster
        show_output=True, min_node_count=None, timeout_in_minutes=20)

    # For a more detailed view of current AmlCompute status, use get_status()

现在已有所需的包和计算资源,可以在云中定型模型。You now have the necessary packages and compute resources to train a model in the cloud.

浏览数据Explore data

对模型进行定型之前,需要了解用于定型的数据。Before you train a model, you need to understand the data that you use to train it. 本部分介绍以下操作:In this section you learn how to:

  • 下载 MNIST 数据集。Download the MNIST dataset.
  • 显示一些示例图像。Display some sample images.

下载 MNIST 数据集Download the MNIST dataset

使用 Azure 开放数据集获取原始 MNIST 数据文件。Use Azure Open Datasets to get the raw MNIST data files. Azure 开放数据集是精选公共数据集,可用于将方案专属特征添加到机器学习解决方案,以提高模型的准确度。Azure Open Datasets are curated public datasets that you can use to add scenario-specific features to machine learning solutions for more accurate models. 每个数据集都有相应的类(此例中为 MNIST),以便以不同的方式检索数据。Each dataset has a corresponding class, MNIST in this case, to retrieve the data in different ways.

此代码将数据检索为 FileDataset 对象,该对象是 Dataset 的子类。This code retrieves the data as a FileDataset object, which is a subclass of Dataset. FileDataset 引用数据存储或公共 URL 中的任何格式的单个或多个文件。A FileDataset references single or multiple files of any format in your datastores or public urls. 该类可让你通过创建对数据源位置的引用来将文件下载或装载到计算。The class provides you with the ability to download or mount the files to your compute by creating a reference to the data source location. 此外,你将数据集注册到你的工作区,以便在训练期间轻松检索。Additionally, you register the Dataset to your workspace for easy retrieval during training.

按照操作方法了解有关数据集及其在 SDK 中的用法的详细信息。Follow the how-to to learn more about Datasets and their usage in the SDK.

from azureml.core import Dataset
from azureml.opendatasets import MNIST

data_folder = os.path.join(os.getcwd(), 'data')
os.makedirs(data_folder, exist_ok=True)

mnist_file_dataset = MNIST.get_file_dataset()
mnist_file_dataset.download(data_folder, overwrite=True)

mnist_file_dataset = mnist_file_dataset.register(workspace=ws,
                                                 description='training and test dataset',

显示一些示例图像Display some sample images

将压缩文件加载到 numpy 数组。Load the compressed files into numpy arrays. 然后,使用 matplotlib 从数据集随意绘制 30 张图像,并在上方附加标签。Then use matplotlib to plot 30 random images from the dataset with their labels above them. 此步骤需要 util.py 文件中包含的 load_data 函数。This step requires a load_data function that's included in an util.py file. 此文件包含在示例文件夹中。This file is included in the sample folder. 确保它与此 Notebook 放在同一文件夹中。Make sure it's placed in the same folder as this notebook. load_data 函数直接将压缩文件解析为 numpy 数组。The load_data function simply parses the compressed files into numpy arrays.

# make sure utils.py is in the same directory as this code
from utils import load_data

# note we also shrink the intensity values (X) from 0-255 to 0-1. This helps the model converge faster.
X_train = load_data(os.path.join(data_folder, "train-images-idx3-ubyte.gz"), False) / 255.0
X_test = load_data(os.path.join(data_folder, "t10k-images-idx3-ubyte.gz"), False) / 255.0
y_train = load_data(os.path.join(data_folder, "train-labels-idx1-ubyte.gz"), True).reshape(-1)
y_test = load_data(os.path.join(data_folder, "t10k-labels-idx1-ubyte.gz"), True).reshape(-1)

# now let's show some randomly chosen images from the traininng set.
count = 0
sample_size = 30
plt.figure(figsize=(16, 6))
for i in np.random.permutation(X_train.shape[0])[:sample_size]:
    count = count + 1
    plt.subplot(1, sample_size, count)
    plt.text(x=10, y=-10, s=y_train[i], fontsize=18)
    plt.imshow(X_train[i].reshape(28, 28), cmap=plt.cm.Greys)

随机图像示例显示:A random sample of images displays:


现在你已了解这些图像的外观和预期预测结果。Now you have an idea of what these images look like and the expected prediction outcome.

在远程群集上定型Train on a remote cluster

在此任务中,你将提交作业以在前面设置的远程训练群集上运行。For this task, you submit the job to run on the remote training cluster you set up earlier. 若要提交作业:To submit a job you:

  • 创建目录Create a directory
  • 创建定型脚本Create a training script
  • 创建估算器对象Create an estimator object
  • 提交作业Submit the job

创建目录Create a directory

创建一个目录,将所需的代码从计算机发送到远程资源。Create a directory to deliver the necessary code from your computer to the remote resource.

script_folder = os.path.join(os.getcwd(), "sklearn-mnist")
os.makedirs(script_folder, exist_ok=True)

创建定型脚本Create a training script

若要将作业提交到群集,首先创建定型脚本。To submit the job to the cluster, first create a training script. 运行以下代码,以在刚创建的目录中创建名为 train.py 的定型脚本。Run the following code to create the training script called train.py in the directory you just created.

%%writefile $script_folder/train.py

import argparse
import os
import numpy as np
import glob

from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib

from azureml.core import Run
from utils import load_data

# let user feed in 2 parameters, the dataset to mount or download, and the regularization rate of the logistic regression model
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
parser.add_argument('--regularization', type=float, dest='reg', default=0.01, help='regularization rate')
args = parser.parse_args()

data_folder = args.data_folder
print('Data folder:', data_folder)

# load train and test set into numpy arrays
# note we scale the pixel intensity values to 0-1 (by dividing it with 255.0) so the model can converge faster.
X_train = load_data(glob.glob(os.path.join(data_folder, '**/train-images-idx3-ubyte.gz'), recursive=True)[0], False) / 255.0
X_test = load_data(glob.glob(os.path.join(data_folder, '**/t10k-images-idx3-ubyte.gz'), recursive=True)[0], False) / 255.0
y_train = load_data(glob.glob(os.path.join(data_folder, '**/train-labels-idx1-ubyte.gz'), recursive=True)[0], True).reshape(-1)
y_test = load_data(glob.glob(os.path.join(data_folder, '**/t10k-labels-idx1-ubyte.gz'), recursive=True)[0], True).reshape(-1)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape, sep = '\n')

# get hold of the current run
run = Run.get_context()

print('Train a logistic regression model with regularization rate of', args.reg)
clf = LogisticRegression(C=1.0/args.reg, solver="liblinear", multi_class="auto", random_state=42)
clf.fit(X_train, y_train)

print('Predict the test set')
y_hat = clf.predict(X_test)

# calculate accuracy on the prediction
acc = np.average(y_hat == y_test)
print('Accuracy is', acc)

run.log('regularization rate', np.float(args.reg))
run.log('accuracy', np.float(acc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=clf, filename='outputs/sklearn_mnist_model.pkl')

请注意该脚本获取数据和保存模型的方式:Notice how the script gets data and saves models:

  • 定型脚本读取参数以查找包含数据的目录。The training script reads an argument to find the directory that contains the data. 稍后提交作业时,请参考数据存储获取此参数:parser.add_argument('--data-folder', type=str, dest='data_folder', help='data directory mounting point')When you submit the job later, you point to the datastore for this argument: parser.add_argument('--data-folder', type=str, dest='data_folder', help='data directory mounting point')

  • 训练脚本将模型保存到一个名为outputs 的目录中。The training script saves your model into a directory named outputs. 此目录中编写的所有内容都会自动上传到你的工作区。Anything written in this directory is automatically uploaded into your workspace. 稍后,你将在本教程从此目录访问模型。You access your model from this directory later in the tutorial. joblib.dump(value=clf, filename='outputs/sklearn_mnist_model.pkl')

  • 训练脚本需要使用文件 utils.py 来正确加载数据集。The training script requires the file utils.py to load the dataset correctly. 以下代码将 utils.py 复制到 script_folder 中,以便可以与远程资源上的训练脚本一起访问该文件。The following code copies utils.py into script_folder so that the file can be accessed along with the training script on the remote resource.

    import shutil
    shutil.copy('utils.py', script_folder)

创建估算器Create an estimator

SKLearn 估算器对象用于提交运行。An SKLearn estimator object is used to submit the run. 通过运行以下代码创建估算器以定义以下项:Create your estimator by running the following code to define these items:

  • 估算器对象的名称,estThe name of the estimator object, est.
  • 包含脚本的目录。The directory that contains your scripts. 此目录中的所有文件都上传到群集节点以便执行。All the files in this directory are uploaded into the cluster nodes for execution.
  • 计算目标。The compute target. 在此示例中,将使用所创建的 Azure 机器学习计算群集。In this case, you use the Azure Machine Learning compute cluster you created.
  • 定型脚本名称,train.py 。The training script name, train.py.
  • 定型脚本所需的参数。Parameters required from the training script.

在本教程中,此目标是 AmlCompute。In this tutorial, this target is AmlCompute. 此脚本文件夹中的所有文件都上传到群集节点来运行。All files in the script folder are uploaded into the cluster nodes for run. 设置 data_folder 以使用数据集。The data_folder is set to use the dataset. 首先,创建一个指定训练所需依赖项的环境对象。First create an environment object that specifies the dependencies required for training.

from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies

env = Environment('my_env')
cd = CondaDependencies.create(pip_packages=['azureml-sdk','scikit-learn','azureml-dataprep[pandas,fuse]>=1.1.14'])
env.python.conda_dependencies = cd

然后使用以下代码创建估算器。Then create the estimator with the following code.

from azureml.train.sklearn import SKLearn

script_params = {
    '--data-folder': mnist_file_dataset.as_named_input('mnist_opendataset').as_mount(),
    '--regularization': 0.5

est = SKLearn(source_directory=script_folder,

将作业提交到群集Submit the job to the cluster

通过提交估算器对象来运行此试验:Run the experiment by submitting the estimator object:

run = exp.submit(config=est)

由于调用是异步的,因此一旦作业启动,它就会返回“正在准备”或“正在运行”状态 。Because the call is asynchronous, it returns a Preparing or Running state as soon as the job is started.

监视远程运行Monitor a remote run

总的来说,首次运行需要大约 10 分钟 。In total, the first run takes about 10 minutes. 但对于后续运行,只要不更改脚本依赖项,将重复使用相同的映像。But for subsequent runs, as long as the script dependencies don't change, the same image is reused. 因此容器启动时间要快得多。So the container startup time is much faster.

等待时会发生以下情况:What happens while you wait:

  • 映像创建:将创建与估算器指定的 Python 环境相匹配的 Docker 映像。Image creation: A Docker image is created that matches the Python environment specified by the estimator. 映像将上传到工作区。The image is uploaded to the workspace. 创建和上传映像需要大约五分钟 。Image creation and uploading takes about five minutes.

    此阶段针对每个 Python 环境发生一次,因为已缓存容器用于后续运行。This stage happens once for each Python environment because the container is cached for subsequent runs. 映像创建期间,日志将流式传输到运行历史记录。During image creation, logs are streamed to the run history. 可以使用这些日志监视映像创建进度。You can monitor the image creation progress by using these logs.

  • 缩放:如果远程群集需要比当前可用节点更多的节点来执行运行,则会自动添加其他节点。Scaling: If the remote cluster requires more nodes to do the run than currently available, additional nodes are added automatically. 缩放通常需要大约五分钟 。Scaling typically takes about five minutes.

  • 正在运行:在此阶段,必要的脚本和文件会发送到计算目标。Running: In this stage, the necessary scripts and files are sent to the compute target. 接着,装载或复制数据存储。Then datastores are mounted or copied. 然后,运行 entry_script 。And then the entry_script is run. 运行作业时,stdout 和 ./logs 目录会流式传输到运行历史记录 。While the job is running, stdout and the ./logs directory are streamed to the run history. 可以使用这些日志监视运行进度。You can monitor the run's progress by using these logs.

  • 后期处理:运行的 ./outputs 目录将复制到工作区中的运行历史记录,以便可以访问这些结果 。Post-processing: The ./outputs directory of the run is copied over to the run history in your workspace, so you can access these results.

可以通过多种方式查看正在运行的作业的进度。You can check the progress of a running job in several ways. 本教程使用 Jupyter 小组件和 wait_for_completion 方法。This tutorial uses a Jupyter widget and a wait_for_completion method.

Jupyter 小组件Jupyter widget

使用 Jupyter 小组件查看运行进度。Watch the progress of the run with a Jupyter widget. 和运行提交一样,该小组件采用异步方式,并每隔 10 到 15 秒提供实时更新,直到完成作业:Like the run submission, the widget is asynchronous and provides live updates every 10 to 15 seconds until the job finishes:

from azureml.widgets import RunDetails

训练结束时,该小组件将如下所示:The widget will look like the following at the end of training:

Notebook 小组件

如果需要取消运行,可以遵照这些说明If you need to cancel a run, you can follow these instructions.

完成时获取日志结果Get log results upon completion

模型定型和监视在后台发生。Model training and monitoring happen in the background. 在运行更多代码之前,请耐心等待,直到该模型完成定型。Wait until the model has finished training before you run more code. 使用 wait_for_completion 显示模型定型过程何时完成:Use wait_for_completion to show when the model training is finished:

run.wait_for_completion(show_output=False)  # specify True for a verbose log

显示运行结果Display run results

现在你拥有一个在远程群集上定型的模型。You now have a model trained on a remote cluster. 检索模型的准确性:Retrieve the accuracy of the model:


输出显示远程模型的准确度为 0.9204:The output shows the remote model has accuracy of 0.9204:

{'regularization rate': 0.8, 'accuracy': 0.9204}

下一教程将详细介绍此模型。In the next tutorial, you explore this model in more detail.

注册模型Register model

定型脚本的最后一步在运行作业的群集 VM 内名为 outputs 的目录中编写文件 outputs/sklearn_mnist_model.pklThe last step in the training script wrote the file outputs/sklearn_mnist_model.pkl in a directory named outputs in the VM of the cluster where the job is run. outputs 是一个专门目录,此目录中的所有内容都会自动都上传到工作区。outputs is a special directory in that all content in this directory is automatically uploaded to your workspace. 此内容在工作区下试验的运行记录中显示。This content appears in the run record in the experiment under your workspace. 因此,模型文件现还在工作区中提供。So the model file is now also available in your workspace.

可以看到与该运行关联的文件:You can see files associated with that run:


在工作区中注册模型,以便稍后你或其他协作者可以查询、检查和部署该模型:Register the model in the workspace, so that you or other collaborators can later query, examine, and deploy this model:

# register model
model = run.register_model(model_name='sklearn_mnist',
print(model.name, model.id, model.version, sep='\t')

清理资源Clean up resources


已创建的资源可以用作其他 Azure 机器学习教程和操作方法文章的先决条件。The resources you created can be used as prerequisites to other Azure Machine Learning tutorials and how-to articles.

如果不打算使用已创建的资源,请删除它们,以免产生任何费用:If you don't plan to use the resources you created, delete them, so you don't incur any charges:

  1. 在 Azure 门户中,选择最左侧的“资源组” 。In the Azure portal, select Resource groups on the far left.

    在 Azure 门户中删除Delete in the Azure portal

  2. 从列表中选择已创建的资源组。From the list, select the resource group you created.

  3. 选择“删除资源组” 。Select Delete resource group.

  4. 输入资源组名称。Enter the resource group name. 然后选择“删除” 。Then select Delete.

此外可以只删除 Azure 机器学习计算群集。You can also delete just the Azure Machine Learning Compute cluster. 但是,自动缩放已打开,并且群集最小值为零。However, autoscale is turned on, and the cluster minimum is zero. 因此,未使用此特定资源时,便不会产生额外计算费用:So this particular resource won't incur additional compute charges when not in use:

# Optionally, delete the Azure Machine Learning Compute cluster

后续步骤Next steps

在本 Azure 机器学习教程中,已使用 Python 执行以下任务:In this Azure Machine Learning tutorial, you used Python for the following tasks:

  • 设置开发环境。Set up your development environment.
  • 访问和检查数据。Access and examine the data.
  • 使用流行的 scikit-learn 机器学习库训练远程群集上的多个模型Train multiple models on a remote cluster using the popular scikit-learn machine learning library
  • 查看定型详细信息,然后注册最佳模型。Review training details and register the best model.

可以使用本系列教程的下一部分中的说明来部署此注册模型:You're ready to deploy this registered model by using the instructions in the next part of the tutorial series: