教程:使用你自己的数据(第 4 部分,共 4 部分)Tutorial: Use your own data (part 4 of 4)

本教程介绍了如何在 Azure 机器学习中上传并使用你自己的数据来训练机器学习模型。This tutorial shows you how to upload and use your own data to train machine learning models in Azure Machine Learning.

本教程是由四部分组成的系列教程的第 4 部分,你可以在其中了解 Azure 机器学习基础知识,并在 Azure 中完成基于作业的机器学习任务。This tutorial is part 4 of a four-part tutorial series in which you learn the fundamentals of Azure Machine Learning and complete jobs-based machine learning tasks in Azure. 本教程以你在第 1 部分:设置第 2 部分:运行“Hello World!”第 3 部分:训练模型中完成的工作为基础编写。This tutorial builds on the work you completed in Part 1: Set up, Part 2: Run "Hello World!", and Part 3: Train a model.

第 3 部分:训练模型中,已通过 PyTorch API 中内置的 torchvision.datasets.CIFAR10 方法下载了数据。In Part 3: Train a model, data was downloaded through the inbuilt torchvision.datasets.CIFAR10 method in the PyTorch API. 但在许多情况下,你将会需要在远程训练运行中使用自己的数据。However, in many cases you'll want to use your own data in a remote training run. 本文介绍了可用于在 Azure 机器学习中使用你自己的数据的工作流。This article shows the workflow that you can use to work with your own data in Azure Machine Learning.

本教程介绍以下操作:In this tutorial, you:

  • 将训练脚本配置为使用本地目录中的数据。Configure a training script to use data in a local directory.
  • 在本地测试训练脚本。Test the training script locally.
  • 将数据上传到 Azure。Upload data to Azure.
  • 创建控制脚本。Create a control script.
  • 了解新的 Azure 机器学习概念(传递参数、数据集、数据存储)。Understand the new Azure Machine Learning concepts (passing parameters, datasets, datastores).
  • 提交并运行训练脚本。Submit and run your training script.
  • 在云中查看代码输出。View your code output in the cloud.

先决条件Prerequisites

  • 完成本系列的第 3 部分Completion of part 3 of the series.
  • Python 语言和机器学习工作流的入门知识。Introductory knowledge of the Python language and machine learning workflows.
  • 本地开发环境,如 Visual Studio Code、Jupyter 或 PyCharm。Local development environment, such as Visual Studio Code, Jupyter, or PyCharm.
  • Python(版本 3.5 至 3.7)。Python (version 3.5 to 3.7).

调整训练脚本Adjust the training script

现在,你的训练脚本 (tutorial/src/train.py) 已在 Azure 机器学习中运行,并且你可以监视模型性能。By now you have your training script (tutorial/src/train.py) running in Azure Machine Learning, and you can monitor the model performance. 让我们通过引入参数来将训练脚本参数化。Let's parameterize the training script by introducing arguments. 使用参数可轻松比较不同的超参数。Using arguments will allow you to easily compare different hyperparameters.

我们的训练脚本现在设置为在每次运行时下载 CIFAR10 数据集。Our training script is now set to download the CIFAR10 dataset on each run. 以下 Python 代码已调整为从某个目录中读取数据。The following Python code has been adjusted to read the data from a directory.

备注

使用 argparse 将脚本参数化。The use of argparse parameterizes the script.

# tutorial/src/train.py
import os
import argparse
import torch
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

from model import Net
from azureml.core import Run

run = Run.get_context()

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--data_path', type=str, help='Path to the training data')
    parser.add_argument('--learning_rate', type=float, default=0.001, help='Learning rate for SGD')
    parser.add_argument('--momentum', type=float, default=0.9, help='Momentum for SGD')
    args = parser.parse_args()
    
    print("===== DATA =====")
    print("DATA PATH: " + args.data_path)
    print("LIST FILES IN DATA PATH...")
    print(os.listdir(args.data_path))
    print("================")
    
    # prepare DataLoader for CIFAR10 data
    transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
    trainset = torchvision.datasets.CIFAR10(
        root=args.data_path,
        train=True,
        download=False,
        transform=transform,
    )
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)

    # define convolutional network
    net = Net()

    # set up pytorch loss /  optimizer
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = optim.SGD(
        net.parameters(),
        lr=args.learning_rate,
        momentum=args.momentum,
    )

    # train the network
    for epoch in range(2):

        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            # unpack the data
            inputs, labels = data

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:
                loss = running_loss / 2000
                run.log('loss', loss) # log loss metric to AML
                print(f'epoch={epoch + 1}, batch={i + 1:5}: loss {loss:.2f}')
                running_loss = 0.0

    print('Finished Training')

了解代码更改Understanding the code changes

train.py 中的代码已使用 argparse 库来设置 data_pathlearning_ratemomentumThe code in train.py has used the argparse library to set up data_path, learning_rate, and momentum.

# .... other code
parser = argparse.ArgumentParser()
parser.add_argument('--data_path', type=str, help='Path to the training data')
parser.add_argument('--learning_rate', type=float, default=0.001, help='Learning rate for SGD')
parser.add_argument('--momentum', type=float, default=0.9, help='Momentum for SGD')
args = parser.parse_args()
# ... other code

此外,train.py 脚本已经过改编,以将优化器更新为使用用户定义的参数:Also, the train.py script was adapted to update the optimizer to use the user-defined parameters:

optimizer = optim.SGD(
    net.parameters(),
    lr=args.learning_rate,     # get learning rate from command-line argument
    momentum=args.momentum,    # get momentum from command-line argument
)

在本地测试脚本Test the script locally

你的脚本现在接受数据路径作为参数。Your script now accepts data path as an argument. 若要开始,请在本地对其进行测试。To start with, test it locally. 向教程目录结构添加一个名为 data 的文件夹。Add to your tutorial directory structure a folder called data. 目录结构应类似于:Your directory structure should look like:

tutorial
└──.azureml
|  └──config.json
|  └──pytorch-env.yml
└──data
└──src
|  └──hello.py
|  └──model.py
|  └──train.py
└──01-create-workspace.py
└──02-create-compute.py
└──03-run-hello.py
└──04-run-pytorch.py

如果在上一教程中未在本地运行 train.py,将不会有 data/ 目录。If you didn't run train.py locally in the previous tutorial, you won't have the data/ directory. 在此示例中,请在 train.py 脚本中在 download=True 的情况下在本地运行 torchvision.datasets.CIFAR10 方法。In this case, run the torchvision.datasets.CIFAR10 method locally with download=True in your train.py script.

若要在本地运行修改后的训练脚本,请调用:To run the modified training script locally, call:

python src/train.py --data_path ./data --learning_rate 0.003 --momentum 0.92

可以通过传入数据的本地路径来避免下载 CIFAR10 数据集。You avoid having to download the CIFAR10 dataset by passing in a local path to the data. 对于“学习速率”和“动量”超参数,你也可以使用不同的值进行试验,而无需在训练脚本中对这些超参数进行硬编码 。You can also experiment with different values for learning rate and momentum hyperparameters without having to hard-code them in the training script.

将数据上传到 AzureUpload the data to Azure

若要在 Azure 机器学习中运行此脚本,需要使训练数据在 Azure 中可用。To run this script in Azure Machine Learning, you need to make your training data available in Azure. Azure 机器学习工作区附带默认的数据存储。Your Azure Machine Learning workspace comes equipped with a default datastore. 这是一个 Azure Blob 存储帐户,你可以在其中存储训练数据。This is an Azure Blob Storage account where you can store your training data.

备注

Azure 机器学习允许连接其他基于云的数据存储,用来存储你的数据。Azure Machine Learning allows you to connect other cloud-based datastores that store your data. 有关详细信息,请参阅数据存储文档For more details, see the datastores documentation.

tutorial 目录中,创建一个名为 05-upload-data.py 的新 Python 控制脚本:Create a new Python control script called 05-upload-data.py in the tutorial directory:

# tutorial/05-upload-data.py
from azureml.core import Workspace
ws = Workspace.from_config()
datastore = ws.get_default_datastore()
datastore.upload(src_dir='./data', target_path='datasets/cifar10', overwrite=True)

target_path 值指定 CIFAR10 数据将要上传到的数据存储的路径。The target_path value specifies the path on the datastore where the CIFAR10 data will be uploaded.

提示

在使用 Azure 机器学习上传数据时,可以使用 Azure 存储资源管理器来上传临时文件。While you're using Azure Machine Learning to upload the data, you can use Azure Storage Explorer to upload ad hoc files. 如果需要 ETL 工具,可以使用 Azure 数据工厂将数据引入 Azure。If you need an ETL tool, you can use Azure Data Factory to ingest your data into Azure.

运行该 Python 文件来上传数据。Run the Python file to upload the data. (上传速度应该会很快,时间应短于 60 秒。)(The upload should be quick, less than 60 seconds.)

python 05-upload-data.py

应该会看到以下标准输出:You should see the following standard output:

Uploading ./data\cifar-10-batches-py\data_batch_2
Uploaded ./data\cifar-10-batches-py\data_batch_2, 4 files out of an estimated total of 9
.
.
Uploading ./data\cifar-10-batches-py\data_batch_5
Uploaded ./data\cifar-10-batches-py\data_batch_5, 9 files out of an estimated total of 9
Uploaded 9 files

创建控制脚本Create a control script

像之前的操作一样,新建一个名为“06-run-pytorch-data.py”的 Python 控制脚本:As you've done previously, create a new Python control script called 06-run-pytorch-data.py:

# tutorial/06-run-pytorch-data.py
from azureml.core import Workspace
from azureml.core import Experiment
from azureml.core import Environment
from azureml.core import ScriptRunConfig
from azureml.core import Dataset

if __name__ == "__main__":
    ws = Workspace.from_config()
    
    datastore = ws.get_default_datastore()
    dataset = Dataset.File.from_files(path=(datastore, 'datasets/cifar10'))

    experiment = Experiment(workspace=ws, name='day1-experiment-data')

    config = ScriptRunConfig(
        source_directory='./src',
        script='train.py',
        compute_target='cpu-cluster',
        arguments=[
            '--data_path', dataset.as_named_input('input').as_mount(),
            '--learning_rate', 0.003,
            '--momentum', 0.92],
        )
    
    # set up pytorch environment
    env = Environment.from_conda_specification(name='pytorch-env',file_path='.azureml/pytorch-env.yml')
    config.run_config.environment = env

    run = experiment.submit(config)
    aml_url = run.get_portal_url()
    print("Submitted to an Azure Machine Learning compute cluster. Click on the link below")
    print("")
    print(aml_url)

了解代码更改Understand the code changes

该控制脚本类似于此系列第 3 部分中的控制脚本,包含以下新行:The control script is similar to the one from part 3 of this series, with the following new lines:

dataset = Dataset.File.from_files( ... )

数据集用于引用上传到 Azure Blob 存储的数据。A dataset is used to reference the data you uploaded to Azure Blob Storage. 数据集是基于数据的抽象层,旨在提高可靠性和可信任性。Datasets are an abstraction layer on top of your data that are designed to improve reliability and trustworthiness.

config = ScriptRunConfig(...)

ScriptRunConfig 已修改,包含将传递到 train.py 中的参数列表。ScriptRunConfig is modified to include a list of arguments that will be passed into train.py. dataset.as_named_input('input').as_mount() 参数表示指定的目录将会被装载到计算目标。The dataset.as_named_input('input').as_mount() argument means the specified directory will be mounted to the compute target.

将该运行提交到 Azure 机器学习Submit the run to Azure Machine Learning

现在,请重新提交该运行以使用新配置:Now resubmit the run to use the new configuration:

python 06-run-pytorch-data.py

此代码将会在 Azure 机器学习工作室中输出一个指向试验的 URL。This code will print a URL to the experiment in the Azure Machine Learning studio. 如果访问该链接,就可以看到代码在运行。If you go to that link, you'll be able to see your code running.

检查日志文件Inspect the log file

在工作室中,转到试验运行(通过选择前面的 URL 输出),然后选择“输出 + 日志”。In the studio, go to the experiment run (by selecting the previous URL output) followed by Outputs + logs. 选择 70_driver_log.txt 文件。Select the 70_driver_log.txt file. 应会看到以下输出:You should see the following output:

Processing 'input'.
Processing dataset FileDataset
{
  "source": [
    "('workspaceblobstore', 'datasets/cifar10')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "XXXXX",
    "name": null,
    "version": null,
    "workspace": "Workspace.create(name='XXXX', subscription_id='XXXX', resource_group='X')"
  }
}
Mounting input to /tmp/tmp9kituvp3.
Mounted input to /tmp/tmp9kituvp3 as folder.
Exit __enter__ of DatasetContextManager
Entering Run History Context Manager.
Current directory:  /mnt/batch/tasks/shared/LS_root/jobs/dsvm-aml/azureml/tutorial-session-3_1600171983_763c5381/mounts/workspaceblobstore/azureml/tutorial-session-3_1600171983_763c5381
Preparing to call script [ train.py ] with arguments: ['--data_path', '$input', '--learning_rate', '0.003', '--momentum', '0.92']
After variable expansion, calling script [ train.py ] with arguments: ['--data_path', '/tmp/tmp9kituvp3', '--learning_rate', '0.003', '--momentum', '0.92']

Script type = None
===== DATA =====
DATA PATH: /tmp/tmp9kituvp3
LIST FILES IN DATA PATH...
['cifar-10-batches-py', 'cifar-10-python.tar.gz']

注意:Notice:

  • Azure 机器学习已自动为你将 Blob 存储装载到计算群集。Azure Machine Learning has mounted Blob Storage to the compute cluster automatically for you.
  • 控制脚本中使用的 dataset.as_named_input('input').as_mount() 将解析为装入点。The dataset.as_named_input('input').as_mount() used in the control script resolves to the mount point.

清理资源Clean up resources

重要

已创建的资源可以用作其他 Azure 机器学习教程和操作方法文章的先决条件。The resources you created can be used as prerequisites to other Azure Machine Learning tutorials and how-to articles.

如果不打算使用已创建的资源,请删除它们,以免产生任何费用:If you don't plan to use the resources you created, delete them, so you don't incur any charges:

  1. 在 Azure 门户中,选择最左侧的“资源组” 。In the Azure portal, select Resource groups on the far left.

    在 Azure 门户中删除Delete in the Azure portal

  2. 从列表中选择已创建的资源组。From the list, select the resource group you created.

  3. 选择“删除资源组” 。Select Delete resource group.

  4. 输入资源组名称。Enter the resource group name. 然后选择“删除” 。Then select Delete.

还可保留资源组,但请删除单个工作区。You can also keep the resource group but delete a single workspace. 显示工作区属性,然后选择“删除”。Display the workspace properties and select Delete.

后续步骤Next steps

在本教程中,我们了解了如何通过使用 Datastore 将数据上传到 Azure。In this tutorial, we saw how to upload data to Azure by using Datastore. 数据存储充当工作区的云存储,为你提供了一个持久且灵活的数据保存位置。The datastore served as cloud storage for your workspace, giving you a persistent and flexible place to keep your data.

你已了解如何修改训练脚本,以通过命令行接受数据路径。You saw how to modify your training script to accept a data path via the command line. 通过使用 Dataset,可以将目录装载到远程运行。By using Dataset, you were able to mount a directory to the remote run.

有了模型后,接下来请学习:Now that you have a model, learn: