使用 Azure 机器学习数据集来训练模型Train models with Azure Machine Learning datasets

本文介绍如何使用 Azure 机器学习数据集来训练机器学习模型。In this article, you learn how to work with Azure Machine Learning datasets to train machine learning models. 可以在本地或远程计算目标中使用数据集,而不必考虑连接字符串或数据路径。You can use datasets in your local or remote compute target without worrying about connection strings or data paths.

Azure 机器学习数据集提供了与 Azure 机器学习训练功能(如 ScriptRunConfigHyperDriveAzure 机器学习管道)的无缝集成。Azure Machine Learning datasets provide a seamless integration with Azure Machine Learning training functionality like ScriptRunConfig, HyperDrive and Azure Machine Learning pipelines.

如果尚未准备好为模型训练提供数据,但需要将数据加载到笔记本进行数据浏览,请了解如何浏览数据集中的数据If you are not ready to make your data available for model training, but want to load your data to your notebook for data exploration, see how to explore the data in your dataset.

先决条件Prerequisites

若要创建数据集并使用数据集进行训练,需要具有以下各项:To create and train with datasets, you need:

备注

某些数据集类依赖于 azureml-dataprep 包。Some Dataset classes have dependencies on the azureml-dataprep package. 对于 Linux 用户,只有以下分发版支持这些类:Red Hat Enterprise Linux、Ubuntu、Fedora 和 CentOS。For Linux users, these classes are supported only on the following distributions: Red Hat Enterprise Linux, Ubuntu, Fedora, and CentOS.

在机器学习训练脚本中使用数据集Consume datasets in machine learning training scripts

如果有尚未注册为数据集的结构化数据,请创建一个 TabularDataset,并在训练脚本中直接使用它进行本地或远程实验。If you have structured data not yet registered as a dataset, create a TabularDataset and use it directly in your training script for your local or remote experiment.

在此示例中,你创建一个未注册的 TabularDataset,并在 ScriptRunConfig 对象中将其指定为脚本参数,以便进行训练。In this example, you create an unregistered TabularDataset and specify it as a script argument in the ScriptRunConfig object for training. 如果要在工作区中的其他实验中重用此 TabularDataset,请参见如何将数据集注册到工作区If you want to reuse this TabularDataset with other experiments in your workspace, see how to register datasets to your workspace.

创建 TabularDatasetCreate a TabularDataset

下面的代码从 Web URL 创建未注册的 TabularDataset。The following code creates an unregistered TabularDataset from a web url.

from azureml.core.dataset import Dataset

web_path ='https://dprepdata.blob.core.windows.net/demo/Titanic.csv'
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_path)

TabularDataset 对象提供将 TabularDataset 中的数据加载到 pandas 或 Spark 数据帧中的功能,以便你可以使用熟悉的数据准备和训练库,且无需离开笔记本。TabularDataset objects provide the ability to load the data in your TabularDataset into a pandas or Spark DataFrame so that you can work with familiar data preparation and training libraries without having to leave your notebook.

在训练脚本中访问数据集Access dataset in training script

下面的代码将配置脚本参数 --input-data,你将在配置训练运行时指定该参数(请参阅下一节)。The following code configures a script argument --input-data that you will specify when you configure your training run (see next section). 在将表格数据集作为参数值传入时,Azure ML 会将其解析为该数据集的 ID,然后你可以在训练脚本中使用该 ID 来访问该数据集(无需在脚本中对该数据集的名称或 ID 进行硬编码)。When the tabular dataset is passed in as the argument value, Azure ML will resolve that to ID of the dataset, which you can then use to access the dataset in your training script (without having to hardcode the name or ID of the dataset in your script). 然后,该脚本使用 to_pandas_dataframe() 方法将该数据集加载到 pandas 数据帧中,以便在训练之前进一步探索数据并做好准备。It then uses the to_pandas_dataframe() method to load that dataset into a pandas dataframe for further data exploration and preparation prior to training.

备注

如果原始数据源包含 NaN、空字符串或空值,则当你使用 to_pandas_dataframe() 时,这些值会被替换为“Null”值。If your original data source contains NaN, empty strings or blank values, when you use to_pandas_dataframe(), then those values are replaced as a Null value.

如果需要从内存中的 pandas 数据帧将准备好的数据加载到新数据集中,请将这些数据写入到一个本地文件(如 parquet),然后从该文件创建新数据集。If you need to load the prepared data into a new dataset from an in-memory pandas dataframe, write the data to a local file, like a parquet, and create a new dataset from that file. 详细了解如何创建数据集Learn more about how to create datasets.

%%writefile $script_folder/train_titanic.py

import argparse
from azureml.core import Dataset, Run

parser = argparse.ArgumentParser()
parser.add_argument("--input-data", type=str)
args = parser.parse_args()

run = Run.get_context()
ws = run.experiment.workspace

# get the input dataset by ID
dataset = Dataset.get_by_id(ws, id=args.input_data)

# load the TabularDataset to pandas DataFrame
df = dataset.to_pandas_dataframe()

配置训练运行Configure the training run

ScriptRunConfig 对象用于配置和提交训练运行。A ScriptRunConfig object is used to configure and submit the training run.

此代码创建 ScriptRunConfig 对象 src,该对象指定以下内容:This code creates a ScriptRunConfig object, src, that specifies

  • 脚本的脚本目录。A script directory for your scripts. 此目录中的所有文件都上传到群集节点以便执行。All the files in this directory are uploaded into the cluster nodes for execution.
  • 训练脚本 train_titanic.pyThe training script, train_titanic.py.
  • 用于训练的输入数据集 titanic_ds,作为脚本参数。The input dataset for training, titanic_ds, as a script argument. 在将此数据集传递到脚本时,Azure ML 会将其解析为该数据集的相应 ID。Azure ML will resolve this to corresponding ID of the dataset when it is passed to your script.
  • 该运行的计算目标。The compute target for the run.
  • 该运行的环境。The environment for the run.
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory=script_folder,
                      script='train_titanic.py',
                      # pass dataset as an input with friendly name 'titanic'
                      arguments=['--input-data', titanic_ds.as_named_input('titanic')],
                      compute_target=compute_target,
                      environment=myenv)
                             
# Submit the run configuration for your training run
run = experiment.submit(src)
run.wait_for_completion(show_output=True)                             

将文件装载到远程计算目标Mount files to remote compute targets

如果你有非结构化数据,请创建一个 FileDataset,然后装载或下载数据文件,使它们可在训练中用于远程计算目标。If you have unstructured data, create a FileDataset and either mount or download your data files to make them available to your remote compute target for training. 了解何时使用装载与下载进行远程训练实验。Learn about when to use mount vs. download for your remote training experiments.

下面的示例创建一个 FileDataset,并通过将该数据集作为参数传递到训练脚本来将其装载到计算目标上。The following example creates a FileDataset and mounts the dataset to the compute target by passing it as an argument to the training script.

备注

如果使用的是自定义 Docker 基础映像,则需要通过 apt-get install -y fuse 安装 fuse,作为让数据集装载正常工作所需的依赖项。If you are using a custom Docker base image, you will need to install fuse via apt-get install -y fuse as a dependency for dataset mount to work. 了解如何生成自定义生成映像Learn how to build a custom build image.

创建 FileDatasetCreate a FileDataset

以下示例从 Web URL 创建未注册的 FileDataset。The following example creates an unregistered FileDataset from web urls. 从其他来源详细了解如何创建数据集Learn more about how to create datasets from other sources.

from azureml.core.dataset import Dataset

web_paths = [
            'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'
            ]
mnist_ds = Dataset.File.from_files(path = web_paths)

配置训练运行Configure the training run

建议在装载时通过 ScriptRunConfig 构造函数的 arguments 参数将该数据集作为参数传递。We recommend passing the dataset as an argument when mounting via the arguments parameter of the ScriptRunConfig constructor. 这样,你就可以通过参数在训练脚本中获取数据路径(装入点)。By doing so, you will get the data path (mounting point) in your training script via arguments. 这样,可以在任何云平台上使用相同的训练脚本进行本地调试和远程训练。This way, you will be able use the same training script for local debugging and remote training on any cloud platform.

下面的示例创建一个 ScriptRunConfig,它通过 arguments 传入 FileDataset。The following example creates a ScriptRunConfig that passes in the FileDataset via arguments. 在提交该运行后,mnist_ds 数据集引用的数据文件将会装载到计算目标。After you submit the run, data files referred by the dataset mnist_ds will be mounted to the compute target.

from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory=script_folder,
                      script='train_mnist.py',
                      # the dataset will be mounted on the remote compute and the mounted path passed as an argument to the script
                      arguments=['--data-folder', mnist_ds.as_mount(), '--regularization', 0.5],
                      compute_target=compute_target,
                      environment=myenv)

# Submit the run configuration for your training run
run = experiment.submit(src)
run.wait_for_completion(show_output=True)

在训练脚本中检索数据Retrieve data in your training script

下面的代码展示了如何在脚本中检索数据。The following code shows how to retrieve the data in your script.

%%writefile $script_folder/train_mnist.py

import argparse
import os
import numpy as np
import glob

from utils import load_data

# retrieve the 2 arguments configured through `arguments` in the ScriptRunConfig
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
parser.add_argument('--regularization', type=float, dest='reg', default=0.01, help='regularization rate')
args = parser.parse_args()

data_folder = args.data_folder
print('Data folder:', data_folder)

# get the file paths on the compute
X_train_path = glob.glob(os.path.join(data_folder, '**/train-images-idx3-ubyte.gz'), recursive=True)[0]
X_test_path = glob.glob(os.path.join(data_folder, '**/t10k-images-idx3-ubyte.gz'), recursive=True)[0]
y_train_path = glob.glob(os.path.join(data_folder, '**/train-labels-idx1-ubyte.gz'), recursive=True)[0]
y_test = glob.glob(os.path.join(data_folder, '**/t10k-labels-idx1-ubyte.gz'), recursive=True)[0]

# load train and test set into numpy arrays
X_train = load_data(X_train_path, False) / 255.0
X_test = load_data(X_test_path, False) / 255.0
y_train = load_data(y_train_path, True).reshape(-1)
y_test = load_data(y_test, True).reshape(-1)

装载和下载Mount vs download

对于从 Azure Blob 存储、Azure 文件存储、Azure Data Lake Storage Gen1、Azure Data Lake Storage Gen2、Azure SQL 数据库和 Azure Database for PostgreSQL 创建的数据集,可以装载或下载任何格式的文件。Mounting or downloading files of any format are supported for datasets created from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL.

装载数据集时,请将数据集引用的文件附加到目录(装入点),并使其在计算目标上可用。When you mount a dataset, you attach the files referenced by the dataset to a directory (mount point) and make it available on the compute target. 基于 Linux 的计算支持装载,这些计算包括 Azure 机器学习计算、虚拟机和 HDInsight。Mounting is supported for Linux-based computes, including Azure Machine Learning Compute, virtual machines, and HDInsight.

下载数据集时,数据集引用的所有文件都将下载到计算目标。When you download a dataset, all the files referenced by the dataset will be downloaded to the compute target. 所有计算类型都支持下载。Downloading is supported for all compute types.

如果脚本处理数据集引用的所有文件,并且计算磁盘可以容纳整个数据集,则建议下载,以避免从存储服务流式传输数据的开销。If your script processes all files referenced by the dataset, and your compute disk can fit your full dataset, downloading is recommended to avoid the overhead of streaming data from storage services. 如果数据大小超出计算磁盘大小,则无法下载。If your data size exceeds the compute disk size, downloading is not possible. 对于此方案,我们建议装载,因为在处理时只会加载脚本使用的数据文件。For this scenario, we recommend mounting since only the data files used by your script are loaded at the time of processing.

以下代码将 dataset 装载到 mounted_path 的临时目录The following code mounts dataset to the temp directory at mounted_path

import tempfile
mounted_path = tempfile.mkdtemp()

# mount dataset onto the mounted_path of a Linux-based compute
mount_context = dataset.mount(mounted_path)

mount_context.start()

import os
print(os.listdir(mounted_path))
print (mounted_path)

在机器学习脚本中获取数据集Get datasets in machine learning scripts

可以在本地以及在 Azure 机器学习计算等计算群集上远程访问已注册的数据集。Registered datasets are accessible both locally and remotely on compute clusters like the Azure Machine Learning compute. 若要跨试验访问已注册的数据集,请使用以下代码来访问工作区并获取在以前提交的运行中使用过的数据集。To access your registered dataset across experiments, use the following code to access your workspace and get the dataset that was used in your previously submitted run. 默认情况下,Dataset 类中的 get_by_name() 方法返回已注册到工作区的数据集的最新版本。By default, the get_by_name() method on the Dataset class returns the latest version of the dataset that's registered with the workspace.

%%writefile $script_folder/train.py

from azureml.core import Dataset, Run

run = Run.get_context()
workspace = run.experiment.workspace

dataset_name = 'titanic_ds'

# Get a dataset by name
titanic_ds = Dataset.get_by_name(workspace=workspace, name=dataset_name)

# Load a TabularDataset into pandas DataFrame
df = titanic_ds.to_pandas_dataframe()

在训练期间访问源代码Access source code during training

Azure Blob 存储具有比 Azure 文件共享更快的吞吐速度,并将扩展到大量并行启动的作业。Azure Blob storage has higher throughput speeds than an Azure file share and will scale to large numbers of jobs started in parallel. 出于此原因,我们建议配置运行以使用 Blob 存储来传输源代码文件。For this reason, we recommend configuring your runs to use Blob storage for transferring source code files.

下面的代码示例在运行配置中指定用于源代码传输的 Blob 数据存储。The following code example specifies in the run configuration which blob datastore to use for source code transfers.

# workspaceblobstore is the default blob storage
src.run_config.source_directory_data_store = "workspaceblobstore" 

Notebook 示例Notebook examples

疑难解答Troubleshooting

  • 数据集初始化失败:“等待装入点准备完毕”已超时Dataset initialization failed: Waiting for mount point to be ready has timed out:
    • 如果你没有任何出站网络安全组规则并且正在使用 azureml-sdk>=1.12.0,请将 azureml-dataset-runtime 及其依赖项更新为特定次要版本的最新版本,而如果你正在运行中使用该版本,请重新创建环境,以便获得具有修补程序的最新补丁。If you don't have any outbound network security group rules and are using azureml-sdk>=1.12.0, update azureml-dataset-runtime and its dependencies to be the latest for the specific minor version, or if you are using it in a run, recreate your environment so it can have the latest patch with the fix.
    • 如果使用的是 azureml-sdk<1.12.0,请升级到最新版本。If you are using azureml-sdk<1.12.0, upgrade to the latest version.
    • 如果具有出站 NSG 规则,请确保存在允许服务标记 AzureResourceMonitor 的所有流量的出站规则。If you have outbound NSG rules, make sure there is an outbound rule that allows all traffic for the service tag AzureResourceMonitor.

AzureFile 存储过载Overloaded AzureFile storage

如果收到 Unable to upload project files to working directory in AzureFile because the storage is overloaded 错误,请应用以下解决方法。If you receive an error Unable to upload project files to working directory in AzureFile because the storage is overloaded, apply following workarounds.

如果对其他工作负荷(例如数据传输)使用文件共享,则我们建议使用 Blob,以便可以自由使用文件共享来提交运行。If you are using file share for other workloads, such as data transfer, the recommendation is to use blobs so that file share is free to be used for submitting runs. 还可以在两个不同的工作区之间拆分工作负荷。You may also split the workload between two different workspaces.

作为输入传递数据Passing data as input

  • TypeError:FileNotFound:无此类文件或目录: 如果文件不在提供的文件路径中,则会出现此错误。TypeError: FileNotFound: No such file or directory: This error occurs if the file path you provide isn't where the file is located. 需确保引用文件的方式与在计算目标上将数据集装载到的位置相一致。You need to make sure the way you refer to the file is consistent with where you mounted your dataset on your compute target. 为确保确定性状态,我们建议在将数据集装载到计算目标时使用抽象路径。To ensure a deterministic state, we recommend using the abstract path when mounting a dataset to a compute target. 例如,在以下代码中,我们将数据集装载到计算目标文件系统的根目录 /tmp 下。For example, in the following code we mount the dataset under the root of the filesystem of the compute target, /tmp.

    # Note the leading / in '/tmp/dataset'
    script_params = {
        '--data-folder': dset.as_named_input('dogscats_train').as_mount('/tmp/dataset'),
    } 
    

    如果不包含前导正斜杠“/”,则需要为计算目标上的工作目录添加前缀(例如 /mnt/batch/.../tmp/dataset),以指示要将数据集装载到的位置。If you don't include the leading forward slash, '/', you'll need to prefix the working directory e.g. /mnt/batch/.../tmp/dataset on the compute target to indicate where you want the dataset to be mounted.

后续步骤Next steps