创建 Azure 机器学习数据集Create Azure Machine Learning datasets

本文介绍如何使用 Azure 机器学习 Python SDK 创建 Azure 机器学习数据集,以访问本地或远程试验的数据。In this article, you learn how to create Azure Machine Learning datasets to access data for your local or remote experiments with the Azure Machine Learning Python SDK. 若要了解在 Azure 机器学习总体数据访问工作流中的哪些位置使用数据集,请参阅安全地访问数据一文。To understand where datasets fit in Azure Machine Learning's overall data access workflow, see the Securely access data article.

通过创建数据集,可以创建对数据源位置的引用及其元数据的副本。By creating a dataset, you create a reference to the data source location, along with a copy of its metadata. 由于数据保留在其现有位置中,因此不会产生额外的存储成本,也不会损害数据源的完整性。Because the data remains in its existing location, you incur no extra storage cost, and don't risk the integrity of your data sources. 此外,还会对数据集进行延迟计算,这有助于提高工作流执行速度。Also datasets are lazily-evaluated, which aids in workflow performance speeds.

使用 Azure 机器学习数据集可以:With Azure Machine Learning datasets, you can:

  • 在存储中保留数据的单个副本,供数据集引用。Keep a single copy of data in your storage, referenced by datasets.

  • 在模型训练期间无缝访问数据,而无需考虑连接字符串或数据路径。详细了解如何使用数据集进行训练Seamlessly access data during model training without worrying about connection strings or data paths.Learn more about how to train with datasets.

  • 与其他用户共享数据和展开协作。Share data and collaborate with other users.

先决条件Prerequisites

若要创建和使用数据集,需要做好以下准备:To create and work with datasets, you need:

备注

某些数据集类依赖于 azureml-dataprep 包,此包仅兼容64位 Python。Some dataset classes have dependencies on the azureml-dataprep package, which is only compatible with 64-bit Python. 对于 Linux 用户,只有以下分发版支持这些类:Red Hat Enterprise Linux(7、8)、Ubuntu(14.04、16.04、18.04)、Fedora(27、28)、Debian(8、9)和 CentOS (7)。For Linux users, these classes are supported only on the following distributions: Red Hat Enterprise Linux (7, 8), Ubuntu (14.04, 16.04, 18.04), Fedora (27, 28), Debian (8, 9), and CentOS (7). 如果使用的是不受支持的发行版,请按照此指南安装 .NET Core 2.1 以继续操作。If you are using unsupported distros, please follow this guide to install .NET Core 2.1 to proceed.

计算大小指南Compute size guidance

创建数据集时,请检查计算处理能力和内存中的数据大小。When creating a dataset, review your compute processing power and the size of your data in memory. 存储中的数据大小不同于数据帧中的数据大小。The size of your data in storage is not the same as the size of data in a dataframe. 例如,CSV 文件中的数据在数据帧中最多可以扩展到 10 倍,因此,1 GB 的 CSV 文件在数据帧中可能会变成 10 GB。For example, data in CSV files can expand up to 10x in a dataframe, so a 1 GB CSV file can become 10 GB in a dataframe.

如果数据已压缩,则可进一步让其扩展;以压缩 parquet 格式存储的 20 GB 相对稀疏的数据可以在内存中扩展到大约 800 GB。If your data is compressed, it can expand further; 20 GB of relatively sparse data stored in compressed parquet format can expand to ~800 GB in memory. 由于 Parquet 文件以纵栏格式存储数据,因此,如果只需要一半的列,则只需在内存中加载大约 400 GB。Since Parquet files store data in a columnar format, if you only need half of the columns, then you only need to load ~400 GB in memory.

详细了解如何优化 Azure 机器学习中的数据处理Learn more about optimizing data processing in Azure Machine Learning.

数据集类型Dataset types

存在两种数据集类型:FileDataset 和 TabularDataset,具体取决于用户在训练中使用它们的方式。There are two dataset types, based on how users consume them in training; FileDatasets and TabularDatasets. 这两种类型均可用于涉及估算器、AutoML、hyperDrive 和管道的 Azure 机器学习训练工作流。Both types can be used in Azure Machine Learning training workflows involving, estimators, AutoML, hyperDrive and pipelines.

FileDatasetFileDataset

FileDataset 引用数据存储或公共 URL 中的单个或多个文件。A FileDataset references single or multiple files in your datastores or public URLs. 如果数据已经清理,并且可以在训练试验中使用,则可将文件作为 FileDataset 对象下载或装载到计算中。If your data is already cleansed, and ready to use in training experiments, you can download or mount the files to your compute as a FileDataset object.

建议将 FileDataset 用于机器学习工作流,因为源文件可以采用任何格式,因此可以实现更广泛的机器学习方案,包括深度学习。We recommend FileDatasets for your machine learning workflows, since the source files can be in any format, which enables a wider range of machine learning scenarios, including deep learning.

使用 Python SDK 创建 FileDataset。Create a FileDataset with the Python SDK. ..

TabularDatasetTabularDataset

TabularDataset 通过分析提供的文件或文件列表来以表格格式表示数据。A TabularDataset represents data in a tabular format by parsing the provided file or list of files. 这样你就可以将数据具体化为 pandas 或 Spark 数据帧,以便进行熟悉的数据准备并训练库,不需离开笔记本。This provides you with the ability to materialize the data into a pandas or Spark DataFrame so you can work with familiar data preparation and training libraries without having to leave your notebook. 可以从 .csv、.tsv、.parquet、.jsonl 文件以及从 SQL 查询结果创建 TabularDataset 对象。You can create a TabularDataset object from .csv, .tsv, .parquet, .jsonl files, and from SQL query results.

使用 TabularDataset 时,可以从数据中的列或者从数据所存储到的任何路径模式指定一个时间戳,以启用时序特征。With TabularDatasets, you can specify a time stamp from a column in the data or from wherever the path pattern data is stored to enable a time series trait. 此规范允许按时间轻松有效地进行筛选。This specification allows for easy and efficient filtering by time. 有关示例,请参阅使用 NOAA 天气数据的表格时序相关 API 演示For an example, see Tabular time series-related API demo with NOAA weather data.

使用 Python SDK 创建 TabularDataset。Create a TabularDataset with the Python SDK.

备注

通过 Azure 机器学习工作室生成的 AutoML 工作流目前仅支持 TabularDataset。AutoML workflows generated via the Azure Machine Learning studio currently only support TabularDatasets.

访问虚拟网络中的数据集Access datasets in a virtual network

如果工作区位于虚拟网络中,则必须将数据集配置为跳过验证。If your workspace is in a virtual network, you must configure the dataset to skip validation. 有关如何在虚拟网络中使用数据存储和数据集的详细信息,请参阅保护工作区和相关资源For more information on how to use datastores and datasets in a virtual network, see Secure a workspace and associated resources.

创建数据集Create datasets

要使数据可供 Azure 机器学习访问,必须从 Azure 数据存储或公共 Web URL 中的路径创建数据集。For the data to be accessible by Azure Machine Learning, datasets must be created from paths in Azure datastores or public web URLs.

若要使用 Python SDK 从 Azure 数据存储创建数据集,请执行以下步骤:To create datasets from an Azure datastore with the Python SDK:

  1. 验证是否对已注册的 Azure 数据存储拥有 contributorowner 访问权限。Verify that you have contributor or owner access to the registered Azure datastore.

  2. 通过引用数据存储中的路径创建数据集。Create the dataset by referencing paths in the datastore. 可以从多个数据存储中的多个路径创建数据集。You can create a dataset from multiple paths in multiple datastores. 可以从其创建数据集的文件数量或数据大小没有硬性限制。There is no hard limit on the number of files or data size that you can create a dataset from.

备注

对于每个数据路径,会将几个请求发送到存储服务,以检查它是否指向文件或文件夹。For each data path, a few requests will be sent to the storage service to check whether it points to a file or a folder. 此开销可能会导致性能下降或故障。This overhead may lead to degraded performance or failure. 数据集引用内含 1000 个文件的文件夹的行为被视为引用一个数据路径。A dataset referencing one folder with 1000 files inside is considered referencing one data path. 为了获得最佳性能,建议创建在数据存储中引用不到 100 个路径的数据集。We recommend creating dataset referencing less than 100 paths in datastores for optimal performance.

创建 FileDatasetCreate a FileDataset

使用 FileDatasetFactory 类中的 from_files() 方法可以加载任意格式的文件并创建未注册的 FileDataset。Use the from_files() method on the FileDatasetFactory class to load files in any format and to create an unregistered FileDataset.

如果存储位于虚拟网络或防火墙后面,请在 from_files() 方法中设置参数 validate=FalseIf your storage is behind a virtual network or firewall, set the parameter validate=False in your from_files() method. 这会绕过初始验证步骤,确保可以从这些安全文件创建数据集。This bypasses the initial validation step, and ensures that you can create your dataset from these secure files. 详细了解如何使用虚拟网络中的数据存储和数据集Learn more about how to use datastores and datasets in a virtual network.

# create a FileDataset pointing to files in 'animals' folder and its subfolders recursively
datastore_paths = [(datastore, 'animals')]
animal_ds = Dataset.File.from_files(path=datastore_paths)

# create a FileDataset from image and label files behind public web urls
web_paths = ['https://azureopendatastorage.blob.core.windows.net/mnist/train-images-idx3-ubyte.gz',
             'https://azureopendatastorage.blob.core.windows.net/mnist/train-labels-idx1-ubyte.gz']
mnist_ds = Dataset.File.from_files(path=web_paths)

若要在工作区的不同试验中重用和共享数据集,请注册数据集To reuse and share datasets across experiment in your workspace, register your dataset.

提示

在一个方法中使用公共预览版方法 upload_directory() 从本地目录上传文件,并创建一个 FileDataset。Upload files from a local directory and create a FileDataset in a single method with the public preview method, upload_directory(). 此方法是一个试验性预览功能,可能会随时更改。This method is an experimental preview feature, and may change at any time.

此方法会将数据上传到基础存储,因此会产生存储费用。This method uploads data to your underlying storage, and as a result incur storage costs.

创建 TabularDatasetCreate a TabularDataset

使用 TabularDatasetFactory 类中的 from_delimited_files() 方法可以读取 .csv 或 .tsv 格式的文件,以及创建未注册的 TabularDataset。Use the from_delimited_files() method on the TabularDatasetFactory class to read files in .csv or .tsv format, and to create an unregistered TabularDataset. 如果从多个文件进行读取,结果将聚合为一种表格表示形式。If you're reading from multiple files, results will be aggregated into one tabular representation.

如果存储位于虚拟网络或防火墙后面,请在 from_delimited_files() 方法中设置参数 validate=FalseIf your storage is behind a virtual network or firewall, set the parameter validate=False in your from_delimited_files() method. 这会绕过初始验证步骤,确保可以从这些安全文件创建数据集。This bypasses the initial validation step, and ensures that you can create your dataset from these secure files. 详细了解如何使用虚拟网络中的数据存储和数据集Learn more about how to use datastores and datasets in a virtual network.

以下代码按名称获取现有工作区和所需的数据存储。The following code gets the existing workspace and the desired datastore by name. 然后将数据存储和文件位置传递给 path 参数以创建新的 TabularDataset weather_dsAnd then passes the datastore and file locations to the path parameter to create a new TabularDataset, weather_ds.

from azureml.core import Workspace, Datastore, Dataset

datastore_name = 'your datastore name'

# get existing workspace
workspace = Workspace.from_config()
    
# retrieve an existing datastore in the workspace by name
datastore = Datastore.get(workspace, datastore_name)

# create a TabularDataset from 3 file paths in datastore
datastore_paths = [(datastore, 'weather/2018/11.csv'),
                   (datastore, 'weather/2018/12.csv'),
                   (datastore, 'weather/2019/*.csv')]
weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)

设置数据架构Set data schema

默认情况下,在创建 TabularDataset 时,将自动推断列数据类型。By default, when you create a TabularDataset, column data types are inferred automatically. 如果推断出的类型与预期不符,则可以通过使用以下代码指定列类型来更新数据集架构。If the inferred types don't match your expectations, you can update your dataset schema by specifying column types with the following code. 参数 infer_column_type 仅适用于从分隔文件创建的数据集。The parameter infer_column_type is only applicable for datasets created from delimited files. 详细了解支持的数据类型Learn more about supported data types.

from azureml.core import Dataset
from azureml.data.dataset_factory import DataType

# create a TabularDataset from a delimited file behind a public web url and convert column "Survived" to boolean
web_path ='https://dprepdata.blob.core.windows.net/demo/Titanic.csv'
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_path, set_column_types={'Survived': DataType.to_bool()})

# preview the first 3 rows of titanic_ds
titanic_ds.take(3).to_pandas_dataframe()
(索引)(Index) PassengerIdPassengerId SurvivedSurvived PclassPclass 名称Name SexSex AgeAge SibSpSibSp ParchParch TicketTicket FareFare CabinCabin EmbarkedEmbarked
00 11 FalseFalse 33 Braund, Mr. Owen HarrisBraund, Mr. Owen Harris male 22.022.0 11 00 A/5 21171A/5 21171 7.25007.2500 SS
11 22 TrueTrue 11 Cumings, Mrs. John Bradley (Florence Briggs Th...Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.038.0 11 00 PC 17599PC 17599 71.283371.2833 C85C85 CC
22 33 TrueTrue 33 Heikkinen, Miss.Heikkinen, Miss. LainaLaina female 26.026.0 00 00 STON/O2.STON/O2. 31012823101282 7.92507.9250 SS

若要在工作区的不同试验中重用和共享数据集,请注册数据集To reuse and share datasets across experiments in your workspace, register your dataset.

浏览数据Explore data

创建并注册数据集之后,可以在模型训练之前将其加载到笔记本中以进行数据浏览。After you create and register your dataset, you can load it into your notebook for data exploration prior to model training. 如果不需要进行任何数据浏览,请参阅使用数据集进行训练,了解如何在训练脚本中使用数据集,以便提交 ML 试验。If you don't need to do any data exploration, see how to consume datasets in your training scripts for submitting ML experiments in Train with datasets.

对于 FileDataset,可以装载或下载数据集,并应用通常用于数据浏览的 python 库。For FileDatasets, you can either mount or download your dataset, and apply the python libraries you'd normally use for data exploration. 详细了解如何装载和下载Learn more about mount vs download.

# download the dataset 
dataset.download(target_path='.', overwrite=False) 

# mount dataset to the temp directory at `mounted_path`

import tempfile
mounted_path = tempfile.mkdtemp()
mount_context = dataset.mount(mounted_path)

mount_context.start()

对于 TabularDataset,请使用 to_pandas_dataframe() 方法来查看数据帧中的数据。For TabularDatasets, use the to_pandas_dataframe() method to view your data in a dataframe.

# preview the first 3 rows of titanic_ds
titanic_ds.take(3).to_pandas_dataframe()
(索引)(Index) PassengerIdPassengerId SurvivedSurvived PclassPclass 名称Name SexSex AgeAge SibSpSibSp ParchParch TicketTicket FareFare CabinCabin EmbarkedEmbarked
00 11 FalseFalse 33 Braund, Mr. Owen HarrisBraund, Mr. Owen Harris male 22.022.0 11 00 A/5 21171A/5 21171 7.25007.2500 SS
11 22 TrueTrue 11 Cumings, Mrs. John Bradley (Florence Briggs Th...Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.038.0 11 00 PC 17599PC 17599 71.283371.2833 C85C85 CC
22 33 TrueTrue 33 Heikkinen, Miss.Heikkinen, Miss. LainaLaina female 26.026.0 00 00 STON/O2.STON/O2. 31012823101282 7.92507.9250 SS

从 pandas 数据帧创建数据集Create a dataset from pandas dataframe

若要从内存中 pandas 数据帧创建 TabularDataset,请将数据写入本地文件(例如 csv),然后从该文件创建数据集。To create a TabularDataset from an in memory pandas dataframe, write the data to a local file, like a csv, and create your dataset from that file. 下面的代码演示了此工作流。The following code demonstrates this workflow.

# azureml-core of version 1.0.72 or higher is required
# azureml-dataprep[pandas] of version 1.1.34 or higher is required

from azureml.core import Workspace, Dataset
local_path = 'data/prepared.csv'
dataframe.to_csv(local_path)

# upload the local file to a datastore on the cloud

subscription_id = 'xxxxxxxxxxxxxxxxxxxxx'
resource_group = 'xxxxxx'
workspace_name = 'xxxxxxxxxxxxxxxx'

workspace = Workspace(subscription_id, resource_group, workspace_name)

# get the datastore to upload prepared data
datastore = workspace.get_default_datastore()

# upload the local file from src_dir to the target_path in datastore
datastore.upload(src_dir='data', target_path='data')
# create a dataset referencing the cloud location
dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, ('data/prepared.csv'))])

提示

在一个方法中使用公共预览版方法 register_spark_dataframe()register_pandas_dataframe() 从内存中 spark 或 pandas 数据帧创建并注册一个 TabularDataset。Create and register a TabularDataset from an in memory spark or pandas dataframe with a single method with public preview methods, register_spark_dataframe() and register_pandas_dataframe(). 这些注册方法属于试验性预览功能,可能会随时更改。These register methods are experimental preview features, and may change at any time.

这些方法会将数据上传到基础存储,因此会产生存储费用。These methods upload data to your underlying storage, and as a result incur storage costs.

注册数据集Register datasets

若要完成创建过程,请将数据集注册到工作区。To complete the creation process, register your datasets with a workspace. 使用 register() 方法将数据集注册到工作区,以便与其他人共享,并在工作区中的实验中重复使用这些数据集:Use the register() method to register datasets with your workspace in order to share them with others and reuse them across experiments in your workspace:

titanic_ds = titanic_ds.register(workspace=workspace,
                                 name='titanic_ds',
                                 description='titanic training data')

使用 Azure 资源管理器创建数据集Create datasets using Azure Resource Manager

https://github.com/Azure/azure-quickstart-templates/tree/master/101-machine-learning-dataset-create-* 上有许多模板可用于创建数据集。There are a number of templates at https://github.com/Azure/azure-quickstart-templates/tree/master/101-machine-learning-dataset-create-* that can be used to create datasets.

若要了解如何使用这些模板,请参阅使用 Azure 资源管理器模板创建 Azure 机器学习的工作区For information on using these templates, see Use an Azure Resource Manager template to create a workspace for Azure Machine Learning.

使用数据集进行训练Train with datasets

在机器学习试验中使用数据集来训练 ML 模型。Use your datasets in your machine learning experiments for training ML models. 详细了解如何使用数据集进行训练Learn more about how to train with datasets

对数据集进行版本控制Version datasets

可以通过创建新版本,以相同的名称注册新数据集。You can register a new dataset under the same name by creating a new version. 数据集版本是为数据状态设置书签的一种方法,以便可以应用数据集的特定版本进行试验或者在将来重现该数据集。A dataset version is a way to bookmark the state of your data so that you can apply a specific version of the dataset for experimentation or future reproduction. 详细了解数据集版本Learn more about dataset versions.

# create a TabularDataset from Titanic training data
web_paths = ['https://dprepdata.blob.core.windows.net/demo/Titanic.csv',
             'https://dprepdata.blob.core.windows.net/demo/Titanic2.csv']
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_paths)

# create a new version of titanic_ds
titanic_ds = titanic_ds.register(workspace = workspace,
                                 name = 'titanic_ds',
                                 description = 'new titanic training data',
                                 create_new_version = True)

后续步骤Next steps