创建和浏览带标签的 Azure 机器学习数据集Create and explore Azure Machine Learning dataset with labels

在本文中,你将学习如何从 Azure 机器学习数据标记项目中导出数据标签,并将其加载为常用格式,例如,加载为 Pandas 数据帧以用于浏览数据,或者加载为 Torchvision 数据集以用于转换图像。In this article, you'll learn how to export the data labels from an Azure Machine Learning data labeling project and load them into popular formats such as, a pandas dataframe for data exploration or a Torchvision dataset for image transformation.

什么是带标签的数据集What are datasets with labels

我们将带标签的 Azure 机器学习数据集称为带标签的数据集。We refer to Azure Machine Learning datasets with labels as labeled datasets. 这些特定数据集类型的带标签数据集只能创建为 Azure 机器学习数据标记项目的输出。These specific dataset types of labeled datasets are only created as an output of Azure Machine Learning data labeling projects. 可以使用这些步骤创建数据标记项目。Create a data labeling project with these steps. 机器学习支持用于图像分类的数据标记项目(无论是多标签的还是多类的),以及带边界框的对象标识。Machine Learning supports data labeling projects for image classification, either multi-label or multi-class, and object identification together with bounded boxes.

先决条件Prerequisites

导出数据标签Export data labels

完成数据标记项目后,可以从标记项目中导出标签数据。When you complete a data labeling project, you can export the label data from a labeling project. 这样,便可以捕获对数据及其标签的引用,并将其导出为 COCO 格式或 Azure 机器学习数据集。Doing so, allows you to capture both the reference to the data and its labels, and export them in COCO format or as an Azure Machine Learning dataset. 使用标记项目的“项目详细信息”页上的“导出”按钮。 Use the Export button on the Project details page of your labeling project.

COCOCOCO

COCO 文件是在 Azure 机器学习工作区的默认 Blob 存储中创建的,该存储位于 export/coco 内的某个文件夹中。The COCO file is created in the default blob store of the Azure Machine Learning workspace in a folder within export/coco.

Azure 机器学习数据集Azure Machine Learning dataset

可以在 Azure 机器学习工作室的“数据集”部分中访问导出的 Azure 机器学习数据集。 You can access the exported Azure Machine Learning dataset in the Datasets section of your Azure Machine Learning studio. 数据集“详细信息” 页还提供了演示如何从 Python 访问标签的示例代码。The dataset Details page also provides sample code to access your labels from Python.

导出的数据集

浏览带标签的数据集Explore labeled datasets

将带标签的数据集加载到 pandas 数据帧或 Torchvision 数据集中以利用常见的开源库来浏览数据,以及利用 PyTorch 提供的库进行图像转换和训练。Load your labeled datasets into a pandas dataframe or Torchvision dataset to leverage popular open-source libraries for data exploration, as well as PyTorch provided libraries for image transformation and training.

Pandas 数据帧Pandas dataframe

可以使用 azureml-contrib-dataset 类的 to_pandas_dataframe() 方法将带标签的数据集加载到 pandas 数据帧。You can load labeled datasets into a pandas dataframe with the to_pandas_dataframe() method from the azureml-contrib-dataset class. 可以使用以下 shell 命令安装此类:Install the class with the following shell command:

pip install azureml-contrib-dataset

备注

azureml.contrib 命名空间会频繁更改,因为我们正在改进服务。The azureml.contrib namespace changes frequently, as we work to improve the service. 因此,此命名空间中的任何内容都应被视为预览版,Microsoft 并不完全支持。As such, anything in this namespace should be considered as a preview, and not fully supported by Microsoft.

转换为 pandas 数据帧时,我们针对文件流提供以下文件处理选项。We offer the following file handling options for file streams when converting to a pandas dataframe.

  • 下载:将数据文件下载到本地路径。Download: Download your data files to a local path.
  • 装载:将数据文件装载到装入点。Mount: Mount your data files to a mount point. 装载仅适用于基于 Linux 的计算,包括 Azure 机器学习笔记本 VM 和 Azure 机器学习计算。Mount only works for Linux-based compute, including Azure Machine Learning notebook VM and Azure Machine Learning Compute.
import azureml.contrib.dataset
from azureml.contrib.dataset import FileHandlingOption
animal_pd = animal_labels.to_pandas_dataframe(file_handling_option=FileHandlingOption.DOWNLOAD, target_path='./download/', overwrite_download=True)

import matplotlib.pyplot as plt
import matplotlib.image as mpimg

#read images from downloaded path
img = mpimg.imread(animal_pd.loc[0,'image_url'])
imgplot = plt.imshow(img)

Torchvision 数据集Torchvision datasets

还可以使用 azureml-contrib-dataset 类中的 to_torchvision() 方法将带标签的数据集加载到 Torchvision 数据集中。You can load labeled datasets into Torchvision dataset with the to_torchvision() method also from the azureml-contrib-dataset class. 若要使用此方法,需要安装 PyTorchTo use this method, you need to have PyTorch installed.

from torchvision.transforms import functional as F

# load animal_labels dataset into torchvision dataset
pytorch_dataset = animal_labels.to_torchvision()
img = pytorch_dataset[0][0]
print(type(img))

# use methods from torchvision to transform the img into grayscale
pil_image = F.to_pil_image(img)
gray_image = F.to_grayscale(pil_image, num_output_channels=3)

imgplot = plt.imshow(gray_image)

后续步骤Next steps