创建数据标记项目并导出标签Create a data labeling project and export labels

在机器学习项目中标记大量的数据通常让人感到头疼。Labeling voluminous data in machine learning projects is often a headache. 包含计算机视觉组件的项目(例如图像分类或对象检测)通常需要为数千个图像提供标签。Projects that have a computer-vision component, such as image classification or object detection, generally require labels for thousands of images.

Azure 机器学习数据标签提供一个中心位置用于创建、管理和监视标记项目。Azure Machine Learning data labeling gives you a central place to create, manage, and monitor labeling projects. 使用机器学习可以协调数据、标签和团队成员,以有效地管理标记任务。Use it to coordinate data, labels, and team members to efficiently manage labeling tasks. 机器学习支持多标签或多类图像分类,以及带边界框的对象标识。Machine Learning supports image classification, either multi-label or multi-class, and object identification with bounded boxes.

数据标签可跟踪进度,并维护未完成标记任务的队列。Data labeling tracks progress and maintains the queue of incomplete labeling tasks.

可以启动和停止项目并控制标记进度。You are able to start and stop the project and control the labeling progress. 可以检查已标记的数据,并以 COCO 格式导出已标记的数据或将其导出为 Azure 机器学习数据集。You can review the labeled data and export labeled in COCO format or as an Azure Machine Learning dataset.


目前仅支持图像分类和对象识别标记项目。Only image classification and object identification labeling projects are currently supported. 此外,数据图像必须在 Azure Blob 数据存储中提供。Additionally, the data images must be available in an Azure blob datastore. (如果没有现有的数据存储,可以在创建项目过程中上传图像。)(If you do not have an existing datastore, you may upload images during project creation.)

本文将指导如何进行以下操作:In this article, you'll learn how to:

  • 创建一个项目Create a project
  • 指定项目的数据和结构Specify the project's data and structure
  • 运行和监视项目Run and monitor the project
  • 导出标签Export the labels


  • 本地文件或 Azure Blob 存储中要标记的数据。The data that you want to label, either in local files or in Azure blob storage.
  • 要应用的一组标签。The set of labels that you want to apply.
  • 标记说明。The instructions for labeling.
  • Azure 订阅。An Azure subscription. 如果没有 Azure 订阅,可在开始前创建一个试用帐户If you don't have an Azure subscription, create a trial account before you begin.
  • 机器学习工作区。A Machine Learning workspace. 请参阅创建 Azure 机器学习工作区See Create an Azure Machine Learning workspace.

创建标记项目Create a labeling project

标记项目是通过 Azure 机器学习管理的。Labeling projects are administered from Azure Machine Learning. 可以使用“标记项目”页来管理项目。You use the Labeling projects page to manage your projects.

如果数据已在 Azure Blob 存储中,在创建标记项目之前,应以数据存储的形式提供这些数据。If your data is already in Azure Blob storage, you should make it available as a datastore before you create the labeling project. 有关如何使用数据存储的示例,请参阅教程:创建第一个图像分类标记项目For an example of using a datastore, see Tutorial: Create your first image classification labeling project.

若要创建项目,请选择“添加项目”。To create a project, select Add project. 为项目指定适当的名称,然后选择“标记任务类型”。Give the project an appropriate name and select Labeling task type.


  • 若要将一组标签中的单个标签应用到某个图像,请为项目选择“多类图像分类”。Choose Image Classification Multi-class for projects when you want to apply only a single label from a set of labels to an image.
  • 若要将一组标签中的一个或多个标签应用到某个图像,请为项目选择“多标签图像分类”。Choose Image Classification Multi-label for projects when you want to apply one or more labels from a set of labels to an image. 例如,可以使用“狗”和“白天”标记狗的照片。For instance, a photo of a dog might be labeled with both dog and daytime.
  • 若要将标签和边界框应用到图像中的每个对象,请为项目选择“对象标识(边界框)”。Choose Object Identification (Bounding Box) for projects when you want to assign a label and a bounding box to each object within an image.

准备好继续时,选择“下一步”。Select Next when you're ready to continue.

指定要标记的数据Specify the data to label

如果已创建包含数据的数据集,请从“选择现有数据集”下拉列表中选择该数据集。If you already created a dataset that contains your data, select it from the Select an existing dataset drop-down list. 或者,选择“创建数据集”以使用现有的 Azure 数据存储或上传本地文件。Or, select Create a dataset to use an existing Azure datastore or to upload local files.


一个项目最多包含 500,000 个图像。A project cannot contain more than 500,000 images. 如果数据集中包含的图像超过此限制,将只加载前 500,000 个图像。If your dataset has more, only the first 500,000 images will be loaded.

从 Azure 数据存储创建数据集Create a dataset from an Azure datastore

在许多情况下,上传本地文件就可以了。In many cases, it's fine to just upload local files. 但是,使用 Azure 存储资源管理器可以更快、更可靠地传输大量数据。But Azure Storage Explorer provides a faster and more robust way to transfer a large amount of data. 建议将存储资源管理器用作移动文件的默认方式。We recommend Storage Explorer as the default way to move files.

若要基于已存储在 Azure Blob 存储中的数据创建数据集:To create a dataset from data that you've already stored in Azure Blob storage:

  1. 选择“创建数据集” > “从数据存储”。Select Create a dataset > From datastore.
  2. 为数据集指定一个 名称Assign a Name to your dataset.
  3. 选择“文件”作为 数据集类型Choose File as the Dataset type. 仅支持文件数据集类型。Only file dataset types are supported.
  4. 选择数据存储。Select the datastore.
  5. 如果数据位于 Blob 存储中的子文件夹中,请选择“浏览”以选择相应的路径。If your data is in a subfolder within your blob storage, choose Browse to select the path.
    • 将“/”追加到路径中可以包括所选路径的子文件夹中的所有文件。Append "/" to the path to include all the files in subfolders of the selected path.
    • 追加“* / .*”可以包括当前容器及其子文件夹中的所有数据。Append "*/.*" to include all the data in the current container and its subfolders.
  6. 提供数据集的说明。Provide a description for your dataset.
  7. 选择“下一页”。Select Next.
  8. 确认详细信息。Confirm the details. 选择“后退”以修改设置,或选择“创建”以创建数据集。Select Back to modify the settings or Create to create the dataset.

基于上传的数据创建数据集Create a dataset from uploaded data

若要直接上传数据:To directly upload your data:

  1. 选择“创建数据集” > “从本地文件”。Select Create a dataset > From local files.
  2. 为数据集指定一个 名称Assign a Name to your dataset.
  3. 选择“文件”作为 数据集类型Choose "File" as the Dataset type.
  4. 可选: 选择“高级设置”可以自定义数据存储、容器,以及数据的路径。Optional: Select Advanced settings to customize the datastore, container, and path to your data.
  5. 选择“浏览”选择要上传的本地文件。Select Browse to select the local files to upload.
  6. 提供数据集的说明。Provide a description of your dataset.
  7. 选择“下一页”。Select Next.
  8. 确认详细信息。Confirm the details. 选择“后退”以修改设置,或选择“创建”以创建数据集。Select Back to modify the settings or Create to create the dataset.

数据将上传到机器学习工作区的默认 Blob 存储(“workspaceblobstore”)。The data gets uploaded to the default blob store ("workspaceblobstore") of your Machine Learning workspace.

配置增量刷新 Configure incremental refresh

如果打算向数据集中添加新图像,请使用增量刷新将这些新图像添加到项目。If you plan to add new images to your dataset, use incremental refresh to add these new images your project. 启用增量刷新时,将根据标记完成率定期检查数据集,以将新图像添加到项目。When incremental refresh is enabled, the dataset is checked periodically for new images to be added to a project, based on the labeling completion rate. 项目包含的图像达到最大数 500,000 时,新数据检查将停止。The check for new data stops when the project contains the maximum 500,000 images.

若要将更多图像添加到项目中,请使用 Azure 存储资源管理器上载到 blob 存储中的相应文件夹。To add more images to your project, use Azure Storage Explorer to upload to the appropriate folder in the blob storage.

如果希望项目持续监视数据存储中的新数据,请选中“启用增量刷新”框。Check the box for Enable incremental refresh when you want your project to continually monitor for new data in the datastore.

如果不希望数据存储中的新图像添加到项目,请取消选中此框。Uncheck this box if you do not want new images that appear in the datastore to be added to your project.

可以在项目的“详细信息”选项卡的“增量刷新”部分中找到最新刷新的时间戳。You can find the timestamp for the latest refresh in the Incremental refresh section of Details tab for your project.

添加标签类Specify label classes

在“标签类”页上,指定用于对数据分类的类集。On the Label classes page, specify the set of classes to categorize your data. 请谨慎执行此操作,因为标记程序能否在类中进行选择会影响其准确性和速度。Do this carefully, because your labelers' accuracy and speed will be affected by their ability to choose among the classes. 例如,不要拼写出植物或动物的完整属类和物种,而是使用字段代码或者将属类缩写。For instance, instead of spelling out the full genus and species for plants or animals, use a field code or abbreviate the genus.

在每行输入一个标签。Enter one label per row. 使用 + 按钮添加新行。Use the + button to add a new row. 如果已经输入了 3 到 4 个标签,但不超过 10 个,则我们建议使用编号(“1:”、“2:”)作为名称的前缀,使标记程序能够使用编号键来加速工作。If you have more than 3 or 4 labels but fewer than 10, you may want to prefix the names with numbers ("1: ", "2: ") so the labelers can use the number keys to speed their work.

描述标记任务Describe the labeling task

清楚地解释标记任务非常重要。It's important to clearly explain the labeling task. 在“标记说明”页上,可以添加外部站点的链接来提供标记说明,或在该页上的编辑框中提供说明。On the Labeling instructions page, you can add a link to an external site for labeling instructions, or provide instructions in the edit box on the page. 让说明面向任务并适合受众。Keep the instructions task-oriented and appropriate to the audience. 请考虑以下问题:Consider these questions:

  • 他们将看到哪些标签,他们如何在标签之间进行选择?What are the labels they'll see, and how will they choose among them? 是否提供了参考文本?Is there a reference text to refer to?
  • 如果没有合适的标签,应该怎么办?What should they do if no label seems appropriate?
  • 如果没有合适的多个标签,应该怎么办?What should they do if multiple labels seem appropriate?
  • 他们应当向标签应用什么置信度阈值?What confidence threshold should they apply to a label? 如果他们不确定,是否需要“最佳推测”?Do you want their "best guess" if they aren't certain?
  • 他们应该如何处理部分封闭或重叠的相关对象?What should they do with partially occluded or overlapping objects of interest?
  • 如果某个相关对象被图像边缘剪裁,应该怎么办?What should they do if an object of interest is clipped by the edge of the image?
  • 如果他们错误地提交了标签,该怎么处理?What should they do after they submit a label if they think they made a mistake?

对于边界框,重要的问题包括:For bounding boxes, important questions include:

  • 如何为此任务定义边界框?How is the bounding box defined for this task? 边界框应是完全位于对象的内部还是外部?Should it be entirely on the interior of the object, or should it be on the exterior? 是要尽可能准确地裁剪边界框,还是可以接受一定的间隙?Should it be cropped as closely as possible, or is some clearance acceptable?
  • 希望标记程序在定义边界框中应用何种程度的缜密性和一致性?What level of care and consistency do you expect the labelers to apply in defining bounding boxes?
  • 如何标记图像中部分显示的对象?How to label the object that is partially shown in the image?
  • 如何标记其他对象部分遮盖的对象?How to label the object that partially covered by other object?


请务必注意,标记程序可以使用编号键 1-9 选择前 9 个标签。Be sure to note that the labelers will be able to select the first 9 labels by using number keys 1-9.

使用 ML 辅助标记Use ML assisted labeling

在“ML 辅助标记”页中可以触发自动机器学习模型,以加速完成标记任务。The ML assisted labeling page lets you trigger automatic machine learning models to accelerate the labeling task. 在标记项目的开头,图像将按随机顺序排列,以减少潜在的偏差。At the beginning of your labeling project, the images are shuffled into a random order to reduce potential bias. 但是,数据集中的任何偏差都会反映在训练的模型中。However, any biases that are present in the dataset will be reflected in the trained model. 例如,如果 80% 的图像属于单个类,则用于训练模型的大约 80% 的数据将属于该类。For example, if 80% of your images are of a single class, then approximately 80% of the data used to train the model will be of that class. 此训练不包括主动学习。This training does not include active learning.

选择“启用 ML 辅助标记”并指定 GPU,以启用由以下两个阶段构成的辅助标记过程:Select Enable ML assisted labeling and specify a GPU to enable assisted labeling, which consists of two phases:

  • 群集Clustering
  • 预先标记Prelabeling

启动辅助标记所需的确切标记图像数目不是固定的。The exact number of labeled images necessary to start assisted labeling is not a fixed number. 它可能根据标记项目的不同而有很大的差异。This can vary significantly from one labeling project to another. 对于某些项目,在手动标记 300 个图像后,有时可能会看到预先标记或聚类任务。For some projects, is sometimes possible to see prelabel or cluster tasks after 300 images have been manually labeled. ML 辅助标记使用称为“迁移学习”的技术,该技术使用预先训练的模型来直接启动训练过程。ML Assisted Labeling uses a technique called Transfer Learning, which uses a pre-trained model to jump-start the training process. 如果数据集的类类似于预先训练的模型中的类,则只有在手动标记数百个图像之后,才能使用预先标签。If your dataset's classes are similar to those in the pre-trained model, pre-labels may be available after only a few hundred manually labeled images. 如果数据集与用于预先训练模型的数据有很大的不同,此时间可能要长得多。If your dataset is significantly different from the data used to pre-train the model, it may take much longer.

由于最终的标签仍依赖于标记人员的输入,因此,此技术有时称为“人在回路”标记。Since the final labels still rely on input from the labeler, this technology is sometimes called human in the loop labeling.


ML 辅助数据标记不支持在虚拟网络后面受保护的默认存储帐户。ML assisted data labelling does not support default storage accounts secured behind a virtual network. 对于 ML 辅助数据标记,必须使用非默认存储帐户。You must use a non-default storage account for ML assisted data labelling. 可在虚拟网络后面保护非默认存储帐户。The non-default storage account can be secured behind the virtual network.


提交一定数量的标签后,用于图像分类的机器学习模型开始将类似的图像分组到一起。After a certain number of labels are submitted, the machine learning model for image classification starts to group together similar images. 这些类似的图像在同一个屏幕上向标记人员显示,以加速完成手动标记。These similar images are presented to the labelers on the same screen to speed up manual tagging. 当标记人员查看包含 4、6 或 9 个图像的网格时,聚类将特别有用。Clustering is especially useful when the labeler is viewing a grid of 4, 6, or 9 images.

基于手动标记的数据训练机器学习模型后,该模型将截断至其最后一个完全连接的层。Once a machine learning model has been trained on your manually labeled data, the model is truncated to its last fully-connected layer. 然后,将在通常称作“嵌入”或“特征化”的流程中通过截断的模型传递未标记的图像。Unlabeled images are then passed through the truncated model in a process commonly known as "embedding" or "featurization." 这会将每个图像嵌入到此模型层定义的某个高维空间。This embeds each image in a high-dimensional space defined by this model layer. 属于该空间中最近的邻域的图像将用于聚类任务。Images which are nearest neighbors in the space are used for clustering tasks.

对象检测模型不会出现聚类阶段。The clustering phase does not appear for object detection models.


提交足够的图像标签后,将使用分类模型来预测图像标记。After enough image labels are submitted, a classification model is used to predict image tags. 或者使用对象检测模型来预测边界框。Or an object detection model is used to predict bounding boxes. 标记人员现在会看到包含一些页面,其中包含每个图像上存在的预测标签。The labeler now sees pages that contain predicted labels already present on each image. 对于对象检测,还会显示预测框。For object detection, predicted boxes are also shown. 接下来的任务是检查这些预测,并更正任何错误标记的图像,然后提交页面。The task is then to review these predictions and correct any mis-labeled images before submitting the page.

基于手动标记的数据训练机器学习模型后,将会基于手动标记的图像的测试集评估该模型,以根据各种不同的置信度阈值确定其准确度。Once a machine learning model has been trained on your manually labeled data, the model is evaluated on a test set of manually labeled images to determine its accuracy at a variety of different confidence thresholds. 此评估过程用于确定置信度阈值,如果超过该阈值,则表示模型足够准确,可以显示预先标签。This evaluation process is used to determine a confidence threshold above which the model is accurate enough to show pre-labels. 然后,将会根据未标记的数据评估模型。The model is then evaluated against unlabeled data. 预测结果的置信度高于此阈值的图像将用于预先标记。Images with predictions more confident than this threshold are used for pre-labeling.

初始化标记项目Initialize the labeling project

在初始化标记项目后,项目的某些方面是不可变的。After the labeling project is initialized, some aspects of the project are immutable. 无法更改任务类型或数据集。You can't change the task type or dataset. 可以修改任务说明的标签和 URL。You can modify labels and the URL for the task description. 请在创建项目之前仔细检查设置。Carefully review the settings before you create the project. 提交项目后,将返回到“数据标记”主页,其中显示项目的状态为“正在初始化”。After you submit the project, you're returned to the Data Labeling homepage, which will show the project as Initializing.


此页面可能不会自动刷新。This page may not automatically refresh. 因此,在暂停后,手动刷新页面会看到项目状态为“已创建”。So, after a pause, manually refresh the page to see the project's status as Created.

运行和监视项目Run and monitor the project

初始化项目后,Azure 将开始运行该项目。After you initialize the project, Azure will begin running it. 在“数据标签”主页上选择该项目,以查看项目详细信息Select the project on the main Data Labeling page to see details of the project

若要暂停或重启项目,请在右上方切换“正在运行”状态。To pause or restart the project, toggle the Running status on the top right. 只能在项目运行时标记数据。You can only label data when the project is running.


“仪表板”选项卡将显示标记任务的进度。The Dashboard tab shows the progress of the labeling task.


进度图显示已标记的项数以及尚未完成的项数。The progress chart shows how many items have been labeled and how many are not yet done. 挂起的项可能:Items pending may be:

  • 尚未添加到任务Not yet added to a task
  • 包含在分配给标记人员但尚未完成的任务中Included in a task that is assigned to a labeler but not yet completed
  • 处于尚未分配的任务队列中In the queue of tasks yet to be assigned

中间部分显示尚未分配的任务队列。The middle section shows the queue of tasks yet to be assigned. 当 ML 辅助标记处于关闭状态时,此部分显示要分配的手动任务数。When ML assisted labeling is off, this section shows the number of manual tasks to be assigned. 当 ML 辅助标记处于开启状态时,这还会显示:When ML assisted labeling is on, this will also show:

  • 在队列中包含群集项的任务Tasks containing clustered items in the queue
  • 在队列中包含预标记项的任务Tasks containing prelabeled items in the queue

此外,当 ML 辅助标记处于启用状态时,一个小进度栏会显示下一次训练运行的时间。Additionally, when ML assisted labeling is enabled, a small progress bar shows when the next training run will occur. “试验”部分提供每个机器学习运行的链接。The Experiments sections give links for each of the machine learning runs.

  • 训练 - 训练模型以预测标签Training - trains a model to predict the labels
  • 验证 - 确定此模型的预测是否将用于预标记项Validation - determines whether this model's prediction will be used for pre-labeling the items
  • 推理 - 新项的预测运行Inference - prediction run for new items
  • 特征化 - 群集项(仅适用于图像分类项目)Featurization - clusters items (only for image classification projects)

右侧是已完成的任务的标签分布。On the right hand side is a distribution of the labels for those tasks that are complete. 请记住,在某些项目类型中,一个项可以具有多个标签,在这种情况下,总标签数可以大于总项数。Remember that in some project types, an item can have multiple labels, in which case the total number of labels can be greater than the total number items.

“数据”选项卡Data tab

在“数据”选项卡上,可以查看数据集并检查已标记的数据。On the Data tab, you can see your dataset and review labeled data. 如果发现数据标记不正确,可以选择该数据,然后选择“拒绝”,这会删除标签,并将数据放回到未标记队列中。If you see incorrectly labeled data, select it and choose Reject, which will remove the labels and put the data back into the unlabeled queue.

“详细信息”选项卡Details tab

查看项目的详细信息。View details of your project. 在此选项卡中,可以:In this tab you can:

  • 查看项目详细信息和输入数据集View project details and input datasets
  • 启用增量刷新Enable incremental refresh
  • 查看用于在项目中存储已标记的输出的存储容器详细信息View details of the storage container used to store labeled outputs in your project
  • 将标签添加到项目Add labels to your project
  • 编辑为标签提供的说明Edit instructions you give to your labels
  • 编辑 ML 辅助标记的详细信息,包括启用/禁用Edit details of ML assisted labeling, including enable/disable

标记人员的访问权限Access for labelers

有权访问工作区的任何人都可以在项目中标记数据。Anyone who has access to your workspace can label data in your project. 还可以为标记人员自定义权限,以便他们可以访问标记,但不能访问工作区或标记项目的其他部分。You can also customize the permissions for your labelers so that the can access labeling but not other parts of the workspace or your labeling project. 有关更多详细信息,请参阅管理对 Azure 机器学习工作区的访问权限,并了解如何创建标记人员自定义角色For more details, see Manage access to an Azure Machine Learning workspace, and learn how to create the labeler custom role.

将新标签类添加到项目Add new label class to a project

在标记过程中,你可能会发现对图像进行分类需要其他标签。During the labeling process, you may find that additional labels are needed to classify your images. 例如,可能需要添加“未知”或“其他”标签来指示含混的图像。For example, you may want to add an "Unknown" or "Other" label to indicate confusing images.

使用以下步骤将一个或多个标签添加到项目:Use these steps to add one or more labels to a project:

  1. 在“数据标记”主页上选择该项目。Select the project on the main Data Labeling page.
  2. 在页面右上角,将“正在运行”切换为“已暂停”以使标记人员停止进行其活动。At the top right of the page, toggle Running to Paused to stop labelers from their activity.
  3. 选择“详细信息”选项卡。Select the Details tab.
  4. 在左侧的列表中,选择“标签类”。In the list on the left, select Label classes.
  5. 在列表顶部,选择“+ 添加标签”添加标签At the top of the list, select + Add Labels Add a label
  6. 在窗体中添加新标签,然后选择如何继续。In the form, add your new label and choose how to proceed. 由于已更改图像的可用标签,因此请选择如何处理已标记的数据:Since you've changed the available labels for an image, you choose how to treat the already labeled data:
    • 重新开始,同时删除所有现有标签。Start over, removing all existing labels. 如果要从新的完整标签集开始标记,请选择此选项。Choose this option if you want to start labeling from the beginning with the new full set of labels.
    • 重新开始,同时保留所有现有标签。Start over, keeping all existing labels. 选择此选项可将所有数据标记为“未标记”,但保留现有标签作为先前标记的图像的默认标记。Choose this option to mark all data as unlabeled, but keep the existing labels as a default tag for images that were previously labeled.
    • 继续,同时保留所有现有标签。Continue, keeping all existing labels. 选择此选项可以保留所有按原样标记的数据,并开始对尚未标记的数据使用新标签。Choose this option to keep all data already labeled as is, and start using the new label for data not yet labeled.
  7. 根据需要修改新标签的说明页。Modify your instructions page as necessary for the new label(s).
  8. 添加所有新标签后,在页面右上方将“已暂停”切换为“正在运行”以重启项目。 Once you have added all new labels, at the top right of the page toggle Paused to Running to restart the project.

导出标签Export the labels

随时可以导出标签数据以进行机器学习试验。You can export the label data for Machine Learning experimentation at any time. 可以使用 COCO 格式导出图像标签,或将其导出为带有标签的 Azure 机器学习数据集Image labels can be exported in COCO format or as an Azure Machine Learning dataset with labels. 使用标记项目的“项目详细信息”页上的“导出”按钮。 Use the Export button on the Project details page of your labeling project.

COCO 文件是在 Azure 机器学习工作区的默认 Blob 存储中创建的,该存储位于 export/coco 内的某个文件夹中。The COCO file is created in the default blob store of the Azure Machine Learning workspace in a folder within export/coco. 可以在机器学习的“数据集”部分访问导出的 Azure 机器学习数据集。You can access the exported Azure Machine Learning dataset in the Datasets section of Machine Learning. 数据集详细信息页还提供了演示如何从 Python 访问标签的示例代码。The dataset details page also provides sample code to access your labels from Python.


后续步骤Next steps