在 Azure 机器学习设计器(预览版)中转换数据Transform data in Azure Machine Learning designer (preview)

应用于:否基本版是企业版            (升级到企业版APPLIES TO: noBasic edition yesEnterprise edition                       (Upgrade to Enterprise)

本文介绍如何在 Azure 机器学习设计器中转换和保存数据集,以便可以为机器学习准备好自己的数据。In this article, you learn how to transform and save datasets in Azure Machine Learning designer so that you can prepare your own data for machine learning.

你将使用示例 Adult Census Income Binary Classification 数据集来准备两个数据集:一个数据集包含仅来自美国的成年人口信息,另一个数据集包含来自非美国成人的人口信息。You will use the sample Adult Census Income Binary Classification dataset to prepare two datasets: one dataset that includes adult census information from only the United States and another dataset that includes census information from non-US adults.

在本文中,学习如何:In this article, you learn how to:

  1. 转换数据集以准备用于训练。Transform a dataset to prepare it for training.
  2. 将生成的数据集导出到数据存储。Export the resulting datasets to a datastore.
  3. 查看结果。View results.

此操作说明是如何重新训练设计器模型文章的先决条件。This how-to is a prerequisite for the how to retrain designer models article. 在该文章中,你将了解如何使用转换的数据集,通过管道参数训练多个模型。In that article, you will learn how to use the transformed datasets to train multiple models with pipeline parameters.

转换数据集Transform a dataset

在此部分中,你会了解如何导入示例数据集并将数据拆分为美国和非美国数据集。In this section, you learn how to import the sample dataset and split the data into US and non-US datasets. 有关如何将自己的数据导入设计器中的详细信息,请参阅如何导入数据For more information on how to import your own data into the designer, see how to import data.

导入数据Import data

使用下列步骤导入示例数据集。Use the following steps to import the sample dataset.

  1. 登录到 ml.azure.com,选择要使用的工作区。Sign in to ml.azure.com, and select the workspace you want to work with.

  2. 转到设计器。Go to the designer. 选择 Easy-to-use-prebuild 模块以创建新管道。Select Easy-to-use-prebuild modules to create a new pipeline.

  3. 选择默认计算目标以运行管道。Select a default compute target to run the pipeline.

  4. 管道画布左侧是数据集和模块的控制板。To the left of the pipeline canvas is a palette of datasets and modules. 选择“数据集”。Select Datasets. 然后,查看“示例”部分。Then view the Samples section.

  5. 将“Adult Census Income Binary classification”数据集拖放到画布上。Drag and drop the Adult Census Income Binary classification dataset onto the canvas.

  6. 选择“Adult Census Income”数据集模块。Select the Adult Census Income dataset module.

  7. 在画布右侧出现的详细信息窗格中,选择“输出”。In the details pane that appears to the right of the canvas, select Outputs.

  8. 选择可视化图标Select the visualize icon 可视化图标.

  9. 使用数据预览窗口浏览数据集。Use the data preview window to explore the dataset. 请特别注意“native-country”列值。Take special note of the "native-country" column values.

拆分数据Split the data

在此部分中,会使用“拆分数据”模块标识和拆分在“native-country”列中包含“United-States”的行。In this section, you use the Split Data module to identify and split rows that contain "United-States" in the "native-country" column.

  1. 在画布左侧的模块控制板中,展开“数据转换”部分并找到“拆分数据”模块。 In the module palette to the left of the canvas, expand the Data Transformation section and find the Split Data module.

  2. 将“拆分数据”模块拖动到画布上,并将模块放置在数据集模块下。Drag the Split Data module onto the canvas, and drop the module below the dataset module.

  3. 将数据集模块连接到“拆分数据”模块。Connect the dataset module to the Split Data module.

  4. 选择“拆分数据”模块。Select the Split Data module.

  5. 在画布右侧的模块详细信息窗格中,将“拆分模式”设置为“正则表达式”。 In the module details pane to the right of the canvas, set Splitting mode to Regular Expression.

  6. 输入“正则表达式”:\"native-country" United-StatesEnter the Regular Expression: \"native-country" United-States.

    “正则表达式”模式对值测试单列。The Regular expression mode tests a single column for a value. 有关“拆分数据”模块的详细信息,请参阅相关算法模块参考页面For more information on the Split Data module, see the related algorithm module reference page.

管道应如下所示:Your pipeline should look like this:

显示如何配置管道和“拆分数据”模块的屏幕截图.

保存数据集Save the datasets

现在管道已设置为拆分数据,便需要指定保存数据集的位置。Now that your pipeline is set up to split the data, you need to specify where to persist the datasets. 对于此示例,请使用“导出数据”模块将数据集保存到数据存储。For this example, use the Export Data module to save your dataset to a datastore. 有关数据存储的详细信息,请参阅连接到 Azure 存储服务For more information on datastores, see Connect to Azure storage services

  1. 在画布左侧的模块控制板中,展开“数据输入和输出”部分并找到“导出数据”模块。 In the module palette to the left of the canvas, expand the Data Input and Output section and find the Export Data module.

  2. 将两个“导出数据”模块拖放到“拆分数据”模块下。Drag and drop two Export Data modules below the Split Data module.

  3. 将“拆分数据”模块的每个输出端口都连接到不同的“导出数据”模块。 Connect each output port of the Split Data module to a different Export Data module.

    管道应如下所示:Your pipeline should look something like this:

    显示如何连接“导出数据”模块的屏幕截图.

  4. 选择连接到“拆分数据”模块的最左侧端口的“导出数据”模块。Select the Export Data module that is connected to the left-most port of the Split Data module.

    输出端口的顺序对于“拆分数据”模块十分重要。The order of the output ports matter for the Split Data module. 第一个输出端口包含正则表达式为 true 的行。The first output port contains the rows where the regular expression is true. 在此例中,第一个端口包含基于美国的收入的行,第二个端口包含基于美国以外的收入的行。In this case, the first port contains rows for US-based income, and the second port contains rows for non-US based income.

  5. 在画布右侧的模块详细信息窗格中,设置以下选项:In the module details pane to the right of the canvas, set the following options:

    数据存储类型:Azure Blob 存储Datastore type: Azure Blob Storage

    数据存储:选择现有数据存储,或选择“新建数据存储”以立即创建一个。Datastore: Select an existing datastore or select "New datastore" to create one now.

    路径:/data/us-incomePath: /data/us-income

    文件格式:csvFile format: csv

    备注

    本文假设你有权访问注册到当前 Azure 机器学习工作区的数据存储。This article assumes that you have access to a datastore registered to the current Azure Machine Learning workspace. 有关如何设置数据存储的说明,请参阅连接到 Azure 存储服务For instructions on how to setup a datastore, see Connect to Azure storage services.

    如果没有数据存储,则可以立即创建一个。If you don't have a datastore, you can create one now. 例如,本文会将数据集保存到与工作区关联的默认 blob 存储帐户。For example purposes, this article will save the datasets to the default blob storage account associated with the workspace. 它会将数据集保存到名为 data 的新文件夹中的 azureml 容器。It will save the datasets into the azureml container in a new folder called data.

  6. 选择连接到“拆分数据”模块的最右侧端口的“导出数据”模块。Select the Export Data module connected to the right-most port of the Split Data module.

  7. 在画布右侧的模块详细信息窗格中,设置以下选项:In the module details pane to the right of the canvas, set the following options:

    数据存储类型:Azure Blob 存储Datastore type: Azure Blob Storage

    数据存储:选择与上面相同的数据存储Datastore: Select the same datastore as above

    路径:/data/non-us-incomePath: /data/non-us-income

    文件格式:csvFile format: csv

  8. 确认连接到“拆分数据”左侧端口的“导出数据”模块的“路径”为 /data/us-incomeConfirm the Export Data module connected to the left port of the Split Data has the Path /data/us-income.

  9. 确认连接到右侧端口的“导出数据”模块的“路径”为 /data/non-us-incomeConfirm the Export Data module connected to the right port has the Path /data/non-us-income.

    管道和设置应如下所示:Your pipeline and settings should look like this:

    显示如何配置“导出数据”模块的屏幕截图.

提交运行Submit the run

现在,管道已设置为拆分和导出数据,可提交管道运行。Now that your pipeline is setup to split and export the data, submit a pipeline run.

  1. 在画布顶部选择“提交”。At the top of the canvas, select Submit.

  2. 在“设置管道运行”对话框中,选择“新建”以创建试验。In the Set up pipeline run dialog, select Create new to create an experiment.

    试验将相关管道运行以逻辑方式分组在一起。Experiments logically group together related pipeline runs. 如果在将来运行此管道,则应使用相同试验进行日志记录和跟踪。If you run this pipeline in the future, you should use the same experiment for logging and tracking purposes.

  3. 提供描述性试验名称,如“split-census-data”。Provide a descriptive experiment name like "split-census-data".

  4. 选择“提交”。Select Submit.

查看结果View results

管道运行完成之后,可以通过在 Azure 门户中导航到 blob 存储来查看结果。After the pipeline finishes running, you can view your results by navigating to your blob storage in the Azure portal. 还可以查看“拆分数据”模块的中间结果,以确认数据已正确拆分。You can also view the intermediary results of the Split Data module to confirm that your data has been split correctly.

  1. 选择“拆分数据”模块。Select the Split Data module.

  2. 在画布右侧的模块详细信息窗格中,选择“输出 + 日志”。In the module details pane to the right of the canvas, select Outputs + logs.

  3. 选择“结果数据集 1”旁的可视化图标 可视化图标Select the visualize icon visualize icon next to Results dataset1.

  4. 验证“native-country”列是否只包含值“United-States”。Verify that the "native-country" column only contains the value "United-States".

  5. 选择“结果数据集 2”旁的可视化图标 可视化图标Select the visualize icon visualize icon next to Results dataset2.

  6. 验证“native-country”列是否不包含值“United-States”。Verify that the "native-country" column does not contain the value "United-States".

清理资源Clean up resources

如果要继续学习此操作说明的第 2 部分使用 Azure 机器学习设计器重新训练模型,请跳过此部分。Skip this section if you want to continue on with part 2 of this how to, Retrain models with Azure Machine Learning designer.

重要

可以使用你创建的、用作其他 Azure 机器学习教程和操作指南文章的先决条件的资源。You can use the resources that you created as prerequisites for other Azure Machine Learning tutorials and how-to articles.

删除所有内容Delete everything

如果你不打算使用所创建的任何内容,请删除整个资源组,以免产生任何费用。If you don't plan to use anything that you created, delete the entire resource group so you don't incur any charges.

  1. 在 Azure 门户的窗口左侧选择“资源组” 。In the Azure portal, select Resource groups on the left side of the window.

    在 Azure 门户中删除资源组

  2. 在列表中选择你创建的资源组。In the list, select the resource group that you created.

  3. 选择“删除资源组” 。Select Delete resource group.

删除该资源组也会删除在设计器中创建的所有资源。Deleting the resource group also deletes all resources that you created in the designer.

删除各项资产Delete individual assets

在创建试验的设计器中删除各个资产,方法是将其选中,然后选择“删除”按钮。 In the designer where you created your experiment, delete individual assets by selecting them and then selecting the Delete button.

此处创建的计算目标在未使用时,会自动缩减到零个节点。 The compute target that you created here automatically autoscales to zero nodes when it's not being used. 此操作旨在最大程度地减少费用。This action is taken to minimize charges. 若要删除计算目标,请执行以下步骤: If you want to delete the compute target, take these steps:

删除资产

可以通过选择每个数据集并选择“注销” ,从工作区中注销数据集。You can unregister datasets from your workspace by selecting each dataset and selecting Unregister.

取消注册数据集

若要删除数据集,请使用 Azure 门户或 Azure 存储资源管理器访问存储帐户,然后手动删除这些资产。To delete a dataset, go to the storage account by using the Azure portal or Azure Storage Explorer and manually delete those assets.

后续步骤Next steps

本文介绍了如何转换数据集并将它保存到已注册的数据存储中。In this article, you learned how to transform a dataset and save it to a registered datastore.

继续学习此操作说明系列的下一个部分使用 Azure 机器学习设计器重新训练模型,以使用转换的数据集和管道参数训练机器学习模块。Continue to the next part of this how-to series with Retrain models with Azure Machine Learning designer to use your transformed datasets and pipeline parameters to train machine learning models.