转换为数据集Convert to Dataset

本文介绍了如何使用 Azure 机器学习设计器(预览版)中的“转换为数据集”模块将管道的任何数据转换为设计器的内部格式。This article describes how to use the Convert to Dataset module in Azure Machine Learning designer (preview) to convert any data for a pipeline to the designer's internal format.

大多数情况下,不一定非要转换。Conversion is not required in most cases. 在对数据执行任何操作时,Azure 机器学习都会将数据隐式转换为其本机数据集格式。Azure Machine Learning implicitly converts data to its native dataset format when any operation is performed on the data.

如果你对一组数据执行了某种标准化或清理,并且你希望确保在其他管道中使用这些更改,我们建议将数据保存为数据集格式。We recommend saving data to the dataset format if you've performed some kind of normalization or cleaning on a set of data, and you want to ensure that the changes are used in other pipelines.

备注

“转换为数据集”只会更改数据的格式。Convert to Dataset changes only the format of the data. 它不会在工作区中保存数据的新副本。It does not save a new copy of the data in the workspace. 若要保存数据集,请双击输出端口,选择“另存为数据集” ,然后输入新名称。To save the dataset, double-click the output port, select Save as dataset, and enter a new name.

如何使用“转换为数据集”How to use Convert to Dataset

在使用“转换为数据集”之前,建议使用编辑元数据模块来准备数据集。We recommend that you use the Edit Metadata module to prepare the dataset before you use Convert to Dataset. 你可以添加或更改列名,调整数据类型,以及根据需要进行其他更改。You can add or change column names, adjust data types, and make other changes as needed.

  1. 将“转换为数据集”模块添加到管道。Add the Convert to Dataset module to your pipeline. 可以在设计器中的数据转换类别中找到此模块。You can find this module in the Data transformation category in the designer.

  2. 将它连接到可以输出数据集的任何模块。Connect it to any module that outputs a dataset.

    只要数据是表格,就可以将其转换为数据集。As long as the data is tabular, you can convert it to a dataset. 这包括通过导入数据加载的数据、通过手动输入数据创建的数据,或通过应用转换转换的数据集。This includes data loaded through Import Data, data created through Enter Data Manually, or datasets transformed through Apply Transformation.

  3. 在“操作” 下拉列表中,指示是否要在保存数据集之前对数据执行任何清理操作:In the Action drop-down list, indicate if you want to do any cleanup on the data before you save the dataset:

    • :按原样使用数据。None: Use the data as is.

    • SetMissingValue:将特定值设置为数据集中的缺失值。SetMissingValue: Set a specific value to a missing value in the dataset. 默认占位符是问号字符 (?),但你可以使用“自定义缺失值” 选项来输入其他值。The default placeholder is the question mark character (?), but you can use the Custom missing value option to enter a different value. 例如,如果你为“自定义缺失值”输入了 Taxi ,则数据集中的所有 Taxi 实例都将更改为缺失值。For example, if you enter Taxi for Custom missing value, then all instances of Taxi in the dataset will be changed to the missing value.

    • ReplaceValues:使用此选项可以指定要替换为任何其他确切值的单个确切值。ReplaceValues: Use this option to specify a single exact value to be replaced with any other exact value. 可以通过设置“替换” 方法来替换缺失值或自定义值:You can replace missing values or custom values by setting the Replace method:

      • 缺失:选择此选项可以替换输入数据集中的缺失值。Missing: Choose this option to replace missing values in the input dataset. 对于“新值” ,请输入要用来替换缺失值的值。For New Value, enter the value to replace the missing values with.
      • 自定义:选择此选项可以替换输入数据集中的自定义值。Custom: Choose this option to replace custom values in the input dataset. 对于“自定义值” ,请输入要查找的值。For Custom value, enter the value that you want to find. 例如,如果数据包含用作缺失值占位符的字符串 obs,则输入 obsFor example, if your data contains the string obs used as a placeholder for missing values, you enter obs. 对于“新值” ,请输入用来替换原始字符串的新值。For New value, enter the new value to replace the original string with.

    请注意,ReplaceValues 操作仅应用于完全匹配项。Note that the ReplaceValues operation applies only to exact matches. 例如,这些字符串不会受影响:obs.obsoleteFor example, these strings would not be affected: obs., obsolete.

  4. 提交管道。Submit the pipeline.

结果Results

  • 要使用新名称保存生成的数据集,请在模块右面板的“输出”选项卡下选择“注册数据集”图标 。To save the resulting dataset with a new name, select on the icon Register dataset under the Outputs tab in the right panel of the module.

技术说明Technical notes

  • 接受数据集作为输入的任何模块还可以使用 CSV 文件或 TSV 文件中的数据。Any module that takes a dataset as input can also take data in the CSV file or the TSV file. 在运行任何模块代码之前,将对输入进行预处理。Before any module code is run, the inputs are preprocessed. 预处理等效于对输入运行“转换为数据集”模块。Preprocessing is equivalent to running the Convert to Dataset module on the input.

  • 无法从 SVMLight 格式转换为数据集。You can't convert from the SVMLight format to a dataset.

  • 当指定自定义替换操作时,搜索和替换操作将应用于完整值。When you're specifying a custom replace operation, the search-and-replace operation applies to complete values. 不允许部分匹配。Partial matches are not allowed. 例如,可以将 3 替换为 -1 或 33,但不能替换两位数(例如 35)中的 3。For example, you can replace a 3 with a -1 or with 33, but you can't replace a 3 in a two-digit number such as 35.

  • 对于自定义替换操作,如果你使用不符合列的当前数据类型的任何字符进行替换,则替换将失败且不会提示。For custom replace operations, the replacement will silently fail if you use as a replacement any character that does not conform to the current data type of the column.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.