“编辑元数据”模块Edit Metadata module

本文介绍 Azure 机器学习设计器(预览版)中包含的一个模块。This article describes a module included in Azure Machine Learning designer (preview).

使用“编辑元数据”模块可以更改与数据集中的列关联的元数据。Use the Edit Metadata module to change metadata that's associated with columns in a dataset. 使用“编辑元数据”模块后,数据集的值和数据类型将会更改。The value and data type of the dataset will change after use of the Edit Metadata module.

典型的元数据更改可能包括:Typical metadata changes might include:

  • 将布尔值或数字列视为分类值。Treating Boolean or numeric columns as categorical values.

  • 指示哪个列包含标签,或包含要分类或预测的值。Indicating which column contains the class label or contains the values you want to categorize or predict.

  • 将列标记为特征。Marking columns as features.

  • 将日期/时间值更改为数字值,或反之。Changing date/time values to numeric values or vice versa.

  • 重命名列。Renaming columns.

随时可以使用“编辑元数据”来修改列的定义(目的往往是为了满足下游模块的要求)。Use Edit Metadata anytime you need to modify the definition of a column, typically to meet requirements for a downstream module. 例如,某些模块只能处理特定的数据类型,或者需要在列中设置标志(例如 IsFeatureIsCategorical)。For example, some modules work only with specific data types or require flags on the columns, such as IsFeature or IsCategorical.

执行所需的操作后,可将元数据重置为其原始状态。After you perform the required operation, you can reset the metadata to its original state.

配置“编辑元数据”Configure Edit Metadata

  1. 在 Azure 机器学习设计器中,将“编辑元数据”模块添加到管道,并连接要更新的数据集。In Azure Machine Learning designer, add the Edit Metadata module to your pipeline and connect the dataset you want to update. 可以在“数据转换” 类别中找到该模块。You can find the module in the Data Transformation category.

  2. 在模块的右侧面板中单击“编辑列”,然后选择要处理的一列或一组列。Click Edit column in the right panel of the module and choose the column or set of columns to work with. 可以按名称或索引单独选择列,也可以按类型选择一组列。You can choose columns individually by name or index, or you can choose a group of columns by type.

  3. 如果需要将不同的数据类型分配到所选列,请选择“数据类型”选项。Select the Data type option if you need to assign a different data type to the selected columns. 可能需要更改某些运算的数据类型。You might need to change the data type for certain operations. 例如,如果源数据集中的数字作为文本进行处理,则在使用数学运算之前,必须将相关的列更改为数字数据类型。For example, if your source dataset has numbers handled as text, you must change them to a numeric data type before using math operations.

    • 支持的数据类型为“字符串”、“整数”、“双精度”、“布尔值”和“日期时间”。 The supported data types are String, Integer, Double, Boolean, and DateTime.

    • 如果选择多个列,必须将元数据更改应用到所有选定列。If you select multiple columns, you must apply the metadata changes to all selected columns. 例如,假设选择了两个或三个数字列。For example, let's say you choose two or three numeric columns. 可将所有这些列更改为字符串数据类型,并通过一个操作将其重命名。You can change them all to a string data type and rename them in one operation. 但是,不能将一列更改为字符串数据类型,将另一列从浮点数更改为整数。However, you can't change one column to a string data type and another column from a float to an integer.

    • 如果未指定新数据类型,则列的元数据将保持不变。If you don't specify a new data type, the column metadata is unchanged.

    • 执行“编辑元数据”操作后,列类型和值将会更改。The column type and values will change after you perform the Edit Metadata operation. 随时可以使用“编辑元数据”重置列数据类型,以此恢复原始数据类型。You can recover the original data type at any time by using Edit Metadata to reset the column data type.

    备注

     > <span data-ttu-id="c46da-132">“日期/时间格式” 遵循 [Python 内置日期/时间格式](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)。</span><span class="sxs-lookup"><span data-stu-id="c46da-132">The **DateTime Format** follows [Python built-in datetime format](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior).</span></span>  
     > <span data-ttu-id="c46da-133">如果将任何类型的数字更改为“日期时间”类型,请将“日期时间”格式保留空白。 </span><span class="sxs-lookup"><span data-stu-id="c46da-133">If you change any type of number to the **DateTime** type, leave the **DateTime Format** field blank.</span></span> <span data-ttu-id="c46da-134">目前无法指定目标数据格式。</span><span class="sxs-lookup"><span data-stu-id="c46da-134">Currently it isn't possible to specify the target data format.</span></span>
    
  4. 选择“分类”选项,指定应将所选列中的值视为类别。Select the Categorical option to specify that the values in the selected columns should be treated as categories.

    例如,你的某个列包含数字 0、1 和 2,但你知道这些数字实际上表示“烟民”、“非烟民”和“未知”。For example, you might have a column that contains the numbers 0, 1, and 2, but know that the numbers actually mean "Smoker," "Non-smoker," and "Unknown." 在这种情况下,将该列标记为分类可以确保这些值仅用于分组数据,而不会在数字计算中使用。In that case, by flagging the column as categorical you ensure that the values are used only to group data and not in numeric calculations.

  5. 若要更改 Azure 机器学习在模型中使用数据的方式,请使用“字段”选项。Use the Fields option if you want to change the way that Azure Machine Learning uses the data in a model.

    • 功能:使用此选项可将列标记为仅对特征列运行的模块中的特征。Feature: Use this option to flag a column as a feature in modules that operate only on feature columns. 默认情况下,最初会将所有列视为特征。By default, all columns are initially treated as features.

    • 标签:使用此选项来标记标签(也称为可预测属性或目标变量)。Label: Use this option to mark the label, which is also known as the predictable attribute or target variable. 许多模块要求数据集中刚好存在一个标签列。Many modules require that exactly one label column is present in the dataset.

      在许多情况下,Azure 机器学习可以推断某列是否包含类标签。In many cases, Azure Machine Learning can infer that a column contains a class label. 通过设置此元数据,可以确保正确标识列。By setting this metadata, you can ensure that the column is identified correctly. 设置此选项不会更改数据值。Setting this option does not change data values. 它只会更改某些机器学习算法处理数据的方式。It changes only the way that some machine-learning algorithms handle the data.

    提示

    是否有不适合这些类别的数据?Do you have data that doesn't fit into these categories? 例如,数据集可能包含无法用作变量的值(例如唯一标识符)。For example, your dataset might contain values such as unique identifiers that aren't useful as variables. 在模型中使用时,此类 ID 有时可能会导致问题。Sometimes such IDs can cause problems when used in a model.

    幸运的是,Azure 机器学习会保留所有数据,因此,你无需从数据集中删除此类列。Fortunately, Azure Machine Learning keeps all of your data, so that you don't have to delete such columns from the dataset. 需要对某些特殊的列集执行操作时,只需使用选择数据集中的列模块来暂时删除所有其他列。When you need to perform operations on some special set of columns, just remove all other columns temporarily by using the Select Columns in Dataset module. 以后,可以使用添加列模块将列合并回到数据集。Later you can merge the columns back into the dataset by using the Add Columns module.

  6. 使用以下选项可以清除前面所做的选择,并将元数据还原为默认值。Use the following options to clear previous selections and restore metadata to the default values.

    • 清除特征:使用此选项可以删除特征标志。Clear feature: Use this option to remove the feature flag.

      最初会将所有列视为特征。All columns are initially treated as features. 对于执行数学运算的模块,可能需要使用此选项来防止将数字列视为变量。For modules that perform mathematical operations, you might need to use this option in order to prevent numeric columns from being treated as variables.

    • 清除标签:使用此选项可从指定的列中删除标签元数据。Clear label: Use this option to remove the label metadata from the specified column.

    • 清除评分:使用此选项可从指定的列中删除评分元数据。Clear score: Use this option to remove the score metadata from the specified column.

      目前无法在 Azure 机器学习中将列显式标记为评分。You currently can't explicitly mark a column as a score in Azure Machine Learning. 但是,某些操作会导致在内部将列标记为评分。However, some operations result in a column being flagged as a score internally. 此外,自定义 R 模块可能会输出评分值。Also, a custom R module might output score values.

  7. 对于“新列名”,请输入所选的一个或多个列的新名称。For New column names, enter the new name of the selected column or columns.

    • 列名只能使用 UTF-8 编码支持的字符。Column names can use only characters that are supported by UTF-8 encoding. 不允许空字符串、null 值,或完全由空格组成的名称。Empty strings, nulls, or names that consist entirely of spaces aren't allowed.

    • 若要重命名多个列,请按列索引的顺序以逗号分隔列表的形式输入名称。To rename multiple columns, enter the names as a comma-separated list in order of the column indexes.

    • 必须重命名所有选定列。All selected columns must be renamed. 不能省略或跳过列。You can't omit or skip columns.

  8. 提交管道。Submit the pipeline.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.