“删除重复行”模块Remove Duplicate Rows module

本文介绍 Azure 机器学习设计器中的一个模块。This article describes a module in Azure Machine Learning designer.

使用此模块从数据集中删除可能的重复数据。Use this module to remove potential duplicates from a dataset.

例如,假设数据看上去如下所示,其中包含了多个患者记录。For example, assume your data looks like the following, and represents multiple records for patients.

PatientIDPatientID InitialsInitials 性别Gender AgeAge 入院时间Admitted
11 F.M.F.M. MM 5353 一月Jan
22 F.A.M.F.A.M. MM 5353 一月Jan
33 F.A.M.F.A.M. MM 2424 一月Jan
33 F.M.F.M. MM 2424 二月Feb
44 F.M.F.M. MM 2323 二月Feb
F.M.F.M. MM 2323
55 F.A.M.F.A.M. MM 5353
66 F.A.M.F.A.M. MM NaNNaN
77 F.A.M.F.A.M. MM NaNNaN

显然,这个示例有多个列,其中可能有重复的数据。Clearly, this example has multiple columns with potentially duplicate data. 它们是否确实是重复项取决于你对数据的了解。Whether they are actually duplicates depends on your knowledge of the data.

  • 例如,你可能知道许多患者具有相同的姓名。For example, you might know that many patients have the same name. 不使用任何姓名列消除重复项,仅使用“ID”列进行消除。You wouldn't eliminate duplicates using any name columns, only the ID column. 这样,无论患者是否具有相同的姓名,都只会筛选出具有重复 ID 值的行。That way, only the rows with duplicate ID values are filtered out, regardless of whether the patients have the same name or not.

  • 或者,可以选择允许 ID 字段中存在重复项,而使用一些其他的文件组合来查找唯一记录,如名字、姓氏、年龄和性别。Alternatively, you might decide to allow duplicates in the ID field, and use some other combination of files to find unique records, such as first name, last name, age, and gender.

若要设置用于判断行是否重复的条件,请将一列或一组列指定为键。To set the criteria for whether a row is duplicate or not, you specify a single column or a set of columns to use as keys . 仅当两行的所有键列中的值相等时,才会将两行视为重复。Two rows are considered duplicates only when the values in all key columns are equal. 如果任何行缺少键值,则不会将这些行视为重复行。If any row has missing value for keys , they will not be considered duplicate rows. 例如,如果在上表中将“性别”和“年龄”设置为“键”,则第 6 行和第 7 行不是重复行,因为其缺少“年龄”值。For example, if Gender and Age are set as Keys in above table, row 6 and 7 are not duplicate rows given they have missing value in Age.

运行该模块时,它将创建一个候选数据集,并返回一组在指定列集中没有重复项的行。When you run the module, it creates a candidate dataset, and returns a set of rows that have no duplicates across the set of columns you specified.

重要

源数据集不会更改;此模块创建一个新的数据集,该数据集是根据指定条件排除了重复项而筛选出来的。The source dataset is not altered; this module creates a new dataset that is filtered to exclude duplicates, based on the criteria you specify.

如何使用“删除重复行”How to use Remove Duplicate Rows

  1. 将模块添加到管道。Add the module to your pipeline. 可以在“数据转换”、“操作”下找到“删除重复行”模板 。You can find the Remove Duplicate Rows module under Data Transformation , Manipulation .

  2. 连接要检查其是否有重复行的数据集。Connect the dataset that you want to check for duplicate rows.

  3. 在“属性”窗格的“键列选择筛选器表达式”下,单击“启动列选择器”,选择用于标识重复项的列 。In the Properties pane, under Key column selection filter expression , click Launch column selector , to choose columns to use in identifying duplicates.

    在此上下文中,“键”并不表示唯一标识符。In this context, Key does not mean a unique identifier. 使用“列选择器”选择的所有列都指定为“键列”。All columns that you select using the Column Selector are designated as key columns . 所有未选定的列都被视为非键列。All unselected columns are considered non-key columns. 所选择的作为键的列的组合确定了记录的唯一性。The combination of columns that you select as keys determines the uniqueness of the records. (可将其想象成使用多个等值连接的 SQL 语句。)(Think of it as a SQL statement that uses multiple equalities joins.)

    示例:Examples:

    • “我想要确保 ID 是唯一的”:仅选择“ID”列。"I want to ensure that IDs are unique": Choose only the ID column.
    • “我想要确保名字、姓氏和 ID 的组合是唯一的”:选中所有三个列。"I want to ensure that the combination of first name, last name, and ID is unique": Select all three columns.
  4. 使用“保留第一个重复行”复选框指示在找到重复项时要返回的行:Use the Retain first duplicate row checkbox to indicate which row to return when duplicates are found:

    • 如果选择此选项,则返回第一行,丢弃其他行。If selected, the first row is returned and others discarded.
    • 如果取消选择此选项,则最后一个重复行将保留在结果中,而丢弃其他行。If you uncheck this option, the last duplicate row is kept in the results, and others are discarded.
  5. 提交管道。Submit the pipeline.

  6. 若要查看结果,请右键单击模块,然后选择“可视化”。To review the results, right-click the module, and select Visualize .

提示

如果结果难以理解,或者想要排除某些列,则可以使用选择数据集中的列模块来删除列。If the results are difficult to understand, or if you want to exclude some columns from consideration, you can remove columns by using the Select Columns in Dataset module.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.