使用数据流代码片段删除重复行和查找 nullDedupe rows and find nulls by using data flow snippets

适用于: Azure 数据工厂 Azure Synapse Analytics

通过使用映射数据流中的代码片段,可以轻松执行重复数据删除和 null 筛选等常见任务。By using code snippets in mapping data flows, you can easily perform common tasks such as data deduplication and null filtering. 本文介绍如何使用数据流脚本代码片段轻松地将这些功能添加到管道。This article explains how to easily add those functions to your pipelines by using data flow script snippets.

创建管道Create a pipeline

  1. 选择“新建管道”。Select New Pipeline.

  2. 添加数据流活动。Add a data flow activity.

  3. 选择“源设置”选项卡,添加源转换,然后将其连接到某个数据集。Select the Source settings tab, add a source transformation, and then connect it to one of your datasets.

    用于添加源类型的“源设置”窗格的屏幕截图。

    重复数据删除和 null 检查代码片段使用利用数据流架构偏差的一般模式。The dedupe and null check snippets use generic patterns that take advantage of data flow schema drift. 代码片段适用于数据集中的任何架构,或没有预定义架构的数据集。The snippets work with any schema from your dataset, or with datasets that have no pre-defined schema.

  4. 数据流脚本 (DFS) 的“使用所有列的不同行”部分中,复制 DistinctRows 的代码片段。In the "Distinct row using all columns" section of Data flow script (DFS), copy the code snippet for DistinctRows.

  5. 转到“数据流脚本”文档页,复制不同行的代码片段。Go to the Data Flow Script documentation page and copy the code snippet for Distinct Rows.

    源代码片段的屏幕截图。

  6. 在脚本中,在 source1 的定义后,按 Enter,然后粘贴代码片段。In your script, after the definition for source1, hit Enter, and then paste the code snippet.

  7. 执行以下操作之一:Do either of the following:

    • 通过在粘贴的代码前面键入 source1,将此粘贴的代码片段连接到你之前在图中创建的源转换。Connect this pasted code snippet to the source transformation that you created earlier in the graph by typing source1 in front of the pasted code.

    • 或者,可以通过从图中的新转换节点选择传入流,在设计器中连接新转换。Alternatively, you can connect the new transformation in the designer by selecting the incoming stream from the new transformation node in the graph.

      “有条件拆分设置”窗格的屏幕截图。

    现在,数据流将使用聚合转换从源删除重复行,聚合转换通过对所有列值使用通用哈希来按所有行分组。Now your data flow will remove duplicate rows from your source by using the aggregate transformation, which groups by all rows by using a general hash across all column values.

  8. 添加一个代码片段,用于将数据拆分为一个包含带 null 的行的流,以及另一个不含 null 的流。Add a code snippet for splitting your data into one stream that contains rows with nulls and another stream without nulls. 为此,请执行以下操作:To do so:

  9. 返回到代码片段库,这一次复制 NULL 检查的代码。Go back to the Snippet library and this time copy the code for the NULL checks.

    b.b. 在数据流设计器中,再次选择“脚本”,然后在底部粘贴此新转换代码。In your data flow designer, select Script again, and then paste this new transformation code at the bottom. 此操作通过将该转换的名称置于粘贴的代码片段前面,将脚本连接到之前的转换。This action connects the script to your previous transformation by placing the name of that transformation in front of the pasted snippet.

    数据流图现在应如下所示:Your data flow graph should now look similar to this:

    数据流图的屏幕截图。

现在,你已通过从数据流脚本库中获取现有代码片段并将其添加到现有设计中,创建了一个包含通用重复数据删除和 null 检查的工作数据流。You have now created a working data flow with generic deduping and null checks by taking existing code snippets from the Data Flow Script library and adding them into your existing design.

后续步骤Next steps

  • 使用映射数据流转换来生成数据流逻辑的其余部分。Build the rest of your data flow logic by using mapping data flows transformations.