联接数据Join Data

本文介绍了如何通过 Azure 机器学习设计器(预览版)中的联接数据模块使用数据库样式的联接操作来合并两个数据集。This article describes how to use the Join Data module in Azure Machine Learning designer (preview) to merge two datasets using a database-style join operation.

如何配置“联接数据”How to configure Join Data

若要对两个数据集执行联接,它们应通过键列进行关联。To perform a join on two datasets, they should be related by a key column. 还支持使用多个列的组合键。Composite keys using multiple columns are also supported.

  1. 添加要合并的数据集,然后将联接数据模块拖动到管道中。Add the datasets you want to combine, and then drag the Join Data module into your pipeline.

    可以在“数据转换” 类别下的“操作” 中找到此模块。You can find the module in the Data Transformation category, under Manipulation.

  2. 将数据集连接到联接数据模块。Connect the datasets to the Join Data module.

  3. 选择“启动列选择器” 来选择键列。Select Launch column selector to choose key column(s). 请记得同时为左侧输入和右侧输入选择列。Remember to choose columns for both the left and right inputs.

    对于单一键:For a single key:

    为两个输入选择单个键列。Select a single key column for both inputs.

    对于组合键:For a composite key:

    按相同顺序从左输入和右输入中选择所有键列。Select all the key columns from left input and right input in the same order. 当所有键列都匹配时,联接数据模块将联接这些表。The Join Data module will join the tables when all key columns match. 如果列顺序与原始表不同,请选中选项“允许重复项并保留选定内容中的列顺序” 。Check the option Allow duplicates and preserve column order in selection if the column order isn't the same as the original table.

    列选择器

  4. 如果要在文本列联接上保留区分大小写,请选择“匹配大小写” 选项。Select the Match case option if you want to preserve case sensitivity on a text column join.

  5. 使用“联接类型” 下拉列表指定应当如何组合数据集。Use the Join type dropdown list to specify how the datasets should be combined.

    • 内部联接:“内部联接” 是最常见的联接操作。Inner Join: An inner join is the most common join operation. 仅当键列的值匹配时,它才会返回组合的行。It returns the combined rows only when the values of the key columns match.

    • 左外部联接:“左外部联接” 为左表中的所有行返回联接的行。Left Outer Join: A left outer join returns joined rows for all rows from the left table. 如果左表中的某行在右表中没有匹配的行,则对于来自右表的所有列,返回的行将包含缺失值。When a row in the left table has no matching rows in the right table, the returned row contains missing values for all columns that come from the right table. 你还可以为缺失值指定替换值。You can also specify a replacement value for missing values.

    • 完全外部联接:“完全外部联接” 返回来自左表 (table1) 和来自右表 (table2) 的所有行。Full Outer Join: A full outer join returns all rows from the left table (table1) and from the right table (table2).

      对于任一表中在另一表中没有匹配行的每个行,结果中将包括含缺失值的一个行。For each of the rows in either table that have no matching rows in the other, the result includes a row containing missing values.

    • 左半联接:当键列的值匹配时,“左半联接” 只返回左表中的值。Left Semi-Join: A left semi-join returns only the values from the left table when the values of the key columns match.

  6. 对于选项“在联接的表中保留右侧的键列” :For the option Keep right key columns in joined table:

    • 选择此选项可以查看两个输入表中的键。Select this option to view the keys from both input tables.
    • 取消选择此选项将只返回左侧输入中的键列。Deselect to only return the key columns from the left input.
  7. 提交管道。Submit the pipeline.

  8. 若要查看结果,请右键单击“联接数据”,然后选择“可视化” 。To view the results, right-click the Join Data and select Visualize.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.