映射数据流中的联接转换Join transformation in mapping data flow

适用于: Azure 数据工厂 Azure Synapse Analytics

使用联接转换可以在映射数据流中合并来自两个源或流的数据。Use the join transformation to combine data from two sources or streams in a mapping data flow. 输出流将包含这两个源中基于联接条件匹配的所有列。The output stream will include all columns from both sources matched based on a join condition.

联接类型Join types

映射数据流目前支持五种不同的联接类型。Mapping data flows currently supports five different join types.

内部联接Inner Join

内联仅输出在两个表中具有匹配值的行。Inner join only outputs rows that have matching values in both tables.

左外部Left Outer

左外部联接返回左侧流中的所有行以及右侧流中的匹配记录。Left outer join returns all rows from the left stream and matched records from the right stream. 如果左侧流中的某一行没有匹配项,则右侧流中的输出列将设置为 NULL。If a row from the left stream has no match, the output columns from the right stream are set to NULL. 输出为内联返回的行以及左侧流中不匹配的行。The output will be the rows returned by an inner join plus the unmatched rows from the left stream.

备注

由于联接条件中可能出现笛卡尔积,因此,数据流使用的 Spark 引擎有时会失败。The Spark engine used by data flows will occasionally fail due to possible cartesian products in your join conditions. 如果发生这种情况,可以切换到自定义交叉联接并手动输入联接条件。If this occurs, you can switch to a custom cross join and manually enter your join condition. 这样做可能会导致数据流性能下降,因为执行引擎可能需要计算关系两侧的所有行,然后对行进行筛选。This may result in slower performance in your data flows as the execution engine may need to calculate all rows from both sides of the relationship and then filter rows.

右外部Right Outer

右外部联接返回右侧流中的所有行以及左侧流中的匹配记录。Right outer join returns all rows from the right stream and matched records from the left stream. 如果右侧流中的某一行没有匹配项,则左侧流中的输出列将设置为 NULL。If a row from the right stream has no match, the output columns from the left stream are set to NULL. 输出为内联返回的行以及右侧流中不匹配的行。The output will be the rows returned by an inner join plus the unmatched rows from the right stream.

完全外部Full Outer

完全外部联接输出两侧中的所有列和行,其中不匹配的列为 NULL 值。Full outer join outputs all columns and rows from both sides with NULL values for columns that aren't matched.

自定义交叉联接Custom cross join

交叉联接根据条件输出两个流的叉积。Cross join outputs the cross product of the two streams based upon a condition. 如果使用的条件不相等,请指定一个自定义表达式作为交叉联接条件。If you're using a condition that isn't equality, specify a custom expression as your cross join condition. 输出流为所有符合联接条件的行。The output stream will be all rows that meet the join condition.

可以将此联接类型用于非等值联接和 OR 条件。You can use this join type for non-equi joins and OR conditions.

如果要显式生成完整的笛卡尔积,请在联接之前,在两个独立流的每个流中使用派生列转换来创建要匹配的合成键。If you would like to explicitly produce a full cartesian product, use the Derived Column transformation in each of the two independent streams before the join to create a synthetic key to match on. 例如,在每个名为 SyntheticKey 的流的派生列中创建一个新列,并将其设置为等于 1For example, create a new column in Derived Column in each stream called SyntheticKey and set it equal to 1. 然后使用 a.SyntheticKey == b.SyntheticKey 作为自定义联接表达式。Then use a.SyntheticKey == b.SyntheticKey as your custom join expression.

备注

在自定义交叉联接中,确保包含左右关系每一侧中的至少一列。Make sure to include at least one column from each side of your left and right relationship in a custom cross join. 如果对静态值而不是每一侧的列执行交叉联接,则会对整个数据集进行完全扫描,从而导致数据流性能不佳。Executing cross joins with static values instead of columns from each side results in full scans of the entire dataset, causing your data flow to perform poorly.

配置Configuration

  1. 在“右侧流”下拉菜单中选择要联接的数据流。Choose which data stream you're joining with in the Right stream dropdown.
  2. 选择“联接类型”Select your Join type
  3. 选择要根据联接条件进行匹配的键列。Choose which key columns you want to match on for you join condition. 默认情况下,数据流在每个流中的一列之间查找相等性。By default, data flow looks for equality between one column in each stream. 若要通过计算值进行比较,请将鼠标悬停在列下拉菜单上,然后选择“计算列”。To compare via a computed value, hover over the column dropdown and select Computed column.

联接转换Join Transformation

非等值联接Non-equi joins

若要在联接条件中使用条件运算符,例如不等于 (!=) 或大于 (>),请在两列之间更改运算符下拉菜单。To use a conditional operator such as not equals (!=) or greater than (>) in your join conditions, change the operator dropdown between the two columns. 非等值联接要求使用“优化”选项卡中的“固定”广播来广播两个流中的至少一个流。Non-equi joins require at least one of the two streams to be broadcasted using Fixed broadcasting in the Optimize tab.

非等值联接Non-equi join

优化联接性能Optimizing join performance

与 SSIS 等工具中的合并联接不同,联接转换不是强制性的合并联接操作。Unlike merge join in tools like SSIS, the join transformation isn't a mandatory merge join operation. 联接键无需排序。The join keys don't require sorting. 联接操作基于 Spark 中的最佳联接操作(广播或映射侧联接)进行。The join operation occurs based on the optimal join operation in Spark, either broadcast or map-side join.

联接转换优化Join Transformation optimize

在联接、查找和存在转换中,如果工作器节点内存可容纳一个数据流或同时容纳两个数据流,则可以通过启用“广播”来优化性能。In joins, lookups and exists transformation, if one or both data streams fit into worker node memory, you can optimize performance by enabling Broadcasting. 默认情况下,Spark 引擎将自动决定是否广播一侧。By default, the spark engine will automatically decide whether or not to broadcast one side. 若要手动选择要广播的一侧,请选择“固定”。To manually choose which side to broadcast, select Fixed.

建议不要通过“关闭”选项来禁用广播,除非联接遇到超时错误。It's not recommended to disable broadcasting via the Off option unless your joins are running into timeout errors.

自联接Self-Join

若要对数据流进行自联接,请使用选择转换为现有流添加别名。To self-join a data stream with itself, alias an existing stream with a select transformation. 通过单击转换旁边的加号图标并选择“新建分支”,来创建新的分支。Create a new branch by clicking on the plus icon next to a transformation and selecting New branch. 添加选择转换,以便为原始流添加别名。Add a select transformation to alias the original stream. 添加联接转换,并选择原始流作为“左侧流”,选择转换作为“右侧流”。Add a join transformation and choose the original stream as the Left stream and the select transformation as the Right stream.

自联接Self-join

测试联接条件Testing join conditions

在调试模式下使用数据预览测试联接转换时,请使用一小组已知数据。When testing the join transformations with data preview in debug mode, use a small set of known data. 对大型数据集中的行进行采样时,无法预测将读取哪些行和键进行测试。When sampling rows from a large dataset, you can't predict which rows and keys will be read for testing. 结果是非确定性的,这意味着你的联接条件可能不会返回任何匹配项。The result is non-deterministic, meaning that your join conditions may not return any matches.

数据流脚本Data flow script

语法Syntax

<leftStream>, <rightStream>
    join(
        <conditionalExpression>,
        joinType: { 'inner'> | 'outer' | 'left_outer' | 'right_outer' | 'cross' }
        broadcast: { 'auto' | 'left' | 'right' | 'both' | 'off' }
    ) ~> <joinTransformationName>

内联示例Inner join example

下面的示例是一个名为 JoinMatchedData 的联接转换,它采用左侧流 TripData 和右侧流 TripFareThe below example is a join transformation named JoinMatchedData that takes left stream TripData and right stream TripFare. 联接条件为表达式 hack_license == { hack_license} && TripData@medallion == TripFare@medallion && vendor_id == { vendor_id} && pickup_datetime == { pickup_datetime},如果每个流中的 hack_licensemedallionvendor_idpickup_datetime 列均匹配,则返回 true。The join condition is the expression hack_license == { hack_license} && TripData@medallion == TripFare@medallion && vendor_id == { vendor_id} && pickup_datetime == { pickup_datetime} that returns true if the hack_license, medallion, vendor_id, and pickup_datetime columns in each stream match. joinType'inner'The joinType is 'inner'. 我们仅在左侧流中启用广播,因此 broadcast 的值为 'left'We're enabling broadcasting in only the left stream so broadcast has value 'left'.

在数据工厂 UX 中,此转换如下图所示:In the Data Factory UX, this transformation looks like the below image:

屏幕截图显示转换,其中包含已选择的“联接设置”选项卡和内部联接类型。Screenshot shows the transformation with the Join Settings tab selected and a Join type of Inner.

此转换的数据流脚本位于下面的代码片段中:The data flow script for this transformation is in the snippet below:

TripData, TripFare
    join(
        hack_license == { hack_license}
        && TripData@medallion == TripFare@medallion
        && vendor_id == { vendor_id}
        && pickup_datetime == { pickup_datetime},
        joinType:'inner',
        broadcast: 'left'
    )~> JoinMatchedData

自定义交叉联接示例Custom cross join example

下面的示例是一个名为 JoiningColumns 的联接转换,它采用左侧流 LeftStream 和右侧流 RightStreamThe below example is a join transformation named JoiningColumns that takes left stream LeftStream and right stream RightStream. 此转换接收两个流,并将 leftstreamcolumn 列大于 rightstreamcolumn 列的所有行联接在一起。This transformation takes in two streams and joins together all rows where column leftstreamcolumn is greater than column rightstreamcolumn. joinTypecrossThe joinType is cross. 未启用广播,因此 broadcast 的值为 'none'Broadcasting is not enabled broadcast has value 'none'.

在数据工厂 UX 中,此转换如下图所示:In the Data Factory UX, this transformation looks like the below image:

屏幕截图显示转换,其中包含已选择的“联接设置”选项卡和自定义(交叉)联接类型。Screenshot shows the transformation with the Join Settings tab selected and a Join type of Custom (cross).

此转换的数据流脚本位于下面的代码片段中:The data flow script for this transformation is in the snippet below:

LeftStream, RightStream
    join(
        leftstreamcolumn > rightstreamcolumn,
        joinType:'cross',
        broadcast: 'none'
    )~> JoiningColumns

后续步骤Next steps

联接数据后,创建一个派生列并将数据接收到目标数据存储中。After joining data, create a derived column and sink your data to a destination data store.