范围联接优化Range Join optimization

当使用间隔中的点或间隔重叠条件联接两个关系时,将发生“范围联接”。A range join occurs when two relations are joined using a point in interval or interval overlap condition. Databricks Runtime 中的范围联接优化支持可以在查询性能方面带来数量级的改进,但需要仔细地进行手动优化。The range join optimization support in Databricks Runtime can bring orders of magnitude improvement in query performance, but requires careful manual tuning.

间隔中的点范围联接Point in interval range join

“间隔中的点范围联接”是一种联接,其中的条件包含的谓词指定一个关系中的值介于另一个关系中的两个值之间。A point in interval range join is a join in which the condition contains predicates specifying that a value from one relation is between two values from the other relation. 例如: 。For example:

-- using BETWEEN expressions
SELECT *
FROM points JOIN ranges ON points.p BETWEEN ranges.start and ranges.end;

-- using inequality expressions
SELECT *
FROM points JOIN ranges ON points.p >= ranges.start AND points.p < ranges.end;

-- with fixed length interval
SELECT *
FROM points JOIN ranges ON points.p >= ranges.start AND points.p < ranges.start + 100;

-- join two sets of point values within a fixed distance from each other
SELECT *
FROM points1 p1 JOIN points2 p2 ON p1.p >= p2.p - 10 AND p1.p <= p2.p + 10;

-- a range condition together with other join conditions
SELECT *
FROM points, ranges
WHERE points.symbol = ranges.symbol
  AND points.p >= ranges.start
  AND points.p < ranges.end;

间隔重叠范围联接Interval overlap range join

“间隔重叠范围联接”是一种联接,其中的条件包含的谓词指定每个关系中两个值之间的间隔的重叠。An interval overlap range join is a join in which the condition contains predicates specifying an overlap of intervals between two values from each relation. 例如: 。For example:

-- overlap of [r1.start, r1.end] with [r2.start, r2.end]
SELECT *
FROM r1 JOIN r2 ON r1.start < r2.end AND r2.start < r1.end;

-- overlap of fixed length intervals
SELECT *
FROM r1 JOIN r2 ON r1.start < r2.start + 100 AND r2.start < r1.start + 100;

-- a range condition together with other join conditions
SELECT *
FROM r1 JOIN r2 ON r1.symbol = r2.symbol
  AND r1.start <= r2.end
  AND r1.end >= r2.start;

范围联接优化Range join optimization

范围联接优化针对满足以下条件的联接执行:The range join optimization is performed for joins that:

  • 有一个条件可解释为间隔中的点范围联接或间隔重叠范围联接。Have a condition that can be interpreted as a point in interval or interval overlap range join.
  • 范围联接条件中涉及的所有值均为数值类型(整型、浮点、小数)、DATETIMESTAMPAll values involved in the range join condition are of a numeric type (integral, floating point, decimal), DATE, or TIMESTAMP.
  • 范围联接条件中涉及的所有值都属于同一类型。All values involved in the range join condition are of the same type. 对于十进制类型,这些值还需要具有相同的刻度和精度。In the case of the decimal type, the values also need to be of the same scale and precision.
  • 这是一个 INNER JOIN,或者,在间隔中的点范围联接中,这是在左侧具有点值的 LEFT OUTER JOIN,或者是在右侧具有点值的 RIGHT OUTER JOINIt is an INNER JOIN, or in case of point in interval range join, a LEFT OUTER JOIN with point value on the left side, or RIGHT OUTER JOIN with point value on the right side.
  • 具有箱大小优化参数。Have a bin size tuning parameter.

箱大小Bin size

“箱大小”是一个数值优化参数,它将范围条件的值域拆分为多个大小相等的“箱”。The bin size is a numeric tuning parameter that splits the values domain of the range condition into multiple bins of equal size. 例如,如果箱大小为 10,则优化会将域拆分为长度为 10 的间隔的箱。For example, with a bin size of 10, the optimization splits the domain into bins that are intervals of length 10. 如果你在范围条件 p BETWEEN start AND end 中有一个点,并且 start 为 8,end 为 22,则此值间隔与长度为 10 的三个箱重叠 – 第一个箱从 0 到 10,第二个箱从 10 到 20,第三个箱从 20 到 30。If you have a point in range condition of p BETWEEN start AND end, and start is 8 and end is 22, this value interval overlaps with three bins of length 10 – the first bin from 0 to 10, the second bin from 10 to 20, and the third bin from 20 to 30. 只有位于相同的三个箱中的点需要被视为该间隔的可能的联接匹配项。Only the points that fall within the same three bins need to be considered as possible join matches for that interval. 例如,如果 p 为 32,则可以排除它位于 start 为 8 且 end 为 22 的范围中,因为它位于从 30 到 40 的箱中。For example, if p is 32, it can be ruled out as falling between start of 8 and end of 22, because it falls in the bin from 30 to 40.

备注

  • 对于 DATE 值,箱大小的值被解释为天数。For DATE values, the value of the bin size is interpreted as days. 例如,箱大小值为 7 表示一周。For example, a bin size value of 7 represents a week.
  • 对于 TIMESTAMP 值,箱大小的值被解释为秒数。For TIMESTAMP values, the value of the bin size is interpreted as seconds. 如果需要亚秒值,则可以使用小数值。If a sub-second value is required, fractional values can be used. 例如,箱大小值为 60 表示一分钟,而箱大小值为 0.1 表示 100 毫秒。For example, a bin size value of 60 represents a minute, and a bin size value of 0.1 represents 100 milliseconds.

你可以通过在查询中使用范围联接提示或通过设置会话配置参数来指定箱大小。You can specify the bin size either by using a range join hint in the query or by setting a session configuration parameter. 仅当手动指定箱大小时,才会应用范围联接优化。The range join optimization is applied only if you manually specify the bin size. 选择箱大小部分介绍了如何选择最佳箱大小。Section Choose the bin size describes how to choose an optimal bin size.

使用范围联接提示启用范围联接Enable range join using a range join hint

若要在 SQL 查询中启用范围联接优化,可以使用范围联接提示来指定箱大小。To enable the range join optimization in a SQL query, you can use a range join hint to specify the bin size. 提示必须包含所联接的关系之一的关系名称和箱大小数值参数。The hint must contain the relation name of one of the joined relations and the numeric bin size parameter. 关系名称可以是表、视图或子查询。The relation name can be a table, a view, or a subquery.

SELECT /*+ RANGE_JOIN(points, 10) */ *
FROM points JOIN ranges ON points.p >= ranges.start AND points.p < ranges.end;

SELECT /*+ RANGE_JOIN(r1, 0.1) */ *
FROM (SELECT * FROM ranges WHERE ranges.amount < 100) r1, ranges r2
WHERE r1.start < r2.start + 100 AND r2.start < r1.start + 100;

SELECT /*+ RANGE_JOIN(c, 500) */ *
FROM a
  JOIN b ON (a.b_key = b.id)
  JOIN c ON (a.ts BETWEEN c.start_time AND c.end_time)

备注

在第三个示例中,你必须在 c 上放置提示。In the third example, you must place the hint on c. 这是因为联接是左结合,因此,该查询将被解释为 (a JOIN b) JOIN ca 上的提示应用于 ab 的联接,而不是与 c 的联接。This is because joins are left associative, so the query is interpreted as (a JOIN b) JOIN c, and the hint on a applies to the join of a with b and not the join with c.

你还可以在某个已联接的数据帧上放置范围联接提示。You can also place a range join hint on one of the joined DataFrames. 在这种情况下,提示仅包含箱大小数值参数。In that case, the hint contains just the numeric bin size parameter.

val df1 = spark.table("ranges").as("left")
val df2 = spark.table("ranges").as("right")

val joined = df1.hint("range_join", 10)
  .join(df2, $"left.type" === $"right.type" &&
     $"left.end" > $"right.start" &&
     $"left.start" < $"right.end")

val joined2 = df1
  .join(df2.hint("range_join", 0.5), $"left.type" === $"right.type" &&
     $"left.end" > $"right.start" &&
     $"left.start" < $"right.end")

使用会话配置启用范围联接Enable range join using session configuration

如果你不想修改查询,则可以将箱大小指定为配置参数。If you don’t want to modify the query, you can specify the bin size as a configuration parameter.

SET spark.databricks.optimizer.rangeJoin.binSize=5

此配置适用于具有范围条件的任何联接。This configuration applies to any join with a range condition. 但是,通过范围联接提示设置的不同箱大小始终会覆盖通过配置设置的大小。However, a different bin size set through a range join hint always overrides the one set through the configuration.

选择箱大小 Choose the bin size

范围联接优化的有效性取决于选择的箱大小是否合适。The effectiveness of the range join optimization depends on choosing the appropriate bin size.

箱大小越小,箱数量越多,这有助于筛选潜在的匹配项。A small bin size results in a larger number of bins, which helps in filtering the potential matches. 但是,如果箱大小显著小于所遇的值间隔,并且值间隔与多个箱间隔重叠,则会降低效率。However, it becomes inefficient if the bin size is significantly smaller than the encountered value intervals, and the value intervals overlap multiple bin intervals. 例如,如果条件为 p BETWEEN start AND end,其中 start 为 1,000,000 且 end 为 1,999,999,箱大小为 10,则值间隔与 100,000 个箱重叠。For example, with a condition p BETWEEN start AND end, where start is 1,000,000 and end is 1,999,999, and a bin size of 10, the value interval overlaps with 100,000 bins.

如果间隔的长度比较统一并且是已知的,则建议你将箱大小设置为值间隔的典型预期长度。If the length of the interval is fairly uniform and known, we recommend that you set the bin size to the typical expected length of the value interval. 但是,如果间隔的长度是变化且扭曲的,则必须设置一个能有效过滤短间隔的箱大小,同时防止长间隔重叠过多的箱,这就需要进行平衡。假设有一个 ranges 表,且 startend 列之间存在间隔,则可使用以下查询确定扭曲间隔长度值的不同百分位数:However, if the length of the interval is varying and skewed, a balance must be found to set a bin size that filters the short intervals efficiently, while preventing the long intervals from overlapping too many bins. Assuming a table ranges, with intervals that are between columns start and end, you can determine different percentiles of the skewed interval length value with the following query:

SELECT APPROX_PERCENTILE(CAST(end - start AS DOUBLE), ARRAY(0.5, 0.9, 0.99, 0.999, 0.9999)) FROM ranges

建议设置的箱大小为第 90 个百分位的值的最大值,或第 99 个百分位的值除以 10,或第 99.9 个百分位的值除以 100,依此类推。A recommended setting of bin size would be the maximum of the value at the 90th percentile, or the value at the 99th percentile divided by 10, or the value at the 99.9th percentile divided by 100 and so on. 基本原理是:The rationale is:

  • 如果第 90 个百分位的值是箱大小,则只有 10% 的值间隔长度大于箱间隔,因此将跨 2 个以上的相邻箱间隔。If the value at the 90th percentile is the bin size, only 10% of the value interval lengths are longer than the bin interval, so span more than 2 adjacent bin intervals.
  • 如果第 99 个百分位的值是箱大小,则只有 1% 的值间隔长度跨 11 个以上的相邻箱间隔。If the value at the 99th percentile is the bin size, only 1% of the value interval lengths span more than 11 adjacent bin intervals.
  • 如果第 99.9 个百分位的值是箱大小,则只有 0.1% 的值间隔长度跨 101 个以上的相邻箱间隔。If the value at the 99.9th percentile is the bin size, only 0.1% of the value interval lengths span more than 101 adjacent bin intervals.
  • 如果需要,可以针对第 99.99 个百分位、第 99.999 个百分位等百分位的值重复相同的步骤。The same can be repeated for the values at the 99.99th, the 99.999th percentile, and so on if needed.

上述方法限制了与多个箱间隔重叠的扭曲长值间隔的数量。The described method limits the amount of skewed long value intervals that overlap multiple bin intervals. 这样获得的箱大小值只是进行优调的起点,实际结果可能取决于具体的工作负荷。The bin size value obtained this way is only a starting point for fine tuning; actual results may depend on the specific workload.