倾斜联接优化 Skew Join optimization

数据倾斜是指表的数据不均匀地分布在群集的多个分区中的一种情况。Data skew is a condition in which a table’s data is unevenly distributed among partitions in the cluster. 数据偏斜可能会严重降低查询(尤其是那些使用联接的查询)的性能。Data skew can severely downgrade performance of queries, especially those with joins. 大表之间的联接需要混排数据,而倾斜可能导致群集中的工作极其不平衡。Joins between big tables require shuffling data and the skew can lead to an extreme imbalance of work in the cluster. 如果一个查询在完成很少的任务(例如,200 个任务中的最后 3 个任务)时停滞不前,那么很可能是数据倾斜在影响该查询。It’s likely that data skew is affecting a query if a query appears to be stuck finishing very few tasks (for example, the last 3 tasks out of 200). 要验证:To verify:

  1. 单击停滞的阶段,验证它是否正在进行联接。Click the stage that is stuck and verify that it is doing a join.
  2. 查询完成后,请找到执行联接的阶段并检查任务持续时间分布。After the query finishes, find the stage that does a join and check the task duration distribution.
  3. 通过减少持续时间对任务进行排序,并检查前几个任务。Sort the tasks by decreasing duration and check the first few tasks. 如果一个任务完成的时间比其他任务长得多,则存在倾斜。If one task took much longer to complete than the other tasks, there is skew.

为了减轻倾斜,Azure Databricks 上的 Delta Lake SQL 会接受查询中的倾斜提示。To ameliorate skew, Delta Lake on Azure Databricks SQL accepts skew hints in queries. 使用这些提示中的信息,Spark 可以构造更好的查询计划,该计划不会受到数据倾斜的影响。With the information from these hints, Spark can construct a better query plan, one that does not suffer from data skew.

仅关系名称Only relation name

倾斜提示必须至少包含具有倾斜的关系的名称。A skew hint must contain at least the name of the relation with skew. 关系是表、视图或子查询。A relation is a table, view, or a subquery. 然后,与此关系进行的所有联接都会使用倾斜联接优化。All joins with this relation then use skew join optimization.

-- table with skew
SELECT /*+ SKEW('orders') */ * FROM orders, customers WHERE c_custId = o_custId

-- subquery with skew
SELECT /*+ SKEW('C1') */ *
  FROM (SELECT * FROM customers WHERE c_custId < 100) C1, orders
  WHERE C1.c_custId = o_custId

关系和列Relation and columns

一个关系可能有多个联接,只有其中一部分会受到倾斜的影响。There might be multiple joins on a relation and only some of them will suffer from skew. 倾斜联接优化有一些开销,因此最好是只在需要时使用它。Skew join optimization has some overhead so it is better to use it only when needed. 倾斜提示会为此接受列名。For this purpose, the skew hint accepts column names. 只有使用这些列的联接才使用倾斜联接优化。Only joins with these columns use skew join optimization.

-- single column
SELECT /*+ SKEW('orders', 'o_custId') */ *
  FROM orders, customers
  WHERE o_custId = c_custId

-- multiple columns
SELECT /*+ SKEW('orders', ('o_custId', 'o_storeRegionId')) */ *
  FROM orders, customers
  WHERE o_custId = c_custId AND o_storeRegionId = c_regionId

关系、列和倾斜值Relation, columns, and skew values

你还可以在提示中指定倾斜值。You can also specify skew values in the hint. 倾斜值可能是已知的值(例如,是从不改变的值),也可能是很容易找到的值,具体取决于查询和数据。这样做可以减少倾斜联接优化的开销。Depending on the query and data, the skew values might be known (for example, because they never change) or might be easy to find out. Doing this reduces the overhead of skew join optimization. 如果不这样做,则 Delta Lake 会自动检测这些值。Otherwise, Delta Lake detects them automatically.

-- single column, single skew value
SELECT /*+ SKEW('orders', 'o_custId', 0) */ *
  FROM orders, customers
  WHERE o_custId = c_custId

-- single column, multiple skew values
SELECT /*+ SKEW('orders', 'o_custId', (0, 1, 2)) */ *
  FROM orders, customers
  WHERE o_custId = c_custId

-- multiple columns, multiple skew values
SELECT /*+ SKEW('orders', ('o_custId', 'o_storeRegionId'), ((0, 1001), (1, 1002))) */ *
  FROM orders, customers
  WHERE o_custId = c_custId AND o_storeRegionId = c_regionId