如何在基于数据集和数据帧的联接命令中指定倾斜提示How to specify skew hints in dataset and DataFrame-based join commands

DataFrameDataset 对象执行 join 命令时,如果发现由于数据倾斜而导致查询无法完成少量任务,则可以使用 hint("skew") 方法 df.hint("skew") 指定倾斜提示。When you perform a join command with DataFrame or Dataset objects, if you find that the query is stuck on finishing a small number of tasks due to data skew, you can specify the skew hint with the hint("skew") method: df.hint("skew"). 倾斜联接优化将在为其指定了 skew 提示的 DataFrame 上执行。The skew join optimization is performed on the DataFrame for which you specify the skew hint.

除了基本提示外,还可以指定具有以下参数组合的 hint 方法:列名、列名列表以及列名和倾斜值。In addition to the basic hint, you can specify the hint method with the following combinations of parameters: column name, list of column names, and column name and skew value.

  • DataFrame 和列名。DataFrame and column name. 倾斜联接优化将在 DataFrame 的指定列上执行。The skew join optimization is performed on the specified column of the DataFrame.

    df.hint("skew", "col1")
    
  • DataFrame 和多个列。DataFrame and multiple columns. 倾斜联接优化将对 DataFrame 中的多个列执行。The skew join optimization is performed for multiple columns in the DataFrame.

    df.hint("skew", ["col1","col2"])
    
  • DataFrame、列名和倾斜值。DataFrame, column name, and skew value. 倾斜联接优化将对具有倾斜值的列中的数据执行。The skew join optimization is performed on the data in the column with the skew value.

    df.hint("skew", "col1", "value")
    

示例Example

此示例演示了如何为 join 操作中涉及的多个 DataFrame 对象指定倾斜提示:This example shows how to specify the skew hint for multiple DataFrame objects involved in a join operation:

val joinResults = ds1.hint("skew").as("L").join(ds2.hint("skew").as("R"), $"L.col1" === $"R.col1")