如何通过 Bucket 存储功能提高性能How to improve performance with bucketing

Bucket 存储功能是 Apache Spark SQL 中的一种优化技术。Bucketing is an optimization technique in Apache Spark SQL. 根据从一个或多个 Bucket 存储列派生的值,在指定数量的 Bucket 之间分配数据。Data is allocated among a specified number of buckets, according to values derived from one or more bucketing columns. Bucket 存储功能可在下游操作(如表联接)之前混排和排序数据,从而提高性能。Bucketing improves performance by shuffling and sorting data prior to downstream operations such as table joins. 要权衡的是因混排和排序而产生的初始开销,但对于某些数据转换,该技术可避免之后的混排和排序,从而提高性能。The tradeoff is the initial overhead due to shuffling and sorting, but for certain data transformations, this technique can improve performance by avoiding later shuffling and sorting.

该技术对维度表非常有用,维度表是包含主键的常用表。This technique is useful for dimension tables, which are frequently used tables containing primary keys. 当经常有涉及大型表和小型表的联接操作时,它也非常有用。It is also useful when there are frequent join operations involving large and small tables.

下面的示例笔记本显示了执行由 Bucket 存储的表和未由 Bucket 存储的表的联接时物理计划中的差异。The example notebook below shows the differences in physical plans when performing joins of bucketed and unbucketed tables.

Bucket 存储示例笔记本Bucketing example notebook

获取笔记本Get notebook