动态文件修剪Dynamic file pruning

动态文件修剪 (DFP) 可以显著提高 Delta 表上许多查询的性能。Dynamic file pruning (DFP), can significantly improve the performance of many queries on Delta tables. DFP 对于非分区表或非分区列上的联接特别有效。DFP is especially efficient for non-partitioned tables, or for joins on non-partitioned columns. DFP 的性能影响通常与数据聚类相关,因此请考虑使用 Z 排序以最大限度地提高 DFP 的效益。The performance impact of DFP is often correlated to the clustering of data so consider using Z-Ordering to maximize the benefit of DFP.

有关 DFP 的背景和用例,请参阅通过动态文件修剪在 Delta Lake 上更快进行 SQL 查询For background and use cases for DFP, see Faster SQL Queries on Delta Lake with Dynamic File Pruning.

备注

在 Databricks Runtime 6.1 及更高版本中可用。Available in Databricks Runtime 6.1 and above.

DFP 由以下 Apache Spark 配置选项控制:DFP is controlled by the following Apache Spark configuration options:

  • spark.databricks.optimizer.dynamicPartitionPruning(默认值为 true):指示优化器向下推 DFP 筛选器的主标志。spark.databricks.optimizer.dynamicPartitionPruning (default is true): The main flag that directs the optimizer to push down DFP filters. 设置为 false 时,DFP 将不起作用。When set to false, DFP will not be in effect.
  • spark.databricks.optimizer.deltaTableSizeThreshold(默认值为 10,000,000,000 bytes (10 GB)):表示连接探测侧触发 DFP 所需的 Delta 表的最小大小(以字节为单位)。spark.databricks.optimizer.deltaTableSizeThreshold (default is 10,000,000,000 bytes (10 GB)): Represents the minimum size (in bytes) of the Delta table on the probe side of the join required to trigger DFP. 如果探测侧不是很大,那么向下推筛选器可能不值得,我们可以简单地扫描整个表。If the probe side is not very large, it is probably not worthwhile to push down the filters and we can just simply scan the whole table. 可以通过运行 DESCRIBE DETAIL table_name 命令然后查看 sizeInBytes 列来查找 Delta 表的大小。You can find the size of a Delta table by running the DESCRIBE DETAIL table_name command and then looking at the sizeInBytes column.
  • spark.databricks.optimizer.deltaTableFilesThreshold(默认值为 1000):表示连接探测侧触发 DFP 所需的 Delta 表的文件数。spark.databricks.optimizer.deltaTableFilesThreshold (defaults is 1000): Represents the number of files of the Delta table on the probe side of the join required to trigger DFP. 如果探测侧表包含的文件少于阈值,则不会触发 DPP。When the probe side table contains fewer files than the threshold value, DPP will not be triggered. 如果表只有几个文件,则启用 DFP 可能不值得。If a table has only a few files, it is probably not worthwhile to enable DFP. 可以通过运行 DESCRIBE DETAIL table_name 命令然后查看 numFiles 列来查找 Delta 表的大小。You can find the size of a Delta table can be found by running the DESCRIBE DETAIL table_name command and then looking at the numFiles column.