优化(Azure Databricks 上的 Delta Lake) Optimize (Delta Lake on Azure Databricks)

OPTIMIZE [db_name.]table_name [WHERE predicate]
  [ZORDER BY (col_name1, col_name2, ...)]

优化 Delta Lake 数据的布局。Optimize the layout of Delta Lake data. (可选)优化数据子集或按列归置数据。Optionally optimize a subset of data or colocate data by column. 如果未指定归置,则将执行二进制打包优化。If you do not specify colocation, bin-packing optimization is performed.

备注

  • 二进制打包优化幂等,这意味着如果在同一数据集上运行两次,则第二次运行不起作用。Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no effect. 它旨在根据文件在磁盘上的大小生成均衡的数据文件,但不一定是每个文件的元组数。It aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number of tuples per file. 但是,这两个度量值通常是相关的。However, the two measures are most often correlated.
  • Z 排序不是幂等的,而应该是增量操作。Z-Ordering is not idempotent but aims to be an incremental operation. 多次运行不能保证 Z 排序所需的时间减少。The time it takes for Z-Ordering is not guaranteed to reduce over multiple runs. 但是,如果没有将新数据添加到刚刚进行 Z 排序的分区,则该分区的另一个 Z 排序将不会产生任何效果。However, if no new data was added to a partition that was just Z-Ordered, another Z-Ordering of that partition will not have any effect. 它旨在根据元组的数量生成均衡的数据文件,但不一定是磁盘上的数据大小。It aims to produce evenly-balanced data files with respect to the number of tuples, but not necessarily data size on disk. 这两个度量值通常是相关的,但可能会有例外的情况,导致优化任务时间出现偏差。The two measures are most often correlated, but there can be situations when that is not the case, leading to skew in optimize task times. * 若要控制输出文件大小,请设置 Spark配置 spark.databricks.delta.optimize.maxFileSize* To control the output file size, set the Spark configuration spark.databricks.delta.optimize.maxFileSize. 默认值为 1073741824The default value is 1073741824. 指定值 134217728 会将最大输出文件大小设置为 100 MB。Specifying the value 134217728 sets the max output file size to 100MB.

WHERE

优化与给定分区谓词匹配的行子集。Optimize the subset of rows matching the given partition predicate. 仅支持涉及分区键属性的筛选器。Only filters involving partition key attributes are supported.

ZORDER BY

将列信息并置在同一组文件中。Colocate column information in the same set of files. Delta Lake 数据跳过算法会使用并置,大幅减少需要读取的数据量。Co-locality is used by Delta Lake data-skipping algorithms to dramatically reduce the amount of data that needs to be read. 可以将 ZORDER BY 的多个列指定为以逗号分隔的列表。You can specify multiple columns for ZORDER BY as a comma-separated list. 但是,区域的有效性会随每个附加列一起删除。However, the effectiveness of the locality drops with each additional column.

示例Examples

OPTIMIZE events

OPTIMIZE events WHERE date >= '2017-01-01'

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)