OPTIMIZE(Azure Databricks 上的 Delta Lake) OPTIMIZE (Delta Lake on Azure Databricks)

优化 Delta Lake 数据的布局。Optimizes the layout of Delta Lake data. (可选)优化数据子集或按列归置数据。Optionally optimize a subset of data or colocate data by column. 如果未指定归置,则将执行二进制打包优化。If you do not specify colocation, bin-packing optimization is performed.

语法Syntax

OPTIMIZE table_identifier [WHERE predicate]
  [ZORDER BY (col_name1, col_name2, ...)]
  • 二进制打包优化幂等,这意味着如果在同一数据集上运行两次,则第二次运行不起作用。Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no effect. 它旨在根据文件在磁盘上的大小生成均衡的数据文件,但不一定是每个文件的元组数。It aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number of tuples per file. 但是,这两个度量值通常是相关的。However, the two measures are most often correlated.

  • Z 排序不是幂等的,而应该是增量操作。Z-Ordering is not idempotent but aims to be an incremental operation. 多次运行不能保证 Z 排序所需的时间减少。The time it takes for Z-Ordering is not guaranteed to reduce over multiple runs. 但是,如果没有将新数据添加到刚刚进行 Z 排序的分区,则该分区的另一个 Z 排序将不会产生任何效果。However, if no new data was added to a partition that was just Z-Ordered, another Z-Ordering of that partition will not have any effect. 它旨在根据元组的数量生成均衡的数据文件,但不一定是磁盘上的数据大小。It aims to produce evenly-balanced data files with respect to the number of tuples, but not necessarily data size on disk. 这两个度量值通常是相关的,但可能会有例外的情况,导致优化任务时间出现偏差。The two measures are most often correlated, but there can be situations when that is not the case, leading to skew in optimize task times. * 若要控制输出文件大小,请设置 Spark配置 spark.databricks.delta.optimize.maxFileSize* To control the output file size, set the Spark configuration spark.databricks.delta.optimize.maxFileSize. 默认值为 1073741824The default value is 1073741824. 指定值 134217728 会将最大输出文件大小设置为 100 MB。Specifying the value 134217728 sets the max output file size to 100MB.

  • table_identifiertable_identifier

    • [database_name.] table_name:表名,可选择使用数据库名称进行限定。[database_name.] table_name: A table name, optionally qualified with a database name.
    • delta.`<path-to-table>`:现有 Delta 表的位置。delta.`<path-to-table>`: The location of an existing Delta table.
  • WHERE

    优化与给定分区谓词匹配的行子集。Optimize the subset of rows matching the given partition predicate. 仅支持涉及分区键属性的筛选器。Only filters involving partition key attributes are supported.

  • ZORDER BY

    将列信息并置在同一组文件中。Colocate column information in the same set of files. Delta Lake 数据跳过算法会使用并置,大幅减少需要读取的数据量。Co-locality is used by Delta Lake data-skipping algorithms to dramatically reduce the amount of data that needs to be read. 可以将 ZORDER BY 的多个列指定为以逗号分隔的列表。You can specify multiple columns for ZORDER BY as a comma-separated list. 但是,区域的有效性会随每个附加列一起删除。However, the effectiveness of the locality drops with each additional column.

示例Examples

OPTIMIZE events

OPTIMIZE events WHERE date >= '2017-01-01'

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

有关 OPTIMIZE 命令的详细信息,请参阅使用文件管理优化性能For more information about the OPTIMIZE command, see Optimize performance with file management.