自动优化Auto Optimize

自动优化是一组可选功能,可在向 Delta 表进行各次写入时自动压缩小文件。Auto Optimize is an optional set of features that automatically compact small files during individual writes to a Delta table. 在写入过程中支付少量的开销可以为进行频繁查询的表带来显著的好处。Paying a small cost during writes offers significant benefits for tables that are queried actively. 自动优化在下列情况下尤其有用:Auto Optimize is particularly useful in the following scenarios:

  • 可以接受分钟级延迟的流式处理用例Streaming use cases where latency in the order of minutes is acceptable
  • MERGE INTO 是写入到 Delta Lake 的首选方法MERGE INTO is the preferred method of writing into Delta Lake
  • CREATE TABLE AS SELECTINSERT INTO 是常用操作CREATE TABLE AS SELECT or INSERT INTO are commonly used operations

自动优化的工作原理How Auto Optimize works

自动优化包括两个互补功能:优化写入和自动压缩。Auto Optimize consists of two complementary features: Optimized Writes and Auto Compaction.

优化写入Optimized Writes

Azure Databricks 基于实际数据动态优化 Apache Spark 分区大小,并尝试为每个表分区写出 128 MB 的文件。Azure Databricks dynamically optimizes Apache Spark partition sizes based on the actual data, and attempts to write out 128 MB files for each table partition. 这是一个近似大小,可能因数据集特征而异。This is an approximate size and can vary depending on dataset characteristics.

自动压缩Auto Compaction

在每次写入后,Azure Databricks 会检查文件是否可以进一步压缩,并运行一个快速 OPTIMIZE 作业(128 MB 文件大小,而不是 1 GB),以便进一步压缩包含最多小文件的分区的文件。After an individual write, Azure Databricks checks if files can further be compacted, and runs a quick OPTIMIZE job (with 128 MB file sizes instead of 1GB) to further compact files for partitions that have the most number of small files.

优化写入Optimized writes

使用情况Usage

自动优化设计为针对特定的 Delta 表进行配置。Auto Optimize is designed to be configured for specific Delta tables. 你可以通过设置表属性 delta.autoOptimize.optimizeWrite = true 为表启用优化写入。You enable Optimized Writes for a table by setting the table property delta.autoOptimize.optimizeWrite = true. 同样,你可以设置 delta.autoOptimize.autoCompact = true 来启用自动压缩。Similarly, you set delta.autoOptimize.autoCompact = true to enable Auto Compaction.

  • 对于现有表,请运行:For existing tables, run:

    ALTER TABLE [table_name | delta.`<table-path>`] SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true)
    
  • 若要确保所有新的 Delta 表都启用了这些功能,请设置以下 SQL 配置:To ensure all new Delta tables have these features enabled, set the SQL configuration:

    set spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true;
    set spark.databricks.delta.properties.defaults.autoOptimize.autoCompact = true;
    

此外,你可以使用以下配置为 Spark 会话启用和禁用这两项功能:In addition, you can enable and disable both of these features for Spark sessions with the configurations:

  • spark.databricks.delta.optimizeWrite.enabled
  • spark.databricks.delta.autoCompact.enabled

会话配置优先于表属性,因此你可以更好地控制何时选择加入或选择退出这些功能。The session configurations take precedence over the table properties allowing you to better control when to opt in or opt out of these features.

何时选择加入和选择退出When to opt in and opt out

本部分提供了有关何时选择加入和选择退出自动优化功能的指导。This section provides guidance on when to opt in and opt out of Auto Optimize features.

优化写入Optimized Writes

优化写入旨在最大程度地提高写入到存储服务的数据的吞吐量。Optimized Writes aim to maximize the throughput of data being written to a storage service. 这可以通过减少写入的文件数来实现,而不会牺牲过多的并行度。This can be achieved by reducing the number of files being written, without sacrificing too much parallelism.

优化写入需要根据目标表的分区结构来混排数据。Optimized Writes require the shuffling of data according to the partitioning structure of the target table. 这种混排自然会产生额外成本。This shuffle naturally incurs additional cost. 但是,写入过程中的吞吐量增加可能会对冲掉混排成本。However, the throughput gains during the write may pay off the cost of the shuffle. 如果没有,查询数据时吞吐量会增加,因此这个功能仍然是很有价值的。If not, the throughput gains when querying the data should still make this feature worthwhile.

优化写入的关键在于它是一个自适应的混排。The key part of Optimized Writes is that it is an adaptive shuffle. 如果你有一个流式处理引入用例,并且输入数据速率随时间推移而变化,则自适应混排会根据不同微批次的传入数据速率自行进行相应调整。If you have a streaming ingest use case and input data rates change over time, the adaptive shuffle will adjust itself accordingly to the incoming data rates across micro-batches. 如果代码片段在写出流之前执行 coalesce(n)repartition(n),则可以删除这些行。If you have code snippets where you coalesce(n) or repartition(n) just before you write out your stream, you can remove those lines.

何时选择加入When to opt in

  • 可以接受分钟级延迟的流式处理用例Streaming use cases where minutes of latency is acceptable
  • 使用 MERGEUPDATEDELETEINSERT INTOCREATE TABLE AS SELECT 之类的 SQL 命令时When using SQL commands like MERGE, UPDATE, DELETE, INSERT INTO, CREATE TABLE AS SELECT

何时选择退出When to opt out

当写入的数据以兆字节为数量级且存储优化实例不可用时。When the written data is in the order of terabytes and storage optimized instances are unavailable.

自动压缩Auto Compaction

自动压缩发生在向表进行写入的操作成功后,在执行了写入操作的群集上同步运行。Auto Compaction occurs after a write to a table has succeeded and runs synchronously on the cluster that has performed the write. 这意味着,如果你的代码模式向 Delta Lake 进行写入,然后立即调用 OPTIMIZE,则可以在启用自动压缩的情况下删除 OPTIMIZE 调用。This means that if you have code patterns where you make a write to Delta Lake, and then immediately call OPTIMIZE, you can remove the OPTIMIZE call if you enable Auto Compaction.

自动压缩使用与 OPTIMIZE 不同的试探法。Auto Compaction uses different heuristics than OPTIMIZE. 由于它在写入后同步运行,因此我们已将自动压缩功能优化为使用以下属性运行:Since it runs synchronously after a write, we have tuned Auto Compaction to run with the following properties:

  • Azure Databricks 不支持将 Z 排序与自动压缩一起使用,因为 Z 排序的成本要远远高于纯压缩。Azure Databricks does not support Z-Ordering with Auto Compaction as Z-Ordering is significantly more expensive than just compaction.
  • 自动压缩生成比 OPTIMIZE (1 GB) 更小的文件 (128 MB)。Auto Compaction generates smaller files (128 MB) than OPTIMIZE (1 GB).
  • 自动压缩“贪婪地”选择能够最好地利用压缩的有限的一组分区。Auto Compaction greedily chooses a limited set of partitions that would best leverage compaction. 所选的分区数因其启动时所处的群集的大小而异。The number of partitions selected will vary depending on the size of cluster it is launched on. 如果群集有较多的 CPU,则可以优化较多的分区。If your cluster has more CPUs, more partitions can be optimized.

何时选择加入When to opt in

  • 可以接受分钟级延迟的流式处理用例Streaming use cases where minutes of latency is acceptable
  • 当表上没有常规 OPTIMIZE 调用时When you don’t have regular OPTIMIZE calls on your table

何时选择退出When to opt out

当其他编写器可能同时执行 DELETEMERGEUPDATEOPTIMIZE 之类的操作时(因为自动压缩可能会导致这些作业发生事务冲突)。When other writers may be performing operations like DELETE, MERGE, UPDATE or OPTIMIZE concurrently as Auto Compaction may cause a transaction conflict for those jobs. 如果自动压缩由于事务冲突而失败,Azure Databricks 不会使压缩失败,也不会重试压缩。If Auto Compaction fails due to a transaction conflict, Azure Databricks does not fail or retry the compaction.

示例工作流:使用并发删除或更新操作进行流式引入Example workflow: Streaming ingest with concurrent deletes or updates

此工作流假定你有一个群集运行全天候流式处理作业来引入数据,另一个群集每小时、每天或临时运行一次作业来删除或更新一批记录。This workflow assumes that you have one cluster running a 24/7 streaming job ingesting data, and one cluster that runs on an hourly, daily, or ad-hoc basis to delete or update a batch of records. 对于此用例,Azure Databricks 建议你:For this use case, Azure Databricks recommends that you:

  • 使用以下命令在表级别启用优化写入Enable Optimized Writes on the table level using

    ALTER TABLE <table_name|delta.`table_path`> SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true)
    

    这可以确保流写入的文件数以及删除和更新作业的大小是最佳的。This makes sure that the number of files written by the stream and the delete and update jobs are of optimal size.

  • 对执行删除或更新操作的作业使用以下设置将在会话级别启用自动压缩。Enable Auto Compaction on the session level using the following setting on the job that performs the delete or update.

    spark.sql("set spark.databricks.delta.autoCompact.enabled = true")
    

    这样就可以在表中压缩文件。This allows files to be compacted across your table. 由于这发生在删除或更新之后,因此可以减轻事务冲突的风险。Since it happens after the delete or update, you mitigate the risks of a transaction conflict.

常见问题 (FAQ)Frequently asked questions (FAQ)

自动优化是否对文件进行 Z 排序?Does Auto Optimize Z-Order files?

自动优化仅对小文件执行压缩。Auto Optimize performs compaction only on small files. 它不对文件进行 Z 排序It does not Z-Order files.

自动优化是否会损坏 Z 排序的文件?Does Auto Optimize corrupt Z-Ordered files?

自动优化会忽略 Z 排序的文件。Auto Optimize ignores files that are Z-Ordered. 它仅压缩新文件。It only compacts new files.

如果我对要将数据流式传输到其中的表启用了自动优化,并且某个并发事务与优化冲突,那么我的作业是否会失败?If I have Auto Optimize enabled on a table that I’m streaming into, and a concurrent transaction conflicts with the optimize, will my job fail?

否。No. 系统会忽略导致自动优化失败的事务冲突,流将继续正常运行。Transaction conflicts that cause Auto Optimize to fail are ignored, and the stream will continue to operate normally.

如果在我的表上启用了自动优化,是否需要计划 OPTIMIZE 作业?Do I need to schedule OPTIMIZE jobs if Auto Optimize is enabled on my table?

对于大于 10 TB 的表,建议让 OPTIMIZE 按计划运行,以便进一步合并文件并减少 Delta 表的元数据。For tables with size greater than 10 TB, we recommend that you keep OPTIMIZE running on a schedule to further consolidate files, and reduce the metadata of your Delta table. 由于自动优化不支持 Z 排序,因此仍应将 OPTIMIZE ... ZORDER BY 作业计划为定期运行。Since Auto Optimize does not support Z-Ordering, you should still schedule OPTIMIZE ... ZORDER BY jobs to run periodically.