最佳实践Best practices

本文介绍了使用 Delta Lake 时的最佳做法。This article describes best practices when using Delta Lake.

提供数据位置提示Provide data location hints

如果希望在查询谓词中常规使用某一列,并且该列具有较高的基数(即包含多个非重复值),则使用 Z-ORDER BYIf you expect a column to be commonly used in query predicates and if that column has high cardinality (that is, a large number of distinct values), then use Z-ORDER BY. Delta Lake 根据列值自动在文件中布局数据,在查询时根据布局信息跳过不相关的数据。Delta Lake automatically lays out the data in the files based on the column values and use the layout information to skip irrelevant data while querying.

有关详细信息,请参阅 Z 排序(多维聚类)For details, see Z-Ordering (multi-dimensional clustering).

选择正确的分区列Choose the right partition column

可以按列对 Delta 表进行分区。You can partition a Delta table by a column. 最常使用的分区列是 dateThe most commonly used partition column is date. 请按照以下两个经验法则来确定要根据哪个列进行分区:Follow these two rules of thumb for deciding on what column to partition by:

  • 如果某个列的基数将会很高,则不要将该列用于分区。If the cardinality of a column will be very high, do not use that column for partitioning. 例如,如果你按列 userId 进行分区并且可能有 100 万个不同的用户 ID,则这是一种错误的分区策略。For example, if you partition by a column userId and if there can be 1M distinct user IDs, then that is a bad partitioning strategy.
  • 每个分区中的数据量:如果你预计该分区中的数据至少有 1 GB,可以按列进行分区。Amount of data in each partition: You can partition by a column if you expect data in that partition to be at least 1 GB.

压缩文件 Compact files

如果你连续将数据写入到 Delta 表,则它会随着时间的推移累积大量文件,尤其是小批量添加数据时。If you continuously write data to a Delta table, it will over time accumulate a large number of files, especially if you add data in small batches. 这可能会对表读取效率产生不利影响,并且还会影响文件系统的性能。This can have an adverse effect on the efficiency of table reads, and it can also affect the performance of your file system. 理想情况下,应当将大量小文件定期重写到较小数量的较大型文件中。Ideally, a large number of small files should be rewritten into a smaller number of larger files on a regular basis. 这称为压缩。This is known as compaction.

你可以使用 OPTIMIZE 命令来压缩表。You can compact a table using the OPTIMIZE command.

替换表的内容或架构 Replace the content or schema of a table

有时候,你可能希望替换 Delta 表。Sometimes you may want to replace a Delta table. 例如: 。For example:

  • 你发现表中的数据不正确,需要对内容进行替换。You discover the data in the table is incorrect and want to replace the content.
  • 你希望重写整个表,以执行不兼容架构更改(删除列或更改列类型)。You want to rewrite the whole table to do incompatible schema changes (drop columns or change column types).

尽管可以删除 Delta 表的整个目录并在同一路径上创建新表,但不建议这样做,因为:While you can delete the entire directory of a Delta table and create a new table on the same path, it’s not recommended because:

  • 删除目录效率不高。Deleting a directory is not efficient. 删除某个包含极大文件的目录可能需要数小时甚至数天的时间。A directory containing very large files can take hours or even days to delete.
  • 删除的文件中的所有内容都会丢失;如果删除了错误的表,则很难恢复。You lose all of content in the deleted files; it’s hard to recover if you delete the wrong table.
  • 目录删除不是原子操作。The directory deletion is not atomic. 删除表时,某个读取表的并发查询可能会失败或看到的是部分表。While you are deleting the table a concurrent query reading the table can fail or see a partial table.

如果不需要更改表架构,则可以从 Delta 表中删除数据并插入新数据,或者通过更新表来纠正不正确的值。If you don’t need to change the table schema, you can delete data from a Delta table and insert your new data, or update the table to fix the incorrect values.

如果要更改表架构,则能够以原子方式替换整个表。If you want to change the table schema, you can replace the whole table atomically. 例如: 。For example:

PythonPython

dataframe.write \
  .format("delta") \
  .mode("overwrite") \
  .option("overwriteSchema", "true") \
  .partitionBy(<your-partition-columns>) \
  .saveAsTable("<your-table>") # Managed table
dataframe.write \
  .format("delta") \
  .mode("overwrite") \
  .option("overwriteSchema", "true") \
  .option("path", "<your-table-path>") \
  .partitionBy(<your-partition-columns>) \
  .saveAsTable("<your-table>") # External table

SQLSQL

REPLACE TABLE <your-table> USING DELTA PARTITIONED BY (<your-partition-columns>) AS SELECT ... -- Managed table
REPLACE TABLE <your-table> USING DELTA PARTITIONED BY (<your-partition-columns>) LOCATION "<your-table-path>" AS SELECT ... -- External table

ScalaScala

dataframe.write
  .format("delta")
  .mode("overwrite")
  .option("overwriteSchema", "true")
  .partitionBy(<your-partition-columns>)
  .saveAsTable("<your-table>") // Managed table
dataframe.write
  .format("delta")
  .mode("overwrite")
  .option("overwriteSchema", "true")
  .option("path", "<your-table-path>")
  .partitionBy(<your-partition-columns>)
  .saveAsTable("<your-table>") // External table

此方法有多个优点:There are multiple benefits with this approach:

  • 覆盖表的速度要快得多,因为它不需要以递归方式列出目录或删除任何文件。Overwriting a table is much faster because it doesn’t need to list the directory recursively or delete any files.
  • 表的旧版本仍然存在。The old version of the table still exists. 如果删除了错误的表,则可以使用按时间顺序查看轻松检索旧数据。If you delete the wrong table you can easily retrieve the old data using Time Travel.
  • 这是一个原子操作。It’s an atomic operation. 在删除表时,并发查询仍然可以读取表。Concurrent queries can still read the table while you are deleting the table.
  • 由于 Delta Lake ACID 事务保证,如果覆盖表失败,则该表将处于其以前的状态。Because of Delta Lake ACID transaction guarantees, if overwriting the table fails, the table will be in its previous state.

此外,如果你想要在覆盖表后删除旧文件以节省存储成本,则可以使用 VACUUM 来删除它们。In addition, if you want to delete old files to save storage cost after overwriting the table, you can use VACUUM to delete them. 它针对文件删除进行了优化,通常比删除整个目录要快。It’s optimized for file deletion and usually faster than deleting the entire directory.