VACUUMVACUUM

清除与表关联的文件。Clean up files associated with a table. 对于 Apache Spark 和 Delta 表,此命令有不同版本。There are different versions of this command for Apache Spark and Delta tables.

清空 Spark 表Vacuum a Spark table

以递归方式清空与 Spark 表关联的目录,并删除超过保留期阈值的未提交文件。Recursively vacuums directories associated with the Spark table and remove uncommitted files older than a retention threshold. 默认阈值为 7 天。The default threshold is 7 days. Azure Databricks 在数据写入时自动触发 VACUUM 操作。Azure Databricks automatically triggers VACUUM operations as data is written. 请参阅清除未提交的文件See Clean up uncommitted files.

语法Syntax

VACUUM [ table_identifier | path] [RETAIN num HOURS]
  • table_identifiertable_identifier

    [database_name.] table_name:表名,可选择使用数据库名称进行限定。[database_name.] table_name: A table name, optionally qualified with a database name.

  • 路径path

    表文件的路径。Path to the table files.

  • RETAIN num HOURSRETAIN num HOURS

    保留期阈值。The retention threshold.

清空 Delta 表(Azure Databricks 上的 Delta Lake) Vacuum a Delta table (Delta Lake on Azure Databricks)

以递归方式清空与 Delta 表关联的目录,并删除不再处于表事务日志最新状态且超过保留期阈值的数据文件。Recursively vacuum directories associated with the Delta table and remove data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. 根据从 Delta 的事务日志中以逻辑方式删除文件的时间和保留时间(而不是其在存储系统上的修改时间戳)删除这些文件。Files are deleted according to the time they have been logically removed from Delta’s transaction log + retention hours, not their modification timestamps on the storage system. 默认阈值为 7 天。The default threshold is 7 days. Azure Databricks 不会对 Delta 表自动触发 VACUUM 操作。Azure Databricks does not automatically trigger VACUUM operations on Delta tables. 请参阅删除 Delta 表不再引用的文件See Remove files no longer referenced by a Delta table.

如果对 Delta 表运行 VACUUM,则将无法再回头按时间顺序查看在指定数据保留期之前创建的版本。If you run VACUUM on a Delta table, you lose the ability time travel back to a version older than the specified data retention period.

VACUUM table_identifier [RETAIN num HOURS] [DRY RUN]
  • table_identifiertable_identifier

    • [database_name.] table_name:表名,可选择使用数据库名称进行限定。[database_name.] table_name: A table name, optionally qualified with a database name.
    • delta.`<path-to-table>`:现有 Delta 表的位置。delta.`<path-to-table>`: The location of an existing Delta table.
  • RETAIN num HOURSRETAIN num HOURS

    保留期阈值。The retention threshold.

  • DRY RUNDRY RUN

    返回要删除的文件的列表。Return a list of files to be deleted.