转换为 Delta(Azure Databricks 上的 Delta Lake)Convert To Delta (Delta Lake on Azure Databricks)

CONVERT TO DELTA [ [db_name.]table_name | parquet.`<path-to-table>` ] [NO STATISTICS]
[PARTITIONED BY (col_name1 col_type1, col_name2 col_type2, ...)]

备注

CONVERT TO DELTA [db_name.]table_name 需要 Databricks Runtime 6.6 或更高版本。CONVERT TO DELTA [db_name.]table_name requires Databricks Runtime 6.6 or above.

将现有 Parquet 表就地转换为 Delta 表。Convert an existing Parquet table to a Delta table in-place. 此命令会列出目录中的所有文件,创建 Delta Lake 事务日志来跟踪这些文件,并通过读取所有 Parquet 文件的页脚来自动推断数据架构。This command lists all the files in the directory, creates a Delta Lake transaction log that tracks these files, and automatically infers the data schema by reading the footers of all Parquet files. 转换过程会收集统计信息,以提升转换后的 Delta 表的查询性能。The conversion process collects statistics to improve query performance on the converted Delta table. 如果提供表名,则元存储也将更新,以反映该表现在是 Delta 表。If you provide a table name, the metastore is also updated to reflect that the table is now a Delta table.

NO STATISTICS

在转换过程中绕过统计信息收集,以更快的速度完成转换。Bypass statistics collection during the conversion process and finish conversion faster. 将表转换为 Delta Lake 后,可以使用 OPTIMIZE ZORDER BY 重新组织数据布局并生成统计信息。After the table is converted to Delta Lake, you can use OPTIMIZE ZORDER BY to reorganize the data layout and generate statistics.

PARTITIONED BY

按指定的列对创建的表进行分区。Partition the created table by the specified columns. 如果数据已分区,则为必需。Required if the data is partitioned. 如果目录结构不符合 PARTITIONED BY 规范,转换过程将中止并引发异常。The conversion process aborts and throw an exception if the directory structure does not conform to the PARTITIONED BY specification. 如果未提供 PARTITIONED BY 子句,该命令便假定该表未分区。If you do not provide the PARTITIONED BY clause, the command assumes that the table is not partitioned.

注意事项Caveats

Delta Lake 未跟踪的文件均不可见,运行 VACUUM 时可将其删除。Any file not tracked by Delta Lake is invisible and can be deleted when you run VACUUM. 在转换过程中,请勿更新或追加数据文件。You should avoid updating or appending data files during the conversion process. 转换表后,请确保通过 Delta Lake 执行所有写入。After the table is converted, make sure all writes go through Delta Lake.

多个外部表可能都使用同一个基础 Parquet 目录。It is possible that multiple external tables share the same underlying Parquet directory. 在这种情况下,如果在其中某个外部表上运行 CONVERT,则无法访问其他外部表,因为其基础目录已从 Parquet 转换为 Delta Lake。In this case, if you run CONVERT on one of the external tables, then you will not be able to access the other external tables because their underlying directory has been converted from Parquet to Delta Lake. 若要再次查询或写入这些外部表,还必须在这些表上运行 CONVERTTo query or write to these external tables again, you must run CONVERT on them as well.

CONVERT 会将目录信息(例如架构和表属性)填充到 Delta Lake 事务日志中。CONVERT populates the catalog information, such as schema and table properties, to the Delta Lake transaction log. 如果基础目录已转换为 Delta Lake 并且其元数据与目录元数据不同,则会引发 convertMetastoreMetadataMismatchExceptionIf the underlying directory has already been converted to Delta Lake and its metadata is different from the catalog metadata, a convertMetastoreMetadataMismatchException will be thrown. 如果希望 CONVERT 覆盖 Delta Lake 事务日志中的现有元数据,请将 SQL 配置 spark.databricks.delta.convert.metadataCheck.enabled 设置为 false。If you want CONVERT to overwrite the existing metadata in the Delta Lake transaction log, set the SQL configuration spark.databricks.delta.convert.metadataCheck.enabled to false.

撤消转换Undo the conversion

如果执行了可更改数据文件的 Delta Lake 操作(例如 DELETEOPTIMIZE),请先运行以下命令进行垃圾回收:If you have performed Delta Lake operations such as DELETE or OPTIMIZE that can change the data files, first run the following command for garbage collection:

VACUUM delta.`<path-to-table>` RETAIN 0 HOURS

然后,删除 <path-to-table>/_delta_log 目录。Then, delete the <path-to-table>/_delta_log directory.