表实用工具命令Table utility commands

Delta 表支持多个实用工具命令。Delta tables support a number of utility commands.

删除 Delta 表不再引用的文件 Remove files no longer referenced by a Delta table

可在表上运行 vacuum 命令,来删除 Delta 表中不再引用且在保留期阈值之前创建的文件。You can remove files no longer referenced by a Delta table and are older than the retention threshold by running the vacuum command on the table. vacuum 不会自动触发。vacuum is not triggered automatically. 文件的默认保留期阈值为 7 天。The default retention threshold for the files is 7 days.

重要

  • vacuum 仅删除数据文件,不删除日志文件。vacuum deletes only data files, not log files. 检查点操作后,会自动异步删除日志文件。Log files are deleted automatically and asynchronously after checkpoint operations. 日志文件的默认保留期为 30 天,可通过使用 ALTER TABLE SET TBLPROPERTIES SQL 方法设置的 delta.logRetentionDuration 属性进行配置。The default retention period of log files is 30 days, configurable through the delta.logRetentionDuration property which you set with the ALTER TABLE SET TBLPROPERTIES SQL method. 请查看表属性See Table properties.
  • 运行 vacuum 后,无法再按时间顺序查看在保留期之前创建的版本。The ability to time travel back to a version older than the retention period is lost after running vacuum.

SQLSQL

VACUUM eventsTable   -- vacuum files not required by versions older than the default retention period

VACUUM '/data/events' -- vacuum files in path-based table

VACUUM delta.`/data/events/`

VACUUM delta.`/data/events/` RETAIN 100 HOURS  -- vacuum files not required by versions more than 100 hours old

VACUUM eventsTable DRY RUN    -- do dry run to get the list of files to be deleted

有关语法详细信息,请参阅For syntax details, see

  • Databricks Runtime 7.x:VACUUMDatabricks Runtime 7.x: VACUUM
  • Databricks Runtime 5.5 LTS 和 6.x:清空Databricks Runtime 5.5 LTS and 6.x: Vacuum

PythonPython

备注

可在 Databricks Runtime 6.1 及更高版本中使用 Python API。The Python API is available in Databricks Runtime 6.1 and above.

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, pathToTable)  # path-based tables, or
deltaTable = DeltaTable.forName(spark, tableName)    # Hive metastore-based tables

deltaTable.vacuum()        # vacuum files not required by versions older than the default retention period

deltaTable.vacuum(100)     # vacuum files not required by versions more than 100 hours old

ScalaScala

备注

可在 Databricks Runtime 6.0 及更高版本中使用 Scala API。The Scala API is available in Databricks Runtime 6.0 and above.

import io.delta.tables._

val deltaTable = DeltaTable.forPath(spark, pathToTable)

deltaTable.vacuum()        // vacuum files not required by versions older than the default retention period

deltaTable.vacuum(100)     // vacuum files not required by versions more than 100 hours old

JavaJava

备注

可在 Databricks Runtime 6.0 及更高版本中使用 Java API。The Java API is available in Databricks Runtime 6.0 and above.

import io.delta.tables.*;
import org.apache.spark.sql.functions;

DeltaTable deltaTable = DeltaTable.forPath(spark, pathToTable);

deltaTable.vacuum();        // vacuum files not required by versions older than the default retention period

deltaTable.vacuum(100);     // vacuum files not required by versions more than 100 hours old

有关 Scala、Java 和 Python 语法的详细信息,请查看 API 参考See the API reference for Scala, Java, and Python syntax details.

警告

建议不要将保留期间隔设置为少于 7 天,因为并发读取器或写入器仍可将旧快照和未提交的文件用于表。We do not recommend that you set a retention interval shorter than 7 days, because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table. 如果 vacuum 清理活动文件,则并发读取器可能会失败;更糟糕的是,当 vacuum 删除尚未提交的文件时,表可能会损坏。If vacuum cleans up active files, concurrent readers can fail or, worse, tables can be corrupted when vacuum deletes files that have not yet been committed.

Delta Lake 具有一项安全检查,用于防止运行危险的 vacuum 命令。Delta Lake has a safety check to prevent you from running a dangerous vacuum command. 如果确定对此表执行的操作所耗的时间均未超过计划指定的保留间隔,可通过将 Apache Spark 配置属性 spark.databricks.delta.retentionDurationCheck.enabled 设置为 false 来关闭此安全检查。If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Apache Spark configuration property spark.databricks.delta.retentionDurationCheck.enabled to false. 选择的时间间隔必须比最长运行并发事务长,也必须比任何流可滞后于对表的最新更新的最长时间长。You must choose an interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table.

检索 Delta 表历史记录 Retrieve Delta table history

通过运行 history 命令,可检索每次写入 Delta 表的操作、用户和时间戳等内容的相关信息。You can retrieve information on the operations, user, timestamp, and so on for each write to a Delta table by running the history command. 按时间倒序返回返回操作。The operations are returned in reverse chronological order. 表历史记录默认保留 30 天。By default table history is retained for 30 days.

SQLSQL

DESCRIBE HISTORY '/data/events/'          -- get the full history of the table

DESCRIBE HISTORY delta.`/data/events/`

DESCRIBE HISTORY '/data/events/' LIMIT 1  -- get the last operation only

DESCRIBE HISTORY eventsTable

有关 Spark SQL 语法详细信息,请参阅For Spark SQL syntax details, see

PythonPython

备注

可在 Databricks Runtime 6.1 及更高版本中使用 Python API。The Python API is available in Databricks Runtime 6.1 and above.

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, pathToTable)

fullHistoryDF = deltaTable.history()    # get the full history of the table

lastOperationDF = deltaTable.history(1) # get the last operation

ScalaScala

备注

可在 Databricks Runtime 6.0 及更高版本中使用 Scala API。The Scala API is available in Databricks Runtime 6.0 and above.

import io.delta.tables._

val deltaTable = DeltaTable.forPath(spark, pathToTable)

val fullHistoryDF = deltaTable.history()    // get the full history of the table

val lastOperationDF = deltaTable.history(1) // get the last operation

JavaJava

备注

可在 Databricks Runtime 6.0 及更高版本中使用 Java API。The Java API is available in Databricks Runtime 6.0 and above.

import io.delta.tables.*;

DeltaTable deltaTable = DeltaTable.forPath(spark, pathToTable);

DataFrame fullHistoryDF = deltaTable.history();       // get the full history of the table

DataFrame lastOperationDF = deltaTable.history(1);    // fetch the last operation on the DeltaTable

有关 Scala/Java/Python 语法的详细信息,请查看 API 参考See the API reference for Scala/Java/Python syntax details.

历史记录架构History schema

history 操作的输出包含以下列。The output of the history operation has the following columns.

Column 类型Type 说明Description
版本version longlong 通过操作生成的表版本。Table version generated by the operation.
timestamptimestamp timestamptimestamp 提交此版本的时间。When this version was committed.
userIduserId 字符串string 运行操作的用户的 ID。ID of the user that ran the operation.
userNameuserName 字符串string 运行操作的用户的姓名。Name of the user that ran the operation.
operationoperation 字符串string 操作的名称。Name of the operation.
operationParametersoperationParameters mapmap 操作的参数(例如谓词。)Parameters of the operation (for example, predicates.)
作业 (job)job structstruct 运行操作的作业的详细信息。Details of the job that ran the operation.
笔记本notebook structstruct 运行操作的笔记本的详细信息。Details of notebook from which the operation was run.
clusterIdclusterId 字符串string 运行操作的群集的 ID。ID of the cluster on which the operation ran.
readVersionreadVersion longlong 读取以执行写入操作的表的版本。Version of the table that was read to perform the write operation.
isolationLevelisolationLevel 字符串string 用于此操作的隔离级别。Isolation level used for this operation.
isBlindAppendisBlindAppend booleanboolean 此操作是否追加数据。Whether this operation appended data.
operationMetricsoperationMetrics mapmap 操作的指标(例如已修改的行数和文件数。)Metrics of the operation (for example, number of rows and files modified.)
userMetadatauserMetadata 字符串string 用户定义的提交元数据(如果已指定)User-defined commit metadata if it was specified
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+-----------------+-------------+--------------------+
|version|          timestamp|userId|userName|operation| operationParameters| job|notebook|clusterId|readVersion|   isolationLevel|isBlindAppend|    operationMetrics|
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+-----------------+-------------+--------------------+
|      5|2019-07-29 14:07:47|   ###|     ###|   DELETE|[predicate -> ["(...|null|     ###|      ###|          4|WriteSerializable|        false|[numTotalRows -> ...|
|      4|2019-07-29 14:07:41|   ###|     ###|   UPDATE|[predicate -> (id...|null|     ###|      ###|          3|WriteSerializable|        false|[numTotalRows -> ...|
|      3|2019-07-29 14:07:29|   ###|     ###|   DELETE|[predicate -> ["(...|null|     ###|      ###|          2|WriteSerializable|        false|[numTotalRows -> ...|
|      2|2019-07-29 14:06:56|   ###|     ###|   UPDATE|[predicate -> (id...|null|     ###|      ###|          1|WriteSerializable|        false|[numTotalRows -> ...|
|      1|2019-07-29 14:04:31|   ###|     ###|   DELETE|[predicate -> ["(...|null|     ###|      ###|          0|WriteSerializable|        false|[numTotalRows -> ...|
|      0|2019-07-29 14:01:40|   ###|     ###|    WRITE|[mode -> ErrorIfE...|null|     ###|      ###|       null|WriteSerializable|         true|[numFiles -> 2, n...|
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+-----------------+-------------+--------------------+

备注

  • 仅当使用 Databricks Runtime 6.5 或更高版本运行历史记录中的 history 命令和操作时,操作指标才可用。Operation metrics are available only when the history command and the operation in the history were run using Databricks Runtime 6.5 or above.
  • 如果使用以下方法写入 Delta 表,则其他一些列不可用:A few of the other columns are not available if you write into a Delta table using the following methods:
  • 将来添加的列将始终添加到最后一列的后面。Columns added in the future will always be added after the last column.

操作指标说明Operation metrics keys

history 操作返回 operationMetrics 列映射中操作指标的集合。The history operation returns a collection of operations metrics in the operationMetrics column map.

下表按操作列出了映射键定义。The following tables list the map key definitions by operation.

操作Operation 指标名称Metric name 说明Description
WRITE、CREATE TABLE AS SELECT、REPLACE TABLE AS SELECT、COPY INTOWRITE, CREATE TABLE AS SELECT, REPLACE TABLE AS SELECT, COPY INTO
numFilesnumFiles 写入的文件数。Number of files written.
numOutputBytesnumOutputBytes 已写入的内容的大小(以字节为单位)。Size in bytes of the written contents.
numOutputRowsnumOutputRows 写入的行数。Number of rows written.
STREAMING UPDATESTREAMING UPDATE
numAddedFilesnumAddedFiles 添加的文件数。Number of files added.
numRemovedFilesnumRemovedFiles 删除的文件数。Number of files removed.
numOutputRowsnumOutputRows 写入的行数。Number of rows written.
numOutputBytesnumOutputBytes 写入大小(以字节为单位)。Size of write in bytes.
DELETEDELETE
numAddedFilesnumAddedFiles 添加的文件数。Number of files added. 删除表的分区时未提供。Not provided when partitions of the table are deleted.
numRemovedFilesnumRemovedFiles 删除的文件数。Number of files removed.
numDeletedRowsnumDeletedRows 删除的行数。Number of rows removed. 删除表的分区时未提供。Not provided when partitions of the table are deleted.
numCopiedRowsnumCopiedRows 在删除文件期间复制的行数。Number of rows copied in the process of deleting files.
TRUNCATETRUNCATE numRemovedFilesnumRemovedFiles 删除的文件数。Number of files removed.
MERGEMERGE
numSourceRowsnumSourceRows 源数据帧中的行数。Number of rows in the source DataFrame.
numTargetRowsInsertednumTargetRowsInserted 插入到目标表的行数。Number of rows inserted into the target table.
numTargetRowsUpdatednumTargetRowsUpdated 目标表中更新的行数。Number of rows updated in the target table.
numTargetRowsDeletednumTargetRowsDeleted 目标表中删除的行数。Number of rows deleted in the target table.
numTargetRowsCopiednumTargetRowsCopied 复制的目标行数。Number of target rows copied.
numOutputRowsnumOutputRows 写出的总行数。Total number of rows written out.
numTargetFilesAddednumTargetFilesAdded 添加到接收器(目标)的文件数。Number of files added to the sink(target).
numTargetFilesRemovednumTargetFilesRemoved 从接收器(目标)删除的文件数。Number of files removed from the sink(target).
UPDATEUPDATE
numAddedFilesnumAddedFiles 添加的文件数。Number of files added.
numRemovedFilesnumRemovedFiles 删除的文件数。Number of files removed.
numUpdatedRowsnumUpdatedRows 更新的行数。Number of rows updated.
numCopiedRowsnumCopiedRows 刚才在更新文件期间复制的行数。Number of rows just copied over in the process of updating files.
FSCKFSCK numRemovedFilesnumRemovedFiles 删除的文件数。Number of files removed.
CONVERTCONVERT numConvertedFilesnumConvertedFiles 已转换的 Parquet 文件数。Number of Parquet files that have been converted.
操作Operation 指标名称Metric name 说明Description
CLONE (1)CLONE (1)
sourceTableSizesourceTableSize 所克隆版本的源表的大小(以字节为单位)。Size in bytes of the source table at the version that’s cloned.
sourceNumOfFilessourceNumOfFiles 源表中已克隆版本的文件数。Number of files in the source table at the version that’s cloned.
numRemovedFilesnumRemovedFiles 目标表中删除的文件数(如果替换了先前的 Delta 表)。Number of files removed from the target table if a previous Delta table was replaced.
removedFilesSizeremovedFilesSize 如果替换了先前的 Delta 表,则为目标表中删除文件的总大小(以字节为单位)。Total size in bytes of the files removed from the target table if a previous Delta table was replaced.
numCopiedFilesnumCopiedFiles 复制到新位置的文件数。Number of files that were copied over to the new location. 如果是浅表克隆,则为 0。0 for shallow clones.
copiedFilesSizecopiedFilesSize 复制到新位置的文件总大小(以字节为单位)。Total size in bytes of the files that were copied copied over to the new location. 如果是浅表克隆,则为 0。0 for shallow clones.
RESTORE (2)RESTORE (2)
tableSizeAfterRestoretableSizeAfterRestore 还原后的表大小(字节)。Table size in bytes after restore.
numOfFilesAfterRestorenumOfFilesAfterRestore 还原后表中的文件数。Number of files in the table after restore.
numRemovedFilesnumRemovedFiles 还原操作删除的文件数。Number of files removed by the restore operation.
numRestoredFilesnumRestoredFiles 由于还原而添加的文件数。Number of files that were added as a result of the restore.
removedFilesSizeremovedFilesSize 还原操作删除的文件的大小(字节)。Size in bytes of files removed by the restore.
restoredFilesSizerestoredFilesSize 还原操作添加的文件的大小(字节)。Size in bytes of files added by the restore.
OPTIMIZEOPTIMIZE
numAddedFilesnumAddedFiles 添加的文件数。Number of files added.
numRemovedFilesnumRemovedFiles 优化的文件数。Number of files optimized.
numAddedBytesnumAddedBytes 优化表后添加的字节数。Number of bytes added after the table was optimized.
numRemovedBytesnumRemovedBytes 删除的字节数。Number of bytes removed.
minFileSizeminFileSize 优化表后最小文件的大小。Size of the smallest file after the table was optimized.
p25FileSizep25FileSize 优化表后第 25 个百分位文件的大小。Size of the 25th percentile file after the table was optimized.
p50FileSizep50FileSize 优化表后的文件大小中值。Median file size after the table was optimized.
p75FileSizep75FileSize 优化表后第 75 个百分位文件的大小。Size of the 75th percentile file after the table was optimized.
maxFileSizemaxFileSize 优化表后最大文件的大小。Size of the largest file after the table was optimized.

(1) 需要 Databricks Runtime 7.3 LTS 或更高版本。(1) Requires Databricks Runtime 7.3 LTS or above.

(2) 需要 Databricks Runtime 7.4 或更高版本。(2) Requires Databricks Runtime 7.4 or above.

检索 Delta 表详细信息 Retrieve Delta table details

可以使用 DESCRIBE DETAIL 检索有关 Delta 表的详细信息(例如文件数、数据大小)。You can retrieve detailed information about a Delta table (for example, number of files, data size) using DESCRIBE DETAIL.

DESCRIBE DETAIL '/data/events/'

DESCRIBE DETAIL eventsTable

有关 Spark SQL 语法详细信息,请参阅For Spark SQL syntax details, see

详细信息架构Detail schema

此操作的输出只有一行具有以下架构。The output of this operation has only one row with the following schema.

Column 类型Type 说明Description
formatformat 字符串string 表的格式,即“delta”。Format of the table, that is, “delta”.
idid 字符串string 表的唯一 ID。Unique ID of the table.
namename 字符串string 在元存储中定义的表名称。Name of the table as defined in the metastore.
descriptiondescription 字符串string 表的说明。Description of the table.
locationlocation 字符串string 表的位置。Location of the table.
createdAtcreatedAt timestamptimestamp 表创建时间。When the table was created.
lastModifiedlastModified timestamptimestamp 表的上次修改时间。When the table was last modified.
partitionColumnspartitionColumns 字符串数组array of strings 如果表已分区,则为分区列的名称。Names of the partition columns if the table is partitioned.
numFilesnumFiles longlong 表最新版本中的文件数。Number of the files in the latest version of the table.
propertiesproperties string-string 映射string-string map 此表的所有属性集。All the properties set for this table.
minReaderVersionminReaderVersion intint 可读取表的读取器最低版本(由日志协议而定)。Minimum version of readers (according to the log protocol) that can read the table.
minWriterVersionminWriterVersion intint 可写入表的写入器最低版本(由日志协议而定)。Minimum version of writers (according to the log protocol) that can write to the table.
+------+--------------------+------------------+-----------+--------------------+--------------------+-------------------+----------------+--------+-----------+----------+----------------+----------------+
|format|                  id|              name|description|            location|           createdAt|       lastModified|partitionColumns|numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|
+------+--------------------+------------------+-----------+--------------------+--------------------+-------------------+----------------+--------+-----------+----------+----------------+----------------+
| delta|d31f82d2-a69f-42e...|default.deltatable|       null|file:/Users/tdas/...|2020-06-05 12:20:...|2020-06-05 12:20:20|              []|      10|      12345|        []|               1|               2|
+------+--------------------+------------------+-----------+--------------------+--------------------+-------------------+----------------+--------+-----------+----------+----------------+----------------+

将 Parquet 表转换为 Delta 表 Convert a Parquet table to a Delta table

就地将 Parquet 表转换为 Delta 表。Convert a Parquet table to a Delta table in-place. 此命令会列出目录中的所有文件,创建 Delta Lake 事务日志来跟踪这些文件,并通过读取所有 Parquet 文件的页脚来自动推断数据架构。This command lists all the files in the directory, creates a Delta Lake transaction log that tracks these files, and automatically infers the data schema by reading the footers of all Parquet files. 如果数据已分区,则必须将分区列的架构指定为 DDL 格式的字符串(即 <column-name1> <type>, <column-name2> <type>, ...)。If your data is partitioned, you must specify the schema of the partition columns as a DDL-formatted string (that is, <column-name1> <type>, <column-name2> <type>, ...).

SQLSQL

-- Convert unpartitioned parquet table at path '<path-to-table>'
CONVERT TO DELTA parquet.`<path-to-table>`

-- Convert partitioned Parquet table at path '<path-to-table>' and partitioned by integer columns named 'part' and 'part2'
CONVERT TO DELTA parquet.`<path-to-table>` PARTITIONED BY (part int, part2 int)

有关语法详细信息,请参阅For syntax details, see

PythonPython

备注

可在 Databricks Runtime 6.1 及更高版本中使用 Python API。The Python API is available in Databricks Runtime 6.1 and above.

from delta.tables import *

# Convert unpartitioned parquet table at path '<path-to-table>'
deltaTable = DeltaTable.convertToDelta(spark, "parquet.`<path-to-table>`")

# Convert partitioned parquet table at path '<path-to-table>' and partitioned by integer column named 'part'
partitionedDeltaTable = DeltaTable.convertToDelta(spark, "parquet.`<path-to-table>`", "part int")

ScalaScala

备注

可在 Databricks Runtime 6.0 及更高版本中使用 Scala API。The Scala API is available in Databricks Runtime 6.0 and above.

import io.delta.tables._

// Convert unpartitioned Parquet table at path '<path-to-table>'
val deltaTable = DeltaTable.convertToDelta(spark, "parquet.`<path-to-table>`")

// Convert partitioned Parquet table at path '<path-to-table>' and partitioned by integer columns named 'part' and 'part2'
val partitionedDeltaTable = DeltaTable.convertToDelta(spark, "parquet.`<path-to-table>`", "part int, part2 int")

JavaJava

备注

可在 Databricks Runtime 6.0 及更高版本中使用 Scala API。The Scala API is available in Databricks Runtime 6.0 and above.

import io.delta.tables.*;

// Convert unpartitioned parquet table at path '<path-to-table>'
DeltaTable deltaTable = DeltaTable.convertToDelta(spark, "parquet.`<path-to-table>`");

// Convert partitioned Parquet table at path '<path-to-table>' and partitioned by integer columns named 'part' and 'part2'
DeltaTable deltaTable = DeltaTable.convertToDelta(spark, "parquet.`<path-to-table>`", "part int, part2 int");

备注

Delta Lake 未跟踪的文件均不可见,运行 vacuum 时可将其删除。Any file not tracked by Delta Lake is invisible and can be deleted when you run vacuum. 在转换过程中,请勿更新或追加数据文件。You should avoid updating or appending data files during the conversion process. 转换表后,请确保通过 Delta Lake 执行所有写入。After the table is converted, make sure all writes go through Delta Lake.

将 Delta 表转换为 Parquet 表Convert a Delta table to a Parquet table

可按照以下步骤将 Delta 表轻松地重新转换为 Parquet 表:You can easily convert a Delta table back to a Parquet table using the following steps:

  1. 如果执行了可更改数据文件的 Delta Lake 操作(例如 deletemerge),请运行 vacuum 并将保留期设为 0 小时,从而删除表的最新版本中未包含的所有数据文件。If you have performed Delta Lake operations that can change the data files (for example, delete or merge), run vacuum with retention of 0 hours to delete all data files that do not belong to the latest version of the table.
  2. 删除表目录中的 _delta_log 目录。Delete the _delta_log directory in the table directory.

将 Delta 表还原到早期状态 Restore a Delta table to an earlier state

重要

此功能目前以公共预览版提供。This feature is in Public Preview.

备注

适用于 Databricks Runtime 7.4 及更高版本。Available in Databricks Runtime 7.4 and above.

你可以使用 RESTORE 命令将 Delta 表还原到其以前的状态。You can restore a Delta table to its earlier state by using the RESTORE command. Delta 表在内部维护该表的历史版本,使其能够还原到以前的状态。A Delta table internally maintains historic versions of the table that enable it to be restored to an earlier state. RESTORE 命令支持使用一个与早期状态相对应的版本作为选项,或支持使用一个表明早期状态何时创建的时间戳作为选项。A version corresponding to the earlier state or a timestamp of when the earlier state was created are supported as options by the RESTORE command.

重要

  • 你可以还原已还原的表和已克隆的表。You can restore an already restored table and cloned tables.
  • 将表还原为通过手动方式或通过 vacuum 删除了数据文件的旧版本时,操作会失败。Restoring a table to an older version where the data files were deleted manually or by vacuum will fail. 如果将 spark.sql.files.ignoreMissingFiles 设置为 true,则仍可部分还原为此版本。Restoring to this version partially is still possible if spark.sql.files.ignoreMissingFiles is set to true.
  • 用于还原到早期状态的时间戳格式是 yyyy-MM-dd HH:mm:ssThe timestamp format for restoring to an earlier state is yyyy-MM-dd HH:mm:ss. 还支持仅提供 date(yyyy-MM-dd) 字符串。Providing only a date(yyyy-MM-dd) string is also supported.

SQLSQL

RESTORE TABLE db.target_table TO VERSION AS OF <version>
RESTORE TABLE delta.`/data/target/` TO TIMESTAMP AS OF <timestamp>

PythonPython

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, <path-to-table>)  # path-based tables, or
deltaTable = DeltaTable.forName(spark, <table-name>)    # Hive metastore-based tables

deltaTable.restoreToVersion(0) # restore table to oldest version

deltaTable.restoreToTimestamp('2019-02-14') # restore to a specific timestamp

ScalaScala

import io.delta.tables._

val deltaTable = DeltaTable.forPath(spark, <path-to-table>)
val deltaTable = DeltaTable.forName(spark, <table-name>)

deltaTable.restoreToVersion(0) // restore table to oldest version

deltaTable.restoreToTimestamp("2019-02-14") // restore to a specific timestamp

JavaJava

import io.delta.tables.*;

DeltaTable deltaTable = DeltaTable.forPath(spark, <path-to-table>);
DeltaTable deltaTable = DeltaTable.forName(spark, <table-name>);

deltaTable.restoreToVersion(0) // restore table to oldest version

deltaTable.restoreToTimestamp("2019-02-14") // restore to a specific timestamp

有关语法详细信息,请参阅 RESTORE(Azure Databricks 上的 Delta Lake)For syntax details, see RESTORE (Delta Lake on Azure Databricks).

表访问控制Table access control

对于要还原的表,你必须拥有 MODIFY 权限。You must have MODIFY permission on the table being restored.

克隆 Delta 表 Clone a Delta table

备注

可在 Databricks Runtime 7.2 及更高版本中使用。Available in Databricks Runtime 7.2 and above.

可使用 clone 命令在特定版本创建现有 Delta 表的副本。You can create a copy of an existing Delta table at a specific version using the clone command. 克隆可以是深层克隆,也可是浅表克隆。Clones can be either deep or shallow.

本部分内容:In this section:

克隆类型Clone types

  • 除了复制现有表的元数据,深层克隆还会将源表数据复制到克隆目标。A deep clone is a clone that copies the source table data to the clone target in addition to the metadata of the existing table. 此外,它还会克隆流元数据,使写入 Delta 表的流可在源表上停止,并在克隆的目标位置(即停止位置)继续进行克隆。Additionally, stream metadata is also cloned such that a stream that writes to the Delta table can be stopped on a source table and continued on the target of a clone from where it left off.
  • 浅表克隆不会将数据文件复制到克隆目标。A shallow clone is a clone that does not copy the data files to the clone target. 表元数据等效于源。The table metadata is equivalent to the source. 创建这些克隆的成本较低。These clones are cheaper to create.

对深层克隆或浅表克隆所做的任何更改都只影响克隆本身,不会影响源表。Any changes made to either deep or shallow clones affect only the clones themselves and not the source table.

克隆的元数据包括:架构、分区信息、不变性、为 Null 性。The metadata that is cloned includes: schema, partitioning information, invariants, nullability. 此外,还会克隆流和 COPY INTO(Azure Databricks 上的 Delta Lake)元数据(仅限深层克隆)。For deep clones only, stream and COPY INTO (Delta Lake on Azure Databricks) metadata are also cloned. 未克隆的两种元数据是表说明和用户定义的提交元数据这。Metadata not cloned are the table description and user-defined commit metadata.

重要

  • 浅表克隆会引用源目录中的数据文件。Shallow clones reference data files in the source directory. 如果在源表上运行 vacuum,客户端将无法再读取引用的数据文件,并且将引发 FileNotFoundExceptionIf you run vacuum on the source table clients will no longer be able to read the referenced data files and a FileNotFoundException will be thrown. 在这种情况下,在浅表克隆上运行带有 replace 的克隆将修复该克隆。In this case, running clone with replace over the shallow clone will repair the clone. 如果此情况经常发生,请考虑使用深层克隆,而不是依赖于源表。If this occurs often, consider using a deep clone instead which does not depend on the source table.
  • 深层克隆不依赖进行克隆的源,但由于深层克隆会复制数据和元数据,因此创建成本很高。Deep clones do not depend on the source from which they were cloned, but are expensive to create because a deep clone copies the data as well as the metadata.
  • 使用 replace 克隆到已在该路径具有表的目标时,如果该路径不存在,会创建一个 Delta 日志。Cloning with replace to a target that already has a table at that path creates a Delta log if one does not exist at that path. 可运行 vacuum 来清理任何现有数据。You can clean up any existing data by running vacuum. 如果现有表是 Delta 表,则会在现有 Delta 表上创建新的提交,其中包括源表中的新元数据和新数据。If the existing table is a Delta table, a new commit is created on the existing Delta table that includes the new metadata and new data from the source table.
  • 克隆表与 Create Table As SelectCTAS 不同。Cloning a table is not the same as Create Table As Select or CTAS. 除数据外,克隆还会复制源表的元数据。A clone copies the metadata of the source table in addition to the data. 而且,克隆的语法更为简单:无需指定分区、格式、不变性和为 Null 性等,因为它们取自源表。Cloning also has simpler syntax: you don’t need to specify partitioning, format, invariants, nullability and so on as they are taken from the source table.
  • 克隆的表具有与其源表无关的历史记录。A cloned table has an independent history from its source table. 在克隆的表上按时间顺序查询时,这些查询使用的输入与它们在其源表上查询时使用的不同。Time travel queries on a cloned table will not work with the same inputs as they work on its source table.

SQLSQL

 CREATE TABLE delta.`/data/target/` CLONE delta.`/data/source/` -- Create a deep clone of /data/source at /data/target

 CREATE OR REPLACE TABLE db.target_table CLONE db.source_table -- Replace the target

 CREATE TABLE IF NOT EXISTS TABLE delta.`/data/target/` CLONE db.source_table -- No-op if the target table exists

 CREATE TABLE db.target_table SHALLOW CLONE delta.`/data/source`

 CREATE TABLE db.target_table SHALLOW CLONE delta.`/data/source` VERSION AS OF version

 CREATE TABLE db.target_table SHALLOW CLONE delta.`/data/source` TIMESTAMP AS OF timestamp_expression -- timestamp can be like “2019-01-01” or like date_sub(current_date(), 1)

PythonPython

 from delta.tables import *

 deltaTable = DeltaTable.forPath(spark, pathToTable)  # path-based tables, or
 deltaTable = DeltaTable.forName(spark, tableName)    # Hive metastore-based tables

 deltaTable.clone(target, isShallow, replace) # clone the source at latest version

 deltaTable.cloneAtVersion(version, target, isShallow, replace) # clone the source at a specific version

# clone the source at a specific timestamp such as timestamp=“2019-01-01”
 deltaTable.cloneAtTimestamp(timestamp, target, isShallow, replace)

ScalaScala

 import io.delta.tables._

 val deltaTable = DeltaTable.forPath(spark, pathToTable)
 val deltaTable = DeltaTable.forName(spark, tableName)

 deltaTable.clone(target, isShallow, replace) // clone the source at latest version

 deltaTable.cloneAtVersion(version, target, isShallow, replace) // clone the source at a specific version

 deltaTable.cloneAtTimestamp(timestamp, target, isShallow, replace) // clone the source at a specific timestamp

JavaJava

 import io.delta.tables.*;

 DeltaTable deltaTable = DeltaTable.forPath(spark, pathToTable);
 DeltaTable deltaTable = DeltaTable.forName(spark, tableName);

 deltaTable.clone(target, isShallow, replace) // clone the source at latest version

 deltaTable.cloneAtVersion(version, target, isShallow, replace) // clone the source at a specific version

 deltaTable.cloneAtTimestamp(timestamp, target, isShallow, replace) // clone the source at a specific timestamp

有关语法详细信息,请参阅 CLONE(Azure Databricks 上的 Delta Lake)For syntax details, see CLONE (Delta Lake on Azure Databricks).

权限Permissions

你必须为 Azure Databricks 表访问控制和云提供商配置权限。You must configure permissions for Azure Databricks table access control and your cloud provider.

表访问控制Table access control

深层克隆和浅表克隆都需要以下权限:The following permissions are required for both deep and shallow clones:

  • 对源表的 SELECT 权限。SELECT permission on the source table.
  • 如果使用 CLONE 创建新表,请对创建表的数据库具有 CREATE 权限。If you are using CLONE to create a new table, CREATE permission on the database in which you are creating the table.
  • 如果使用 CLONE 替换表,则必须对表具有 MODIFY 权限。If you are using CLONE to replace a table, you must have MODIFY permission on the table.

云提供商权限Cloud provider permissions

如果已创建深层克隆,则任何读取深层克隆的用户都必须对该克隆的目录具有读取访问权限。If you have created a deep clone, any user that reads the deep clone must have read access to the clone’s directory. 若要更改克隆,用户必须对克隆的目录具有写入访问权限。To make changes to the clone, users must have write access to the clone’s directory.

如果已创建浅表克隆,则任何读取该浅表克隆的用户都需要权限才能读取原始表中的文件,因为数据文件保留在源表中,并包含浅表克隆以及克隆的目录。If you have created a shallow clone, any user that reads the shallow clone needs permission to read the files in the original table, since the data files remain in the source table with shallow clones, as well as the clone’s directory. 若要更改克隆,用户需要对克隆的目录具有写入访问权限。To make changes to the clone, users will need write access to the clone’s directory.

克隆用例Clone use cases

本部分内容:In this section:

数据存档Data archiving

数据保存的时间可能需要比按时间顺序查看可实现的或者灾难恢复所需的时间长。Data may need to be kept for longer than is feasible with time travel or for disaster recovery. 在这些情况下,你可创建深层克隆,保留表在某个时间点的状态以供存档。In these cases, you can create a deep clone to preserve the state of a table at a certain point in time for archival. 还可通过增量存档来保留源表的持续更新状态以进行灾难恢复。Incremental archiving is also possible to keep a continually updating state of a source table for disaster recovery.

-- Every month run
CREATE OR REPLACE TABLE delta.`/some/archive/path` CLONE my_prod_table

机器学习流重现Machine learning flow reproduction

在进行机器学习时,你可能希望将已训练 ML 模型的表的特定版本进行存档。When doing machine learning, you may want to archive a certain version of a table on which you trained an ML model. 可使用此存档数据集测试将来的模型。Future models can be tested using this archived data set.

    -- Trained model on version 15 of Delta table
CREATE TABLE delta.`/model/dataset` CLONE entire_dataset VERSION AS OF 15

在生产表上进行短期试验Short-term experiments on a production table

为了在不损坏表的情况下测试生产表中的工作流,可轻松创建一个浅表克隆。To test a workflow on a production table without corrupting the table, you can easily create a shallow clone. 这样,就可在包含所有生产数据的克隆表上运行任意工作流,而不会影响任何生产工作负载。This allows you to run arbitrary workflows on the cloned table that contains all the production data but does not affect any production workloads.

-- Perform shallow clone
CREATE OR REPLACE TABLE my_test SHALLOW CLONE my_prod_table;

UPDATE my_test WHERE user_id is null SET invalid=true;
-- Run a bunch of validations. Once happy:

-- This should leverage the update information in the clone to prune to only
-- changed files in the clone if possible
MERGE INTO my_prod_table
USING my_test
ON my_test.user_id <=> my_prod_table.user_id
WHEN MATCHED AND my_test.user_id is null THEN UPDATE *;

DROP TABLE my_test;

数据共享Data sharing

单个组织内的其他业务部门可能也需要访问上述数据,但可能不需要最新更新。Other business units within a single organization may want to access the same data but may not require the latest updates. 可为不同的业务部门提供不同权限的克隆,而不是直接授予对源表的访问权限。Instead of giving access to the source table directly, you can provide clones with different permissions for different business units. 克隆的性能会比简单视图的性能更高。The performance of the clone can exceed that of a simple view.

-- Perform deep clone
CREATE OR REPLACE TABLE shared_table CLONE my_prod_table;

-- Grant other users access to the shared table
GRANT SELECT ON shared_table TO `<user-name>@<user-domain>.com`;

表属性替代Table property overrides

表属性替代特别适用于:Table property overrides are particularly useful for:

  • 在与不同的业务部门共享数据时,使用所有者或用户信息对表进行批注。Annotating tables with owner or user information when sharing data with different business units.

  • 需要对 Delta 表存档并按时间顺序查看。Archiving Delta tables and time travel is required. 你可以单独为存档表指定日志保留期。You can specify the log retention period independently for the archive table. 例如: 。For example:

    SQLSQL
    CREATE OR REPLACE archive.my_table CLONE prod.my_table
    LOCATION 'xx://archive/my_table'
    TBLPROPERTIES (
      delta.logRetentionDuration = '3650 days',
      delta.deletedFileRetentionDuration = '3650 days'
    )
    
    PythonPython
    dt = DeltaTable.forName(spark, "prod.my_table")
    tblProps = {
      "delta.logRetentionDuration": "3650 days",
      "delta.deletedFileRetentionDuration": "3650 days"
    }
    dt.clone('xx://archive/my_table', isShallow=False, replace=True, tblProps)
    
    ScalaScala
    val dt = DeltaTable.forName(spark, "prod.my_table")
    val tblProps = Map(
      "delta.logRetentionDuration" -> "3650 days",
      "delta.deletedFileRetentionDuration" -> "3650 days"
    )
    dt.clone("xx://archive/my_table", isShallow = false, replace = true, properties = tblProps)