迁移指南Migration guide

将工作负荷迁移到 Delta LakeMigrate workloads to Delta Lake

在你将工作负载迁移到 Delta Lake 时,与 Apache Spark 和 Apache Hive 提供的数据源相比,你应该注意以下简化和差异。When you migrate workloads to Delta Lake, you should be aware of the following simplifications and differences compared with the data sources provided by Apache Spark and Apache Hive.

Delta Lake 会自动处理以下操作,不应手动执行:Delta Lake handles the following operations automatically, which you should never perform manually:

  • REFRESH TABLE:Delta 表始终返回最新信息,因此无需在更改后手动调用 REFRESH TABLEREFRESH TABLE: Delta tables always return the most up-to-date information, so there is no need to manually call REFRESH TABLE after changes.

  • 添加和删除分区:Delta Lake 会自动跟踪表中存在的分区集,并在添加或删除数据时更新列表。Add and remove partitions: Delta Lake automatically tracks the set of partitions present in a table and updates the list as data is added or removed. 因此,无需运行 ALTER TABLE [ADD|DROP] PARTITIONMSCKAs a result, there is no need to run ALTER TABLE [ADD|DROP] PARTITION or MSCK.

  • 加载单个分区:作为一种优化,你有时可以直接加载你感兴趣的数据分区。Load a single partition: As an optimization, you may sometimes directly load the partition of data you are interested in. 例如,spark.read.parquet("/data/date=2017-01-01")For example, spark.read.parquet("/data/date=2017-01-01"). 对于 Delta Lake,这是不必要的,因为它可以从事务日志中快速读取文件列表,找到相关的文件。This is unnecessary with Delta Lake, since it can quickly read the list of files from the transaction log to find the relevant ones. 如果你对单个分区感兴趣,请使用 WHERE 子句来指定它。If you are interested in a single partition, specify it using a WHERE clause. 例如,spark.read.delta("/data").where("date = '2017-01-01'")For example, spark.read.delta("/data").where("date = '2017-01-01'"). 对于分区中包含多个文件的大型表,此方法比从 Parquet 表加载单个分区(使用直接分区路径或 WHERE)的速度要快得多,因为列出目录中的文件通常比从事务日志中读取文件的列表要慢得多。For large tables with many files in the partition, this can be much faster than loading a single partition (with direct partition path, or with WHERE) from a Parquet table because listing the files in the directory is often slower than reading the list of files from the the transaction log.

将现有应用程序移植到 Delta Lake 时,你应避免以下操作,这些操作会绕过事务日志:When you port an existing application to Delta Lake, you should avoid the following operations, which bypass the transaction log:

  • 手动修改数据:Delta Lake 使用事务日志自动提交对表的更改。Manually modify data: Delta Lake uses the transaction log to atomically commit changes to the table. 由于日志是事实的来源,因此 Spark 不会读取已写出但未添加到事务日志的文件。Because the log is the source of truth, files that are written out but not added to the transaction log are not read by Spark. 同样,即使你手动删除了某个文件,指向该文件的指针仍会出现在事务日志中。Similarly, even if you manually delete a file, a pointer to the file is still present in the transaction log. 请始终使用本指南中所述的命令,而不是手动修改 Delta 表中存储的文件。Instead of manually modifying files stored in a Delta table, always use the commands that are described in this guide.

  • 外部读取器:Delta Lake 中存储的数据被编码为 Parquet 文件。External readers: The data stored in Delta Lake is encoded as Parquet files. 但是,使用外部读取器访问这些文件并不安全。However, accessing these files using an external reader is not safe. 你将看到重复和未提交的数据,并且当某人运行删除 Delta 表不再引用的文件时,读取可能会失败。You’ll see duplicates and uncommitted data and the read may fail when someone runs Remove files no longer referenced by a Delta table.

    备注

    由于文件以开放格式编码,因此你始终可以选择将文件移到 Delta Lake 之外。Because the files are encoded in an open format, you always have the option to move the files outside Delta Lake. 此时,可以运行 VACUUM RETAIN 0 并删除事务日志。At that point, you can run VACUUM RETAIN 0 and delete the transaction log. 这样一来,表的文件将处于一致的状态,供你选择的外部读取器读取。This leaves the table’s files in a consistent state that can be read by the external reader of your choice.

示例Example

假设你已将 Parquet 数据存储在目录 /data-pipeline 中,并想要创建一个名为 events 的表。Suppose you have Parquet data stored in the directory /data-pipeline and want to create a table named events. 始终可以读入数据帧并另存为 Delta 表。You can always read into DataFrame and save as Delta table. 此方法将复制数据,并让 Spark 管理表。This approach copies data and lets Spark manage the table. 此外,还可转换为 Delta Lake,这样速度更快,但会生成非托管表。Alternatively you can convert to Delta Lake which is faster but results in an unmanaged table.

保存为 Delta 表Save as Delta table

  1. 将数据读入数据帧,并将其保存到 delta 格式的新目录中:Read the data into a DataFrame and save it to a new directory in delta format:

    data = spark.read.parquet("/data-pipeline")
    data.write.format("delta").save("/mnt/delta/data-pipeline/")
    
  2. 创建引用 Delta Lake 目录中该文件的 Delta 表 eventsCreate a Delta table events that refers to the files in the Delta Lake directory:

    spark.sql("CREATE TABLE events USING DELTA LOCATION '/mnt/delta/data-pipeline/'")
    

转换为 Delta 表Convert to Delta table

可使用两种方法将 Parquet 表转换为 Delta 表:You have two options for converting a Parquet table to a Delta table:

  • 将文件转换为 Delta Lake 格式,然后创建 Delta 表:Convert files to Delta Lake format and create Delta table:

    CONVERT TO DELTA parquet.`/data-pipeline/`
    CREATE TABLE events USING DELTA LOCATION '/data-pipeline/'
    
  • 创建 Parquet 表,然后转换为 Delta 表:Create Parquet table and convert to Delta table:

    CREATE TABLE events USING PARQUET OPTIONS (path '/data-pipeline/')
    CONVERT TO DELTA events
    

有关详细信息,请参阅将 Parquet 表转换为 Delta 表For details, see Convert a Parquet table to a Delta table.