如何填充或更新现有 Delta 表中的列How to populate or update columns in an existing Delta table

问题Problem

你有一个现有的 Delta 表,其中包含几个空白列。You have an existing Delta table, with a few empty columns. 你需要使用原始 Parquet 文件中的数据填充或更新这些列。You need to populate or update those columns with data from a raw Parquet file.

解决方案Solution

在此示例中,有一个 customers 表,它是现有的 Delta 表。In this example, there is a customers table, which is an existing Delta table. 该表包含具有缺失值的 address 列。It has an address column with missing values. 更新后的数据以 Parquet 格式存在。The updated data exists in Parquet format.

  1. 使用 Apache Spark API 语句从 Parquet 文件创建 DataFrameCreate a DataFrame from the Parquet file using an Apache Spark API statement:

    updatesDf = spark.read.parquet("/path/to/raw-file")
    
  2. 查看 updatesDF DataFrame 的内容:View the contents of the updatesDF DataFrame:

    display(updatesDf)
    
  3. updatesDf DataFrame 创建表。Create a table from the updatesDf DataFrame. 在此示例中,该类名为 updatesIn this example, it is named updates.

    updatesDf.createOrReplaceTempView("updates")
    
  4. 检查 updates 表的内容,并将其与 customers 的内容进行比较:Check the contents of the updates table, and compare it to the contents of customers:

    display(customers)
    
  5. 使用 MERGE INTO 语句将 updates 表中的数据合并到原始 customers 表中。Use the MERGE INTO statement to merge the data from the updates table into the original customers table.

    MERGE INTO customers
    USING updates
    ON customers.customerId = source.customerId
    WHEN MATCHED THEN
      UPDATE SET address = updates.address
    WHEN NOT MATCHED
      THEN INSERT (customerId, address) VALUES (updates.customerId, updates.address)
    

这里,customers 是原始 Delta 表,包含具有缺失值的 address 列。Here, customers is the original Delta table that has an address column with missing values. updates 是从 DataFrame updatesDf 创建的表,通过读取原始文件中的数据创建的。updates is the table created from the DataFrame updatesDf, which is created by reading data from the raw file. 原始 Delta 表的 address 列使用 updates 中的值进行填充,并覆盖 address 列中的任何现有值。The address column of the original Delta table is populated with the values from updates, overwriting any existing values in the address column.

如果 updates 包含 customers 表中尚未包含的客户,则该命令将添加这些新客户记录。If updates contains customers that are not already in the customers table, then the command adds these new customer records.

有关使用 MERGE INTO 的更多示例,请参阅合并到 (Delta Lake)For more examples of using MERGE INTO, see Merge Into (Delta Lake).