Delta Lake UPDATE 查询失败,出现 IllegalState 异常Delta Lake UPDATE query fails with IllegalState exception

问题Problem

执行在其任何转换中使用 Python UDF 的 Delta Lake UPDATEDELETEMERGE 查询时,执行将失败并出现以下异常:When you execute a Delta Lake UPDATE, DELETE, or MERGE query that uses Python UDFs in any of its transformations, it fails with the following exception:

java.lang.UnsupportedOperationException: Error in SQL statement:
IllegalStateException: File (adl://xxx/table1) to be rewritten not found among candidate files:
adl://xxx/table1/part-00001-39cae1bb-9406-49d2-99fb-8c865516fbaa-c000.snappy.parquet

VersionVersion

此问题发生在 Databricks Runtime 5.5 及更低版本上。This problem occurs on Databricks Runtime 5.5 and below.

原因Cause

Delta Lake 内部依赖于 input_file_name() 函数来执行 UPDATEDELETEMERGE 之类的操作。Delta Lake internally depends on the input_file_name() function for operations like UPDATE, DELETE, and MERGE. 如果在用于评估 Python UDF 的 SELECT 语句中使用 input_file_name(),则会返回空值。input_file_name() returns an empty value if you use it in a SELECT statement that evaluates a Python UDF. UPDATE 在内部调用 SELECT,这样将无法返回文件名并导致错误。UPDATE calls SELECT internally, which then fails to return file names and leads to the error. Scala UDF 不会发生此错误。This error does not occur with Scala UDFs.

解决方案Solution

可以使用两个选项:You have two options:

  • 使用 Databricks Runtime 6.0 或更高版本,它们包括此问题的解决方法:[SPARK-28153]Use Databricks Runtime 6.0 or above, which includes the resolution to this issue: [SPARK-28153].
  • 如果无法使用 Databricks Runtime 6.0 或更高版本,请使用 Scala UDF 而不是 Python UDF。If you can’t use Databricks Runtime 6.0 or above, use Scala UDFs instead of Python UDFs.