中断后在覆盖模式下创建表失败Create table in overwrite mode fails when interrupted

问题Problem

在尝试通过取消当前正在运行的作业来重新运行 Apache Spark 写入操作时,会发生以下错误:When you attempt to rerun an Apache Spark write operation by cancelling the currently running job, the following error occurs:

Error: org.apache.spark.sql.AnalysisException: Cannot create the managed table('`testdb`.` testtable`').
The associated location ('dbfs:/user/hive/warehouse/testdb.db/metastore_cache_ testtable) already exists.;

VersionVersion

此问题可能会在 Databricks Runtime 5.0 及更高版本中发生。This problem can occur in Databricks Runtime 5.0 and above.

原因Cause

此问题的原因是 2.4 版 Spark 的默认行为发生了更改。This problem is due to a change in the default behavior of Spark in version 2.4.

如果存在以下情况,则可能发生此问题:This problem can occur if:

  • 正在进行写入操作时终止群集。The cluster is terminated while a write operation is in progress.
  • 发生临时性网络问题。A temporary network issue occurs.
  • 作业中断。The job is interrupted.

某个特定表的元存储数据在损坏后很难恢复,除非通过手动删除该位置中的文件。Once the metastore data for a particular table is corrupted, it is hard to recover except by dropping the files in that location manually. 从根本上来说,问题在于在 Azure Databricks 尝试覆盖被称为 _STARTED 的元数据目录时,该目录不会被自动删除。Basically, the problem is that a metadata directory called _STARTED isn’t deleted automatically when Azure Databricks tries to overwrite it.

可以通过执行以下步骤来重现该问题:You can reproduce the problem by following these steps:

  1. 创建数据帧:Create a DataFrame:

    val df = spark.range(1000)

  2. 在覆盖模式下将该数据帧写入某个位置:Write the DataFrame to a location in overwrite mode:

    df.write.mode(SaveMode.Overwrite).saveAsTable("testdb.testtable")

  3. 在该命令正在执行时将其取消。Cancel the command while it is executing.

  4. 重新运行 write 命令。Re-run the write command.

解决方案Solution

spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation 标志设置为 trueSet the flag spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation to true. 此标志会删除 _STARTED 目录,并会将该进程返回到原始状态。This flag deletes the _STARTED directory and returns the process to the original state. 例如,可以在笔记本中设置该标志:For example, you can set it in the notebook:

spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")

或者可以将其设置为群集级别的 Spark 配置Or you can set it as a cluster level Spark configuration:

spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation true

可以选择的另一种方法是手动清理错误消息中指定的数据目录。Another option is to manually clean up the data directory specified in the error message. 可以使用 dbutils.fs.rm 来执行此操作。You can do this with dbutils.fs.rm.

dbutils.fs.rm("<path-to-directory>", True)