中断后在覆盖模式下创建表失败Create table in overwrite mode fails when interrupted
问题Problem
在尝试通过取消当前正在运行的作业来重新运行 Apache Spark 写入操作时,会发生以下错误:When you attempt to rerun an Apache Spark write operation by cancelling the currently running job, the following error occurs:
Error: org.apache.spark.sql.AnalysisException: Cannot create the managed table('`testdb`.` testtable`').
The associated location ('dbfs:/user/hive/warehouse/testdb.db/metastore_cache_ testtable) already exists.;
VersionVersion
此问题可能会在 Databricks Runtime 5.0 及更高版本中发生。This problem can occur in Databricks Runtime 5.0 and above.
原因Cause
此问题的原因是 2.4 版 Spark 的默认行为发生了更改。This problem is due to a change in the default behavior of Spark in version 2.4.
如果存在以下情况,则可能发生此问题:This problem can occur if:
- 正在进行写入操作时终止群集。The cluster is terminated while a write operation is in progress.
- 发生临时性网络问题。A temporary network issue occurs.
- 作业中断。The job is interrupted.
某个特定表的元存储数据在损坏后很难恢复,除非通过手动删除该位置中的文件。Once the metastore data for a particular table is corrupted, it is hard to recover except by dropping the files in that location manually. 从根本上来说,问题在于在 Azure Databricks 尝试覆盖被称为 _STARTED
的元数据目录时,该目录不会被自动删除。Basically, the problem is that a metadata directory called _STARTED
isn’t deleted automatically when Azure Databricks tries to overwrite it.
可以通过执行以下步骤来重现该问题:You can reproduce the problem by following these steps:
创建数据帧:Create a DataFrame:
val df = spark.range(1000)
在覆盖模式下将该数据帧写入某个位置:Write the DataFrame to a location in overwrite mode:
df.write.mode(SaveMode.Overwrite).saveAsTable("testdb.testtable")
在该命令正在执行时将其取消。Cancel the command while it is executing.
重新运行
write
命令。Re-run thewrite
command.
解决方案Solution
将 spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation
标志设置为 true
。Set the flag spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation
to true
. 此标志会删除 _STARTED
目录,并会将该进程返回到原始状态。This flag deletes the _STARTED
directory and returns the process to the original state.
例如,可以在笔记本中设置该标志:For example, you can set it in the notebook:
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
或者可以将其设置为群集级别的 Spark 配置:Or you can set it as a cluster level Spark configuration:
spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation true
可以选择的另一种方法是手动清理错误消息中指定的数据目录。Another option is to manually clean up the data directory specified in the error message. 可以使用 dbutils.fs.rm
来执行此操作。You can do this with dbutils.fs.rm
.
dbutils.fs.rm("<path-to-directory>", True)