使用 display() 时,检查点文件不会被删除Checkpoint files not being deleted when using display()

问题Problem

你有一个流式处理作业使用 display() 来显示数据帧。You have a streaming job using display() to display DataFrames.

val streamingDF = spark.readStream.schema(schema).parquet(<input_path>)
display(streamingDF)

检查点文件会被创建,但不会被删除。Checkpoint files are being created, but are not being deleted.

可以通过导航到根目录并查看 /local_disk0/tmp/ 文件夹来验证此问题。You can verify the problem by navigating to the root directory and looking in the /local_disk0/tmp/ folder. 检查点文件保留在文件夹中。Checkpoint files remain in the folder.

原因Cause

该命令 display(streamingDF) 是一种内存接收器实现,它可以为每个微批显示流式处理数据帧中的数据。The command display(streamingDF) is a memory sink implementation that can display the data from the streaming DataFrame for every micro-batch. 若要跟踪流式处理更新,需要一个检查点目录。A checkpoint directory is required to track the streaming updates.

如果未指定自定义检查点位置,则将在 /local_disk0/tmp/ 处创建默认的检查点目录。If you have not specified a custom checkpoint location, a default checkpoint directory is created at /local_disk0/tmp/.

Azure Databricks 使用检查点目录来确保进度信息的正确性和一致性。Azure Databricks uses the checkpoint directory to ensure correct and consistent progress information. 当故意或意外关闭流时,检查点目录将允许 Azure Databricks 重启并从上次离开的确切位置继续操作。When a stream is shut down, either purposely or accidentally, the checkpoint directory allows Azure Databricks to restart and pick up exactly where it left off.

如果因从笔记本取消流而导致该流被关闭,Azure Databricks 作业会尽最大努力清理检查点目录。If a stream is shut down by cancelling the stream from the notebook, the Azure Databricks job attempts to clean up the checkpoint directory on a best-effort basis. 如果流以任何其他方式终止,或者如果作业被终止,则不会清理检查点目录。If the stream is terminated in any other way, or if the job is terminated, the checkpoint directory is not cleaned up.

这是设计使然。This is as designed.

解决方案Solution

可以通过以下准则避免不需要的检查点文件。You can prevent unwanted checkpoint files with the following guidelines.

  • 不应在生产作业中使用 display(streamingDF)You should not use display(streamingDF) in production jobs.
  • 如果 display(streamingDF) 对用例是必需的,则应使用 Apache Spark 配置选项 spark.sql.streaming.checkpointLocation 手动指定检查点目录。If display(streamingDF) is mandatory for your use case, you should manually specify the checkpoint directory by using the Apache Spark config option spark.sql.streaming.checkpointLocation.
  • 如果手动指定检查点目录,则应定期删除此目录中的任何剩余文件。If you manually specify the checkpoint directory, you should periodically delete any remaining files in this directory. 这可以每周执行一次。This can be done on a weekly basis.