Spark 作业失败,出现错误消息 Driver is temporarily unavailableSpark job fails with Driver is temporarily unavailable

问题Problem

Databricks 笔记本返回以下错误:A Databricks notebook returns the following error:

Driver is temporarily unavailable

问题可能是间歇性的或持续性的。This issue can be intermittent or not.

相关错误消息如下:A related error message is:

Lost connection to cluster. The notebook may have been detached.

原因Cause

导致此错误的一个常见原因是驱动程序正在经历内存瓶颈。One common cause for this error is that the driver is undergoing a memory bottleneck. 发生这种情况时,驱动程序会由于内存不足 (OOM) 状况而崩溃并重新启动,或由于频繁的完整垃圾回收变得无响应。When this happens, the driver crashes with an out of memory (OOM) condition and gets restarted or becomes unresponsive due to frequent full garbage collection. 造成内存瓶颈的原因有以下几种:The reason for the memory bottleneck can be any of the following:

  • 驱动程序实例类型对于在驱动程序上执行的负载不是最佳的。The driver instance type is not optimal for the load executed on the driver.
  • 驱动程序上执行内存密集型操作。There are memory-intensive operations executed on the driver.
  • 同一群集上有多个并行运行的笔记本或作业。There are many notebooks or jobs running in parallel on the same cluster.

解决方案Solution

解决方案因情况而异。The solution varies from case to case. 在没有具体细节的情况下,最简单的解决方法是增加驱动程序内存。The easiest way to resolve the issue in the absence of specific details is to increase the driver memory. 只需在 Azure Databricks 工作区中的群集编辑页上升级驱动程序节点类型即可增加驱动程序内存。You can increase driver memory simply by upgrading the driver node type on the cluster edit page in your Azure Databricks workspace.

要考虑的其他要点:Other points to consider:

  • 避免内存密集型操作,如:Avoid memory intensive operations like:

    • collect() 运算符,该运算符将大量数据引入驱动程序。collect() operator, which brings a large amount of data to the driver.
    • 将大型数据帧转换为 PandasConversion of a large DataFrame to Pandas

    如果这些操作是必要的,请确保有足够的驱动程序内存可用。If these operations are essential, ensure that enough driver memory is available.

  • 避免在共享的交互式群集上运行批处理作业。Avoid running batch jobs on a shared interactive cluster.

  • 将工作负载分配到不同的群集。Distribute the workloads into different clusters. 无论群集有多大,Spark 驱动程序的功能都不能在群集中分布。No matter how big the cluster is, the functionalities of the Spark driver cannot be distributed within a cluster.