作业由于随机提取失败而失败Jobs failing with shuffle fetch failures

问题Problem

你发现使用随机提取的作业出现间歇性 Apache Spark 作业失败。You are seeing intermittent Apache Spark job failures on jobs using shuffle fetch.

21/02/01 05:59:55 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 4, 10.79.1.45, executor 0): FetchFailed(BlockManagerId(1, 10.79.1.134, 4048, None), shuffleId=1, mapId=0, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to /10.79.1.134:4048
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:553)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:484)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:63)
... 1 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.79.1.134:4048
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

原因Cause

如果在部署后修改了 Azure Databricks 子网 CIDR 范围,则可能发生这种情况。This can happen if you have modified the Azure Databricks subnet CIDR range after deployment. 不支持此行为。This behavior is not supported.

假定下面的详细信息描述了两种场景:Assume the below details describe two scenarios:

原始 Azure Databricks 子网 CIDROriginal Azure Databricks subnet CIDR

  • 专用子网:10.10.0.0/24 (10.10.0.0 - 10.10.0.255)Private subnet: 10.10.0.0/24 (10.10.0.0 - 10.10.0.255)
  • 公共子网:10.10.1.0/24 (10.10.1.0 - 10.10.1.255)Public subnet: 10.10.1.0/24 (10.10.1.0 - 10.10.1.255)

已修改的 Azure Databricks 子网 CIDRModified Azure Databricks subnet CIDR

  • 专用子网:10.10.0.0/18 (10.10.0.0 - 10.10.63.255)Private subnet: 10.10.0.0/18 (10.10.0.0 - 10.10.63.255)
  • 公共子网:10.10.64.0/24 (10.10.64.0 - 10.10.127.255)Public subnet: 10.10.64.0/24 (10.10.64.0 - 10.10.127.255)

使用原始设置,一切按预期方式运行。With the original settings, everything works as intended.

使用修改后的设置,如果为执行程序分配了子网范围为 10.10.1.0 - 10.10.63.255 的 IP 地址,并且为驱动程序分配了子网范围为 10.10.0.0-10.10.0.255 的 IP 地址,则由于防火墙规则限制了原始 CIDR 范围 10.10.0.0/24 中的通信,执行程序之间的通信被阻止。With the modified settings, if executors are assigned IP addresses in the subnet range 10.10.1.0 - 10.10.63.255 and the driver assigned an IP address in the subnet range 10.10.0.0 - 10.10.0.255, the communication between executors is blocked due to a firewall rule limiting communication in the original CIDR range of 10.10.0.0/24.

如果为执行程序和驱动程序分配范围均为 10.10.0.0/24 的 IP 地址,则不会阻止通信,且作业按预期运行。If the executors and driver are both assigned IP addresses in 10.10.0.0/24, no communication is blocked and the job runs as intended. 但是,在修改的设置下不能保证此分配。However, this assignment is not guaranteed under the modified settings.

解决方案Solution

  1. 还原任何子网 CIDR 更改,并还原用于创建 Azure Databricks 工作区的原始 VNet 配置。Revert any subnet CIDR changes and restore the original VNet configuration that you used to create the Azure Databricks workspace.
  2. 重启群集。Restart your cluster.
  3. 重新提交作业。Resubmit your jobs.