方案:Azure HDInsight 中的 Apache Spark thrift 服务器发生 RpcTimeoutExceptionScenario: RpcTimeoutException for Apache Spark thrift server in Azure HDInsight

本文介绍在 Azure HDInsight 群集中使用 Apache Spark 组件时出现的问题的故障排除步骤和可能的解决方法。This article describes troubleshooting steps and possible resolutions for issues when using Apache Spark components in Azure HDInsight clusters.

问题Issue

Spark 应用程序失败并出现 org.apache.spark.rpc.RpcTimeoutException 异常和以下示例中所示的消息:Futures timed outSpark application fails with a org.apache.spark.rpc.RpcTimeoutException exception and a message: Futures timed out, as in the following example:

org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
 at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)

sparkthriftdriver.log 中还可能出现 OutOfMemoryErroroverhead limit exceeded 错误,如以下示例中所示:OutOfMemoryError and overhead limit exceeded errors may also appear in the sparkthriftdriver.log as in the following example:

WARN  [rpc-server-3-4] server.TransportChannelHandler: Exception in connection from /10.0.0.17:53218
java.lang.OutOfMemoryError: GC overhead limit exceeded

原因Cause

这些错误的原因是数据处理期间内存资源不足。These errors are caused by a lack of memory resources during data processing. 如果启动 Java 垃圾回收进程,可能会导致 Spark 应用程序停止响应。If the Java garbage collection process starts, it could lead to the Spark application to stop responding. 查询将开始超时并停止处理。Queries will begin to time out and stop processing. Futures timed out 错误表示群集遭受严重的压力。The Futures timed out error indicates a cluster under severe stress.

解决方法Resolution

通过添加更多的工作器节点或增加现有群集节点的内存容量,来增大群集大小。Increase the cluster size by adding more worker nodes or increasing the memory capacity of the existing cluster nodes. 还可以调整数据管道,以减少一次性处理的数据量。You can also adjust the data pipeline to reduce the amount of data being processed at once.

spark.network.timeout 将控制所有网络连接的超时。The spark.network.timeout controls the timeout for all network connections. 增大网络超时可为某些关键操作提供更多的时间,但不能彻底解决该问题。Increasing the network timeout may allow more time for some critical operations to finish, but this will not resolve the issue completely.

后续步骤Next steps

如果你的问题未在本文中列出,或者无法解决问题,请访问以下渠道以获取更多支持:If you didn't see your problem or are unable to solve your issue, visit the following channel for more support: