Apache Spark 作业失败,出现 maxResultSize 异常Apache Spark job fails with maxResultSize exception

问题Problem

Apache Spark 作业由于出现 maxResultSize 异常而失败:A Spark job fails with a maxResultSize exception:

org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized
results of XXXX tasks (X.0 GB) is bigger than spark.driver.maxResultSize (X.0 GB)

原因Cause

发生此错误是因为超出了配置的大小限制。This error occurs because the configured size limit was exceeded. 大小限制适用于跨所有分区的 Spark 操作的总序列化结果。The size limit applies to the total serialized results for Spark actions across all partitions. Spark 操作包括多项操作,例如对驱动程序节点执行 collect() 操作、toPandas() 或将大型文件保存到驱动程序本地文件系统。The Spark actions include actions such as collect() to the driver node, toPandas(), or saving a large file to the driver local file system.

解决方案Solution

在某些情况下,可能必须重构代码,防止驱动程序节点收集大量数据。In some situations, you might have to refactor the code to prevent the driver node from collecting a large amount of data. 可以更改代码,使得驱动程序节点收集有限数量的数据或增加驱动程序实例的内存大小。You can change the code so that the driver node collects a limited amount of data or increase the driver instance memory size. 例如,可以调用启用了 Arrow 的 toPandas 或写入文件,然后读取这些文件,而不是将大量数据收集回驱动程序。For example you can call toPandas with Arrow enabled or writing files and then read those files instead of collecting large amounts of data back to the driver.

如果十分有必要,则可以将属性 spark.driver.maxResultSize 设置为值 <X>g,该值高于群集 Spark 配置中的异常消息报告的值:If absolutely necessary you can set the property spark.driver.maxResultSize to a value <X>g higher than the value reported in the exception message in the cluster Spark configuration:

spark.driver.maxResultSize <X>g

默认值为 4g。The default value is 4g. 有关详细信息,请参阅应用程序属性For details, see Application Properties

如果设置上限,则驱动程序中可能会发生内存不足错误(取决于 spark.driver.memory 和 JVM 中对象的内存使用情况)。If you set a high limit, out-of-memory errors can occur in the driver (depending on spark.driver.memory and the memory overhead of objects in the JVM). 设置适当的限制,防止出现内存不足错误。Set an appropriate limit to prevent out-of-memory errors.