“NativeAzureFileSystem...RequestBodyTooLarge”出现在 HDInsight 的 Apache Spark 流式处理应用日志中"NativeAzureFileSystem...RequestBodyTooLarge" appear in Apache Spark streaming app log in HDInsight

本文介绍在 Azure HDInsight 群集中使用 Apache Spark 组件时出现的问题的故障排除步骤和可能的解决方法。This article describes troubleshooting steps and possible resolutions for issues when using Apache Spark components in Azure HDInsight clusters.

问题Issue

错误 NativeAzureFileSystem ... RequestBodyTooLarge 显示在 Apache Spark 流应用的驱动程序日志中。The error: NativeAzureFileSystem ... RequestBodyTooLarge appears in the driver log for an Apache Spark streaming app.

原因Cause

Spark 事件日志文件可能达到了 WASB 的文件长度限制。Your Spark event log file is probably hitting the file length limit for WASB.

在 Spark 2.3 中,每个 Spark 应用会生成一个 Spark 事件日志文件。In Spark 2.3, each Spark app generates one Spark event log file. 在 Spark 流应用运行过程中,其事件日志文件会不断增大。The Spark event log file for a Spark streaming app continues to grow while the app is running. 目前,WASB 上的文件的块大小限制为 50000,默认块大小为 4 MB。Today a file on WASB has a 50000 block limit, and the default block size is 4 MB. 因此在默认配置中,最大文件大小为 195 GB。So in default configuration the max file size is 195 GB. 但是,Azure 存储已将最大块大小增大至 100 MB,这样就有效地将单个文件的限制提升至 4.75 TB。However, Azure Storage has increased the max block size to 100 MB, which effectively brought the single file limit to 4.75 TB. 有关详细信息,请参阅 Blob 存储可伸缩性和性能目标For more information, see Scalability and performance targets for Blob storage.

解决方法Resolution

对于此错误,可以采用三种解决方法:There are three solutions available for this error:

  • 将块大小增大至最大 100 MB。Increase the block size to up to 100 MB. 在 Ambari UI 中,修改 HDFS 配置属性 fs.azure.write.request.size(或者在 Custom core-site 节中创建该配置)。In Ambari UI, modify HDFS configuration property fs.azure.write.request.size (or create it in Custom core-site section). 将该属性设置为更大的值,例如:33554432。Set the property to a larger value, for example: 33554432. 保存更新的配置并重启受影响的组件。Save the updated configuration and restart affected components.

  • 定期停止并重新提交 Spark 流作业。Periodically stop and resubmit the spark-streaming job.

  • 使用 HDFS 来存储 Spark 事件日志。Use HDFS to store Spark event logs. 使用 HDFS 存储日志可能会在群集缩放或 Azure 升级期间导致 Spark 事件数据丢失。Using HDFS for storage may result in loss of Spark event data during cluster scaling or Azure upgrades.

    1. 通过 Ambari UI 对 spark.eventlog.dirspark.history.fs.logDirectory 做出更改:Make changes to spark.eventlog.dir and spark.history.fs.logDirectory via Ambari UI:

      spark.eventlog.dir = hdfs://mycluster/hdp/spark2-events
      spark.history.fs.logDirectory = "hdfs://mycluster/hdp/spark2-events"
      
    2. 在 HDFS 上创建目录:Create directories on HDFS:

      hadoop fs -mkdir -p hdfs://mycluster/hdp/spark2-events
      hadoop fs -chown -R spark:hadoop hdfs://mycluster/hdp
      hadoop fs -chmod -R 777 hdfs://mycluster/hdp/spark2-events
      hadoop fs -chmod -R o+t hdfs://mycluster/hdp/spark2-events
      
    3. 通过 Ambari UI 重启所有受影响的服务。Restart all affected services via Ambari UI.

后续步骤Next steps

如果你的问题未在本文中列出,或者无法解决问题,请访问以下渠道以获取更多支持:If you didn't see your problem or are unable to solve your issue, visit the following channel for more support: