RequestBodyTooLarge appear in Apache Spark Streaming application log in HDInsight
This article describes troubleshooting steps and possible resolutions for issues when using Apache Spark components in Azure HDInsight clusters.
You would receive below errors in an Apache Spark Streaming application log
NativeAzureFileSystem ... RequestBodyTooLarge
Or
java.io.IOException: Operation failed: "The request body is too large and exceeds the maximum permissible limit.", 413, PUT, https://<storage account>.dfs.core.chinacloudapi.cn/<container>/hdp/spark2-events/application_1620341592106_0004_1.inprogress?action=flush&retainUncommittedData=false&position=9238349177&close=false&timeout=90, RequestBodyTooLarge, "The request body is too large and exceeds the maximum permissible limit. RequestId:0259adb6-101f-0041-0660-43f672000000 Time:2021-05-07T16:48:00.2660760Z"
at org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.flushWrittenBytesToServiceInternal(AbfsOutputStream.java:362)
at org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.flushWrittenBytesToService(AbfsOutputStream.java:337)
at org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.flushInternal(AbfsOutputStream.java:272)
at org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.hflush(AbfsOutputStream.java:230)
at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:134)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
Files created over ABFS driver create Block blobs in Azure storage. Your Spark event log file is probably hitting the file length limit for WASB. See 50,000 blocks that a block blob can hold at max.
In Spark 2.3, each Spark app generates one Spark event log file. The Spark event log file for a Spark streaming app continues to grow while the app is running. Today a file on WASB has a 50000 block limit, and the default block size is 4 MB. So in default configuration the max file size is 195 GB. However, Azure Storage has increased the max block size to 100 MB, which effectively brought the single file limit to 4.75 TB. For more information, see Scalability and performance targets for Blob storage.
There are four solutions available for this error:
Increase the block size to up to 100 MB. In Ambari UI, modify HDFS configuration property
fs.azure.write.request.size
(or create it inCustom core-site
section). Set the property to a larger value, for example: 33554432. Save the updated configuration and restart affected components.Periodically stop and resubmit the spark-streaming job.
Use HDFS to store Spark event logs. Using HDFS for storage may result in loss of Spark event data during cluster scaling or Azure upgrades.
Make changes to
spark.eventlog.dir
andspark.history.fs.logDirectory
via Ambari UI:spark.eventlog.dir = hdfs://mycluster/hdp/spark2-events spark.history.fs.logDirectory = "hdfs://mycluster/hdp/spark2-events"
Create directories on HDFS:
hadoop fs -mkdir -p hdfs://mycluster/hdp/spark2-events hadoop fs -chown -R spark:hadoop hdfs://mycluster/hdp hadoop fs -chmod -R 777 hdfs://mycluster/hdp/spark2-events hadoop fs -chmod -R o+t hdfs://mycluster/hdp/spark2-events
Restart all affected services via Ambari UI.
Add
--conf spark.hadoop.fs.azure.enable.flush=false
in spark-submit to disable auto flush
If you didn't see your problem or are unable to solve your issue, visit one of the following channels for more support:
- If you need more help, you can submit a support request from the Azure portal. Select Support from the menu bar or open the Help + support hub.