方案:Azure HDInsight 中群集节点的磁盘空间不足Scenario: Cluster node runs out of disk space in Azure HDInsight

本文介绍在与 Azure HDInsight 群集交互时出现的问题的故障排除步骤和可能的解决方法。This article describes troubleshooting steps and possible resolutions for issues when interacting with Azure HDInsight clusters.

问题Issue

作业可能失败并出现如下所示的错误消息:/usr/hdp/2.6.3.2-14/hadoop/libexec/hadoop-config.sh: fork: No space left on device.A job may fail with error message similar to: /usr/hdp/2.6.3.2-14/hadoop/libexec/hadoop-config.sh: fork: No space left on device.

或者,可能收到如下所示的 Apache Ambari 警报:local-dirs usable space is below configured utilization percentageOr you may receive Apache Ambari alert similar to: local-dirs usable space is below configured utilization percentage.

原因Cause

Apache Yarn 应用程序缓存可能占用了所有可用磁盘空间。Apache Yarn application cache may have consumed all available disk space. Spark 应用程序可能运行效率低下。Your Spark application is likely running inefficiently.

解决方法Resolution

  1. 使用 Ambari UI 确定哪个节点的磁盘空间不足。Use Ambari UI to determine which node is running out of disk space.

  2. 确定有问题的节点中哪个文件夹占用了大部分磁盘空间。Determine which folder in the troubling node contributes to most of the disk space. 首先通过 SSH 连接到该节点,然后运行 df 列出所有装入点的磁盘用量。SSH to the node first, then run df to list disk usage for all mounts. 通常,空间占用量最大的装入点是 /mnt,即 OSS 使用的一个临时磁盘。Usually it is /mnt which is a temp disk used by OSS. 可以进入某个文件夹,然后键入 sudo du -hs 显示该文件夹的总文件大小。You can enter into a folder, then type sudo du -hs to show summarized file sizes under a folder. 如果看到类似于 /mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1537280705629_0007 的文件夹,则表示应用程序仍在运行。If you see a folder similar to /mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1537280705629_0007, this means the application is still running. 问题可能与 RDD 持久性或中间随机文件相关。This could be due to RDD persistence or intermediate shuffle files.

  3. 若要缓解此问题,请终止应用程序,以释放该应用程序使用的磁盘空间。To mitigate the issue, kill the application, which will release disk space used by that application.

  4. 若要最终解决该问题,请优化应用程序。To ultimately resolve the issue, optimize your application.

后续步骤Next steps

如果你的问题未在本文中列出,或者无法解决问题,请访问以下渠道获取更多支持:If you didn't see your problem or are unable to solve your issue, visit the following channel for more support:

  • 如果需要更多帮助,可以从 Azure 门户提交支持请求。If you need more help, you can submit a support request from the Azure portal. 从菜单栏中选择“支持” ,或打开“帮助 + 支持” 中心。Select Support from the menu bar or open the Help + support hub.