方案:Azure HDInsight 中群集节点的磁盘空间不足Scenario: Cluster node runs out of disk space in Azure HDInsight

本文介绍在与 Azure HDInsight 群集交互时出现的问题的故障排除步骤和可能的解决方法。This article describes troubleshooting steps and possible resolutions for issues when interacting with Azure HDInsight clusters.

问题Issue

作业可能失败并出现如下所示的错误消息:/usr/hdp/2.6.3.2-14/hadoop/libexec/hadoop-config.sh: fork: No space left on device.A job may fail with error message similar to: /usr/hdp/2.6.3.2-14/hadoop/libexec/hadoop-config.sh: fork: No space left on device.

或者,可能收到如下所示的 Apache Ambari 警报:local-dirs usable space is below configured utilization percentageOr you may receive Apache Ambari alert similar to: local-dirs usable space is below configured utilization percentage.

原因Cause

Apache Yarn 应用程序缓存可能占用了所有可用磁盘空间。Apache Yarn application cache may have consumed all available disk space. Spark 应用程序可能运行效率低下。Your Spark application is likely running inefficiently.

解决方法Resolution

  1. 使用 Ambari UI 确定哪个节点的磁盘空间不足。Use Ambari UI to determine which node is running out of disk space.

  2. 确定有问题的节点中哪个文件夹占用了大部分磁盘空间。Determine which folder in the troubling node contributes to most of the disk space. 首先通过 SSH 连接到该节点,然后运行 df 列出所有装入点的磁盘用量。SSH to the node first, then run df to list disk usage for all mounts. 通常,空间占用量最大的装入点是 /mnt,即 OSS 使用的一个临时磁盘。Usually it is /mnt which is a temp disk used by OSS. 可以进入某个文件夹,然后键入 sudo du -hs 显示该文件夹的总文件大小。You can enter into a folder, then type sudo du -hs to show summarized file sizes under a folder. 如果看到类似于 /mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1537280705629_0007 的文件夹,则表示应用程序仍在运行。If you see a folder similar to /mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1537280705629_0007, this means the application is still running. 问题可能与 RDD 持久性或中间随机文件相关。This could be due to RDD persistence or intermediate shuffle files.

  3. 若要缓解此问题,请终止应用程序,以释放该应用程序使用的磁盘空间。To mitigate the issue, kill the application, which will release disk space used by that application.

  4. 如果此问题在工作器节点上频繁发生,则可以调整群集上的 YARN 本地缓存设置。If the issue happens frequently on the worker nodes, you can tune the YARN local cache settings on the cluster.

    打开 Ambari UI 并导航到 YARN -> 配置 -> 高级。Open the Ambari UI Navigate to YARN --> Configs --> Advanced.
    将以下 2 个属性添加到自定义 yarn-site.xml 部分并进行保存:Add the following 2 properties to the custom yarn-site.xml section and save:

    yarn.nodemanager.localizer.cache.target-size-mb=2048
    yarn.nodemanager.localizer.cache.cleanup.interval-ms=300000
    
  5. 如果上述方法不能永久解决该问题,请优化应用程序。If the above does not permanently fix the issue, optimize your application.

后续步骤Next steps

如果你的问题未在本文中列出,或者无法解决问题,请访问以下渠道以获取更多支持:If you didn't see your problem or are unable to solve your issue, visit the following channel for more support: