优化 HDInsight 中的 Apache Spark 应用程序

本文概述了优化 Azure HDInsight 上的 Apache Spark 应用程序的策略。

概述

可能面临以下常见场景

Apache Spark 作业的性能由多种因素而定。这些性能因素包括：

检查 ResourceManager 或 NodeManager 是否发出警报
检查 YARN > 中 ResourceManager 和 NodeManager 的状态：所有 NodeManager 都应处于“已启动”状态，且只有“活动” ResourceManager 应处于“已启动”状态

检查 Yarn UI 是否可通过 https://YOURCLUSTERNAME.azurehdinsight.cn/yarnui/hn/cluster 访问
检查 ResourceManager 日志中/var/log/hadoop-yarn/yarn/hadoop-yarn-resourcemanager-*.log是否存在任何异常或错误

请参阅 Yarn 常见问题以获得更多信息

转到 Yarn UI，通过 https://YOURCLUSTERNAME.azurehdinsight.cn/yarnui/hn/cluster/scheduler 检查 Yarn 调度器指标
或者，你可以通过 Yarn REST API 检查 Yarn 调度器指标。例如，curl -u "xxxx" -sS -G "https://YOURCLUSTERNAME.azurehdinsight.cn/ws/v1/cluster/scheduler"。对于 ESP，应使用域管理员用户。

所有执行程序资源：spark.executor.instances * (spark.executor.memory + spark.yarn.executor.memoryOverhead) and spark.executor.instances * spark.executor.cores。有关详细信息，请参阅 spark 执行程序配置
ApplicationMaster
- 在群集模式下，使用 spark.driver.memory 和 spark.driver.cores
- 在客户端模式下，使用 spark.yarn.am.memory+spark.yarn.am.memoryOverhead 和 spark.yarn.am.cores

注意

yarn.scheduler.minimum-allocation-mb <= spark.executor.memory+spark.yarn.executor.memoryOverhead <= yarn.scheduler.maximum-allocation-mb

我们需要通过 Spark UI 或 Spark History UI 识别以下症状：

有关详细信息，请参阅监视 Spark 应用程序

有许多优化可帮助你克服这些难题，例如缓存和允许数据倾斜。

在下面的每篇文章中，可找到 Spark 优化的不同方面的信息。

spark.sql.shuffle.partitions 默认为 200。在重排数据以便进行联接或聚合时，我们可以根据业务需求进行调整。
spark.sql.files.maxPartitionBytes 在 HDI 中默认为 1G。读取文件时，要打包到单个分区的最大字节数。仅当使用基于文件的源（如 Parquet、JSON 和 ORC）时，此配置才有效。
Spark 3.0 中的 AQE。请参阅自适应查询执行