Apache Spark 的群集配置优化Cluster configuration optimization for Apache Spark

本文介绍如何在 Azure HDInsight 上优化 Apache Spark 群集的配置,以获得最佳性能。This article discusses how to optimize the configuration of your Apache Spark cluster for best performance on Azure HDInsight.

概述Overview

根据 Spark 群集工作负荷,用户可能认为某个非默认 Spark 配置更能优化 Spark 作业执行。Depending on your Spark cluster workload, you may determine that a non-default Spark configuration would result in more optimized Spark job execution. 使用示例工作负载执行基准测试,以验证任何非默认群集配置。Do benchmark testing with sample workloads to validate any non-default cluster configurations.

下面是一些可调整的常见参数:Here are some common parameters you can adjust:

参数Parameter 说明Description
--num-executors--num-executors 设置适当的执行程序数量。Sets the appropriate number of executors.
--executor-cores--executor-cores 设置每个执行程序的核心数。Sets the number of cores for each executor. 通常应使用中等大小的执行程序,因为其他进程会占用部分可用内存。Typically you should have middle-sized executors, as other processes consume some of the available memory.
--executor-memory--executor-memory 设置每个执行程序的内存大小,用于控制 YARN 上的堆大小。Sets the memory size for each executor, which controls the heap size on YARN. 留一些内存用于执行开销。Leave some memory for execution overhead.

选择正确的执行程序大小Select the correct executor size

在决定执行程序配置时,请考虑 Java 垃圾回收 (GC) 开销。When deciding your executor configuration, consider the Java garbage collection (GC) overhead.

  • 通过以下方式减小执行程序大小:Factors to reduce executor size:

    1. 将堆大小减至 32 GB 以下,使 GC 开销 < 10%。Reduce heap size below 32 GB to keep GC overhead < 10%.
    2. 减少内核数,使 GC 开销 < 10%。Reduce the number of cores to keep GC overhead < 10%.
  • 通过以下方式增加执行程序大小:Factors to increase executor size:

    1. 减少执行程序之间的通信开销。Reduce communication overhead between executors.
    2. 在较大的群集(超过 100 个执行程序)上减少执行程序 (N2) 之间已打开的连接数。Reduce the number of open connections between executors (N2) on larger clusters (>100 executors).
    3. 增加堆大小,以容纳占用大量内存的任务。Increase heap size to accommodate for memory-intensive tasks.
    4. 可选:减少每个执行程序的内存开销。Optional: Reduce per-executor memory overhead.
    5. 可选:通过超额订阅 CPU 来增加使用率和并发性。Optional: Increase usage and concurrency by oversubscribing CPU.

选择执行程序大小时,一般遵循以下做法:As a general rule, when selecting the executor size:

  1. 最开始,每个执行程序 30 GB,并分发可用的计算机内核。Start with 30 GB per executor and distribute available machine cores.
  2. 对于较大的群集(超过 100 个执行程序),增加执行程序内核数。Increase the number of executor cores for larger clusters (> 100 executors).
  3. 基于试运行和上述因素(比如 GC 开销)修改大小。Modify size based both on trial runs and on the preceding factors such as GC overhead.

运行并发查询时,请考虑:When running concurrent queries, consider:

  1. 最开始,每个执行程序 30 GB,并分发所有计算机内核。Start with 30 GB per executor and all machine cores.
  2. 通过超额订阅 CPU,创建多个并行 Spark 应用程序(延迟缩短大约 30%)。Create multiple parallel Spark applications by oversubscribing CPU (around 30% latency improvement).
  3. 跨并行应用程序分布查询。Distribute queries across parallel applications.
  4. 基于试运行和上述因素(比如 GC 开销)修改大小。Modify size based both on trial runs and on the preceding factors such as GC overhead.

有关使用 Ambari 配置执行程序的更多信息,请参阅 Apache Spark 设置 - Spark 执行程序For more information on using Ambari to configure executors, see Apache Spark settings - Spark executors.

通过查看时间线视图,监视查询性能中的离群值或其他性能问题。Monitor query performance for outliers or other performance issues, by looking at the timeline view. 还可以查看 SQL 图、作业统计信息等。Also SQL graph, job statistics, and so forth. 有关使用 YARN 和 Spark History Server 调试 Spark 作业的信息,请参阅调试 Azure HDInsight 中运行的 Apache Spark 作业For information on debugging Spark jobs using YARN and the Spark History server, see Debug Apache Spark jobs running on Azure HDInsight. 有关使用 YARN Timeline Server 的技巧,请参阅访问 Apache Hadoop YARN 应用程序日志For tips on using YARN Timeline Server, see Access Apache Hadoop YARN application logs.

有时,一个或几个执行程序的速度比其他执行程序要慢,执行任务时花费的时间也长得多。Sometimes one or a few of the executors are slower than the others, and tasks take much longer to execute. 这种执行速度缓慢的情况通常发生在较大的群集(超过 30 个节点)上。This slowness frequently happens on larger clusters (> 30 nodes). 在这种情况下,应将工作划分成更多任务,以便计划程序可以补偿速度较慢的任务。In this case, divide the work into a larger number of tasks so the scheduler can compensate for slow tasks. 例如,任务数量应至少为应用程序中执行程序内核数的两倍。For example, have at least twice as many tasks as the number of executor cores in the application. 也可以使用 conf: spark.speculation = true 对任务启用推理执行。You can also enable speculative execution of tasks with conf: spark.speculation = true.

后续步骤Next steps