Apache Spark 的内存使用情况优化Memory usage optimization for Apache Spark

本文介绍如何优化 Apache Spark 群集的内存管理,以在 Azure HDInsight 上获得最佳性能。This article discusses how to optimize memory management of your Apache Spark cluster for best performance on Azure HDInsight.

概述Overview

Spark 在运行时会将数据放在内存中。Spark operates by placing data in memory. 因此,管理内存资源是优化 Spark 作业执行的一个重要方面。So managing memory resources is a key aspect of optimizing the execution of Spark jobs. 可通过以下几种方法来有效地利用群集内存。There are several techniques you can apply to use your cluster's memory efficiently.

  • 为分区策略中的数据大小、类型和分布优先选择较小的数据分区和帐户。Prefer smaller data partitions and account for data size, types, and distribution in your partitioning strategy.
  • 考虑使用更新、更有效的 Kryo data serialization,而不是使用默认的 Java 序列化。Consider the newer, more efficient Kryo data serialization, rather than the default Java serialization.
  • 首选 YARN,因为它分批进行 spark-submitPrefer using YARN, as it separates spark-submit by batch.
  • 监视和优化 Spark 配置设置。Monitor and tune Spark configuration settings.

下图展示了 Spark 内存结构和一些键执行程序内存参数供用户参考。For your reference, the Spark memory structure and some key executor memory parameters are shown in the next image.

Spark 内存注意事项Spark memory considerations

如果使用的是 Apache Hadoop YARN,则 YARN 会控制每个 Spark 节点上的所有容器使用的内存。If you're using Apache Hadoop YARN, then YARN controls the memory used by all containers on each Spark node. 下图展示了一些键对象及其关系。The following diagram shows the key objects and their relationships.

YARN Spark 内存管理

若要解决显示“内存不足”消息的问题,请尝试:To address 'out of memory' messages, try:

  • 查看 DAG 管理数据重组。Review DAG Management Shuffles. 通过映射端化简减少内存使用,对源数据进行预分区(或 Bucket 存储化),最大化单个数据重组,以及减少发送的数据量。Reduce by map-side reducting, pre-partition (or bucketize) source data, maximize single shuffles, and reduce the amount of data sent.
  • 首选具有固定内存限制的 ReduceByKey,而不是 GroupByKey,后者提供聚合、窗口化和其他功能,但具有无限内存限制。Prefer ReduceByKey with its fixed memory limit to GroupByKey, which provides aggregations, windowing, and other functions but it has ann unbounded memory limit.
  • 首选在执行程序或分区上执行更多工作的 TreeReduce,而不是在驱动程序上执行所有工作的 ReducePrefer TreeReduce, which does more work on the executors or partitions, to Reduce, which does all work on the driver.
  • 使用 DataFrame,而不是级别较低的 RDD 对象。Use DataFrames rather than the lower-level RDD objects.
  • 创建用于封装操作(比如“Top N”、各种聚合或窗口化操作)的 ComplexType。Create ComplexTypes that encapsulate actions, such as "Top N", various aggregations, or windowing operations.

有关其他故障排除步骤,请参阅 Azure HDInsight 中 Apache Spark 的 OutOfMemoryError 异常For additional troubleshooting steps, see OutOfMemoryError exceptions for Apache Spark in Azure HDInsight.

后续步骤Next steps