在 HDInsight 中优化 Apache Spark 作业Optimize Apache Spark jobs in HDInsight

本文简要介绍了用于优化 Azure HDInsight 上的 Apache Spark 作业的策略。This article provides an overview of strategies to optimize Apache Spark jobs on Azure HDInsight.

概述Overview

Apache Spark 作业的性能由多种因素而定。The performance of your Apache Spark jobs depends on multiple factors. 这些性能因素包括:数据的存储方式、群集的配置方式,以及处理数据时采用的操作。These performance factors include: how your data is stored, how the cluster is configured, and the operations that are used when processing the data.

可能面临的最常见的难题包括由于大小不恰当的执行程序、长时间运行的操作和引发笛卡尔操作的任务而导致的内存约束。Common challenges you might face include: memory constraints due to improperly sized executors, long-running operations, and tasks that result in cartesian operations.

也有许多优化可帮助你克服这些难题,例如缓存和允许数据倾斜。There are also many optimizations that can help you overcome these challenges, such as caching, and allowing for data skew.

在下面的每篇文章中,可找到 Spark 优化的不同方面的信息。In each of the following articles, you can find information on different aspects of Spark optimization.

后续步骤Next steps