配置 Apache Spark 设置Configure Apache Spark settings

HDInsight Spark 群集包含 Apache Spark 库的安装。An HDInsight Spark cluster includes an installation of the Apache Spark library. 每个 HDInsight 群集包含其所有已安装服务(包括 Spark)的默认配置参数。Each HDInsight cluster includes default configuration parameters for all its installed services, including Spark. 管理 HDInsight Apache Hadoop 群集时,一个重要方面是监视工作负荷(包括 Spark 作业)。A key aspect of managing an HDInsight Apache Hadoop cluster is monitoring workload, including Spark Jobs. 若要最合理地运行 Spark 作业,请在确定群集的逻辑配置时考虑物理群集配置。To best run Spark jobs, consider the physical cluster configuration when determining the cluster's logical configuration.

默认的 HDInsight Apache Spark 群集包括以下节点:三个 Apache ZooKeeper 节点、两个头节点和一个或多个工作节点:The default HDInsight Apache Spark cluster includes the following nodes: three Apache ZooKeeper nodes, two head nodes, and one or more worker nodes:

Spark HDInsight 体系结构

HDInsight 群集中节点的 VM 数目和 VM 大小可能影响 Spark 配置。The number of VMs, and VM sizes, for the nodes in your HDInsight cluster can affect your Spark configuration. 非默认的 HDInsight 配置值通常需要非默认的 Spark 配置值。Non-default HDInsight configuration values often require non-default Spark configuration values. 在创建 HDInsight Spark 群集时,系统会显示每个组件的建议 VM 大小。When you create an HDInsight Spark cluster, you're shown suggested VM sizes for each of the components. 目前,Azure 的内存优化 Linux VM 大小为 D12 v2 或更大。Currently the Memory-optimized Linux VM sizes for Azure are D12 v2 or greater.

Apache Spark 版本Apache Spark versions

使用适合你的群集的最佳 Spark 版本。Use the best Spark version for your cluster. HDInsight 服务本身包含 Spark 和 HDInsight 的多个版本。The HDInsight service includes several versions of both Spark and HDInsight itself. 每个 Spark 版本包含一组默认群集设置。Each version of Spark includes a set of default cluster settings.

创建新群集时,可从以下多个 Spark 版本中进行选择。When you create a new cluster, there are multiple Spark versions to choose from. 若要查看完整列表,请参阅 HDInsight 组件和版本To see the full list, HDInsight Components and Versions

备注

HDInsight 服务中的默认 Apache Spark 版本可随时更改,恕不另行通知。The default version of Apache Spark in the HDInsight service may change without notice. 如果你依赖某个版本,Microsoft 建议在使用 .NET SDK、Azure PowerShell 和 Azure 经典 CLI 创建群集时指定该特定版本。If you have a version dependency, Microsoft recommends that you specify that particular version when you create clusters using .NET SDK, Azure PowerShell, and Azure Classic CLI.

Apache Spark 有三个系统配置位置:Apache Spark has three system configuration locations:

  • Spark 属性控制大多数应用程序参数,可以使用 SparkConf 对象或通过 Java 系统属性进行设置。Spark properties control most application parameters and can be set by using a SparkConf object, or through Java system properties.
  • 可以通过每个节点上的 conf/spark-env.sh 脚本,使用环境变量来配置每台计算机的设置,例如 IP 地址。Environment variables can be used to set per-machine settings, such as the IP address, through the conf/spark-env.sh script on each node.
  • 可以通过 log4j.properties 配置日志记录。Logging can be configured through log4j.properties.

选择特定的 Spark 版本时,群集将包含默认的配置设置。When you select a particular version of Spark, your cluster includes the default configuration settings. 可以通过使用自定义的 Spark 配置文件来更改默认的 Spark 配置值。You can change the default Spark configuration values by using a custom Spark configuration file. 下面显示了一个示例。An example is shown below.

    spark.hadoop.io.compression.codecs org.apache.hadoop.io.compress.GzipCodec
    spark.hadoop.mapreduce.input.fileinputformat.split.minsize 1099511627776
    spark.hadoop.parquet.block.size 1099511627776
    spark.sql.files.maxPartitionBytes 1099511627776
    spark.sql.files.openCostInBytes 1099511627776

上面所示的示例替代了五个 Spark 配置参数的多个默认值。The example shown above overrides several default values for five Spark configuration parameters. 这些值是压缩编解码器 Apache Hadoop MapReduce 拆分的最小大小和 parquet 块大小,These values are the compression codec, Apache Hadoop MapReduce split minimum size and parquet block sizes. 同时也是 Spar SQL 分区和打开文件大小默认值。Also, the Spar SQL partition and open file sizes default values. 之所以选择这些配置更改,是因为关联的数据和作业(在此示例中为基因组数据)具有特定的特征。These configuration changes are chosen because the associated data and jobs (in this example, genomic data) have particular characteristics. 使用这些自定义配置设置可以更好地利用这些特征。These characteristics will do better using these custom configuration settings.


查看群集配置设置View cluster configuration settings

在群集上进行性能优化之前,请验证当前的 HDInsight 群集配置设置。Verify the current HDInsight cluster configuration settings before you do performance optimization on the cluster. 单击 Spark 群集窗格中的“仪表板”链接,从 Azure 门户启动 HDInsight 仪表板。Launch the HDInsight Dashboard from the Azure portal by clicking the Dashboard link on the Spark cluster pane. 使用群集管理员的用户名和密码登录。Sign in with the cluster administrator's username and password.

此时会显示 Apache Ambari Web UI,其中的仪表板显示了重要的群集资源使用指标。The Apache Ambari Web UI appears, with a dashboard of key cluster resource usage metrics. Ambari 仪表板显示 Apache Spark 配置,以及安装的其他服务。The Ambari Dashboard shows you the Apache Spark configuration, and other installed services. 仪表板包含“配置历史记录”选项卡,可在其中查看已安装服务(包括 Spark)的信息。The Dashboard includes a Config History tab, where you view information for installed services, including Spark.

若要查看 Apache Spark 的配置值,请依次选择“配置历史记录”、“Spark2”。 To see configuration values for Apache Spark, select Config History, then select Spark2. 选择“配置”选项卡,然后在服务列表中选择 SparkSpark2(取决于版本)链接。Select the Configs tab, then select the Spark (or Spark2, depending on your version) link in the service list. 此时会显示群集的配置值列表:You see a list of configuration values for your cluster:

Spark 配置

若要查看和更改单个 Spark 配置值,请选择标题中包含“spark”的任何链接。To see and change individual Spark configuration values, select any link with "spark" in the title. Spark 配置包括以下类别的自定义配置值和高级配置值:Configurations for Spark include both custom and advanced configuration values in these categories:

  • 自定义 Spark2-defaultsCustom Spark2-defaults
  • 自定义 Spark2-metrics-propertiesCustom Spark2-metrics-properties
  • 高级 Spark2-defaultsAdvanced Spark2-defaults
  • 高级 Spark2-envAdvanced Spark2-env
  • 高级 spark2-hive-site-overrideAdvanced spark2-hive-site-override

如果创建一组非默认配置值,则更新历史记录可见。If you create a non-default set of configuration values, your update history is visible. 借助此配置历史记录可以确定哪个非默认配置具有最佳性能。This configuration history can be helpful to see which non-default configuration has optimal performance.

备注

若要查看但不更改常用的 Spark 群集配置设置,请在顶层“Spark 作业 UI”界面上选择“环境”选项卡。 To see, but not change, common Spark cluster configuration settings, select the Environment tab on the top-level Spark Job UI interface.

配置 Spark 执行器Configuring Spark executors

下图显示了关键的 Spark 对象:驱动程序及其关联的 Spark 上下文,以及群集管理器及其 n 个 工作节点。The following diagram shows key Spark objects: the driver program and its associated Spark Context, and the cluster manager and its n worker nodes. 每个工作节点包括执行器、缓存和 n 个 任务实例。Each worker node includes an Executor, a cache, and n task instances.

群集对象

Spark 作业使用辅助角色资源(具体而言是内存),因此,我们往往会调整工作节点执行器的 Spark 配置值。Spark jobs use worker resources, particularly memory, so it's common to adjust Spark configuration values for worker node Executors.

我们经常调整 spark.executor.instancesspark.executor.coresspark.executor.memory 这三个关键参数来优化 Spark 配置,以改善应用程序要求。Three key parameters that are often adjusted to tune Spark configurations to improve application requirements are spark.executor.instances, spark.executor.cores, and spark.executor.memory. 执行器是针对 Spark 应用程序启动的进程。An Executor is a process launched for a Spark application. 执行器在工作节点上运行,负责执行应用程序的任务。An Executor runs on the worker node and is responsible for the tasks for the application. 工作器节点数量和工作器节点大小决定执行程序数量和执行程序大小。The number of worker nodes and worker node size determines the number of executors, and executor sizes. 这些值存储在群集头节点的 spark-defaults.conf 中。These values are stored in spark-defaults.conf on the cluster head nodes. 可以通过在 Ambari Web UI 中选择“自定义 spark-defaults”,在运行的群集中编辑这些值。You can edit these values in a running cluster by selecting Custom spark-defaults in the Ambari web UI. 做出更改后,UI 会提示重启所有受影响的服务。After you make changes, you're prompted by the UI to Restart all the affected services.

备注

这三个配置参数可在群集级别配置(适用于群集中运行的所有应用程序),也可以针对每个应用程序指定。These three configuration parameters can be configured at the cluster level (for all applications that run on the cluster) and also specified for each individual application.

Spark 执行程序使用的资源的另一个信息源是 Spark 应用程序 UI。Another source of information about resources used by Spark Executors is the Spark Application UI. 在 UI 中,“执行程序”显示有关配置和已使用资源的“摘要”和“详细信息”视图。In the UI, Executors displays Summary and Detail views of the configuration and consumed resources. 确定是更改整个群集的执行程序值,还是更改一组特定的作业执行操作。Determine whether to change executors values for entire cluster, or particular set of job executions.

Spark 执行器

或者,可以使用 Ambari REST API 以编程方式验证 HDInsight 和 Spark 群集的配置设置。Or you can use the Ambari REST API to programmatically verify HDInsight and Spark cluster configuration settings. GitHub 上的 Apache Ambari API 参考提供了详细信息。More information is available at the Apache Ambari API reference on GitHub.

根据 Spark 群集工作负荷,可以确定某个非默认 Spark 配置是否提供更优化的 Spark 作业执行。Depending on your Spark workload, you may determine that a non-default Spark configuration provides more optimized Spark job executions. 使用示例工作负荷执行基准测试,以验证任何非默认群集配置。Do benchmark testing with sample workloads to validate any non-default cluster configurations. 下面是可以考虑调整的一些常见参数:Some of the common parameters that you may consider adjusting are:

参数Parameter 说明Description
--num-executors--num-executors 设置执行程序数量。Sets the number of executors.
--executor-cores--executor-cores 设置每个执行程序的核心数。Sets the number of cores for each executor. 我们建议使用中等大小的执行器,因为其他进程也会占用一部分可用内存。We recommend using middle-sized executors, as other processes also consume some portion of the available memory.
--executor-memory--executor-memory 控制 Apache Hadoop YARN 上每个执行程序的内存大小(堆大小),你需要留一些内存用于执行开销。Controls the memory size (heap size) of each executor on Apache Hadoop YARN, and you'll need to leave some memory for execution overhead.

下面是使用不同配置值的两个工作节点的示例:Here is an example of two worker nodes with different configuration values:

双节点配置

以下列表显示关键的 Spark 执行器内存参数。The following list shows key Spark executor memory parameters.

参数Parameter 说明Description
spark.executor.memoryspark.executor.memory 定义执行程序可用的内存总量。Defines the total amount of memory available for an executor.
spark.storage.memoryFractionspark.storage.memoryFraction (默认为大约 60%)定义可用于存储持久性 RDD 的内存量。(default ~60%) defines the amount of memory available for storing persisted RDDs.
spark.shuffle.memoryFractionspark.shuffle.memoryFraction (默认为大约 20%)定义保留给随机操作的内存量。(default ~20%) defines the amount of memory reserved for shuffle.
spark.storage.unrollFraction 和 spark.storage.safetyFractionspark.storage.unrollFraction and spark.storage.safetyFraction (合计为总内存的大约 30%)- 这些值由 Spark 在内部使用,不应更改。(totaling ~30% of total memory) - these values are used internally by Spark and shouldn't be changed.

YARN 控制每个 Spark 节点上的容器使用的最大内存量总计。YARN controls the maximum sum of memory used by the containers on each Spark node. 下图显示了 YARN 配置对象与 Spark 对象之间的节点关系。The following diagram shows the per-node relationships between YARN configuration objects and Spark objects.

YARN Spark 内存管理

更改 Jupyter Notebook 中运行的应用程序的参数Change parameters for an application running in Jupyter notebook

HDInsight 中的 Spark 群集默认包含许多组件。Spark clusters in HDInsight include a number of components by default. 其中每个组件包含可按需替代的默认配置值。Each of these components includes default configuration values, which can be overridden as needed.

组件Component 说明Description
Spark CoreSpark Core Spark Core、Spark SQL、Spark 流式处理 API、GraphX 和 Apache Spark MLlib。Spark Core, Spark SQL, Spark streaming APIs, GraphX, and Apache Spark MLlib.
AnacondaAnaconda Python 包管理器。A python package manager.
Apache LivyApache Livy Apache Spark REST API,用于将远程作业提交到 HDInsight Spark 群集。The Apache Spark REST API, used to submit remote jobs to an HDInsight Spark cluster.
Jupyter 和 Apache Zeppelin 笔记本Jupyter and Apache Zeppelin notebooks 用来与 Spark 群集交互的基于浏览器的交互式 UI。Interactive browser-based UI for interacting with your Spark cluster.
ODBC 驱动程序ODBC driver 将 HDInsight 中的 Spark 群集连接到 Microsoft Power BI 和 Tableau 等商业智能 (BI) 工具。Connects Spark clusters in HDInsight to business intelligence (BI) tools such as Microsoft Power BI and Tableau.

对于 Jupyter Notebook 中运行的应用程序,可以使用 %%configure 命令从 Notebook 本身内部进行配置更改。For applications running in the Jupyter notebook, use the %%configure command to make configuration changes from within the notebook itself. 这些配置更改将应用到从 Notebook 实例运行的 Spark 作业。These configuration changes will be applied to the Spark jobs run from your notebook instance. 先在应用程序的开头进行此类更改,然后再运行第一个代码单元。Make such changes at the beginning of the application, before you run your first code cell. 创建 Livy 会话时,会将更改的配置应用到该会话。The changed configuration is applied to the Livy session when it gets created.

备注

若要更改处于应用程序中后面某个阶段的配置,请使用 -f (force) 参数。To change the configuration at a later stage in the application, use the -f (force) parameter. 但是,应用程序中的所有进度将会丢失。However, all progress in the application will be lost.

以下代码演示如何更改 Jupyter Notebook 中运行的应用程序的配置。The code below shows how to change the configuration for an application running in a Jupyter notebook.

    %%configure
    {"executorMemory": "3072M", "executorCores": 4, "numExecutors":10}

结论Conclusion

监视核心配置设置,以确保 Spark 作业以可预测且高效的方式运行。Monitor core configuration settings to ensure your Spark jobs run in a predictable and performant way. 这些设置可帮助确定特定工作负荷的最佳 Spark 群集配置。These settings help determine the best Spark cluster configuration for your particular workloads. 此外,还需要监视长时间运行且/或消耗大量资源的 Spark 作业的执行情况。You'll also need to monitor the execution of long-running and, or resource-consuming Spark job executions. 最常见的难题集中在不正确的配置(例如执行程序的大小不正确)引起的内存压力上。The most common challenges center around memory pressure from improper configurations, such as incorrectly sized executors. 此外还有长时间运行的操作以及会导致笛卡尔运算的任务。Also, long-running operations, and tasks, which result in Cartesian operations.

后续步骤Next steps