使用 Apache Ambari 优化 HDInsight 群集配置Use Apache Ambari to optimize HDInsight cluster configurations

HDInsight 为大规模数据处理应用程序提供 Apache Hadoop 群集。HDInsight provides Apache Hadoop clusters for large-scale data processing applications. 对这些复杂的多节点群集进行管理、监视和优化可能存在一定的难度。Managing, monitoring, and optimizing these complex multi-node clusters can be challenging. Apache Ambari 是可用于管理和监视 HDInsight Linux 群集的 Web 界面。Apache Ambari is a web interface to manage and monitor HDInsight Linux clusters. 对于 Windows 群集,请使用 Ambari REST APIFor Windows clusters, use the Ambari REST API.

有关使用 Ambari Web UI 的简介,请参阅使用 Apache Ambari Web UI 管理 HDInsight 群集For an introduction to using the Ambari Web UI, see Manage HDInsight clusters by using the Apache Ambari Web UI

使用群集凭据通过 https://CLUSTERNAME.azurehdidnsight.cn 登录到 Ambari。Log in to Ambari at https://CLUSTERNAME.azurehdidnsight.cn with your cluster credentials. 初始屏幕显示了概述仪表板。The initial screen displays an overview dashboard.

Ambari 仪表板

Ambari Web UI 可用于管理主机、服务、警报、配置和视图。The Ambari web UI can be used to manage hosts, services, alerts, configurations, and views. Ambari 不可用于创建 HDInsight 群集、升级服务、管理堆栈和版本、停用或重用主机,或者将服务添加到群集。Ambari can't be used to create an HDInsight cluster, upgrade services, manage stacks and versions, decommission or recommission hosts, or add services to the cluster.

管理群集的配置Manage your cluster's configuration

配置设置可帮助优化特定服务。Configuration settings help tune a particular service. 若要修改某个服务的配置设置,请从“服务”边栏(左侧)中选择该服务,然后在服务详细信息页中导航到“配置”选项卡。 To modify a service's configuration settings, select the service from the Services sidebar (on the left), and then navigate to the Configs tab in the service detail page.

“服务”边栏

修改 NameNode Java 堆大小Modify NameNode Java heap size

NameNode Java 堆大小取决于许多因素,例如群集上的负载、文件数和块数。The NameNode Java heap size depends on many factors such as the load on the cluster, the numbers of files, and the numbers of blocks. 默认大小 1 GB 能够很好地满足大多数群集的需要,不过,某些工作负荷可能需要更多或更少的内存。The default size of 1 GB works well with most clusters, although some workloads can require more or less memory.

修改 NameNode Java 堆大小:To modify the NameNode Java heap size:

  1. 从“服务”边栏中选择“HDFS”,然后导航到“配置”选项卡。 Select HDFS from the Services sidebar and navigate to the Configs tab.

    HDFS 配置

  2. 找到“NameNode Java 堆大小”设置。 Find the setting NameNode Java heap size. 也可以使用“筛选器”文本框键入和查找特定的设置。 You can also use the filter text box to type and find a particular setting. 选择设置名称旁边的图标。Select the pen icon beside the setting name.

    NameNode Java 堆大小

  3. 在文本框中键入新值,然后按 Enter 保存更改。Type the new value in the text box, and then press Enter to save the change.

    编辑 NameNode Java 堆大小 1

  4. NameNode Java 堆大小已从 2 GB 更改为 1 GB。The NameNode Java heap size is changed to 1 GB from 2 GB.

    已编辑 NameNode Java 堆大小 2

  5. 单击配置屏幕顶部的绿色“保存”按钮保存所做的更改。 Save your changes by clicking on the green Save button on the top of the configuration screen.

    保存更改

Apache Hive 优化Apache Hive optimization

以下部分介绍了用于优化 Apache Hive 总体性能的配置选项。The following sections describe configuration options for optimizing overall Apache Hive performance.

  1. 若要修改 Hive 配置参数,请从“服务”边栏中选择“Hive”。 To modify Hive configuration parameters, select Hive from the Services sidebar.
  2. 导航到“配置”选项卡。 Navigate to the Configs tab.

设置 Hive 执行引擎Set the Hive execution engine

Hive 提供两个执行引擎:Apache Hadoop MapReduceApache TEZHive provides two execution engines: Apache Hadoop MapReduce and Apache TEZ. Tez 的速度比 MapReduce 更快。Tez is faster than MapReduce. HDInsight Linux 群集将 Tez 用作默认的执行引擎。HDInsight Linux clusters have Tez as the default execution engine. 更改执行引擎:To change the execution engine:

  1. 在 Hive 的“配置”选项卡上的筛选框中,键入“执行引擎”。 In the Hive Configs tab, type execution engine in the filter box.

    搜索执行引擎

  2. “优化”属性的默认值为 TezThe Optimization property's default value is Tez.

    优化 - Tez

优化映射器Tune mappers

Hadoop 会尝试将单个文件拆分(映射)为多个文件,以并行方式处理生成的文件。 Hadoop tries to split (map) a single file into multiple files and process the resulting files in parallel. 映射器数目取决于拆分数目。The number of mappers depends on the number of splits. 以下两个配置参数驱动 Tez 执行引擎的拆分数目:The following two configuration parameters drive the number of splits for the Tez execution engine:

  • tez.grouping.min-size:分组拆分大小的下限,默认值为 16 MB(16,777,216 字节)。tez.grouping.min-size: Lower limit on the size of a grouped split, with a default value of 16 MB (16,777,216 bytes).
  • tez.grouping.max-size:分组拆分大小的上限,默认值为 1 GB(1,073,741,824 字节)。tez.grouping.max-size: Upper limit on the size of a grouped split, with a default value of 1 GB (1,073,741,824 bytes).

在性能方面的一项经验法则是,减小这两个参数可以改善延迟,增大这两个参数可以提高吞吐量。As a performance rule of thumb, decrease both of these parameters to improve latency, increase for more throughput.

例如,若要为数据大小 128 MB 设置四个映射器任务,可将每个任务的这两个参数设置为 32 MB(33,554,432 字节)。For example, to set four mapper tasks for a data size of 128 MB, you would set both parameters to 32 MB each (33,554,432 bytes).

  1. 若要修改限制参数,请导航到 Tez 服务的“配置”选项卡。 To modify the limit parameters, navigate to the Configs tab of the Tez service. 展开“常规”面板并找到 tez.grouping.max-sizetez.grouping.min-size 参数。 Expand the General panel, and locate the tez.grouping.max-size and tez.grouping.min-size parameters.

  2. 将这两个参数设置为 33,554,432 字节 (32 MB)。Set both parameters to 33,554,432 bytes (32 MB).

    Tez 分组大小

这些更改会影响整个服务器中的所有 Tez 作业。These changes affect all Tez jobs across the server. 若要获取最佳结果,请选择适当的参数值。To get an optimal result, choose appropriate parameter values.

优化化简器Tune reducers

Apache ORCSnappy 都提供高性能。Apache ORC and Snappy both offer high performance. 但是,Hive 默认提供的化简器可能太少,从而导致瓶颈。However, Hive may have too few reducers by default, causing bottlenecks.

例如,假设输入数据大小为 50 GB。For example, say you have an input data size of 50 GB. 使用 Snappy 以 ORC 格式压缩这些数据后,大小为 1 GB。That data in ORC format with Snappy compression is 1 GB. Hive 估计所需的化简器数目为:(在映射器中输入的字节数 / hive.exec.reducers.bytes.per.reducer)。Hive estimates the number of reducers needed as: (number of bytes input to mappers / hive.exec.reducers.bytes.per.reducer).

如果使用默认设置,此示例的化简器数目为 4。With the default settings, this example is 4 reducers.

hive.exec.reducers.bytes.per.reducer 参数指定每个化简器处理的字节数。The hive.exec.reducers.bytes.per.reducer parameter specifies the number of bytes processed per reducer. 默认值为 64 MB。The default value is 64 MB. 减小此值可提高并行度,并可能会改善性能。Tuning this value down increases parallelism and may improve performance. 但过度减小也可能生成过多的化简器,从而对性能产生潜在的负面影响。Tuning it too low could also produce too many reducers, potentially adversely affecting performance. 此参数基于特定的数据要求、压缩设置和其他环境因素。This parameter is based on your particular data requirements, compression settings, and other environmental factors.

  1. 若要修改该参数,请导航到 Hive 的“配置”选项卡,然后在“设置”页上找到“每个化简器的数据”参数。 To modify the parameter, navigate to the Hive Configs tab and find the Data per Reducer parameter on the Settings page.

    每个化简器的数据

  2. 选择“编辑”并将该值修改为 128 MB(134,217,728 字节),然后按 Enter 保存。 Select Edit to modify the value to 128 MB (134,217,728 bytes), and then press Enter to save.

    每个化简器的数据 - 已编辑

    假设输入大小为 1024 MB,每个化简器的数据为 128 MB,则有 8 个化简器 (1024/128)。Given an input size of 1,024 MB, with 128 MB of data per reducer, there are 8 reducers (1024/128).

  3. 为“每个化简器的数据”参数提供错误的值可能导致生成大量的化简器,从而对查询性能产生负面影响。 An incorrect value for the Data per Reducer parameter may result in a large number of reducers, adversely affecting query performance. 若要限制化简器的最大数目,请将 hive.exec.reducers.max 设置为适当的值。To limit the maximum number of reducers, set hive.exec.reducers.max to an appropriate value. 默认值为 1009。The default value is 1009.

启用并行执行Enable parallel execution

一个 Hive 查询是在一个或多个阶段中执行的。A Hive query is executed in one or more stages. 如果可以并行运行各个独立阶段,则会提高查询性能。If the independent stages can be run in parallel, that will increase query performance.

  1. 若要启用并行查询执行,请导航到 Hive 的“配置”选项卡并搜索 hive.exec.parallel 属性。 To enable parallel query execution, navigate to the Hive Config tab and search for the hive.exec.parallel property. 默认值为 false。The default value is false. 将该值更改为 true,然后按 Enter 保存该值。Change the value to true, and then press Enter to save the value.

  2. 若要限制并行运行的作业数,请修改 hive.exec.parallel.thread.number 属性。To limit the number of jobs to run in parallel, modify the hive.exec.parallel.thread.number property. 默认值为 8。The default value is 8.

    Hive 并行执行

启用矢量化Enable vectorization

Hive 逐行处理数据。Hive processes data row by row. 矢量化指示 Hive 以块(一个块包含 1,024 行)的方式处理数据,而不是以一次一行的方式处理数据。Vectorization directs Hive to process data in blocks of 1,024 rows rather than one row at a time. 矢量化只适用于 ORC 文件格式。Vectorization is only applicable to the ORC file format.

  1. 若要启用矢量化查询执行,请导航到 Hive 的“配置”选项卡并搜索 hive.vectorized.execution.enabled 参数。 To enable a vectorized query execution, navigate to the Hive Configs tab and search for the hive.vectorized.execution.enabled parameter. Hive 0.13.0 或更高版本的默认值为 true。The default value is true for Hive 0.13.0 or later.

  2. 若要为查询的化简端启用矢量化执行,请将 hive.vectorized.execution.reduce.enabled 参数设置为 true。To enable vectorized execution for the reduce side of the query, set the hive.vectorized.execution.reduce.enabled parameter to true. 默认值为 false。The default value is false.

    Hive 矢量化执行

启用基于成本的优化 (CBO)Enable cost-based optimization (CBO)

默认情况下,Hive 遵循一组规则来找到一个最佳的查询执行计划。By default, Hive follows a set of rules to find one optimal query execution plan. 基于成本的优化 (CBO) 可以评估多个查询执行计划并向每个计划分配一个成本,然后确定成本最低的查询执行计划。Cost-based optimization (CBO) evaluates multiple plans to execute a query and assigns a cost to each plan, then determines the cheapest plan to execute a query.

若要启用 CBO,请导航到 Hive 的“配置”选项卡并搜索 parameter hive.cbo.enable,然后将开关按钮切换到“开”。 To enable CBO, navigate to the Hive Configs tab and search for parameter hive.cbo.enable, then switch the toggle button to On.

CBO 配置

启用 CBO 后,可使用以下附加配置参数提高 Hive 查询性能:The following additional configuration parameters increase Hive query performance when CBO is enabled:

  • hive.compute.query.using.stats

    如果设置为 true,则 Hive 会使用其元存储中存储的统计信息来应答类似于 count(*) 的简单查询。When set to true, Hive uses statistics stored in its metastore to answer simple queries like count(*).

    CBO 统计信息

  • hive.stats.fetch.column.stats

    启用 CBO 时,会创建列统计信息。Column statistics are created when CBO is enabled. Hive 使用元存储中存储的列统计信息来优化查询。Hive uses column statistics, which are stored in metastore, to optimize queries. 如果列数较多,则提取每个列的列统计信息需要花费很长时间。Fetching column statistics for each column takes longer when the number of columns is high. 如果设置为 false,则会禁用从元存储中提取列统计信息。When set to false, this setting disables fetching column statistics from the metastore.

    Hive 统计信息 - 设置列统计信息

  • hive.stats.fetch.partition.stats

    行数、数据大小和文件大小等基本分区统计信息存储在元存储中。Basic partition statistics such as number of rows, data size, and file size are stored in metastore. 如果设置为 true,则会从元存储中提取分区统计信息。When set to true, the partition stats are fetched from metastore. 如果为 false,则从文件系统中提取文件大小,并从行架构中提取行数。When false, the file size is fetched from the file system, and the number of rows is fetched from the row schema.

    Hive 统计信息 - 设置分区统计信息

启用中间压缩Enable intermediate compression

映射任务将创建化简器任务使用的中间文件。Map tasks create intermediate files that are used by the reducer tasks. 中间压缩可以缩小中间文件大小。Intermediate compression shrinks the intermediate file size.

Hadoop 作业通常会遇到 I/O 瓶颈。Hadoop jobs are usually I/O bottlenecked. 压缩数据能够加快 I/O 和总体网络传输速度。Compressing data can speed up I/O and overall network transfer.

可用的压缩类型包括:The available compression types are:

格式Format 工具Tool 算法Algorithm 文件扩展名File Extension 是否可拆分?Splittable?
GzipGzip GzipGzip DEFLATEDEFLATE .gz.gz No
Bzip2Bzip2 Bzip2Bzip2 Bzip2Bzip2 .bz2.bz2 Yes
LZOLZO LzopLzop LZOLZO .lzo.lzo 是(如果已编制索引)Yes, if indexed
SnappySnappy 不适用N/A SnappySnappy SnappySnappy No

一般规则是,尽量使用可拆分的压缩方法,否则会创建极少的映射器。As a general rule, having the compression method splittable is important, otherwise very few mappers will be created. 如果输入数据为文本,则 bzip2 是最佳选项。If the input data is text, bzip2 is the best option. 对于 ORC 格式,Snappy 是最快的压缩选项。For ORC format, Snappy is the fastest compression option.

  1. 若要启用中间压缩,请导航到 Hive 的“配置”选项卡,并将 hive.exec.compress.intermediate 参数设置为 true。 To enable intermediate compression, navigate to the Hive Configs tab, and then set the hive.exec.compress.intermediate parameter to true. 默认值为 false。The default value is false.

    Hive 执行 - 中间压缩

    备注

    若要压缩中间文件,请选择一个 CPU 开销较低的压缩编解码器,即使该编解码器不能提供较高的压缩输出。To compress intermediate files, choose a compression codec with lower CPU cost, even if the codec doesn't have a high compression output.

  2. 若要设置中间压缩编解码器,请将自定义属性 mapred.map.output.compression.codec 添加到 hive-site.xmlmapred-site.xml 文件。To set the intermediate compression codec, add the custom property mapred.map.output.compression.codec to the hive-site.xml or mapred-site.xml file.

  3. 添加自定义设置:To add a custom setting:

    a.a. 导航到 Hive 的“配置”选项卡并选择“高级”选项卡。 Navigate to the Hive Configs tab and select the Advanced tab.

    b.b. 在“高级”选项卡下,找到并展开“自定义 hive-site”窗格。 Under the Advanced tab, find and expand the Custom hive-site pane.

    c.c. 单击“自定义 hive-site”窗格底部的“添加属性”链接。 Click the link Add Property at the bottom of the Custom hive-site pane.

    d.d. 在“添加属性”窗口中,输入 mapred.map.output.compression.codec 作为键,输入 org.apache.hadoop.io.compress.SnappyCodec 作为值。In the Add Property window, enter mapred.map.output.compression.codec as the key and org.apache.hadoop.io.compress.SnappyCodec as the value.

    e.e. 单击“添加” 。Click Add.

    Hive 自定义属性

    这会使用 Snappy 压缩来压缩中间文件。This will compress the intermediate file using Snappy compression. 添加该属性后,它会显示在“自定义 hive-site”窗格中。Once the property is added, it appears in the Custom hive-site pane.

    备注

    此过程会修改 $HADOOP_HOME/conf/hive-site.xml 文件。This procedure modifies the $HADOOP_HOME/conf/hive-site.xml file.

压缩最终输出Compress final output

还可以压缩最终的 Hive 输出。The final Hive output can also be compressed.

  1. 若要压缩最终的 Hive 输出,请导航到 Hive 的“配置”选项卡,并将 hive.exec.compress.output 参数设置为 true。 To compress the final Hive output, navigate to the Hive Configs tab, and then set the hive.exec.compress.output parameter to true. 默认值为 false。The default value is false.

  2. 若要选择输出压缩编解码器,请根据上一部分的步骤 3 所述,将 mapred.output.compression.codec 自定义属性添加到“自定义 hive-site”窗格。To choose the output compression codec, add the mapred.output.compression.codec custom property to the Custom hive-site pane, as described in the previous section's step 3.

    Hive 自定义属性

启用推理执行Enable speculative execution

推理执行可以启动特定数量的重复任务以检测运行速度缓慢的任务跟踪程序并将其加入方块列表,同时通过优化各项任务结果来改善总体作业执行。Speculative execution launches a certain number of duplicate tasks in order to detect and blacklist the slow-running task tracker, while improving the overall job execution by optimizing individual task results.

不应该对输入量较大的长时间运行的 MapReduce 任务启用推理执行。Speculative execution shouldn't be turned on for long-running MapReduce tasks with large amounts of input.

  1. 若要启用推理执行,请导航到 Hive 的“配置”选项卡,并将 hive.mapred.reduce.tasks.speculative.execution 参数设置为 true。 To enable speculative execution, navigate to the Hive Configs tab, and then set the hive.mapred.reduce.tasks.speculative.execution parameter to true. 默认值为 false。The default value is false.

    Hive mapred 化简任务推理执行

优化动态分区Tune dynamic partitions

Hive 允许在表中插入记录时创建动态分区,且无需预定义每个分区。Hive allows for creating dynamic partitions when inserting records into a table, without predefining each and every partition. 这是一项强大功能,不管,它可能导致创建大量的分区并为每个分区创建大量的文件。This is a powerful feature, although it may result in the creation of a large number of partitions and a large number of files for each partition.

  1. 要让 Hive 执行动态分区,应将 hive.exec.dynamic.partition 参数值设置为 true(默认值)。For Hive to do dynamic partitions, the hive.exec.dynamic.partition parameter value should be true (the default).

  2. 将动态分区模式更改为 strictChange the dynamic partition mode to strict. 在 strict(严格)模式下,必须至少有一个分区是静态的。In strict mode, at least one partition has to be static. 这可以阻止未在 WHERE 子句中包含分区筛选器的查询,即,strict 可阻止扫描所有分区的查询。This prevents queries without the partition filter in the WHERE clause, that is, strict prevents queries that scan all partitions. 导航到 Hive 的“配置”选项卡,并将 hive.exec.dynamic.partition.mode 设置为 strictNavigate to the Hive Configs tab, and then set hive.exec.dynamic.partition.mode to strict. 默认值为 nonstrictThe default value is nonstrict.

  3. 若要限制要创建的动态分区数,请修改 hive.exec.max.dynamic.partitions 参数。To limit the number of dynamic partitions to be created, modify the hive.exec.max.dynamic.partitions parameter. 默认值为 5,000。The default value is 5,000.

  4. 若要限制每个节点的动态分区总数,请修改 hive.exec.max.dynamic.partitions.pernodeTo limit the total number of dynamic partitions per node, modify hive.exec.max.dynamic.partitions.pernode. 默认值为 2,000。The default value is 2,000.

启用本地模式Enable local mode

本地模式可让 Hive 在一台计算机上(有时是在单个进程中)执行某个作业的所有任务。Local mode enables Hive to perform all tasks of a job on a single machine, or sometimes in a single process. 如果输入数据较小,并且查询启动任务的开销会消耗总体查询执行资源的绝大部分,则此模式可以提高查询性能。This improves query performance if the input data is small and the overhead of launching tasks for queries consumes a significant percentage of the overall query execution.

若要启用本地模式,请根据启用中间压缩部分的步骤 3 所述,将 hive.exec.mode.local.auto 参数添加到“自定义 hive-site”面板。To enable local mode, add the hive.exec.mode.local.auto parameter to the Custom hive-site panel, as explained in step 3 of the Enable intermediate compression section.

Hive 执行模式 - 本地自动

设置单个 MapReduce MultiGROUP BYSet single MapReduce MultiGROUP BY

如果此属性设置为 true,则包含通用 group-by 键的 MultiGROUP BY 查询将生成单个 MapReduce 作业。When this property is set to true, a MultiGROUP BY query with common group-by keys generates a single MapReduce job.

若要启用此行为,请根据启用中间压缩部分的步骤 3 所述,将 hive.multigroupby.singlereducer 参数添加到“自定义 hive-site”窗格。To enable this behavior, add the hive.multigroupby.singlereducer parameter to the Custom hive-site pane, as explained in step 3 of the Enable intermediate compression section.

在 Hive 中设置单个 MapReduce MultiGROUP BY

其他 Hive 优化Additional Hive optimizations

以下部分介绍了可以设置的其他 Hive 相关优化。The following sections describe additional Hive-related optimizations you can set.

联接优化Join optimizations

Hive 中的默认联接类型是“随机联接”。 The default join type in Hive is a shuffle join. 在 Hive 中,特殊的映射器会读取输入,并向中间文件发出联接键/值对。In Hive, special mappers read the input and emit a join key/value pair to an intermediate file. Hadoop 在随机阶段中排序与合并这些对。Hadoop sorts and merges these pairs in a shuffle stage. 此随机阶段的系统开销较大。This shuffle stage is expensive. 根据数据选择右联接可以显著提高性能。Selecting the right join based on your data can significantly improve performance.

联接类型Join Type 时间When 方式How Hive 设置Hive settings 注释Comments
随机联接Shuffle Join
  • 默认选项Default choice
  • 始终运行Always works
  • 从某个表的一部分内容中读取Reads from part of one of the tables
  • 根据联接键存储和排序Buckets and sorts on Join key
  • 向每个化简器发送一个存储桶Sends one bucket to each reduce
  • 在化简端执行联接Join is done on the Reduce side
不需要过多的 Hive 设置No significant Hive setting needed 每次运行Works every time
映射联接Map Join
  • 一个表可以装入内存One table can fit in memory
  • 将小型表读入内存哈希表Reads small table into memory hash table
  • 通过大型文件的一部分流式处理Streams through part of the large file
  • 联接哈希表中的每条记录Joins each record from hash table
  • 只按映射器执行联接Joins are by the mapper alone
hive.auto.confvert.join=true 速度很快,但受限Very fast, but limited
排序合并存储桶Sort Merge Bucket 如果两个表:If both tables are:
  • 排序方式相同Sorted the same
  • 存储方式相同Bucketed the same
  • 按排序/存储的列执行联接Joining on the sorted/bucketed column
每个进程:Each process:
  • 从每个表中读取存储桶Reads a bucket from each table
  • 处理值最小的行Processes the row with the lowest value
hive.auto.convert.sortmerge.join=true 非常高效Very efficient

执行引擎优化Execution engine optimizations

有关优化 Hive 执行引擎的其他建议:Additional recommendations for optimizing the Hive execution engine:

设置Setting 建议Recommended HDInsight 默认值HDInsight Default
hive.mapjoin.hybridgrace.hashtable True = 更安全,但速度更慢;false = 速度更快True = safer, slower; false = faster falsefalse
tez.am.resource.memory.mb 大多数引擎的上限为 4 GB4 GB upper bound for most 自动优化Auto-Tuned
tez.session.am.dag.submit.timeout.secs 300+300+ 300300
tez.am.container.idle.release-timeout-min.millis 20000+20000+ 1000010000
tez.am.container.idle.release-timeout-max.millis 40000+40000+ 2000020000

Apache Pig 优化Apache Pig optimization

可以通过 Ambari Web UI 修改 Apache Pig 属性以优化 Pig 查询。Apache Pig properties can be modified from the Ambari web UI to tune Pig queries. 通过 Ambari 修改 Pig 属性会直接修改 /etc/pig/2.4.2.0-258.0/pig.properties 文件中的 Pig 属性。Modifying Pig properties from Ambari directly modifies the Pig properties in the /etc/pig/2.4.2.0-258.0/pig.properties file.

  1. 若要修改 Pig 属性,请导航到 Pig 的“配置”选项卡,然后展开“高级 pig-properties”窗格。 To modify Pig properties, navigate to the Pig Configs tab, and then expand the Advanced pig-properties pane.

  2. 查找、取消注释并更改相应的属性值。Find, uncomment, and change the value of the property you wish to modify.

  3. 选择窗口右上方的“保存”以保存新值。 Select Save on the top right side of the window to save the new value. 某些属性可能需要重启服务才能生效。Some properties may require a service restart.

    高级 pig-properties

备注

任何会话级设置都会重写 pig.properties 文件中的属性值。Any session-level settings override property values in the pig.properties file.

优化执行引擎Tune execution engine

可以使用两个执行引擎来执行 Pig 脚本:MapReduce 和 Tez。Two execution engines are available to execute Pig scripts: MapReduce and Tez. Tez 是经过优化的引擎,比 MapReduce 要快得多。Tez is an optimized engine and is much faster than MapReduce.

  1. 若要修改执行引擎,请在“高级 pig-properties”窗格中找到 exectype 属性。 To modify the execution engine, in the Advanced pig-properties pane, find the property exectype.

  2. 默认值为 MapReduceThe default value is MapReduce. 请将它更改为 TezChange it to Tez.

启用本地模式Enable local mode

与在 Hive 中一样,本地模式可用于加快作业,且生成的数据量相对较小。Similar to Hive, local mode is used to speed jobs with relatively smaller amounts of data.

  1. 若要启用本地模式,请将 pig.auto.local.enabled 设置为 trueTo enable the local mode, set pig.auto.local.enabled to true. 默认值为 false。The default value is false.

  2. 输入数据大小小于 pig.auto.local.input.maxbytes 属性值的作业被视为小型作业。Jobs with an input data size less than the pig.auto.local.input.maxbytes property value are considered to be small jobs. 默认值为 1 GB。The default value is 1 GB.

将用户 jar 复制到缓存中Copy user jar cache

Pig 可将 UDF 所需的 JAR 文件复制到分布式缓存,使这些文件可供任务节点使用。Pig copies the JAR files required by UDFs to a distributed cache to make them available for task nodes. 这些 jar 不经常更改。These jars do not change frequently. 如果已启用,pig.user.cache.enabled 设置允许将 jar 放入缓存,使同一用户运行的作业能够重复使用这些文件。If enabled, the pig.user.cache.enabled setting allows jars to be placed in a cache to reuse them for jobs run by the same user. 这样可以小幅提高作业的性能。This results in a minor increase in job performance.

  1. 若要启用,请将 pig.user.cache.enabled 设置为 true。To enable, set pig.user.cache.enabled to true. 默认值为 false。The default is false.

  2. 若要设置缓存 jar 的基本路径,请将 pig.user.cache.location 设置为基本路径。To set the base path of the cached jars, set pig.user.cache.location to the base path. 默认为 /tmpThe default is /tmp.

使用内存设置优化性能Optimize performance with memory settings

以下内存设置可以帮助优化 Pig 脚本的性能。The following memory settings can help optimize Pig script performance.

  • pig.cachedbag.memusage:分配给包的内存量。pig.cachedbag.memusage: The amount of memory allocated to a bag. 包是元组的集合。A bag is collection of tuples. 元组是字段的有序集,字段是数据片段。A tuple is an ordered set of fields, and a field is a piece of data. 如果包中的数据超过分配的内存,则会溢出到磁盘。If the data in a bag is beyond the allocated memory, it is spilled to disk. 默认值为 0.2,表示可用内存的 20%。The default value is 0.2, which represents 20 percent of available memory. 这是在应用程序中的所有包之间分摊的内存量。This memory is shared across all bags in an application.

  • pig.spill.size.threshold:超过此溢出大小阈值(以字节为单位)的包将溢出到磁盘。pig.spill.size.threshold: Bags larger than this spill size threshold (in bytes) are spilled to disk. 默认值为 5 MB。The default value is 5 MB.

压缩临时文件Compress temporary files

Pig 在作业执行期间生成临时文件。Pig generates temporary files during job execution. 压缩临时文件可以在将文件读取或写入到磁盘时提高性能。Compressing the temporary files results in a performance increase when reading or writing files to disk. 以下设置可用于压缩临时文件。The following settings can be used to compress temporary files.

  • pig.tmpfilecompression:如果为 true,则启用临时文件压缩。pig.tmpfilecompression: When true, enables temporary file compression. 默认值为 false。The default value is false.

  • pig.tmpfilecompression.codec:用于压缩临时文件的压缩编解码器。pig.tmpfilecompression.codec: The compression codec to use for compressing the temporary files. 建议的压缩编解码器为 LZO 和 Snappy,它们可以降低 CPU 利用率。The recommended compression codecs are LZO and Snappy for lower CPU utilization.

启用拆分合并Enable split combining

如果已启用,则会合并小型文件,以减少映射任务数目。When enabled, small files are combined for fewer map tasks. 这可以提高包含大量小型文件的作业的效率。This improves the efficiency of jobs with many small files. 若要启用,请将 pig.noSplitCombination 设置为 true。To enable, set pig.noSplitCombination to true. 默认值为 false。The default value is false.

优化映射器Tune mappers

可以通过修改 pig.maxCombinedSplitSize 属性来控制映射器数目。The number of mappers is controlled by modifying the property pig.maxCombinedSplitSize. 此属性指定单个映射任务要处理的数据大小。This specifies the size of the data to be processed by a single map task. 默认值为文件系统的默认块大小。The default value is the filesystem's default block size. 增大此值可减少映射器任务的数目。Increasing this value results in a decrease of the number of mapper tasks.

优化化简器Tune reducers

化简器数目根据 pig.exec.reducers.bytes.per.reducer 参数计算。The number of reducers is calculated based on the parameter pig.exec.reducers.bytes.per.reducer. 该参数指定每个化简器处理的字节数,默认值为 1 GB。The parameter specifies the number of bytes processed per reducer, by default 1 GB. 若要限制化简器的最大数目,请设置 pig.exec.reducers.max 属性,默认值为 999。To limit the maximum number of reducers, set the pig.exec.reducers.max property, by default 999.

使用 Ambari Web UI 优化 Apache HBaseApache HBase optimization with the Ambari web UI

可以通过“HBase 配置” 选项卡修改 Apache HBase 配置。以下部分介绍了一些影响 HBase 性能的重要配置设置。Apache HBase configuration is modified from the HBase Configs tab. The following sections describe some of the important configuration settings that affect HBase performance.

设置 HBASE_HEAPSIZESet HBASE_HEAPSIZE

HBase 堆大小指定区域服务器和主服务器要使用的最大堆数量(以 MB 为单位)。 The HBase heap size specifies the maximum amount of heap to be used in megabytes by region and master servers. 默认值为 1,000 MB。The default value is 1,000 MB. 应该优化群集工作负荷的此项设置。This should be tuned for the cluster workload.

  1. 若要修改,请导航到 HBase“配置”选项卡中的“高级 HBase-env”窗格,然后找到 HBASE_HEAPSIZE 设置。 To modify, navigate to the Advanced HBase-env pane in the HBase Configs tab, and then find the HBASE_HEAPSIZE setting.

  2. 将默认值更改为 5,000 MB。Change the default value to 5,000 MB.

    HBASE_HEAPSIZE

优化读取密集型工作负荷Optimize read-heavy workloads

以下配置对于提高读取密集型工作负荷的性能非常重要。The following configurations are important to improve the performance of read-heavy workloads.

块缓存大小Block cache size

块缓存是读取缓存。The block cache is the read cache. 其大小由 hfile.block.cache.size 参数控制。Its size is controlled by the hfile.block.cache.size parameter. 默认值为 0.4,即总区域服务器内存的 40%。The default value is 0.4, which is 40 percent of the total region server memory. 块缓存大小越大,随机读取的速度越快。The larger the block cache size, the faster the random reads will be.

  1. 若要修改此参数,请导航到 HBase“配置”选项卡中的“设置”选项卡,然后找到“分配到读取缓冲区的 RegionServer 内存百分比”。 To modify this parameter, navigate to the Settings tab in the HBase Configs tab, and then locate % of RegionServer Allocated to Read Buffers.

    HBase 块缓存大小

  2. 若要更改此值,请选择“编辑”图标。 To change the value, select the Edit icon.

Memstore 大小Memstore size

所有编辑内容都存储在称作 Memstore 的内存缓冲区中。All edits are stored in the memory buffer, called a Memstore. 此机制增大了可在单个操作中写入磁盘的总数据量,并可加速以后对最近编辑内容的访问。This increases the total amount of data that can be written to disk in a single operation, and it speeds subsequent access to the recent edits. Memstore 大小由以下两个参数定义:The Memstore size is defined by the following two parameters:

  • hbase.regionserver.global.memstore.UpperLimit:定义 Memstore 总共可以使用的最大区域服务器百分比。hbase.regionserver.global.memstore.UpperLimit: Defines the maximum percentage of the region server that Memstore combined can use.

  • hbase.regionserver.global.memstore.LowerLimit:定义 Memstore 总共可以使用的最小区域服务器百分比。hbase.regionserver.global.memstore.LowerLimit: Defines the minimum percentage of the region server that Memstore combined can use.

若要优化随机读取,可以减小 Memstore 的上限和下限。To optimize for random reads, you can reduce the Memstore upper and lower limits.

从磁盘扫描时提取的行数Number of rows fetched when scanning from disk

hbase.client.scanner.caching 设置定义在扫描程序中调用 next 方法时,要从磁盘读取的行数。The hbase.client.scanner.caching setting defines the number of rows read from disk when the next method is called on a scanner. 默认值为 100。The default value is 100. 该数字越大,从客户端向区域服务器发出的远程调用数就越少,因而扫描速度也就越快。The higher the number, the fewer the remote calls made from the client to the region server, resulting in faster scans. 但是,这也会增大客户端上的内存压力。However, this will also increase memory pressure on the client.

HBase 提取的行数

重要

设置此值时,请不要使扫描程序中的下一次方法调用间隔时间大于扫描程序的超时时间。Do not set the value such that the time between invocation of the next method on a scanner is greater than the scanner timeout. 扫描程序超时期限由 hbase.regionserver.lease.period 属性定义。The scanner timeout duration is defined by the hbase.regionserver.lease.period property.

优化写入密集型工作负荷Optimize write-heavy workloads

以下配置对于提高写入密集型工作负荷的性能非常重要。The following configurations are important to improve the performance of write-heavy workloads.

最大区域文件大小Maximum region file size

HBase 使用称作 HFile 的内部文件格式存储数据。HBase stores data in an internal file format, called HFile. 属性 hbase.hregion.max.filesize 定义区域的单个 HFile 的大小。The property hbase.hregion.max.filesize defines the size of a single HFile for a region. 如果区域中的 HFiles 总数大于此设置,则会将该区域拆分为两个区域。A region is split into two regions if the sum of all HFiles in a region is greater than this setting.

HBase HRegion 最大文件大小

区域文件大小越大,拆分数目越小。The larger the region file size, the smaller the number of splits. 可以增大文件大小,以确定可以最大程度地提高写入性能的值。You can increase the file size to determine a value that results in the maximum write performance.

避免阻止更新Avoid update blocking

  • 属性 hbase.hregion.memstore.flush.size 定义 Memstore 刷新到磁盘的增量大小。The property hbase.hregion.memstore.flush.size defines the size at which Memstore is flushed to disk. 默认大小为 128 MB。The default size is 128 MB.

  • Hbase 区域块乘数由 hbase.hregion.memstore.block.multiplier 定义。The Hbase region block multiplier is defined by hbase.hregion.memstore.block.multiplier. 默认值为 4。The default value is 4. 允许的最大值为 8。The maximum allowed is 8.

  • 如果 Memstore 为 (hbase.hregion.memstore.flush.size * hbase.hregion.memstore.block.multiplier) 字节,则 HBase 会阻止更新。HBase blocks updates if the Memstore is (hbase.hregion.memstore.flush.size * hbase.hregion.memstore.block.multiplier) bytes.

    使用刷新大小和块乘数的默认值时,如果 Memstore 大小为 128 * 4 = 512 MB,则会阻止更新。With the default values of flush size and block multiplier, updates are blocked when Memstore is 128 * 4 = 512 MB in size. 若要减少更新阻止计数,请增大 hbase.hregion.memstore.block.multiplier 的值。To reduce the update blocking count, increase the value of hbase.hregion.memstore.block.multiplier.

HBase 区域块乘数

定义 Memstore 大小Define Memstore size

Memstore 大小由 hbase.regionserver.global.memstore.UpperLimithbase.regionserver.global.memstore.LowerLimit 参数定义。Memstore size is defined by the hbase.regionserver.global.memstore.UpperLimit and hbase.regionserver.global.memstore.LowerLimit parameters. 将这些值设置为相等可以减少写入期间的暂停次数(同时提高刷新频率),并可以提高写入性能。Setting these values equal to each other reduces pauses during writes (also causing more frequent flushing) and results in increased write performance.

设置 Memstore 本地分配缓冲区Set Memstore local allocation buffer

Memstore 本地分配缓冲区使用率由 hbase.hregion.memstore.mslab.enabled 属性确定。Memstore local allocation buffer usage is determined by the property hbase.hregion.memstore.mslab.enabled. 如果已启用 (true),则可以防止在执行写入密集型操作期间出现堆碎片。When enabled (true), this prevents heap fragmentation during heavy write operation. 默认值为 true。The default value is true.

hbase.hregion.memstore.mslab.enabled

后续步骤Next steps