在 Azure HDInsight 中通过 Apache Ambari 优化 Apache HiveOptimize Apache Hive with Apache Ambari in Azure HDInsight

Apache Ambari 是用于管理和监视 HDInsight 群集的 Web 界面。Apache Ambari is a web interface to manage and monitor HDInsight clusters. 有关 Ambari Web UI 的简介,请参阅使用 Apache Ambari Web UI 管理 HDInsight 群集For an introduction to Ambari Web UI, see Manage HDInsight clusters by using the Apache Ambari Web UI.

以下部分介绍了用于优化 Apache Hive 总体性能的配置选项。The following sections describe configuration options for optimizing overall Apache Hive performance.

  1. 若要修改 Hive 配置参数,请从“服务”边栏中选择“Hive”。To modify Hive configuration parameters, select Hive from the Services sidebar.
  2. 导航到“配置”选项卡。Navigate to the Configs tab.

设置 Hive 执行引擎Set the Hive execution engine

Hive 提供两个执行引擎:Apache Hadoop MapReduce 和 Apache TEZ。Hive provides two execution engines: Apache Hadoop MapReduce and Apache TEZ. Tez 的速度比 MapReduce 更快。Tez is faster than MapReduce. HDInsight Linux 群集将 Tez 用作默认的执行引擎。HDInsight Linux clusters have Tez as the default execution engine. 更改执行引擎:To change the execution engine:

  1. 在 Hive 的“配置”选项卡上的筛选框中,键入“执行引擎”。 In the Hive Configs tab, type execution engine in the filter box.

    Apache Ambari 搜索执行引擎

  2. “优化”属性的默认值为 TezThe Optimization property's default value is Tez.

    优化 - Apache Tez 引擎

优化映射器Tune mappers

Hadoop 会尝试将单个文件拆分(映射)为多个文件,以并行方式处理生成的文件。Hadoop tries to split (map) a single file into multiple files and process the resulting files in parallel. 映射器数目取决于拆分数目。The number of mappers depends on the number of splits. 以下两个配置参数驱动 Tez 执行引擎的拆分数目:The following two configuration parameters drive the number of splits for the Tez execution engine:

  • tez.grouping.min-size:分组拆分大小的下限,默认值为 16 MB(16,777,216 字节)。tez.grouping.min-size: Lower limit on the size of a grouped split, with a default value of 16 MB (16,777,216 bytes).
  • tez.grouping.max-size:分组拆分大小的上限,默认值为 1 GB(1,073,741,824 字节)。tez.grouping.max-size: Upper limit on the size of a grouped split, with a default value of 1 GB (1,073,741,824 bytes).

性能准则是,减小这两个参数可以改善延迟,增大这两个参数可以提高吞吐量。As a performance guideline, lower both of these parameters to improve latency, increase for more throughput.

例如,若要为数据大小 128 MB 设置四个映射器任务,可将每个任务的这两个参数设置为 32 MB(33,554,432 字节)。For example, to set four mapper tasks for a data size of 128 MB, you would set both parameters to 32 MB each (33,554,432 bytes).

  1. 若要修改限制参数,请导航到 Tez 服务的“配置”选项卡。To modify the limit parameters, navigate to the Configs tab of the Tez service. 展开“常规”面板并找到 tez.grouping.max-sizetez.grouping.min-size 参数。Expand the General panel, and locate the tez.grouping.max-size and tez.grouping.min-size parameters.

  2. 将这两个参数设置为 33,554,432 字节 (32 MB)。Set both parameters to 33,554,432 bytes (32 MB).

    Apache Ambari Tez 分组大小

这些更改会影响整个服务器中的所有 Tez 作业。These changes affect all Tez jobs across the server. 若要获取最佳结果,请选择适当的参数值。To get an optimal result, choose appropriate parameter values.

优化化简器Tune reducers

Apache ORC 和 Snappy 都可提供高性能。Apache ORC and Snappy both offer high performance. 但是,Hive 默认提供的化简器可能太少,从而导致瓶颈。However, Hive may have too few reducers by default, causing bottlenecks.

例如,假设输入数据大小为 50 GB。For example, say you have an input data size of 50 GB. 使用 Snappy 以 ORC 格式压缩这些数据后,大小为 1 GB。That data in ORC format with Snappy compression is 1 GB. Hive 估计所需的化简器数目为:(在映射器中输入的字节数 / hive.exec.reducers.bytes.per.reducer)。Hive estimates the number of reducers needed as: (number of bytes input to mappers / hive.exec.reducers.bytes.per.reducer).

如果使用默认设置,此示例的化简器数目为 4。With the default settings, this example is four reducers.

hive.exec.reducers.bytes.per.reducer 参数指定每个化简器处理的字节数。The hive.exec.reducers.bytes.per.reducer parameter specifies the number of bytes processed per reducer. 默认值为 64 MB。The default value is 64 MB. 减小此值可提高并行度,并可能会改善性能。Tuning this value down increases parallelism and may improve performance. 但过度减小也可能生成过多的化简器,从而对性能产生潜在的负面影响。Tuning it too low could also produce too many reducers, potentially adversely affecting performance. 此参数基于特定的数据要求、压缩设置和其他环境因素。This parameter is based on your particular data requirements, compression settings, and other environmental factors.

  1. 若要修改该参数,请导航到 Hive 的“配置”选项卡,然后在“设置”页上找到“每个化简器的数据”参数。 To modify the parameter, navigate to the Hive Configs tab and find the Data per Reducer parameter on the Settings page.

    每个化简器的 Apache Ambari 数据

  2. 选择“编辑”并将该值修改为 128 MB(134,217,728 字节),然后按 Enter 保存。Select Edit to modify the value to 128 MB (134,217,728 bytes), and then press Enter to save.

    每个化简器的 Ambari 数据 - 已编辑

    假设输入大小为 1024 MB,每个化简器的数据为 128 MB,则有 8 个化简器 (1024/128)。Given an input size of 1,024 MB, with 128 MB of data per reducer, there are eight reducers (1024/128).

  3. 为“每个化简器的数据”参数提供错误的值可能导致生成大量的化简器,从而对查询性能产生负面影响。An incorrect value for the Data per Reducer parameter may result in a large number of reducers, adversely affecting query performance. 若要限制化简器的最大数目,请将 hive.exec.reducers.max 设置为适当的值。To limit the maximum number of reducers, set hive.exec.reducers.max to an appropriate value. 默认值为 1009。The default value is 1009.

启用并行执行Enable parallel execution

一个 Hive 查询是在一个或多个阶段中执行的。A Hive query is executed in one or more stages. 如果可以并行运行各个独立阶段,则会提高查询性能。If the independent stages can be run in parallel, that will increase query performance.

  1. 若要启用并行查询执行,请导航到 Hive 的“配置”选项卡并搜索 hive.exec.parallel 属性。To enable parallel query execution, navigate to the Hive Config tab and search for the hive.exec.parallel property. 默认值为 false。The default value is false. 将该值更改为 true,然后按 Enter 保存该值。Change the value to true, and then press Enter to save the value.

  2. 若要限制并行运行的作业数,请修改 hive.exec.parallel.thread.number 属性。To limit the number of jobs to run in parallel, modify the hive.exec.parallel.thread.number property. 默认值为 8。The default value is 8.

    Apache Hive 并行执行显示

启用矢量化Enable vectorization

Hive 逐行处理数据。Hive processes data row by row. 矢量化指示 Hive 以块(一个块包含 1,024 行)的方式处理数据,而不是以一次一行的方式处理数据。Vectorization directs Hive to process data in blocks of 1,024 rows rather than one row at a time. 矢量化只适用于 ORC 文件格式。Vectorization is only applicable to the ORC file format.

  1. 若要启用矢量化查询执行,请导航到 Hive 的“配置”选项卡并搜索 hive.vectorized.execution.enabled 参数。To enable a vectorized query execution, navigate to the Hive Configs tab and search for the hive.vectorized.execution.enabled parameter. Hive 0.13.0 或更高版本的默认值为 true。The default value is true for Hive 0.13.0 or later.

  2. 若要为查询的化简端启用矢量化执行,请将 hive.vectorized.execution.reduce.enabled 参数设置为 true。To enable vectorized execution for the reduce side of the query, set the hive.vectorized.execution.reduce.enabled parameter to true. 默认值为 false。The default value is false.

    Apache Hive 矢量化执行

启用基于成本的优化 (CBO)Enable cost-based optimization (CBO)

默认情况下,Hive 遵循一组规则来找到一个最佳的查询执行计划。By default, Hive follows a set of rules to find one optimal query execution plan. 基于成本的优化 (CBO) 会评估多个查询执行计划,Cost-based optimization (CBO) evaluates multiple plans to execute a query. 并为每个计划分配成本,然后确定执行查询的成本最低的计划。And assigns a cost to each plan, then determines the cheapest plan to execute a query.

若要启用 CBO,请导航到“Hive” > “配置” > “设置”,找到“启用基于成本的优化器”,然后将切换按钮切换到“打开” 。To enable CBO, navigate to Hive > Configs > Settings and find Enable Cost Based Optimizer, then switch the toggle button to On.

HDInsight 基于成本的优化器

启用 CBO 后,可使用以下附加配置参数提高 Hive 查询性能:The following additional configuration parameters increase Hive query performance when CBO is enabled:

  • hive.compute.query.using.stats

    如果设置为 true,则 Hive 会使用其元存储中存储的统计信息来应答类似于 count(*) 的简单查询。When set to true, Hive uses statistics stored in its metastore to answer simple queries like count(*).

    使用统计信息的 Apache Hive 计算查询

  • hive.stats.fetch.column.stats

    启用 CBO 时,会创建列统计信息。Column statistics are created when CBO is enabled. Hive 使用元存储中存储的列统计信息来优化查询。Hive uses column statistics, which are stored in metastore, to optimize queries. 如果列数较多,则提取每个列的列统计信息需要花费很长时间。Fetching column statistics for each column takes longer when the number of columns is high. 如果设置为 false,则会禁用从元存储中提取列统计信息。When set to false, this setting disables fetching column statistics from the metastore.

    Apache Hive 统计信息 - 设置列统计信息

  • hive.stats.fetch.partition.stats

    行数、数据大小和文件大小等基本分区统计信息存储在元存储中。Basic partition statistics such as number of rows, data size, and file size are stored in metastore. 如果设置为 true,则从元存储中提取分区统计信息。If set to true, the partition stats are fetched from metastore. 如果设置为 false,则从文件系统中提取文件大小。When false, the file size is fetched from the file system. 行数从行架构中提取。And the number of rows is fetched from the row schema.

    Hive 统计信息 - 设置分区统计信息

启用中间压缩Enable intermediate compression

映射任务将创建化简器任务使用的中间文件。Map tasks create intermediate files that are used by the reducer tasks. 中间压缩可以缩小中间文件大小。Intermediate compression shrinks the intermediate file size.

Hadoop 作业通常会遇到 I/O 瓶颈。Hadoop jobs are usually I/O bottlenecked. 压缩数据能够加快 I/O 和总体网络传输速度。Compressing data can speed up I/O and overall network transfer.

可用的压缩类型包括:The available compression types are:

格式Format 工具Tool 算法Algorithm 文件扩展名File Extension 是否可拆分?Splittable?
GzipGzip GzipGzip DEFLATEDEFLATE .gz No
Bzip2Bzip2 Bzip2Bzip2 Bzip2Bzip2 .bz2 Yes
LZOLZO Lzop LZOLZO .lzo 是(如果已编制索引)Yes, if indexed
SnappySnappy 空值N/A SnappySnappy SnappySnappy No

一般规则是,尽量使用可拆分的压缩方法,否则会创建极少的映射器。As a general rule, having the compression method splittable is important, otherwise few mappers will be created. 如果输入数据为文本,则 bzip2 是最佳选项。If the input data is text, bzip2 is the best option. 对于 ORC 格式,Snappy 是最快的压缩选项。For ORC format, Snappy is the fastest compression option.

  1. 若要启用中间压缩,请导航到 Hive 的“配置”选项卡,并将 hive.exec.compress.intermediate 参数设置为 true。To enable intermediate compression, navigate to the Hive Configs tab, and then set the hive.exec.compress.intermediate parameter to true. 默认值为 false。The default value is false.

    “Hive 执行 - 中间压缩”

    备注

    若要压缩中间文件,请选择一个 CPU 开销较低的压缩编解码器,即使该编解码器不能提供较高的压缩输出。To compress intermediate files, choose a compression codec with lower CPU cost, even if the codec doesn't have a high compression output.

  2. 若要设置中间压缩编解码器,请将自定义属性 mapred.map.output.compression.codec 添加到 hive-site.xmlmapred-site.xml 文件。To set the intermediate compression codec, add the custom property mapred.map.output.compression.codec to the hive-site.xml or mapred-site.xml file.

  3. 添加自定义设置:To add a custom setting:

    a.a. 导航到“Hive” > “配置” > “高级” > “自定义 hive-site” 。Navigate to Hive > Configs > Advanced > Custom hive-site.

    b.b. 选择“自定义 hive-site”窗格底部的“添加属性…”。Select Add Property... at the bottom of the Custom hive-site pane.

    c.c. 在“添加属性”窗口中,输入 mapred.map.output.compression.codec 作为键,输入 org.apache.hadoop.io.compress.SnappyCodec 作为值。In the Add Property window, enter mapred.map.output.compression.codec as the key and org.apache.hadoop.io.compress.SnappyCodec as the value.

    d.d. 选择“添加” 。Select Add.

    “Apache Hive 自定义属性添加”

    此设置将使用 Snappy 压缩来压缩中间文件。This setting will compress the intermediate file using Snappy compression. 添加该属性后,它会显示在“自定义 hive-site”窗格中。Once the property is added, it appears in the Custom hive-site pane.

    备注

    此过程会修改 $HADOOP_HOME/conf/hive-site.xml 文件。This procedure modifies the $HADOOP_HOME/conf/hive-site.xml file.

压缩最终输出Compress final output

还可以压缩最终的 Hive 输出。The final Hive output can also be compressed.

  1. 若要压缩最终的 Hive 输出,请导航到 Hive 的“配置”选项卡,并将 hive.exec.compress.output 参数设置为 true。To compress the final Hive output, navigate to the Hive Configs tab, and then set the hive.exec.compress.output parameter to true. 默认值为 false。The default value is false.

  2. 若要选择输出压缩编解码器,请根据上一部分的步骤 3 所述,将 mapred.output.compression.codec 自定义属性添加到“自定义 hive-site”窗格。To choose the output compression codec, add the mapred.output.compression.codec custom property to the Custom hive-site pane, as described in the previous section's step 3.

    Apache Hive 自定义属性添加 2

启用推理执行Enable speculative execution

推理执行会启动一定数量的重复任务,用于检测运行缓慢的任务跟踪器并将它们列入拒绝列表。Speculative execution launches a certain number of duplicate tasks to detect and deny list the slow-running task tracker. 同时通过优化各个任务结果来改进作业的整体执行。While improving the overall job execution by optimizing individual task results.

不应该对输入量较大的长时间运行的 MapReduce 任务启用推理执行。Speculative execution shouldn't be turned on for long-running MapReduce tasks with large amounts of input.

  • 若要启用推理执行,请导航到 Hive 的“配置”选项卡,并将 hive.mapred.reduce.tasks.speculative.execution 参数设置为 true。To enable speculative execution, navigate to the Hive Configs tab, and then set the hive.mapred.reduce.tasks.speculative.execution parameter to true. 默认值为 false。The default value is false.

    “Hive mapred 化简任务推理执行”

优化动态分区Tune dynamic partitions

Hive 允许在表中插入记录时创建动态分区,且无需预定义每个分区。Hive allows for creating dynamic partitions when inserting records into a table, without predefining every partition. 此功能非常强大。This ability is a powerful feature. 但它可能会导致创建大量的分区,Although it may result in the creation of a large number of partitions. 并为每个分区创建大量文件。And a large number of files for each partition.

  1. 要让 Hive 执行动态分区,应将 hive.exec.dynamic.partition 参数值设置为 true(默认值)。For Hive to do dynamic partitions, the hive.exec.dynamic.partition parameter value should be true (the default).

  2. 将动态分区模式更改为 strictChange the dynamic partition mode to strict. 在 strict(严格)模式下,必须至少有一个分区是静态的。In strict mode, at least one partition has to be static. 此设置可以阻止未在 WHERE 子句中包含分区筛选器的查询,即,strict 可阻止扫描所有分区的查询。This setting prevents queries without the partition filter in the WHERE clause, that is, strict prevents queries that scan all partitions. 导航到 Hive 的“配置”选项卡,并将 hive.exec.dynamic.partition.mode 设置为 strictNavigate to the Hive Configs tab, and then set hive.exec.dynamic.partition.mode to strict. 默认值为 nonstrictThe default value is nonstrict.

  3. 若要限制要创建的动态分区数,请修改 hive.exec.max.dynamic.partitions 参数。To limit the number of dynamic partitions to be created, modify the hive.exec.max.dynamic.partitions parameter. 默认值为 5000。The default value is 5000.

  4. 若要限制每个节点的动态分区总数,请修改 hive.exec.max.dynamic.partitions.pernodeTo limit the total number of dynamic partitions per node, modify hive.exec.max.dynamic.partitions.pernode. 默认值为 2000。The default value is 2000.

启用本地模式Enable local mode

通过本地模式,Hive 可以在一台计算机上或有时在一个进程中,Local mode enables Hive to do all tasks of a job on a single machine. 执行一个作业的所有任务。Or sometimes in a single process. 如果输入数据较小,This setting improves query performance if the input data is small. 并且查询启动任务的开销会消耗总体查询执行资源的绝大部分,则此设置可以提高查询性能。And the overhead of launching tasks for queries consumes a significant percentage of the overall query execution.

若要启用本地模式,请根据启用中间压缩部分的步骤 3 所述,将 hive.exec.mode.local.auto 参数添加到“自定义 hive-site”面板。To enable local mode, add the hive.exec.mode.local.auto parameter to the Custom hive-site panel, as explained in step 3 of the Enable intermediate compression section.

Apache Hive 执行模式 - 本地自动

设置单个 MapReduce MultiGROUP BYSet single MapReduce MultiGROUP BY

如果此属性设置为 true,则包含通用 group-by 键的 MultiGROUP BY 查询将生成单个 MapReduce 作业。When this property is set to true, a MultiGROUP BY query with common group-by keys generates a single MapReduce job.

若要启用此行为,请根据启用中间压缩部分的步骤 3 所述,将 hive.multigroupby.singlereducer 参数添加到“自定义 hive-site”窗格。To enable this behavior, add the hive.multigroupby.singlereducer parameter to the Custom hive-site pane, as explained in step 3 of the Enable intermediate compression section.

在 Hive 中设置单个 MapReduce MultiGROUP BY

其他 Hive 优化Additional Hive optimizations

以下部分介绍了可以设置的其他 Hive 相关优化。The following sections describe additional Hive-related optimizations you can set.

联接优化Join optimizations

Hive 中的默认联接类型是“随机联接”。The default join type in Hive is a shuffle join. 在 Hive 中,特殊的映射器会读取输入,并向中间文件发出联接键/值对。In Hive, special mappers read the input and emit a join key/value pair to an intermediate file. Hadoop 在随机阶段中排序与合并这些对。Hadoop sorts and merges these pairs in a shuffle stage. 此随机阶段的系统开销较大。This shuffle stage is expensive. 根据数据选择右联接可以显著提高性能。Selecting the right join based on your data can significantly improve performance.

联接类型Join Type 时间When 方式How Hive 设置Hive settings 注释Comments
随机联接Shuffle Join
  • 默认选项Default choice
  • 始终运行Always works
  • 从某个表的一部分内容中读取Reads from part of one of the tables
  • 根据联接键存储和排序Buckets and sorts on Join key
  • 向每个化简器发送一个存储桶Sends one bucket to each reduce
  • 在化简端执行联接Join is done on the Reduce side
不需要过多的 Hive 设置No significant Hive setting needed 每次运行Works every time
映射联接Map Join
  • 一个表可以装入内存One table can fit in memory
  • 将小型表读入内存哈希表Reads small table into memory hash table
  • 通过大型文件的一部分流式处理Streams through part of the large file
  • 联接哈希表中的每条记录Joins each record from hash table
  • 只按映射器执行联接Joins are by the mapper alone
hive.auto.confvert.join=true 速度很快,但受限Fast, but limited
排序合并存储桶Sort Merge Bucket 如果两个表:If both tables are:
  • 排序方式相同Sorted the same
  • 存储方式相同Bucketed the same
  • 按排序/存储的列执行联接Joining on the sorted/bucketed column
每个进程:Each process:
  • 从每个表中读取存储桶Reads a bucket from each table
  • 处理值最小的行Processes the row with the lowest value
hive.auto.convert.sortmerge.join=true 高效Efficient

执行引擎优化Execution engine optimizations

有关优化 Hive 执行引擎的其他建议:Additional recommendations for optimizing the Hive execution engine:

设置Setting 建议Recommended HDInsight 默认值HDInsight Default
hive.mapjoin.hybridgrace.hashtable True = 更安全,但速度更慢;false = 速度更快True = safer, slower; false = faster falsefalse
tez.am.resource.memory.mb 大多数引擎的上限为 4 GB4-GB upper bound for most 自动优化Auto-Tuned
tez.session.am.dag.submit.timeout.secs 300+300+ 300300
tez.am.container.idle.release-timeout-min.millis 20000+20000+ 1000010000
tez.am.container.idle.release-timeout-max.millis 40000+40000+ 2000020000

后续步骤Next steps