在 Azure HDInsight 中通过 Apache Ambari 优化 Apache HiveOptimize Apache Hive with Apache Ambari in Azure HDInsight
Apache Ambari 是用于管理和监视 HDInsight 群集的 Web 界面。Apache Ambari is a web interface to manage and monitor HDInsight clusters. 有关 Ambari Web UI 的简介,请参阅使用 Apache Ambari Web UI 管理 HDInsight 群集。For an introduction to Ambari Web UI, see Manage HDInsight clusters by using the Apache Ambari Web UI.
以下部分介绍了用于优化 Apache Hive 总体性能的配置选项。The following sections describe configuration options for optimizing overall Apache Hive performance.
- 若要修改 Hive 配置参数,请从“服务”边栏中选择“Hive”。To modify Hive configuration parameters, select Hive from the Services sidebar.
- 导航到“配置”选项卡。Navigate to the Configs tab.
设置 Hive 执行引擎Set the Hive execution engine
Hive 提供两个执行引擎:Apache Hadoop MapReduce 和 Apache TEZ。Hive provides two execution engines: Apache Hadoop MapReduce and Apache TEZ. Tez 的速度比 MapReduce 更快。Tez is faster than MapReduce. HDInsight Linux 群集将 Tez 用作默认的执行引擎。HDInsight Linux clusters have Tez as the default execution engine. 更改执行引擎:To change the execution engine:
在 Hive 的“配置”选项卡上的筛选框中,键入“执行引擎”。 In the Hive Configs tab, type execution engine in the filter box.
“优化”属性的默认值为 Tez。The Optimization property's default value is Tez.
优化映射器Tune mappers
Hadoop 会尝试将单个文件拆分(映射)为多个文件,以并行方式处理生成的文件。Hadoop tries to split (map) a single file into multiple files and process the resulting files in parallel. 映射器数目取决于拆分数目。The number of mappers depends on the number of splits. 以下两个配置参数驱动 Tez 执行引擎的拆分数目:The following two configuration parameters drive the number of splits for the Tez execution engine:
tez.grouping.min-size
:分组拆分大小的下限,默认值为 16 MB(16,777,216 字节)。tez.grouping.min-size
: Lower limit on the size of a grouped split, with a default value of 16 MB (16,777,216 bytes).tez.grouping.max-size
:分组拆分大小的上限,默认值为 1 GB(1,073,741,824 字节)。tez.grouping.max-size
: Upper limit on the size of a grouped split, with a default value of 1 GB (1,073,741,824 bytes).
性能准则是,减小这两个参数可以改善延迟,增大这两个参数可以提高吞吐量。As a performance guideline, lower both of these parameters to improve latency, increase for more throughput.
例如,若要为数据大小 128 MB 设置四个映射器任务,可将每个任务的这两个参数设置为 32 MB(33,554,432 字节)。For example, to set four mapper tasks for a data size of 128 MB, you would set both parameters to 32 MB each (33,554,432 bytes).
若要修改限制参数,请导航到 Tez 服务的“配置”选项卡。To modify the limit parameters, navigate to the Configs tab of the Tez service. 展开“常规”面板并找到
tez.grouping.max-size
和tez.grouping.min-size
参数。Expand the General panel, and locate thetez.grouping.max-size
andtez.grouping.min-size
parameters.将这两个参数设置为 33,554,432 字节 (32 MB)。Set both parameters to 33,554,432 bytes (32 MB).
这些更改会影响整个服务器中的所有 Tez 作业。These changes affect all Tez jobs across the server. 若要获取最佳结果,请选择适当的参数值。To get an optimal result, choose appropriate parameter values.
优化化简器Tune reducers
Apache ORC 和 Snappy 都可提供高性能。Apache ORC and Snappy both offer high performance. 但是,Hive 默认提供的化简器可能太少,从而导致瓶颈。However, Hive may have too few reducers by default, causing bottlenecks.
例如,假设输入数据大小为 50 GB。For example, say you have an input data size of 50 GB. 使用 Snappy 以 ORC 格式压缩这些数据后,大小为 1 GB。That data in ORC format with Snappy compression is 1 GB. Hive 估计所需的化简器数目为:(在映射器中输入的字节数 / hive.exec.reducers.bytes.per.reducer
)。Hive estimates the number of reducers needed as: (number of bytes input to mappers / hive.exec.reducers.bytes.per.reducer
).
如果使用默认设置,此示例的化简器数目为 4。With the default settings, this example is four reducers.
hive.exec.reducers.bytes.per.reducer
参数指定每个化简器处理的字节数。The hive.exec.reducers.bytes.per.reducer
parameter specifies the number of bytes processed per reducer. 默认值为 64 MB。The default value is 64 MB. 减小此值可提高并行度,并可能会改善性能。Tuning this value down increases parallelism and may improve performance. 但过度减小也可能生成过多的化简器,从而对性能产生潜在的负面影响。Tuning it too low could also produce too many reducers, potentially adversely affecting performance. 此参数基于特定的数据要求、压缩设置和其他环境因素。This parameter is based on your particular data requirements, compression settings, and other environmental factors.
若要修改该参数,请导航到 Hive 的“配置”选项卡,然后在“设置”页上找到“每个化简器的数据”参数。 To modify the parameter, navigate to the Hive Configs tab and find the Data per Reducer parameter on the Settings page.
选择“编辑”并将该值修改为 128 MB(134,217,728 字节),然后按 Enter 保存。Select Edit to modify the value to 128 MB (134,217,728 bytes), and then press Enter to save.
假设输入大小为 1024 MB,每个化简器的数据为 128 MB,则有 8 个化简器 (1024/128)。Given an input size of 1,024 MB, with 128 MB of data per reducer, there are eight reducers (1024/128).
为“每个化简器的数据”参数提供错误的值可能导致生成大量的化简器,从而对查询性能产生负面影响。An incorrect value for the Data per Reducer parameter may result in a large number of reducers, adversely affecting query performance. 若要限制化简器的最大数目,请将
hive.exec.reducers.max
设置为适当的值。To limit the maximum number of reducers, sethive.exec.reducers.max
to an appropriate value. 默认值为 1009。The default value is 1009.
启用并行执行Enable parallel execution
一个 Hive 查询是在一个或多个阶段中执行的。A Hive query is executed in one or more stages. 如果可以并行运行各个独立阶段,则会提高查询性能。If the independent stages can be run in parallel, that will increase query performance.
若要启用并行查询执行,请导航到 Hive 的“配置”选项卡并搜索
hive.exec.parallel
属性。To enable parallel query execution, navigate to the Hive Config tab and search for thehive.exec.parallel
property. 默认值为 false。The default value is false. 将该值更改为 true,然后按 Enter 保存该值。Change the value to true, and then press Enter to save the value.若要限制并行运行的作业数,请修改
hive.exec.parallel.thread.number
属性。To limit the number of jobs to run in parallel, modify thehive.exec.parallel.thread.number
property. 默认值为 8。The default value is 8.
启用矢量化Enable vectorization
Hive 逐行处理数据。Hive processes data row by row. 矢量化指示 Hive 以块(一个块包含 1,024 行)的方式处理数据,而不是以一次一行的方式处理数据。Vectorization directs Hive to process data in blocks of 1,024 rows rather than one row at a time. 矢量化只适用于 ORC 文件格式。Vectorization is only applicable to the ORC file format.
若要启用矢量化查询执行,请导航到 Hive 的“配置”选项卡并搜索
hive.vectorized.execution.enabled
参数。To enable a vectorized query execution, navigate to the Hive Configs tab and search for thehive.vectorized.execution.enabled
parameter. Hive 0.13.0 或更高版本的默认值为 true。The default value is true for Hive 0.13.0 or later.若要为查询的化简端启用矢量化执行,请将
hive.vectorized.execution.reduce.enabled
参数设置为 true。To enable vectorized execution for the reduce side of the query, set thehive.vectorized.execution.reduce.enabled
parameter to true. 默认值为 false。The default value is false.
启用基于成本的优化 (CBO)Enable cost-based optimization (CBO)
默认情况下,Hive 遵循一组规则来找到一个最佳的查询执行计划。By default, Hive follows a set of rules to find one optimal query execution plan. 基于成本的优化 (CBO) 会评估多个查询执行计划,Cost-based optimization (CBO) evaluates multiple plans to execute a query. 并为每个计划分配成本,然后确定执行查询的成本最低的计划。And assigns a cost to each plan, then determines the cheapest plan to execute a query.
若要启用 CBO,请导航到“Hive” > “配置” > “设置”,找到“启用基于成本的优化器”,然后将切换按钮切换到“打开” 。To enable CBO, navigate to Hive > Configs > Settings and find Enable Cost Based Optimizer, then switch the toggle button to On.
启用 CBO 后,可使用以下附加配置参数提高 Hive 查询性能:The following additional configuration parameters increase Hive query performance when CBO is enabled:
hive.compute.query.using.stats
如果设置为 true,则 Hive 会使用其元存储中存储的统计信息来应答类似于
count(*)
的简单查询。When set to true, Hive uses statistics stored in its metastore to answer simple queries likecount(*)
.hive.stats.fetch.column.stats
启用 CBO 时,会创建列统计信息。Column statistics are created when CBO is enabled. Hive 使用元存储中存储的列统计信息来优化查询。Hive uses column statistics, which are stored in metastore, to optimize queries. 如果列数较多,则提取每个列的列统计信息需要花费很长时间。Fetching column statistics for each column takes longer when the number of columns is high. 如果设置为 false,则会禁用从元存储中提取列统计信息。When set to false, this setting disables fetching column statistics from the metastore.
hive.stats.fetch.partition.stats
行数、数据大小和文件大小等基本分区统计信息存储在元存储中。Basic partition statistics such as number of rows, data size, and file size are stored in metastore. 如果设置为 true,则从元存储中提取分区统计信息。If set to true, the partition stats are fetched from metastore. 如果设置为 false,则从文件系统中提取文件大小。When false, the file size is fetched from the file system. 行数从行架构中提取。And the number of rows is fetched from the row schema.
启用中间压缩Enable intermediate compression
映射任务将创建化简器任务使用的中间文件。Map tasks create intermediate files that are used by the reducer tasks. 中间压缩可以缩小中间文件大小。Intermediate compression shrinks the intermediate file size.
Hadoop 作业通常会遇到 I/O 瓶颈。Hadoop jobs are usually I/O bottlenecked. 压缩数据能够加快 I/O 和总体网络传输速度。Compressing data can speed up I/O and overall network transfer.
可用的压缩类型包括:The available compression types are:
格式Format | 工具Tool | 算法Algorithm | 文件扩展名File Extension | 是否可拆分?Splittable? |
---|---|---|---|---|
GzipGzip | GzipGzip | DEFLATEDEFLATE | .gz |
否No |
Bzip2Bzip2 | Bzip2Bzip2 | Bzip2Bzip2 | .bz2 |
是Yes |
LZOLZO | Lzop |
LZOLZO | .lzo |
是(如果已编制索引)Yes, if indexed |
SnappySnappy | 空值N/A | SnappySnappy | SnappySnappy | 否No |
一般规则是,尽量使用可拆分的压缩方法,否则会创建极少的映射器。As a general rule, having the compression method splittable is important, otherwise few mappers will be created. 如果输入数据为文本,则 bzip2
是最佳选项。If the input data is text, bzip2
is the best option. 对于 ORC 格式,Snappy 是最快的压缩选项。For ORC format, Snappy is the fastest compression option.
若要启用中间压缩,请导航到 Hive 的“配置”选项卡,并将
hive.exec.compress.intermediate
参数设置为 true。To enable intermediate compression, navigate to the Hive Configs tab, and then set thehive.exec.compress.intermediate
parameter to true. 默认值为 false。The default value is false.备注
若要压缩中间文件,请选择一个 CPU 开销较低的压缩编解码器,即使该编解码器不能提供较高的压缩输出。To compress intermediate files, choose a compression codec with lower CPU cost, even if the codec doesn't have a high compression output.
若要设置中间压缩编解码器,请将自定义属性
mapred.map.output.compression.codec
添加到hive-site.xml
或mapred-site.xml
文件。To set the intermediate compression codec, add the custom propertymapred.map.output.compression.codec
to thehive-site.xml
ormapred-site.xml
file.添加自定义设置:To add a custom setting:
a.a. 导航到“Hive” > “配置” > “高级” > “自定义 hive-site” 。Navigate to Hive > Configs > Advanced > Custom hive-site.
b.b. 选择“自定义 hive-site”窗格底部的“添加属性…”。Select Add Property... at the bottom of the Custom hive-site pane.
c.c. 在“添加属性”窗口中,输入
mapred.map.output.compression.codec
作为键,输入org.apache.hadoop.io.compress.SnappyCodec
作为值。In the Add Property window, entermapred.map.output.compression.codec
as the key andorg.apache.hadoop.io.compress.SnappyCodec
as the value.d.d. 选择“添加” 。Select Add.
此设置将使用 Snappy 压缩来压缩中间文件。This setting will compress the intermediate file using Snappy compression. 添加该属性后,它会显示在“自定义 hive-site”窗格中。Once the property is added, it appears in the Custom hive-site pane.
备注
此过程会修改
$HADOOP_HOME/conf/hive-site.xml
文件。This procedure modifies the$HADOOP_HOME/conf/hive-site.xml
file.
压缩最终输出Compress final output
还可以压缩最终的 Hive 输出。The final Hive output can also be compressed.
若要压缩最终的 Hive 输出,请导航到 Hive 的“配置”选项卡,并将
hive.exec.compress.output
参数设置为 true。To compress the final Hive output, navigate to the Hive Configs tab, and then set thehive.exec.compress.output
parameter to true. 默认值为 false。The default value is false.若要选择输出压缩编解码器,请根据上一部分的步骤 3 所述,将
mapred.output.compression.codec
自定义属性添加到“自定义 hive-site”窗格。To choose the output compression codec, add themapred.output.compression.codec
custom property to the Custom hive-site pane, as described in the previous section's step 3.
启用推理执行Enable speculative execution
推理执行会启动一定数量的重复任务,用于检测运行缓慢的任务跟踪器并将它们列入拒绝列表。Speculative execution launches a certain number of duplicate tasks to detect and deny list the slow-running task tracker. 同时通过优化各个任务结果来改进作业的整体执行。While improving the overall job execution by optimizing individual task results.
不应该对输入量较大的长时间运行的 MapReduce 任务启用推理执行。Speculative execution shouldn't be turned on for long-running MapReduce tasks with large amounts of input.
若要启用推理执行,请导航到 Hive 的“配置”选项卡,并将
hive.mapred.reduce.tasks.speculative.execution
参数设置为 true。To enable speculative execution, navigate to the Hive Configs tab, and then set thehive.mapred.reduce.tasks.speculative.execution
parameter to true. 默认值为 false。The default value is false.
优化动态分区Tune dynamic partitions
Hive 允许在表中插入记录时创建动态分区,且无需预定义每个分区。Hive allows for creating dynamic partitions when inserting records into a table, without predefining every partition. 此功能非常强大。This ability is a powerful feature. 但它可能会导致创建大量的分区,Although it may result in the creation of a large number of partitions. 并为每个分区创建大量文件。And a large number of files for each partition.
要让 Hive 执行动态分区,应将
hive.exec.dynamic.partition
参数值设置为 true(默认值)。For Hive to do dynamic partitions, thehive.exec.dynamic.partition
parameter value should be true (the default).将动态分区模式更改为 strict。Change the dynamic partition mode to strict. 在 strict(严格)模式下,必须至少有一个分区是静态的。In strict mode, at least one partition has to be static. 此设置可以阻止未在 WHERE 子句中包含分区筛选器的查询,即,strict 可阻止扫描所有分区的查询。This setting prevents queries without the partition filter in the WHERE clause, that is, strict prevents queries that scan all partitions. 导航到 Hive 的“配置”选项卡,并将
hive.exec.dynamic.partition.mode
设置为 strict。Navigate to the Hive Configs tab, and then sethive.exec.dynamic.partition.mode
to strict. 默认值为 nonstrict。The default value is nonstrict.若要限制要创建的动态分区数,请修改
hive.exec.max.dynamic.partitions
参数。To limit the number of dynamic partitions to be created, modify thehive.exec.max.dynamic.partitions
parameter. 默认值为 5000。The default value is 5000.若要限制每个节点的动态分区总数,请修改
hive.exec.max.dynamic.partitions.pernode
。To limit the total number of dynamic partitions per node, modifyhive.exec.max.dynamic.partitions.pernode
. 默认值为 2000。The default value is 2000.
启用本地模式Enable local mode
通过本地模式,Hive 可以在一台计算机上或有时在一个进程中,Local mode enables Hive to do all tasks of a job on a single machine. 执行一个作业的所有任务。Or sometimes in a single process. 如果输入数据较小,This setting improves query performance if the input data is small. 并且查询启动任务的开销会消耗总体查询执行资源的绝大部分,则此设置可以提高查询性能。And the overhead of launching tasks for queries consumes a significant percentage of the overall query execution.
若要启用本地模式,请根据启用中间压缩部分的步骤 3 所述,将 hive.exec.mode.local.auto
参数添加到“自定义 hive-site”面板。To enable local mode, add the hive.exec.mode.local.auto
parameter to the Custom hive-site panel, as explained in step 3 of the Enable intermediate compression section.
设置单个 MapReduce MultiGROUP BYSet single MapReduce MultiGROUP BY
如果此属性设置为 true,则包含通用 group-by 键的 MultiGROUP BY 查询将生成单个 MapReduce 作业。When this property is set to true, a MultiGROUP BY query with common group-by keys generates a single MapReduce job.
若要启用此行为,请根据启用中间压缩部分的步骤 3 所述,将 hive.multigroupby.singlereducer
参数添加到“自定义 hive-site”窗格。To enable this behavior, add the hive.multigroupby.singlereducer
parameter to the Custom hive-site pane, as explained in step 3 of the Enable intermediate compression section.
其他 Hive 优化Additional Hive optimizations
以下部分介绍了可以设置的其他 Hive 相关优化。The following sections describe additional Hive-related optimizations you can set.
联接优化Join optimizations
Hive 中的默认联接类型是“随机联接”。The default join type in Hive is a shuffle join. 在 Hive 中,特殊的映射器会读取输入,并向中间文件发出联接键/值对。In Hive, special mappers read the input and emit a join key/value pair to an intermediate file. Hadoop 在随机阶段中排序与合并这些对。Hadoop sorts and merges these pairs in a shuffle stage. 此随机阶段的系统开销较大。This shuffle stage is expensive. 根据数据选择右联接可以显著提高性能。Selecting the right join based on your data can significantly improve performance.
联接类型Join Type | 时间When | 方式How | Hive 设置Hive settings | 注释Comments |
---|---|---|---|---|
随机联接Shuffle Join |
|
|
不需要过多的 Hive 设置No significant Hive setting needed | 每次运行Works every time |
映射联接Map Join |
|
|
hive.auto.confvert.join=true |
速度很快,但受限Fast, but limited |
排序合并存储桶Sort Merge Bucket | 如果两个表:If both tables are:
|
每个进程:Each process:
|
hive.auto.convert.sortmerge.join=true |
高效Efficient |
执行引擎优化Execution engine optimizations
有关优化 Hive 执行引擎的其他建议:Additional recommendations for optimizing the Hive execution engine:
设置Setting | 建议Recommended | HDInsight 默认值HDInsight Default |
---|---|---|
hive.mapjoin.hybridgrace.hashtable |
True = 更安全,但速度更慢;false = 速度更快True = safer, slower; false = faster | falsefalse |
tez.am.resource.memory.mb |
大多数引擎的上限为 4 GB4-GB upper bound for most | 自动优化Auto-Tuned |
tez.session.am.dag.submit.timeout.secs |
300+300+ | 300300 |
tez.am.container.idle.release-timeout-min.millis |
20000+20000+ | 1000010000 |
tez.am.container.idle.release-timeout-max.millis |
40000+40000+ | 2000020000 |
后续步骤Next steps
- 使用 Apache Ambari Web UI 管理 HDInsight 群集Manage HDInsight clusters with the Apache Ambari web UI
- Apache Ambari REST APIApache Ambari REST API
- 优化 Azure HDInsight 中的 Apache Hive 查询Optimize Apache Hive queries in Azure HDInsight
- 优化群集Optimize clusters
- 优化 Apache HBaseOptimize Apache HBase
- 优化 Apache PigOptimize Apache Pig