方案:Azure HDInsight 中的化简器速度缓慢Scenario: Reducer is slow in Azure HDInsight

本文介绍在 Azure HDInsight 群集中使用交互式查询组件时出现的问题的故障排除步骤和可能的解决方法。This article describes troubleshooting steps and possible resolutions for issues when using Interactive Query components in Azure HDInsight clusters.

问题Issue

运行某个查询(例如 insert into table1 partition(a,b) select a,b,c from table2)时,查询计划将启动许多化简器,但每个分区中的数据将进入单个化简器。When running a query such as insert into table1 partition(a,b) select a,b,c from table2 the query plan starts a bunch of reducers but the data from each partition goes to a single reducer. 这会导致查询花费的时间与最大分区的化简器所花费的时间相当。This causes the query to be as slow as the time taken by the largest partition's reducer.

原因Cause

打开 beeline 并检查集 hive.optimize.sort.dynamic.partition 的值。Open beeline and verify the value of set hive.optimize.sort.dynamic.partition.

应该根据数据的性质将此变量的值设置为 true/false。The value of this variable is meant to be set to true/false based on the nature of the data.

如果输入表中的分区较少(假设少于 10 个),并且输出分区的数目也较少,如果将该变量设置为 true,则会导致使用每个分区的单个化简器对数据进行全局排序和写入。If the partitions in the input table are less(say less than 10), and so is the number of output partitions, and the variable is set to true, this causes data to be globally sorted and written using a single reducer per partition. 即使可用化简器的数目较多,也可能会由于数据倾斜而导致少量的化简器出现滞后,并且无法获得最大并行度。Even if the number of reducers available is larger, a few reducers may be lagging behind due to data skew and the max parallelism cannot be attained. 更改为 false 后,多个化简器可以处理单个分区,并可写出多个较小文件,因而加快插入速度。When changed to false, more than one reducer may handle a single partition and multiple smaller files will be written out, resulting in faster insert. 不过,由于存在较小文件,这可能会影响后续的查询。This might affect further queries though because of the presence of smaller files.

如果分区数较多并且数据未倾斜,则最好是使用 true 值。A value of true makes sense when the number of partitions is larger and data is not skewed. 在这种情况下,将会写出映射阶段的结果,使每个分区由单个化简器处理,从而提高后续查询的性能。In such cases the result of the map phase will be written out such that each partition will be handled by a single reducer resulting in better subsequent query performance.

解决方法Resolution

  1. 尝试将要规范化的数据重新分区成多个分区。Try to repartition the data to normalize into multiple partitions.

  2. 如果方法 #1 不可行,请在 beeline 会话中将配置值设置为 false,然后重试查询。If #1 is not possible, set the value of the config to false in beeline session and try the query again. set hive.optimize.sort.dynamic.partition=falseset hive.optimize.sort.dynamic.partition=false. 不建议在群集级别将值设置为 false。Setting the value to false at a cluster level is not recommended. true 值是最佳的;请根据数据和查询的性质按需设置参数。The value of true is optimal and set the parameter as necessary based on nature of data and query.

后续步骤Next steps

如果你的问题未在本文中列出,或者无法解决问题,请访问以下渠道以获取更多支持:If you didn't see your problem or are unable to solve your issue, visit the following channel for more support:

  • 如果需要更多帮助,可以从 Azure 门户提交支持请求。If you need more help, you can submit a support request from the Azure portal. 从菜单栏中选择“支持” ,或打开“帮助 + 支持” 中心。Select Support from the menu bar or open the Help + support hub.