调整性能:Hive、HDInsight 和 Azure Data Lake Storage Gen2Tune performance: Hive, HDInsight & Azure Data Lake Storage Gen2

已设置默认设置,以便针对许多不同用例提供良好性能。The default settings have been set to provide good performance across many different use cases. 对于 I/O 密集型查询,可以优化 Hive 以获取更好的 Azure Data Lake Storage Gen2 性能。For I/O intensive queries, Hive can be tuned to get better performance with Azure Data Lake Storage Gen2.

先决条件Prerequisites

parametersParameters

下面是为提高 Data Lake Storage Gen2 性能要优化的最重要设置:Here are the most important settings to tune for improved Data Lake Storage Gen2 performance:

  • hive.tez.container.size - 每个任务使用的内存量hive.tez.container.size – the amount of memory used by each tasks

  • tez.grouping.min-size - 每个映射器的最小大小tez.grouping.min-size – minimum size of each mapper

  • tez.grouping.max-size - 每个映射器的最大大小tez.grouping.max-size – maximum size of each mapper

  • hive.exec.reducer.bytes.per.reducer - 每个化简器的大小hive.exec.reducer.bytes.per.reducer – size of each reducer

hive.tez.container.size - 容器大小确定可供每个任务使用的内存量。hive.tez.container.size - The container size determines how much memory is available for each task. 这是用于控制 Hive 中的并发性的主要输入。This is the main input for controlling the concurrency in Hive.

tez.grouping.min-size - 使用此参数可以设置每个映射器的最小大小。tez.grouping.min-size – This parameter allows you to set the minimum size of each mapper. 如果 Tez 选择的映射器数小于此参数的值,则 Tez 将使用此处设置的值。If the number of mappers that Tez chooses is smaller than the value of this parameter, then Tez will use the value set here.

tez.grouping.max-size - 使用此参数可以设置每个映射器的最大大小。tez.grouping.max-size – The parameter allows you to set the maximum size of each mapper. 如果 Tez 选择的映射器数大于此参数的值,则 Tez 将使用此处设置的值。If the number of mappers that Tez chooses is larger than the value of this parameter, then Tez will use the value set here.

hive.exec.reducer.bytes.per.reducer - 此参数设置每个化简器的大小。hive.exec.reducer.bytes.per.reducer – This parameter sets the size of each reducer. 默认情况下,每个化简器为 256 MB。By default, each reducer is 256MB.

指南Guidance

设置 hive.exec.reducer.bytes.per.reducer - 默认值适用于数据未压缩时。Set hive.exec.reducer.bytes.per.reducer – The default value works well when the data is uncompressed. 对于已压缩的数据,应减小化简器。For data that is compressed, you should reduce the size of the reducer.

设置 hive.tez.container.size - 在每个节点中,内存由 yarn.nodemanager.resource.memory-mb 指定,并且默认情况下应在 HDI 群集上正确设置。Set hive.tez.container.size – In each node, memory is specified by yarn.nodemanager.resource.memory-mb and should be correctly set on HDI cluster by default. 有关在 YARN 中设置相应内存的其他信息,请参阅此文章For additional information on setting the appropriate memory in YARN, see this post.

通过减小 Tez 容器可增加并行度,I/O 密集型工作负荷可以从中受益。I/O intensive workloads can benefit from more parallelism by decreasing the Tez container size. 这样可为用户提供更多容器,从而提高并发性。This gives the user more containers which increases concurrency. 但是,某些 Hive 查询(例如 MapJoin)需要占用大量内存。However, some Hive queries require a significant amount of memory (e.g. MapJoin). 如果任务没有足够的内存,则在运行时会出现“内存不足”异常。If the task does not have enough memory, you will get an out of memory exception during runtime. 如果收到“内存不足”异常,则应增加内存。If you receive out of memory exceptions, then you should increase the memory.

正在运行的并发任务数或平行度将受到总 YARN 内存量的限制。The concurrent number of tasks running or parallelism will be bounded by the total YARN memory. YARN 容器数将决定可运行多少个并发任务。The number of YARN containers will dictate how many concurrent tasks can run. 若要查找每个节点的 YARN 内存量,可以转到 Ambari。To find the YARN memory per node, you can go to Ambari. 导航到 YARN 并查看“配置”选项卡。YARN 内存量会显示在此窗口中。Navigate to YARN and view the Configs tab. The YARN memory is displayed in this window.

  • 总 YARN 内存 = 节点数 * 每个节点的 YARN 内存Total YARN memory = nodes * YARN memory per node
  • YARN 容器的 # = 总 YARN 内存 / Tez 容器大小# of YARN containers = Total YARN memory / Tez container size

提高使用 Data Lake Storage Gen2 的性能的关键是尽可能多地增加并发性。The key to improving performance using Data Lake Storage Gen2 is to increase the concurrency as much as possible. Tez 会自动计算应创建的任务数,因此无需设置它。Tez automatically calculates the number of tasks that should be created so you do not need to set it.

示例计算Example calculation

让我们假设你有一个由 8 个节点组成的 D14 群集。Let's say you have an 8 node D14 cluster.

  • 总 YARN 内存 = 节点数 * 每个节点的 YARN 内存Total YARN memory = nodes * YARN memory per node
  • 总 YARN 内存 = 8 个节点 * 96GB = 768GBTotal YARN memory = 8 nodes * 96GB = 768GB
  • YARN 容器的 # = 768GB / 3072MB = 256# of YARN containers = 768GB / 3072MB = 256

有关 Hive 优化的详细信息Further information on Hive tuning

下面是将帮助优化 Hive 查询的几个博客:Here are a few blogs that will help tune your Hive queries: