优化 Azure HDInsight 中的 Apache Hive 查询Optimize Apache Hive queries in Azure HDInsight

在 Azure HDInsight 中,有多种群集类型和技术可以运行 Apache Hive 查询。In Azure HDInsight, there are several cluster types and technologies that can run Apache Hive queries. 创建 HDInsight 群集时,选择适当的群集类型有助于根据工作负荷的需求优化性能。When you create your HDInsight cluster, choose the appropriate cluster type to help optimize performance for your workload needs.

例如,选择“交互式查询” 群集类型可以优化即席的交互式查询。For example, choose Interactive Query cluster type to optimize for ad hoc, interactive queries. 选择 Apache Hadoop 群集类型可以优化用作批处理的 Hive 查询。Choose Apache Hadoop cluster type to optimize for Hive queries used as a batch process. SparkHBase 群集类型也可以运行 Hive 查询。Spark and HBase cluster types can also run Hive queries. 有关针对不同 HDInsight 群集类型运行 Hive 查询的详细信息,请参阅 Azure HDInsight 中的 Apache Hive 和 HiveQL 是什么?For more information on running Hive queries on various HDInsight cluster types, see What is Apache Hive and HiveQL on Azure HDInsight?.

默认情况下,Hadoop 群集类型的 HDInsight 群集不会进行性能优化。HDInsight clusters of Hadoop cluster type are not optimized for performance by default. 本文介绍可应用于查询的一些最常见 Hive 性能优化方法。This article describes some of the most common Hive performance optimization methods that you can apply to your queries.

向外缩放辅助节点Scale out worker nodes

增加 HDInsight 群集中的工作节点数目可以利用更多的映射器和化简器来并行运行工作。Increasing the number of worker nodes in an HDInsight cluster allows the work to leverage more mappers and reducers to be run in parallel. 在 HDInsight 中,可通过两种方式增加扩大的数目:There are two ways you can increase scale out in HDInsight:

  • 创建群集时,可以使用 Azure 门户、Azure PowerShell 或命令行接口指定工作节点的数目。At the time when you create a cluster, you can specify the number of worker nodes using the Azure portal, Azure PowerShell, or command-line interface. 有关详细信息,请参阅创建 HDInsight 群集For more information, see Create HDInsight clusters. 以下屏幕截图显示了 Azure 门户上的工作节点配置:The following screenshot shows the worker node configuration on the Azure portal:

    scaleout_1scaleout_1

  • 创建后,还可以通过编辑工作节点的数目来进一步横向扩展群集,而无需重新创建:After creation, you can also edit the number of worker nodes to scale out a cluster further without recreating one:

    scaleout_2scaleout_2

有关缩放 HDInsight 的详细信息,请参阅缩放 HDInsight 群集For more information about scaling HDInsight, see Scale HDInsight clusters

使用 Apache Tez 而不是 Map ReduceUse Apache Tez instead of Map Reduce

Apache Tez 是 MapReduce 引擎的替代执行引擎。Apache Tez is an alternative execution engine to the MapReduce engine. 基于 Linux 的 HDInsight 群集在默认情况下会启用 Tez。Linux-based HDInsight clusters have Tez enabled by default.

tez_1

Tez 速度更快,因为:Tez is faster because:

  • 作为 MapReduce 引擎中的单个作业执行有向无环图 (DAG)Execute Directed Acyclic Graph (DAG) as a single job in the MapReduce engine. DAG 要求每组映射器后接一组化简器。The DAG requires each set of mappers to be followed by one set of reducers. 这会导致针对每个 Hive 查询运行多个 MapReduce 作业。This causes multiple MapReduce jobs to be spun off for each Hive query. Tez 没有这种局限性,它可以将复杂的 DAG 作为一个作业进行处理,因此将作业启动开销降到最低。Tez does not have such constraint and can process complex DAG as one job thus minimizing job startup overhead.
  • 避免不必要的写入Avoids unnecessary writes. 多个作业用于处理 MapReduce 引擎中的同一 Hive 查询。Multiple jobs are used to process the same Hive query in the MapReduce engine. 每个 MapReduce 作业的输出将作为中间数据写入 HDFS。The output of each MapReduce job is written to HDFS for intermediate data. Tez 最大程度地减少了对每个 Hive 查询运行的作业数,因此能够避免不必要的写入。Since Tez minimizes number of jobs for each Hive query, it is able to avoid unnecessary writes.
  • 最大限度地降低启动延迟Minimizes start-up delays. Tez 可以减少需要启动的映射器数目,同时还能提高优化吞吐量,因此,更有利于最大限度地降低启动延迟。Tez is better able to minimize start-up delay by reducing the number of mappers it needs to start and also improving optimization throughout.
  • 重复使用容器Reuses containers. Tez 会尽可能地重复使用容器,以确保降低由于启动容器而产生的延迟。Whenever possible Tez is able to reuse containers to ensure that latency due to starting up containers is reduced.
  • 连续优化技术Continuous optimization techniques. 传统上,优化是在编译阶段完成的。Traditionally optimization was done during compilation phase. 但是,由于可以提供有关输入的详细信息,因此可以在运行时更好地进行优化。However more information about the inputs is available that allow for better optimization during runtime. Tez 使用连续优化技术,从而可以在运行时阶段进一步优化计划。Tez uses continuous optimization techniques that allow it to optimize the plan further into the runtime phase.

有关这些概念的详细信息,请参阅 Apache TEZFor more information on these concepts, see Apache TEZ.

可以使用以下 set 命令设置查询的前缀,来执行 Tez 支持的任何 Hive 查询:You can make any Hive query Tez enabled by prefixing the query with the following set command:

set hive.execution.engine=tez;

Hive 分区Hive partitioning

I/O 操作是运行 Hive 查询的主要性能瓶颈。I/O operations are the major performance bottleneck for running Hive queries. 如果可以减少需要读取的数据量,即可改善性能。The performance can be improved if the amount of data that needs to be read can be reduced. 默认情况下,Hive 查询会扫描整个 Hive 表。By default, Hive queries scan entire Hive tables. 但是,对于只需扫描少量数据的查询(例如,使用筛选进行查询),此行为会产生不必要的开销。However for queries that only need to scan a small amount of data (for example, queries with filtering), this behavior creates unnecessary overhead. 使用 Hive 分区,Hive 查询只需访问 Hive 表中必要的数据量。Hive partitioning allows Hive queries to access only the necessary amount of data in Hive tables.

Hive 分区的实现方法是将原始数据重新组织成新目录。Hive partitioning is implemented by reorganizing the raw data into new directories. 每个分区都有自身的文件目录。Each partition has its own file directory. 分区由用户定义。The partitioning is defined by the user. 下图说明如何根据年 列来分区 Hive 表。The following diagram illustrates partitioning a Hive table by the column Year. 每年都会创建新的目录。A new directory is created for each year.

Hive 分区

一些分区注意事项:Some partitioning considerations:

  • 不要分区不足 - 根据仅包含少量值的列进行分区可能会导致创建很少的分区。Do not under partition - Partitioning on columns with only a few values can cause few partitions. 例如,根据性别(男性和女性)分区只会创建两个分区,从而最多只会将延迟降低一半。For example, partitioning on gender only creates two partitions to be created (male and female), thus only reduce the latency by a maximum of half.
  • 不要创建过多分区 - 另一种极端情况是,根据包含唯一值的列(例如,userid)创建分区会导致创建多个分区。Do not over partition - On the other extreme, creating a partition on a column with a unique value (for example, userid) causes multiple partitions. 创建过多分区会给群集 namenode 带来很大压力,因为它必须处理大量的目录。Over partition causes much stress on the cluster namenode as it has to handle the large number of directories.
  • 避免数据倾斜 - 明智选择分区键,以便所有分区的大小均等。Avoid data skew - Choose your partitioning key wisely so that all partitions are even size. 例如,按“州”列分区可能会导致数据分布出现偏斜。 For example, partitioning on State column may skew the distribution of data. 因为加利福尼亚州的人口几乎是佛蒙特州的 30 倍,分区大小可能会出现偏差,性能可能有极大的差异。Since the state of California has a population almost 30x that of Vermont, the partition size is potentially skewed and performance may vary tremendously.

要创建分区表,请使用 Partitioned By 子句:To create a partition table, use the Partitioned By clause:

CREATE TABLE lineitem_part
      (L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT,L_LINENUMBER INT,
      L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE,
      L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING,
      L_SHIPDATE_PS STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING, 
      L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING)
PARTITIONED BY(L_SHIPDATE STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

创建分区表后,可以创建静态分区或动态分区。Once the partitioned table is created, you can either create static partitioning or dynamic partitioning.

  • 静态分区表示已在相应目录中创建了分片数据。Static partitioning means that you have already sharded data in the appropriate directories. 使用静态分区可以根据目录位置手动添加 Hive 分区。With static partitions, you add Hive partitions manually based on the directory location. 以下代码片段是一个示例。The following code snippet is an example.

    INSERT OVERWRITE TABLE lineitem_part
    PARTITION (L_SHIPDATE = '5/23/1996 12:00:00 AM')
    SELECT * FROM lineitem 
    WHERE lineitem.L_SHIPDATE = '5/23/1996 12:00:00 AM'
    
    ALTER TABLE lineitem_part ADD PARTITION (L_SHIPDATE = '5/23/1996 12:00:00 AM'))
    LOCATION 'wasb://sampledata@ignitedemo.blob.core.chinacloudapi.cn/partitions/5_23_1996/'
    
  • 动态分区 表示希望 Hive 自动创建分区。Dynamic partitioning means that you want Hive to create partitions automatically for you. 由于已基于暂存表创建了分区表,因此需要做的就是将数据插入分区表:Since you have already created the partitioning table from the staging table, all you need to do is insert data to the partitioned table:

    SET hive.exec.dynamic.partition = true;
    SET hive.exec.dynamic.partition.mode = nonstrict;
    INSERT INTO TABLE lineitem_part
    PARTITION (L_SHIPDATE)
    SELECT L_ORDERKEY as L_ORDERKEY, L_PARTKEY as L_PARTKEY , 
         L_SUPPKEY as L_SUPPKEY, L_LINENUMBER as L_LINENUMBER,
         L_QUANTITY as L_QUANTITY, L_EXTENDEDPRICE as L_EXTENDEDPRICE,
         L_DISCOUNT as L_DISCOUNT, L_TAX as L_TAX, L_RETURNFLAG as L_RETURNFLAG,
         L_LINESTATUS as L_LINESTATUS, L_SHIPDATE as L_SHIPDATE_PS,
         L_COMMITDATE as L_COMMITDATE, L_RECEIPTDATE as L_RECEIPTDATE,
         L_SHIPINSTRUCT as L_SHIPINSTRUCT, L_SHIPMODE as L_SHIPMODE, 
         L_COMMENT as L_COMMENT, L_SHIPDATE as L_SHIPDATE FROM lineitem;
    

有关详细信息,请参阅分区表For more information, see Partitioned Tables.

使用 ORCFile 格式Use the ORCFile format

Hive 支持不同的文件格式。Hive supports different file formats. 例如:For example:

  • 文本:默认的文件格式,适用于大多数情况。Text: the default file format and works with most scenarios.
  • Avro:非常适合互操作性方案。Avro: works well for interoperability scenarios.
  • ORC/Parquet:最适合用于提高性能。ORC/Parquet: best suited for performance.

ORC(优化行纵栏式)格式是存储 Hive 数据的高效方式。ORC (Optimized Row Columnar) format is a highly efficient way to store Hive data. 与其他格式相比,ORC 具有以下优点:Compared to other formats, ORC has the following advantages:

  • 支持复杂类型(包括 DateTime)和复杂的半结构化类型。support for complex types including DateTime and complex and semi-structured types.
  • 高达 70% 的压缩率。up to 70% compression.
  • 每 10,000 行编制一次索引并允许跳过行。indexes every 10,000 rows, which allow skipping rows.
  • 大幅减少运行时执行时间。a significant drop in run-time execution.

要启用 ORC 格式,请先使用 Stored as ORC子句创建一个表:To enable ORC format, you first create a table with the clause Stored as ORC:

CREATE TABLE lineitem_orc_part
      (L_ORDERKEY INT, L_PARTKEY INT,L_SUPPKEY INT, L_LINENUMBER INT,
      L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE,
      L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING,
      L_SHIPDATE_PS STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING,
      L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT      STRING)
PARTITIONED BY(L_SHIPDATE STRING)
STORED AS ORC;

接下来,从暂存表向 ORC 表插入数据。Next, you insert data to the ORC table from the staging table. 例如:For example:

INSERT INTO TABLE lineitem_orc
SELECT L_ORDERKEY as L_ORDERKEY, 
         L_PARTKEY as L_PARTKEY , 
         L_SUPPKEY as L_SUPPKEY,
         L_LINENUMBER as L_LINENUMBER,
         L_QUANTITY as L_QUANTITY, 
         L_EXTENDEDPRICE as L_EXTENDEDPRICE,
         L_DISCOUNT as L_DISCOUNT,
         L_TAX as L_TAX,
         L_RETURNFLAG as L_RETURNFLAG,
         L_LINESTATUS as L_LINESTATUS,
         L_SHIPDATE as L_SHIPDATE,
         L_COMMITDATE as L_COMMITDATE,
         L_RECEIPTDATE as L_RECEIPTDATE, 
         L_SHIPINSTRUCT as L_SHIPINSTRUCT,
         L_SHIPMODE as L_SHIPMODE,
         L_COMMENT as L_COMMENT
FROM lineitem;

可在 Apache Hive 语言手册中阅读有关 ORC 格式的详细信息。You can read more on the ORC format in the Apache Hive Language manual.

向量化Vectorization

向量化可让 Hive 以批的形式同时处理 1024 行,而不是一次处理一行。Vectorization allows Hive to process a batch of 1024 rows together instead of processing one row at a time. 这意味着,简单的操作可以更快地完成,因为需要运行的内部代码更少。It means that simple operations are done faster because less internal code needs to run.

要启用向量化,请在 Hive 查询的前面加上以下设置作为前缀:To enable vectorization prefix your Hive query with the following setting:

 set hive.vectorized.execution.enabled = true;

有关详细信息,请参阅 向量化查询执行For more information, see Vectorized query execution.

其他优化方法Other optimization methods

还可以考虑使用其他一些高级优化方法,例如:There are more optimization methods that you can consider, for example:

  • Hive 装桶: 将大型数据集群集化或分段以优化查询性能的技术。Hive bucketing: a technique that allows to cluster or segment large sets of data to optimize query performance.
  • 联接优化: Hive 的查询执行计划优化,可改善联接的效率并减少用户提示的需要。Join optimization: optimization of Hive's query execution planning to improve the efficiency of joins and reduce the need for user hints. 有关详细信息,请参阅 联接优化For more information, see Join optimization.
  • 增加化简器Increase Reducers.

后续步骤Next steps

在本文中,学习了几种常见的 Hive 查询优化方法。In this article, you have learned several common Hive query optimization methods. 要了解更多信息,请参阅下列文章:To learn more, see the following articles: