什么是 Azure HDInsight 中的 Apache Hive 和 HiveQL?What is Apache Hive and HiveQL on Azure HDInsight?

Apache Hive 是适用于 Apache Hadoop 的数据仓库系统。Apache Hive is a data warehouse system for Apache Hadoop. 使用 Hive 可以汇总、查询和分析数据。Hive enables data summarization, querying, and analysis of data. Hive 查询使用 HiveQL 编写,它是类似于 SQL 的查询语言。Hive queries are written in HiveQL, which is a query language similar to SQL.

Hive 可以实现将结构投影到很大程度上未结构化的数据上。Hive allows you to project structure on largely unstructured data. 定义结构后,可以使用 HiveQL 来查询这些数据,而无需具备 Java 或 MapReduce 方面的知识。After you define the structure, you can use HiveQL to query the data without knowledge of Java or MapReduce.

HDInsight 提供了已针对特定工作负荷进行了优化的多种群集类型。HDInsight provides several cluster types, which are tuned for specific workloads. 以下群集类型最常用于 Hive 查询:The following cluster types are most often used for Hive queries:

群集类型Cluster type 说明Description
交互式查询Interactive Query 提供低延迟分析处理 (LLAP) 功能的 Hadoop 群集,可改善交互式查询的响应时间。A Hadoop cluster that provides Low Latency Analytical Processing (LLAP) functionality to improve response times for interactive queries. 有关详细信息,请参阅 HDInsight 中的交互式查询入门文档。For more information, see the Start with Interactive Query in HDInsight document.
HadoopHadoop 针对批处理工作负荷进行优化的 Hadoop 群集。A Hadoop cluster that is tuned for batch processing workloads. 有关详细信息,请参阅 HDInsight 中的 Apache Hadoop 入门文档。For more information, see the Start with Apache Hadoop in HDInsight document.
SparkSpark Apache Spark 提供内置的功能用于使用 Hive。Apache Spark has built-in functionality for working with Hive. 有关详细信息,请参阅 HDInsight 中的 Apache Spark 入门文档。For more information, see the Start with Apache Spark on HDInsight document.
HBaseHBase HiveQL 可用于查询 Apache HBase 中存储的数据。HiveQL can be used to query data stored in Apache HBase. 有关详细信息,请参阅 Apache HBase on HDInsight 入门文档。For more information, see the Start with Apache HBase on HDInsight document.

如何使用 HiveHow to use Hive

使用下表来了解将 Hive 与 HDInsight 配合使用的各种方法:Use the following table to discover the different ways to use Hive with HDInsight:

使用此方法,如果想要...Use this method if you want... ...交互式查询...interactive queries ...批处理...batch processing ...从此 客户端操作系统...from this client operating system
Visual Studio Code 的 HDInsight 工具HDInsight tools for Visual Studio Code Linux、Unix、Mac OS X 或 WindowsLinux, Unix, Mac OS X, or Windows
用于 Visual Studio 的 HDInsight 工具HDInsight tools for Visual Studio WindowsWindows
Hive 视图Hive View 任何(基于浏览器)Any (browser based)
Beeline 客户端Beeline client Linux、Unix、Mac OS X 或 WindowsLinux, Unix, Mac OS X, or Windows
REST APIREST API   Linux、Unix、Mac OS X 或 WindowsLinux, Unix, Mac OS X, or Windows
Windows PowerShellWindows PowerShell   WindowsWindows

HiveQL 语言参考HiveQL language reference

语言手册 (https://cwiki.apache.org/confluence/display/Hive/LanguageManual) 中提供了HiveQL 语言参考。HiveQL language reference is available in the language manual (https://cwiki.apache.org/confluence/display/Hive/LanguageManual).

Hive 和数据结构Hive and data structure

Hive 知道如何处理结构化和半结构化数据。Hive understands how to work with structured and semi-structured data. 例如,其中的字段由特定字符分隔的文本文件。For example, text files where the fields are delimited by specific characters. 以下 HiveQL 语句基于空格分隔的数据创建一个表:The following HiveQL statement creates a table over space-delimited data:

CREATE EXTERNAL TABLE log4jLogs (
    t1 string,
    t2 string,
    t3 string,
    t4 string,
    t5 string,
    t6 string,
    t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '/example/data/';

Hive 还支持对复杂或不规则的结构化数据使用自定义序列化程序/反序列化程序 (SerDe)Hive also supports custom serializer/deserializers (SerDe) for complex or irregularly structured data. 有关详细信息,请参阅如何将自定义 JSON SerDe 与 HDInsight 配合使用文档。For more information, see the How to use a custom JSON SerDe with HDInsight document.

有关 Hive 支持的文件格式的详细信息,请参阅语言手册 (https://cwiki.apache.org/confluence/display/Hive/LanguageManual)For more information on file formats supported by Hive, see the Language manual (https://cwiki.apache.org/confluence/display/Hive/LanguageManual)

Hive 内部表与外部表Hive internal tables vs external tables

使用 Hive 可以创建两种类型的表:There are two types of tables that you can create with Hive:

  • 内部:数据存储在 Hive 数据仓库中。Internal: Data is stored in the Hive data warehouse. 数据仓库位于群集的默认存储中的 /hive/warehouse/ 上。The data warehouse is located at /hive/warehouse/ on the default storage for the cluster.

    当下列条件之一成立时,请使用内部表:Use internal tables when one of the following conditions apply:

    • 数据是临时性的。Data is temporary.
    • 希望 Hive 管理表和数据的生命周期。You want Hive to manage the lifecycle of the table and data.
  • 外部:数据存储在数据仓库外部。External: Data is stored outside the data warehouse. 数据可以存储在可由群集访问的任何存储中。The data can be stored on any storage accessible by the cluster.

    当下列条件之一成立时,请使用外部表:Use external tables when one of the following conditions apply:

    • 数据也在 Hive 外部使用。The data is also used outside of Hive. 例如,另一个进程(不会锁定文件)会更新数据文件。For example, the data files are updated by another process (that does not lock the files.)
    • 数据需要保留在基础位置,即使在删除表之后。Data needs to remain in the underlying location, even after dropping the table.
    • 需要一个自定义位置,例如非默认存储帐户。You need a custom location, such as a non-default storage account.
    • 由 Hive 以外的程序管理数据格式、位置等。A program other than hive manages the data format, location, and so on.

有关详细信息,请参阅 Hive 内部和外部表简介博客文章。For more information, see the Hive Internal and External Tables Intro blog post.

用户定义函数 (UDF)User-defined functions (UDF)

还可以通过用户定义函数 (UDF) 扩展 Hive。Hive can also be extended through user-defined functions (UDF). UDF 允许实现 HiveQL 中不容易建模的功能或逻辑。A UDF allows you to implement functionality or logic that isn't easily modeled in HiveQL. 有关将 UDF 与 Hive 配合使用的示例,请参阅以下文档:For an example of using UDFs with Hive, see the following documents:

示例数据Example data

Hive on HDInsight 中预先加载了名为 hivesampletable 的内部表。Hive on HDInsight comes pre-loaded with an internal table named hivesampletable. HDInsight 还提供了可与 Hive 配合使用的示例数据集。HDInsight also provides example data sets that can be used with Hive. 这些数据集存储在 /example/data/HdiSamples 目录中。These data sets are stored in the /example/data and /HdiSamples directories. 这些目录存在于群集的默认存储中。These directories exist in the default storage for your cluster.

Hive 查询示例Example Hive query

以下 HiveQL 语句将各列投影到 /example/data/sample.log 文件上:The following HiveQL statements project columns onto the /example/data/sample.log file:

DROP TABLE log4jLogs;
CREATE EXTERNAL TABLE log4jLogs (
    t1 string,
    t2 string,
    t3 string,
    t4 string,
    t5 string,
    t6 string,
    t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '/example/data/';
SELECT t4 AS sev, COUNT(*) AS count FROM log4jLogs 
    WHERE t4 = '[ERROR]' AND INPUT__FILE__NAME LIKE '%.log' 
    GROUP BY t4;

在上例中,HiveQL 语句执行以下操作:In the previous example, the HiveQL statements perform the following actions:

语句Statement 说明Description
DROP TABLEDROP TABLE 如果该表已存在,则删除它。If the table already exists, delete it.
CREATE EXTERNAL TABLECREATE EXTERNAL TABLE 在 Hive 中创建一个新的“外部”表 。Creates a new external table in Hive. 外部表只会在 Hive 中存储表定义。External tables only store the table definition in Hive. 数据以原始格式的形式保留在原始位置中。The data is left in the original location and in the original format.
ROW FORMATROW FORMAT 让 Hive 知道数据的格式已如何进行了设置。Tells Hive how the data is formatted. 在此情况下,每个日志中的字段以空格分隔。In this case, the fields in each log are separated by a space.
STORED AS TEXTFILE LOCATIONSTORED AS TEXTFILE LOCATION 让 Hive 知道数据的存储位置(example/data 目录),并且数据已存储为文本。Tells Hive where the data is stored (the example/data directory) and that it's stored as text. 数据可以在一个文件中,也可以分散在目录的多个文件内。The data can be in one file or spread across multiple files within the directory.
SELECTSELECT 选择“t4”列中包含值“[ERROR]”的所有行的计数 。Selects a count of all rows where the column t4 contains the value [ERROR]. 此语句返回的值为 3,因为有三行包含此值。This statement returns a value of 3 because there are three rows that contain this value.
INPUT__FILE__NAME LIKE '%.log'INPUT__FILE__NAME LIKE '%.log' Hive 会尝试对目录中的所有文件应用架构。Hive attempts to apply the schema to all files in the directory. 在此示例中,目录包含与架构不匹配的文件。In this case, the directory contains files that don't match the schema. 为防止结果中包含垃圾数据,此语句指示 Hive 应当仅返回以 .log 结尾的文件中的数据。To prevent garbage data in the results, this statement tells Hive that we should only return data from files ending in .log.

备注

如果希望通过外部源更新基础数据,应使用外部表。External tables should be used when you expect the underlying data to be updated by an external source. 例如,自动化数据上传过程或 MapReduce 操作。For example, an automated data upload process, or MapReduce operation.

删除外部表不会删除数据,只会删除表定义。Dropping an external table does not delete the data, it only deletes the table definition.

若要创建内部表而非外部表,请使用以下 HiveQL:To create an internal table instead of external, use the following HiveQL:

CREATE TABLE IF NOT EXISTS errorLogs (
    t1 string,
    t2 string,
    t3 string,
    t4 string,
    t5 string,
    t6 string,
    t7 string)
STORED AS ORC;
INSERT OVERWRITE TABLE errorLogs
SELECT t1, t2, t3, t4, t5, t6, t7 
    FROM log4jLogs WHERE t4 = '[ERROR]';

这些语句执行以下操作:These statements perform the following actions:

语句Statement 说明Description
CREATE TABLE IF NOT EXISTSCREATE TABLE IF NOT EXISTS 如果该表不存在,则创建它。If the table doesn't exist, create it. 由于未使用 EXTERNAL 关键字,因此此语句会创建内部表。Because the EXTERNAL keyword isn't used, this statement creates an internal table. 该表存储在 Hive 数据仓库中,并完全由 Hive 管理。The table is stored in the Hive data warehouse and is managed completely by Hive.
STORED AS ORCSTORED AS ORC 以优化的行纵栏式 (ORC) 格式存储数据。Stores the data in Optimized Row Columnar (ORC) format. ORC 是高度优化且有效的 Hive 数据存储格式。ORC is a highly optimized and efficient format for storing Hive data.
INSERT OVERWRITE ...SELECTINSERT OVERWRITE ... SELECT 从包含“[ERROR]”的“log4jLogs”表中选择行,然后将数据插入“errorLogs”表中 。Selects rows from the log4jLogs table that contains [ERROR], and then inserts the data into the errorLogs table.

备注

与外部表不同,删除内部表会同时删除基础数据。Unlike external tables, dropping an internal table also deletes the underlying data.

提高 Hive 查询性能Improve Hive query performance

Apache TezApache Tez

Apache Tez 是让数据密集型应用程序(例如 Hive)能够大规模高效运行的框架。Apache Tez is a framework that allows data intensive applications, such as Hive, to run much more efficiently at scale. 默认情况下,Tez 已启用。Tez is enabled by default. Apache Hive on Tez 设计文档包含有关实现选项和优化配置的详细信息。The Apache Hive on Tez design documents contains details about the implementation choices and tuning configurations.

低延迟分析处理 (LLAP)Low Latency Analytical Processing (LLAP)

LLAP(有时称为 Live Long and Process)是 Hive 2.0 中的一项新功能,可对查询进行内存中缓存。LLAP (sometimes known as Live Long and Process) is a new feature in Hive 2.0 that allows in-memory caching of queries. LLAP 能够大幅加快 Hive 查询的速度,在某些情况下,速度比 Hive 1.x 要快 2.6 倍LLAP makes Hive queries much faster, up to 26x faster than Hive 1.x in some cases.

HDInsight 在交互式查询群集类型中提供 LLAP。HDInsight provides LLAP in the Interactive Query cluster type. 有关详细信息,请参阅交互式查询入门文档。For more information, see the Start with Interactive Query document.

计划 Hive 查询Scheduling Hive queries

在计划的或按需工作流中,可以使用多个服务来运行 Hive 查询。There are several services that can be used to run Hive queries as part of a scheduled or on-demand workflow.

Azure 数据工厂Azure Data Factory

Azure 数据工厂允许将 HDInsight 用作数据工厂管道的一部分。Azure Data Factory allows you to use HDInsight as part of a Data Factory pipeline. 有关通过管道使用 Hive 的详细信息,请参阅在 Azure 数据工厂中使用 Hive 活动转换数据文档。For more information on using Hive from a pipeline, see the Transform data using Hive activity in Azure Data Factory document.

Hive 作业和 SQL Server Integration ServicesHive jobs and SQL Server Integration Services

可以使用 SQL Server Integration Services (SSIS) 来运行 Hive 作业。You can use SQL Server Integration Services (SSIS) to run a Hive job. Azure Feature Pack for SSIS 提供适用于 HDInsight 上的 Hive 作业的以下组件。The Azure Feature Pack for SSIS provides the following components that work with Hive jobs on HDInsight.

有关详细信息,请参阅 Azure 功能包文档。For more information, see the Azure Feature Pack documentation.

Apache OozieApache Oozie

Apache Oozie 是一个管理 Hadoop 作业的工作流和协调系统。Apache Oozie is a workflow and coordination system that manages Hadoop jobs. 有关将 Oozie 与 Hive 配合使用的详细信息,请参阅使用 Apache Oozie 定义和运行工作流文档。For more information on using Oozie with Hive, see the Use Apache Oozie to define and run a workflow document.

后续步骤Next steps

现在,已了解什么是 Hive,以及如何将它与 HDInsight 中的 Hadoop 配合使用,请使用以下链接来学习 Azure HDInsight 的其他用法。Now that you've learned what Hive is and how to use it with Hadoop in HDInsight, use the following links to explore other ways to work with Azure HDInsight.