使用针对 Visual Studio 的 Data Lake 工具运行 Apache Hive 查询Run Apache Hive queries using the Data Lake tools for Visual Studio

了解如何使用用于 Visual Studio 的 Data Lake 工具查询 Apache Hive。Learn how to use the Data Lake tools for Visual Studio to query Apache Hive. 使用 Data Lake 工具,可以轻松创建 Hive 查询,将其提交到 Azure HDInsight 上的 Apache Hadoop 并进行监视。The Data Lake tools allow you to easily create, submit, and monitor Hive queries to Apache Hadoop on Azure HDInsight.

先决条件Prerequisites

使用 Visual Studio 运行 Apache Hive 查询Run Apache Hive queries using the Visual Studio

可以使用两个选项来创建并运行 Hive 查询:You have two options for creating and running Hive queries:

  • 创建临时查询。Create ad-hoc queries.
  • 创建 Hive 应用程序。Create a Hive application.

创建临时 Hive 查询。Create an ad-hoc Hive query

即席查询可以批处理交互式模式执行。Ad hoc queries can be executed in either Batch or Interactive mode.

  1. 启动 Visual Studio 并选择“继续但无需代码” 。Launch Visual Studio and select Continue without code.

  2. 在服务器资源管理器中右键单击“Azure”并选择“连接到 Microsoft Azure 订阅...”,然后完成登录过程。 From Server Explorer, right-click Azure, select Connect to Microsoft Azure Subscription..., and complete the sign in process.

  3. 展开“HDInsight”,右键单击要运行查询的群集,然后选择“编写 Hive 查询” 。Expand HDInsight, right-click the cluster where you want to run the query, and then select Write a Hive Query.

  4. 输入以下 Hive 查询:Enter the following hive query:

    SELECT * FROM hivesampletable;
    
  5. 选择“执行” 。Select Execute. 执行模式默认为“交互式”。 The execution mode defaults to Interactive.

    执行交互式 Hive 查询,Visual Studio

  6. 若要以批处理模式下运行同一查询,请将下拉列表从“交互式” 切换到“批处理” 。To run the same query in Batch mode, toggle the drop-down list from Interactive to Batch. 执行按钮将从“执行” 更改为“提交” 。The execution button changes from Execute to Submit.

    提交批处理 Hive 查询,Visual Studio

    Hive 编辑器支持 IntelliSense。The Hive editor supports IntelliSense. 用于 Visual Studio 的 Data Lake 工具支持在编辑 Hive 脚本时加载远程元数据。Data Lake Tools for Visual Studio supports loading remote metadata when you edit your Hive script. 例如,如果键入 SELECT * FROM,则 IntelliSense 会列出所有建议的表名称。For example, if you type SELECT * FROM, IntelliSense lists all the suggested table names. 在指定表名称后,IntelliSense 会列出列名称。When a table name is specified, IntelliSense lists the column names. 这些工具支持大多数 Hive DML 语句、子查询和内置 UDF。The tools support most Hive DML statements, subqueries, and built-in UDFs. IntelliSense 只建议 HDInsight 工具栏中所选群集的元数据。IntelliSense suggests only the metadata of the cluster that is selected in the HDInsight toolbar.

  7. 在查询工具栏(查询选项卡下面以及查询文本上面的区域)中,选择“提交”,或者选择“提交”旁边的下拉箭头并从下拉列表中选择“高级”。 In the query toolbar (the area below the query tab and above the query text), either select Submit, or select the pulldown arrow next to Submit and choose Advanced from the pulldown list. 如果选择后一个选项,If you select the latter option,

  8. 即选择高级提交选项,请在“提交脚本”对话框中配置“作业名称”、“参数”、“其他配置”和“状态目录”。 If you selected the advanced submit option, configure Job Name, Arguments, Additional Configurations, and Status Directory in the Submit Script dialog box. 然后选择“提交”。 Then select Submit.

    “提交脚本”对话框,HDInsight Hadoop Hive 查询

创建 Hive 应用程序Create a Hive application

若要通过创建 Hive 应用程序来运行 Hive 查询,请执行以下步骤:To run a Hive query by creating a Hive application, follow these steps:

  1. 打开 Visual StudioOpen Visual Studio.

  2. 在“开始”窗口中,选择“创建新项目”。 In the Start window, select Create a new project.

  3. 在“创建新项目”窗口中的“搜索模板”框内,输入 HiveIn the Create a new project window, in the Search for templates box, enter Hive. 然后依次选择“Hive 应用程序”、“下一步”。 Then choose Hive Application and select Next.

  4. 在“配置新项目”窗口中输入一个项目名称,选择或创建新项目的位置,然后选择“创建”。 In the Configure your new project window, enter a Project name, select or create a Location for the new project, and then select Create.

  5. 打开在创建此项目时产生的 Script.hql 文件,并在其中粘贴以下 HiveQL 语句:Open the Script.hql file that is created with this project, and paste in the following HiveQL statements:

    set hive.execution.engine=tez;
    DROP TABLE log4jLogs;
    CREATE EXTERNAL TABLE log4jLogs (t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
    STORED AS TEXTFILE LOCATION '/example/data/';
    SELECT t4 AS sev, COUNT(*) AS count FROM log4jLogs WHERE t4 = '[ERROR]' AND  INPUT__FILE__NAME LIKE '%.log' GROUP BY t4;
    

    这些语句执行以下操作:These statements do the following actions:

    • DROP TABLE:如果表存在,将删除该表。DROP TABLE: Deletes the table if it exists.

    • CREATE EXTERNAL TABLE:在 Hive 中创建一个新的“外部”表。CREATE EXTERNAL TABLE: Creates a new 'external' table in Hive. 外部表只会在 Hive 中存储表定义。External tables only store the table definition in Hive. (数据保留在原始位置。)(The data is left in the original location.)

      备注

      如果你预期基础数据会由外部源(例如 MapReduce 作业或 Azure 服务)更新,则应使用外部表。External tables should be used when you expect the underlying data to be updated by an external source, such as a MapReduce job or an Azure service.

      删除外部表 不会 删除数据,只会删除表定义。Dropping an external table does not delete the data, only the table definition.

    • ROW FORMAT:让 Hive 知道数据的格式已如何进行了设置。ROW FORMAT: Tells Hive how the data is formatted. 在此情况下,每个日志中的字段以空格分隔。In this case, the fields in each log are separated by a space.

    • STORED AS TEXTFILE LOCATION:告知 Hive 数据已以文本形式存储在 example/data 目录中。STORED AS TEXTFILE LOCATION: Tells Hive that the data is stored in the example/data directory, and that it's stored as text.

    • SELECT:选择 t4 列包含值 [ERROR] 的所有行计数。SELECT: Selects a count of all rows where column t4 contains the value [ERROR]. 此语句会返回值 3,因为有三个行包含此值。This statement returns a value of 3, because three rows contain this value.

    • INPUT__FILE__NAME LIKE '%.log':告知 Hive 只会返回以 .log 结尾的文件中的数据。INPUT__FILE__NAME LIKE '%.log': Tells Hive to only return data from files ending in .log. 此子句将搜索限定为包含数据的 sample.log 文件。This clause restricts the search to the sample.log file that contains the data.

  6. 在查询文件工具栏(其外观与临时查询工具栏类似)中,选择要用于此查询的 HDInsight 群集。From the query file toolbar (which has a similar appearance to the ad-hoc query toolbar), select the HDInsight cluster that you want to use for this query. 然后,将“交互式”更改为“批处理”(如果需要),并选择“提交”以 Hive 作业形式运行语句。 Then change Interactive to Batch (if necessary) and select Submit to run the statements as a Hive job.

    “Hive 作业摘要” 将会出现并显示有关正在运行的作业的信息。The Hive Job Summary appears and displays information about the running job. 在“作业状态” 更改为“已完成” 之前,使用“刷新” 链接刷新作业信息。Use the Refresh link to refresh the job information, until the Job Status changes to Completed.

    已完成 Hive 作业摘要,Hive 应用程序,Visual Studio

  7. 选择“作业输出”查看此作业的输出。 Select Job Output to view the output of this job. [ERROR] 3,这是此查询返回的值。It displays [ERROR] 3, which is the value returned by this query.

其他示例Additional example

以下示例依赖于在前一过程创建 Hive 应用程序中创建的 log4jLogs 表。The following example relies on the log4jLogs table created in the previous procedure, Create a Hive application.

  1. 在“服务器资源管理器” 中,右键单击群集,然后选择“编写 Hive 查询” 。From Server Explorer, right-click your cluster and select Write a Hive Query.

  2. 输入以下 Hive 查询:Enter the following hive query:

    set hive.execution.engine=tez;
    CREATE TABLE IF NOT EXISTS errorLogs (t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) STORED AS ORC;
    INSERT OVERWRITE TABLE errorLogs SELECT t1, t2, t3, t4, t5, t6, t7 FROM log4jLogs WHERE t4 = '[ERROR]' AND INPUT__FILE__NAME LIKE '%.log';
    

    这些语句执行以下操作:These statements do the following actions:

    • CREATE TABLE IF NOT EXISTS:如果表尚不存在,将创建表。CREATE TABLE IF NOT EXISTS: Creates a table if it doesn't already exist. 由于未使用 EXTERNAL 关键字,因此此语句会创建内部表。Because the EXTERNAL keyword isn't used, this statement creates an internal table. 内部表存储在 Hive 数据仓库中,并由 Hive 托管。Internal tables are stored in the Hive data warehouse and are managed by Hive.

      备注

      EXTERNAL 表不同,删除内部表会同时删除基础数据。Unlike EXTERNAL tables, dropping an internal table also deletes the underlying data.

    • STORED AS ORC:以优化的行纵栏式 (ORC) 格式存储数据。 STORED AS ORC: Stores the data in optimized row columnar (ORC) format. ORC 是高度优化且有效的 Hive 数据存储格式。ORC is a highly optimized and efficient format for storing Hive data.

    • INSERT OVERWRITE ... SELECT:从包含 [ERROR]log4jLogs 表中选择行,然后将数据插入 errorLogs 表中。INSERT OVERWRITE ... SELECT: Selects rows from the log4jLogs table that contain [ERROR], then inserts the data into the errorLogs table.

  3. 根据需要将“交互式”更改为“批处理”,然后选择“提交”。 Change Interactive to Batch if necessary, then select Submit.

  4. 若要验证该作业是否已创建表,请转到“服务器资源管理器”并展开“Azure” > “HDInsight”。 To verify that the job created the table, go to Server Explorer and expand Azure > HDInsight. 展开 HDInsight 群集,然后展开“Hive 数据库” > “默认”。 Expand your HDInsight cluster, and then expand Hive Databases > default. 此时会列出 errorLogs 表和 log4jLogs 表。The errorLogs table and the log4jLogs table are listed.

后续步骤Next steps

如你所见,用于 Visual Studio 的 HDInsight 工具提供了在 HDInsight 上使用 Hive 查询的轻松方式。As you can see, the HDInsight tools for Visual Studio provide an easy way to work with Hive queries on HDInsight.

有关 HDInsight 中的 Hive 的一般信息:For general information about Hive in HDInsight:

有关 HDInsight 上 Hadoop 的其他使用方法的信息:For information about other ways you can work with Hadoop on HDInsight:

有关适用于 Visual Studio 的 HDInsight 工具的详细信息:For more information about the HDInsight tools for Visual Studio: