将 Apache Beeline 客户端与 Apache Hive 配合使用Use the Apache Beeline client with Apache Hive

了解如何使用 Apache Beeline 在 HDInsight 上运行 Apache Hive 查询。Learn how to use Apache Beeline to run Apache Hive queries on HDInsight.

Beeline 是一个 Hive 客户端,包含在 HDInsight 群集的头节点上。Beeline is a Hive client that is included on the head nodes of your HDInsight cluster. 要连接到 HDInsight 群集上安装的 Beeline 客户端,或在本地安装 Beeline,请参阅连接到或安装 Apache BeelineTo connect to the Beeline client installed on your HDInsight cluster, or install Beeline locally, see Connect to or install Apache Beeline. Beeline 使用 JDBC 连接到 HiveServer2,后者是 HDInsight 群集上托管的一项服务。Beeline uses JDBC to connect to HiveServer2, a service hosted on your HDInsight cluster. 还可以使用 Beeline 通过 Internet 远程访问 Hive on HDInsight。You can also use Beeline to access Hive on HDInsight remotely over the internet. 以下示例提供用于从 Beeline 连接到 HDInsight 的最常见连接字符串。The following examples provide the most common connection strings used to connect to HDInsight from Beeline.

先决条件示例Prerequisites for examples

  • HDInsight 上的 Hadoop 群集。A Hadoop cluster on HDInsight. 请参阅 Linux 上的 HDInsight 入门See Get Started with HDInsight on Linux.

  • 请记下群集主存储的 URI 方案。Notice the URI scheme for your cluster's primary storage. 例如,对于 Azure 存储,此方案为 wasb://;对于 Azure Data Lake Storage Gen2,此方案为 abfs://For example, wasb:// for Azure Storage, abfs:// for Azure Data Lake Storage Gen2. 如果为 Azure 存储启用了安全传输,则 URI 将为 wasbs://If secure transfer is enabled for Azure Storage, the URI is wasbs://. 有关详细信息,请参阅安全传输For more information, see secure transfer.

  • 选项 1:SSH 客户端。Option 1: An SSH client. 有关详细信息,请参阅使用 SSH 连接到 HDInsight (Apache Hadoop)For more information, see Connect to HDInsight (Apache Hadoop) using SSH. 本文档中的大多数步骤都假定从与群集的 SSH 会话使用 Beeline。Most of the steps in this document assume that you are using Beeline from an SSH session to the cluster.

  • 选项 2:本地 Beeline 客户端。Option 2: A local Beeline client.

运行 Hive 查询Run a Hive query

此示例的基础是通过 SSH 连接使用 Beeline 客户端。This example is based on using the Beeline client from an SSH connection.

  1. 使用以下代码打开到群集的 SSH 连接。Open an SSH connection to the cluster with the code below. sshuser 替换为群集的 SSH 用户,并将 CLUSTERNAME 替换为群集的名称。Replace sshuser with the SSH user for your cluster, and replace CLUSTERNAME with the name of your cluster. 出现提示时,输入 SSH 用户帐户的密码。When prompted, enter the password for the SSH user account.

    ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.cn
    
  2. 输入以下命令,通过 Beeline 客户端从打开的 SSH 会话连接到 HiveServer2:Connect to HiveServer2 with your Beeline client from your open SSH session by entering the following command:

    beeline -u 'jdbc:hive2://headnodehost:10001/;transportMode=http'
    
  3. Beeline 命令以 ! 字符开头,例如,!help 显示帮助。Beeline commands begin with a ! character, for example !help displays help. 但是,! 对于某些命令可以省略。However the ! can be omitted for some commands. 例如,help 也是有效的。For example, help also works.

    !sql,用于执行 HiveQL 语句。There's !sql, which is used to execute HiveQL statements. 但是,由于 HiveQL 非常流行,因此可以省略前面的 !sqlHowever, HiveQL is so commonly used that you can omit the preceding !sql. 以下两个语句等效:The following two statements are equivalent:

    !sql show tables;
    show tables;
    

    在新群集上,只会列出一个表:hivesampletableOn a new cluster, only one table is listed: hivesampletable.

  4. 使用以下命令显示 hivesampletable 的架构:Use the following command to display the schema for the hivesampletable:

    describe hivesampletable;
    

    此命令返回以下信息:This command returns the following information:

    +-----------------------+------------+----------+--+
    |       col_name        | data_type  | comment  |
    +-----------------------+------------+----------+--+
    | clientid              | string     |          |
    | querytime             | string     |          |
    | market                | string     |          |
    | deviceplatform        | string     |          |
    | devicemake            | string     |          |
    | devicemodel           | string     |          |
    | state                 | string     |          |
    | country               | string     |          |
    | querydwelltime        | double     |          |
    | sessionid             | bigint     |          |
    | sessionpagevieworder  | bigint     |          |
    +-----------------------+------------+----------+--+
    

    此信息描述表中的列。This information describes the columns in the table.

  5. 输入以下语句,以使用 HDInsight 群集随附的示例数据来创建名为 log4jLogs 的表:(基于 URI 方案根据需要进行修改。)Enter the following statements to create a table named log4jLogs by using sample data provided with the HDInsight cluster: (Revise as needed based on your URI scheme.)

    DROP TABLE log4jLogs;
    CREATE EXTERNAL TABLE log4jLogs (
        t1 string,
        t2 string,
        t3 string,
        t4 string,
        t5 string,
        t6 string,
        t7 string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
    STORED AS TEXTFILE LOCATION 'wasbs:///example/data/';
    SELECT t4 AS sev, COUNT(*) AS count FROM log4jLogs 
        WHERE t4 = '[ERROR]' AND INPUT__FILE__NAME LIKE '%.log' 
        GROUP BY t4;
    

    这些语句执行以下操作:These statements do the following actions:

    语句Statement 说明Description
    DROP TABLEDROP TABLE 如果该表存在,则将其删除。If the table exists, it's deleted.
    CREATE EXTERNAL TABLECREATE EXTERNAL TABLE 在 Hive 中创建一个外部表。Creates an external table in Hive. 外部表只会在 Hive 中存储表定义。External tables only store the table definition in Hive. 数据保留在原始位置。The data is left in the original location.
    ROW FORMATROW FORMAT 如何设置数据的格式。How the data is formatted. 在此情况下,每个日志中的字段以空格分隔。In this case, the fields in each log are separated by a space.
    STORED AS TEXTFILE LOCATIONSTORED AS TEXTFILE LOCATION 数据存储的位置和文件格式。Where the data is stored and in what file format.
    SELECTSELECT 选择 t4 列包含值 [ERROR] 的所有行的计数。Selects a count of all rows where column t4 contains the value [ERROR]. 此查询返回值 3,因为有三行包含此值。This query returns a value of 3 as there are three rows that contain this value.
    INPUT__FILE__NAME LIKE '%.log'INPUT__FILE__NAME LIKE '%.log' Hive 会尝试对目录中的所有文件应用架构。Hive attempts to apply the schema to all files in the directory. 在此示例中,目录包含与架构不匹配的文件。In this case, the directory contains files that don't match the schema. 为防止结果中包含垃圾数据,此语句指示 Hive 应当仅返回以 .log 结尾的文件中的数据。To prevent garbage data in the results, this statement tells Hive that it should only return data from files ending in .log.

    备注

    如果希望通过外部源更新基础数据,应使用外部表。External tables should be used when you expect the underlying data to be updated by an external source. 例如,自动化数据上传进程或 MapReduce 操作。For example, an automated data upload process or a MapReduce operation.

    删除外部表不会删除数据,只会删除表定义。Dropping an external table does not delete the data, only the table definition.

    此命令的输出类似于以下文本:The output of this command is similar to the following text:

    INFO  : Tez session hasn't been created yet. Opening session
    INFO  :
    
    INFO  : Status: Running (Executing on YARN cluster with App id application_1443698635933_0001)
    
    INFO  : Map 1: -/-      Reducer 2: 0/1
    INFO  : Map 1: 0/1      Reducer 2: 0/1
    INFO  : Map 1: 0/1      Reducer 2: 0/1
    INFO  : Map 1: 0/1      Reducer 2: 0/1
    INFO  : Map 1: 0/1      Reducer 2: 0/1
    INFO  : Map 1: 0(+1)/1  Reducer 2: 0/1
    INFO  : Map 1: 0(+1)/1  Reducer 2: 0/1
    INFO  : Map 1: 1/1      Reducer 2: 0/1
    INFO  : Map 1: 1/1      Reducer 2: 0(+1)/1
    INFO  : Map 1: 1/1      Reducer 2: 1/1
    +----------+--------+--+
    |   sev    | count  |
    +----------+--------+--+
    | [ERROR]  | 3      |
    +----------+--------+--+
    1 row selected (47.351 seconds)
    
  6. 退出 Beeline:Exit Beeline:

    !exit
    

运行 HiveQL 文件Run a HiveQL file

本示例是上一示例的延续部分。This example is a continuation from the prior example. 使用以下步骤创建文件,并使用 Beeline 运行该文件。Use the following steps to create a file, then run it using Beeline.

  1. 使用以下命令创建一个名为 query.hql 的文件:Use the following command to create a file named query.hql:

    nano query.hql
    
  2. 将以下文本用作文件的内容。Use the following text as the contents of the file. 此查询创建名为 errorLogs 的新“内部”表:This query creates a new 'internal' table named errorLogs:

    CREATE TABLE IF NOT EXISTS errorLogs (t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) STORED AS ORC;
    INSERT OVERWRITE TABLE errorLogs SELECT t1, t2, t3, t4, t5, t6, t7 FROM log4jLogs WHERE t4 = '[ERROR]' AND INPUT__FILE__NAME LIKE '%.log';
    

    这些语句执行以下操作:These statements do the following actions:

    语句Statement 说明Description
    CREATE TABLE IF NOT EXISTSCREATE TABLE IF NOT EXISTS 如果该表尚不存在,则创建它。If the table doesn't already exist, it's created. 由于未使用 EXTERNAL 关键字,因此此语句会创建内部表。Since the EXTERNAL keyword isn't used, this statement creates an internal table. 内部表存储在 Hive 数据仓库中,由 Hive 全权管理。Internal tables are stored in the Hive data warehouse and are managed completely by Hive.
    STORED AS ORCSTORED AS ORC 以优化的行纵栏式 (ORC) 格式存储数据。Stores the data in Optimized Row Columnar (ORC) format. ORC 格式是高度优化且有效的 Hive 数据存储格式。ORC format is a highly optimized and efficient format for storing Hive data.
    INSERT OVERWRITE ...SELECTINSERT OVERWRITE ... SELECT 从包含 [ERROR]log4jLogs 表中选择行,然后将数据插入 errorLogs 表中。Selects rows from the log4jLogs table that contain [ERROR], then inserts the data into the errorLogs table.

    备注

    与外部表不同,删除内部表会同时删除基础数据。Unlike external tables, dropping an internal table deletes the underlying data as well.

  3. 如果要保存文件,请使用 Ctrl+X,并输入 Y,最后按 Enter 。To save the file, use Ctrl+X, then enter Y, and finally Enter.

  4. 使用以下命令通过 Beeline 运行该文件:Use the following to run the file using Beeline:

    beeline -u 'jdbc:hive2://headnodehost:10001/;transportMode=http' -i query.hql
    

    备注

    -i 参数启动 Beeline,并运行 query.hql 文件中的语句。The -i parameter starts Beeline and runs the statements in the query.hql file. 查询完成后,会出现 jdbc:hive2://headnodehost:10001/> 提示符。Once the query completes, you arrive at the jdbc:hive2://headnodehost:10001/> prompt. 还可以使用 -f 参数运行文件,该参数在查询完成后会退出 Beeline。You can also run a file using the -f parameter, which exits Beeline after the query completes.

  5. 若要验证是否已创建 errorLogs 表,请使用以下语句从 errorLogs 返回所有行:To verify that the errorLogs table was created, use the following statement to return all the rows from errorLogs:

    SELECT * from errorLogs;
    

    应返回三行数据,所有行都包含 t4 列中的 [ERROR]Three rows of data should be returned, all containing [ERROR] in column t4:

    +---------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
    | errorlogs.t1  | errorlogs.t2  | errorlogs.t3  | errorlogs.t4  | errorlogs.t5  | errorlogs.t6  | errorlogs.t7  |
    +---------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
    | 2012-02-03    | 18:35:34      | SampleClass0  | [ERROR]       | incorrect     | id            |               |
    | 2012-02-03    | 18:55:54      | SampleClass1  | [ERROR]       | incorrect     | id            |               |
    | 2012-02-03    | 19:25:27      | SampleClass4  | [ERROR]       | incorrect     | id            |               |
    +---------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
    3 rows selected (0.813 seconds)
    

后续步骤Next steps