将 Apache Beeline 客户端与 Apache Hive 配合使用Use the Apache Beeline client with Apache Hive

了解如何使用 Apache Beeline 在 HDInsight 上运行 Apache Hive 查询。Learn how to use Apache Beeline to run Apache Hive queries on HDInsight.

Beeline 是一个 Hive 客户端,包含在 HDInsight 群集的头节点上。Beeline is a Hive client that is included on the head nodes of your HDInsight cluster. 若要在本地安装 Beeline,请参阅下面的安装 Beeline 客户端To install Beeline locally, see Install beeline client, below. Beeline 使用 JDBC 连接到 HiveServer2,后者是 HDInsight 群集上托管的一项服务。Beeline uses JDBC to connect to HiveServer2, a service hosted on your HDInsight cluster. 还可以使用 Beeline 通过 Internet 远程访问 Hive on HDInsight。You can also use Beeline to access Hive on HDInsight remotely over the internet. 以下示例提供最常见的连接字符串,用于从 Beeline 连接到 HDInsight:The following examples provide the most common connection strings used to connect to HDInsight from Beeline:

连接类型Types of connections

从 SSH 会话From an SSH session

如果从 SSH 会话连接到群集头节点,则可随后连接到端口 headnodehost 上的 10001 地址:When connecting from an SSH session to a cluster headnode, you can then connect to the headnodehost address on port 10001:

beeline -u 'jdbc:hive2://headnodehost:10001/;transportMode=http'

通过 Azure 虚拟网络Over an Azure Virtual Network

通过 Azure 虚拟网络从客户端连接到 HDInsight 时,必须提供群集头节点的完全限定域名 (FQDN)。When connecting from a client to HDInsight over an Azure Virtual Network, you must provide the fully qualified domain name (FQDN) of a cluster head node. 由于直接与群集节点建立此连接,因此此连接使用端口 10001Since this connection is made directly to the cluster nodes, the connection uses port 10001:

beeline -u 'jdbc:hive2://<headnode-FQDN>:10001/;transportMode=http'

<headnode-FQDN> 替换为群集头节点的完全限定域名。Replace <headnode-FQDN> with the fully qualified domain name of a cluster headnode. 若要查找头节点的完全限定域名,请使用使用 Apache Ambari REST API 管理 HDInsight 文档中的信息。To find the fully qualified domain name of a headnode, use the information in the Manage HDInsight using the Apache Ambari REST API document.


通过公共或专用终结点Over public or private endpoints

使用公共或专用终结点连接到群集时,必须提供群集登录帐户名(默认值为 admin)和密码。When connecting to a cluster using the public or private endpoints, you must provide the cluster login account name (default admin) and password. 例如,使用 Beeline 从客户端系统连接到 <clustername>.azurehdinsight.cn 地址。For example, using Beeline from a client system to connect to the <clustername>.azurehdinsight.cn address. 此连接通过端口 443 建立,并使用 SSL 进行加密:This connection is made over port 443, and is encrypted using SSL:

beeline -u 'jdbc:hive2://clustername.azurehdinsight.cn:443/;ssl=true;transportMode=http;httpPath=/hive2' -n admin -p password

或对于专用终结点:or for private endpoint:

beeline -u 'jdbc:hive2://clustername-int.azurehdinsight.cn:443/;ssl=true;transportMode=http;httpPath=/hive2' -n admin -p password

clustername 替换为 HDInsight 群集的名称。Replace clustername with the name of your HDInsight cluster. admin 替换为群集的群集登录帐户。Replace admin with the cluster login account for your cluster. password 替换为群集登录帐户的密码。Replace password with the password for the cluster login account.

专用终结点指向一个基本的负载均衡器,后者只能从在同一区域中进行对等互连的 VNET 访问。Private endpoints point to a basic load balancer, which can only be accessed from the VNETs peered in the same region. 有关详细信息,请参阅对全局 VNet 对等互连和负载均衡器的约束See constraints on global VNet peering and load balancers for more info. 在使用 beeline 之前,可以将 curl 命令与 -v 选项配合使用,以便排查公共或专用终结点的任何连接问题。You can use the curl command with -v option to troubleshoot any connectivity problems with public or private endpoints before using beeline.


将 Beeline 与 Apache Spark 配合使用Use Beeline with Apache Spark

Apache Spark 提供自己的 HiveServer2 实现(有时称为 Spark Thrift 服务器)。Apache Spark provides its own implementation of HiveServer2, which is sometimes referred to as the Spark Thrift server. 此服务使用 Spark SQL 而不是 Hive 来解析查询,并且可以根据查询改善性能。This service uses Spark SQL to resolve queries instead of Hive, and may provide better performance depending on your query.

通过公共或专用终结点Through public or private endpoints

使用的连接字符串略有不同。The connection string used is slightly different. 它是 httpPath/sparkhive2,不包含 httpPath=/hive2Instead of containing httpPath=/hive2 it's httpPath/sparkhive2:

beeline -u 'jdbc:hive2://clustername.azurehdinsight.cn:443/;ssl=true;transportMode=http;httpPath=/sparkhive2' -n admin -p password

或对于专用终结点:or for private endpoint:

beeline -u 'jdbc:hive2://clustername-int.azurehdinsight.cn:443/;ssl=true;transportMode=http;httpPath=/sparkhive2' -n admin -p password

clustername 替换为 HDInsight 群集的名称。Replace clustername with the name of your HDInsight cluster. admin 替换为群集的群集登录帐户。Replace admin with the cluster login account for your cluster. password 替换为群集登录帐户的密码。Replace password with the password for the cluster login account. 专用终结点指向一个基本的负载均衡器,后者只能从在同一区域中进行对等互连的 VNET 访问。Private endpoints point to a basic load balancer, which can only be accessed from the VNETs peered in the same region. 有关详细信息,请参阅对全局 VNet 对等互连和负载均衡器的约束See constraints on global VNet peering and load balancers for more info. 在使用 beeline 之前,可以将 curl 命令与 -v 选项配合使用,以便排查公共或专用终结点的任何连接问题。You can use the curl command with -v option to troubleshoot any connectivity problems with public or private endpoints before using beeline.


使用 Apache Spark 从群集头或 Azure 虚拟网络中From cluster head or inside Azure Virtual Network with Apache Spark

当直接从群集头节点或者从 HDInsight 群集所在的 Azure 虚拟网络中的资源进行连接时,应当为 Spark Thrift 服务器使用端口 10002 而非 10001When connecting directly from the cluster head node, or from a resource inside the same Azure Virtual Network as the HDInsight cluster, port 10002 should be used for Spark Thrift server instead of 10001. 以下示例演示如何直接连接到头节点:The following example shows how to connect directly to the head node:

/usr/hdp/current/spark2-client/bin/beeline -u 'jdbc:hive2://headnodehost:10002/;transportMode=http'

先决条件Prerequisites

  • HDInsight 上的 Hadoop 群集。A Hadoop cluster on HDInsight. 请参阅 Linux 上的 HDInsight 入门See Get Started with HDInsight on Linux.

  • 请记下群集主存储的 URI 方案Notice the URI scheme for your cluster's primary storage. 例如,对于 Azure 存储,此值为 wasb://;对于Azure Data Lake Storage Gen2,此值为 abfs://;对于 Azure Data Lake Storage Gen1,此值为 adl://For example, wasb:// for Azure Storage, abfs:// for Azure Data Lake Storage Gen2, or adl:// for Azure Data Lake Storage Gen1. 如果为 Azure 存储或 Data Lake Storage Gen2 启用了安全传输,则 URI 分别是 wasbs://abfss://If secure transfer is enabled for Azure Storage or Data Lake Storage Gen2, the URI is wasbs:// or abfss://, respectively. 有关详细信息,请参阅安全传输For more information, see secure transfer.

  • 选项 1:SSH 客户端。Option 1: An SSH client. 有关详细信息,请参阅使用 SSH 连接到 HDInsight (Apache Hadoop)For more information, see Connect to HDInsight (Apache Hadoop) using SSH. 本文档中的大多数步骤都假定从与群集的 SSH 会话使用 Beeline。Most of the steps in this document assume that you are using Beeline from an SSH session to the cluster.

  • 选项 2:本地 Beeline 客户端。Option 2: A local Beeline client.

运行 Hive 查询Run a Hive query

此示例的基础是通过 SSH 连接使用 Beeline 客户端。This example is based on using the Beeline client from an SSH connection.

  1. 使用以下代码打开到群集的 SSH 连接。Open an SSH connection to the cluster with the code below. sshuser 替换为群集的 SSH 用户,并将 CLUSTERNAME 替换为群集的名称。Replace sshuser with the SSH user for your cluster, and replace CLUSTERNAME with the name of your cluster. 出现提示时,输入 SSH 用户帐户的密码。When prompted, enter the password for the SSH user account.

    ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.cn
    
  2. 输入以下命令,通过 Beeline 客户端从打开的 SSH 会话连接到 HiveServer2:Connect to HiveServer2 with your Beeline client from your open SSH session by entering the following command:

    beeline -u 'jdbc:hive2://headnodehost:10001/;transportMode=http'
    
  3. Beeline 命令以 ! 字符开头,例如,!help 显示帮助。Beeline commands begin with a ! character, for example !help displays help. 但是,! 对于某些命令可以省略。However the ! can be omitted for some commands. 例如,help 也是有效的。For example, help also works.

    !sql,用于执行 HiveQL 语句。There's !sql, which is used to execute HiveQL statements. 但是,由于 HiveQL 非常流行,因此可以省略前面的 !sqlHowever, HiveQL is so commonly used that you can omit the preceding !sql. 以下两个语句等效:The following two statements are equivalent:

    !sql show tables;
    show tables;
    

    在新群集上,只会列出一个表:hivesampletableOn a new cluster, only one table is listed: hivesampletable.

  4. 使用以下命令显示 hivesampletable 的架构:Use the following command to display the schema for the hivesampletable:

    describe hivesampletable;
    

    此命令返回以下信息:This command returns the following information:

     +-----------------------+------------+----------+--+
     |       col_name        | data_type  | comment  |
     +-----------------------+------------+----------+--+
     | clientid              | string     |          |
     | querytime             | string     |          |
     | market                | string     |          |
     | deviceplatform        | string     |          |
     | devicemake            | string     |          |
     | devicemodel           | string     |          |
     | state                 | string     |          |
     | country               | string     |          |
     | querydwelltime        | double     |          |
     | sessionid             | bigint     |          |
     | sessionpagevieworder  | bigint     |          |
     +-----------------------+------------+----------+--+
    

    此信息描述表中的列。This information describes the columns in the table.

  5. 输入以下语句,以使用 HDInsight 群集随附的示例数据来创建名为 log4jLogs 的表:(基于 URI 方案根据需要进行修改。)Enter the following statements to create a table named log4jLogs by using sample data provided with the HDInsight cluster: (Revise as needed based on your URI scheme.)

    DROP TABLE log4jLogs;
    CREATE EXTERNAL TABLE log4jLogs (
        t1 string,
        t2 string,
        t3 string,
        t4 string,
        t5 string,
        t6 string,
        t7 string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
    STORED AS TEXTFILE LOCATION 'wasbs:///example/data/';
    SELECT t4 AS sev, COUNT(*) AS count FROM log4jLogs 
        WHERE t4 = '[ERROR]' AND INPUT__FILE__NAME LIKE '%.log' 
        GROUP BY t4;
    

    这些语句执行以下操作:These statements do the following actions:

    • DROP TABLE - 如果表存在,则将其删除。DROP TABLE - If the table exists, it's deleted.

    • CREATE EXTERNAL TABLE - 在 Hive 中创建一个外部表。CREATE EXTERNAL TABLE - Creates an external table in Hive. 外部表只会在 Hive 中存储表定义。External tables only store the table definition in Hive. 数据保留在原始位置。The data is left in the original location.

    • ROW FORMAT - 如何设置数据的格式。ROW FORMAT - How the data is formatted. 在此情况下,每个日志中的字段以空格分隔。In this case, the fields in each log are separated by a space.

    • STORED AS TEXTFILE LOCATION - 数据存储位置和文件格式。STORED AS TEXTFILE LOCATION - Where the data is stored and in what file format.

    • SELECT - 选择 t4 列包含值 [ERROR] 的所有行的计数。SELECT - Selects a count of all rows where column t4 contains the value [ERROR]. 此查询返回值 3,因为有三行包含此值。This query returns a value of 3 as there are three rows that contain this value.

    • INPUT__FILE__NAME LIKE '%.log' - Hive 会尝试向目录中的所有文件应用架构。INPUT__FILE__NAME LIKE '%.log' - Hive attempts to apply the schema to all files in the directory. 在此示例中,目录包含与架构不匹配的文件。In this case, the directory contains files that do not match the schema. 为防止结果中包含垃圾数据,此语句指示 Hive 应当仅返回以 .log 结尾的文件中的数据。To prevent garbage data in the results, this statement tells Hive that it should only return data from files ending in .log.

    Note

    如果希望通过外部源更新基础数据,应使用外部表。External tables should be used when you expect the underlying data to be updated by an external source. 例如,自动化数据上传进程或 MapReduce 操作。For example, an automated data upload process or a MapReduce operation.

    删除外部表不会删除数据,只会删除表定义。Dropping an external table does not delete the data, only the table definition.

    此命令的输出类似于以下文本:The output of this command is similar to the following text:

    INFO  : Tez session hasn't been created yet. Opening session
    INFO  :
    
    INFO  : Status: Running (Executing on YARN cluster with App id application_1443698635933_0001)
    
    INFO  : Map 1: -/-      Reducer 2: 0/1
    INFO  : Map 1: 0/1      Reducer 2: 0/1
    INFO  : Map 1: 0/1      Reducer 2: 0/1
    INFO  : Map 1: 0/1      Reducer 2: 0/1
    INFO  : Map 1: 0/1      Reducer 2: 0/1
    INFO  : Map 1: 0(+1)/1  Reducer 2: 0/1
    INFO  : Map 1: 0(+1)/1  Reducer 2: 0/1
    INFO  : Map 1: 1/1      Reducer 2: 0/1
    INFO  : Map 1: 1/1      Reducer 2: 0(+1)/1
    INFO  : Map 1: 1/1      Reducer 2: 1/1
    +----------+--------+--+
    |   sev    | count  |
    +----------+--------+--+
    | [ERROR]  | 3      |
    +----------+--------+--+
    1 row selected (47.351 seconds)
    

运行 HiveQL 文件Run a HiveQL file

这是上一示例的继续。This is a continuation from the prior example. 使用以下步骤创建文件,并使用 Beeline 运行该文件。Use the following steps to create a file, then run it using Beeline.

  1. 使用以下命令创建一个名为 query.hql 的文件:Use the following command to create a file named query.hql:

    nano query.hql
    
  2. 将以下文本用作文件的内容。Use the following text as the contents of the file. 此查询创建名为 errorLogs 的新“内部”表:This query creates a new 'internal' table named errorLogs:

    CREATE TABLE IF NOT EXISTS errorLogs (t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) STORED AS ORC;
    INSERT OVERWRITE TABLE errorLogs SELECT t1, t2, t3, t4, t5, t6, t7 FROM log4jLogs WHERE t4 = '[ERROR]' AND INPUT__FILE__NAME LIKE '%.log';
    

    这些语句执行以下操作:These statements do the following actions:

    语句Statement 说明Description
    CREATE TABLE IF NOT EXISTSCREATE TABLE IF NOT EXISTS 如果该表尚不存在,则创建它。If the table doesn't already exist, it's created. 由于未使用 EXTERNAL 关键字,因此此语句会创建内部表。Since the EXTERNAL keyword isn't used, this statement creates an internal table. 内部表存储在 Hive 数据仓库中,由 Hive 全权管理。Internal tables are stored in the Hive data warehouse and are managed completely by Hive.
    STORED AS ORCSTORED AS ORC 以优化的行纵栏式 (ORC) 格式存储数据。Stores the data in Optimized Row Columnar (ORC) format. ORC 格式是高度优化且有效的 Hive 数据存储格式。ORC format is a highly optimized and efficient format for storing Hive data.
    INSERT OVERWRITE ...SELECTINSERT OVERWRITE ... SELECT 从包含 [ERROR]log4jLogs 表中选择行,然后将数据插入 errorLogs 表中。Selects rows from the log4jLogs table that contain [ERROR], then inserts the data into the errorLogs table.

    Note

    与外部表不同,删除内部表会同时删除基础数据。Unlike external tables, dropping an internal table deletes the underlying data as well.

  3. 如果要保存文件,请使用 Ctrl+ _X,并输入 Y,最后按 EnterTo save the file, use Ctrl+_X, then enter Y, and finally Enter.

  4. 使用以下命令通过 Beeline 运行该文件:Use the following to run the file using Beeline:

    beeline -u 'jdbc:hive2://headnodehost:10001/;transportMode=http' -i query.hql
    

    Note

    -i 参数启动 Beeline,并运行 query.hql 文件中的语句。The -i parameter starts Beeline and runs the statements in the query.hql file. 查询完成后,会出现 jdbc:hive2://headnodehost:10001/> 提示符。Once the query completes, you arrive at the jdbc:hive2://headnodehost:10001/> prompt. 还可以使用 -f 参数运行文件,该参数在查询完成后会退出 Beeline。You can also run a file using the -f parameter, which exits Beeline after the query completes.

  5. 若要验证是否已创建 errorLogs 表,请使用以下语句从 errorLogs 返回所有行:To verify that the errorLogs table was created, use the following statement to return all the rows from errorLogs:

    SELECT * from errorLogs;
    

    应返回三行数据,所有行都包含 t4 列中的 [ERROR]Three rows of data should be returned, all containing [ERROR] in column t4:

     +---------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
     | errorlogs.t1  | errorlogs.t2  | errorlogs.t3  | errorlogs.t4  | errorlogs.t5  | errorlogs.t6  | errorlogs.t7  |
     +---------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
     | 2012-02-03    | 18:35:34      | SampleClass0  | [ERROR]       | incorrect     | id            |               |
     | 2012-02-03    | 18:55:54      | SampleClass1  | [ERROR]       | incorrect     | id            |               |
     | 2012-02-03    | 19:25:27      | SampleClass4  | [ERROR]       | incorrect     | id            |               |
     +---------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
     3 rows selected (0.813 seconds)
    

安装 beeline 客户端Install beeline client

尽管 HDInsight 群集的头节点上包含 Beeline,但你可能需要将其安装在本地计算机上。Although Beeline is included on the head nodes of your HDInsight cluster, you may want to install it on a local machine. 用于在本地计算机上安装 Beeline 的步骤基于用于 Linux 的 Windows 子系统The steps below to install Beeline on a local machine are based on a Windows Subsystem for Linux.

  1. 更新包列表。Update package lists. 在 bash shell 中输入以下命令:Enter the following command in your bash shell:

    sudo apt-get update
    
  2. 安装 Java(如果未安装)。Install Java if not installed. 可以使用 which java 命令进行检查。You can check with the which java command.

    1. 如果未安装 java 包,请输入以下命令:If no java package is installed, enter the following command:

      sudo apt install openjdk-11-jre-headless
      
    2. 打开 bashrc 文件(通常在 ~/.bashrc 中找到):nano ~/.bashrcOpen the bashrc file (usually found in ~/.bashrc): nano ~/.bashrc.

    3. 修改 bashrc 文件。Amend the bashrc file. 在该文件的末尾添加以下行:Add the following line at the end of the file:

      export JAVA_HOME=/usr/lib/jvm/java-1.11.0-openjdk-amd64
      

      依次按 Ctrl+XY、Enter。Then press Ctrl+X, then Y, then enter.

  3. 下载 Hadoop 和 Beeline 存档,输入以下命令:Download Hadoop and Beeline archives, enter the following commands:

    wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
    wget https://archive.apache.org/dist/hive/hive-1.2.1/apache-hive-1.2.1-bin.tar.gz
    
  4. 解压缩这些存档,输入以下命令:Unpack the archives, enter the following commands:

    tar -xvzf hadoop-2.7.3.tar.gz
    tar -xvzf apache-hive-1.2.1-bin.tar.gz
    
  5. 进一步修改 bashrc 文件。Further amend the bashrc file. 你需要确定存档解压缩到的路径。You'll need to identify the path to where the archives were unpacked. 如果使用适用于 Linux 的 Windows 子系统,并严格按步骤操作,则路径为 /mnt/c/Users/user/,其中 user 是你的用户名。If using the Windows Subsystem for Linux, and you followed the steps exactly, your path would be /mnt/c/Users/user/, where user is your user name.

    1. 打开文件 nano ~/.bashrcOpen the file: nano ~/.bashrc

    2. 用适当的路径修改下面的命令,并将其输入到 bashrc 文件的末尾:Modify the commands below with the appropriate path and then enter them at the end of the bashrc file:

      export HADOOP_HOME=/path_where_the_archives_were_unpacked/hadoop-2.7.3
      export HIVE_HOME=/path_where_the_archives_were_unpacked/apache-hive-1.2.1-bin
      PATH=$PATH:$HIVE_HOME/bin
      
    3. 依次按 Ctrl+XY、Enter。Then press Ctrl+X, then Y, then enter.

  6. 关闭并重新打开 bash 会话。Close and then reopen you bash session.

  7. 测试连接。Test your connection. 使用上面的公共或专用终结点的连接格式。Use the connection format from Over public or private endpoints, above.

后续步骤Next steps