在 HDInsight 中通过 JDBC 驱动程序查询 Apache HiveQuery Apache Hive through the JDBC driver in HDInsight

了解如何从 Java 应用程序使用 JDBC 驱动程序将 Apache Hive 查询提交到 Azure HDInsight 中的 Apache Hadoop。Learn how to use the JDBC driver from a Java application to submit Apache Hive queries to Apache Hadoop in Azure HDInsight. 本文档中的信息演示如何以编程方式从 SQuirreL SQL 客户端进行连接。The information in this document demonstrates how to connect programmatically, and from the SQuirreL SQL client.

有关 Hive JDBC 接口的详细信息,请参阅 HiveJDBCInterfaceFor more information on the Hive JDBC Interface, see HiveJDBCInterface.

先决条件Prerequisites

JDBC 连接字符串JDBC connection string

JDBC 通过端口 443 连接到 Azure 上的 HDInsight 群集。JDBC connections to an HDInsight cluster on Azure are made over port 443. 使用 TLS/SSL 保护流量。The traffic is secured using TLS/SSL. 公用网关(群集位于其后)会将通信重定向到 HiveServer2 实际进行侦听的端口。The public gateway that the clusters sit behind redirects the traffic to the port that HiveServer2 is actually listening on. 以下连接字符串显示要用于 HDInsight 的格式:The following connection string shows the format to use for HDInsight:

    jdbc:hive2://CLUSTERNAME.azurehdinsight.cn:443/default;transportMode=http;ssl=true;httpPath=/hive2

CLUSTERNAME 替换为 HDInsight 群集的名称。Replace CLUSTERNAME with the name of your HDInsight cluster.

或者,可以通过“Ambari UI”>“Hive”>“配置”>“高级”获取连接。Or you can get the connection through Ambari UI > Hive > Configs > Advanced.

通过 Ambari 获取 JDBC 连接字符串

连接字符串中的主机名Host name in connection string

连接字符串中的主机名“CLUSTERNAME.azurehdinsight.cn”与群集 URL 相同。Host name 'CLUSTERNAME.azurehdinsight.cn' in the connection string is the same as your cluster URL. 可以通过 Azure 门户获取主机名。You can get it through Azure portal.

连接字符串中的端口Port in connection string

只能使用端口 443 从 Azure 虚拟网络外部的某个位置连接到群集。You can only use port 443 to connect to the cluster from some places outside of the Azure virtual network. HDInsight 是一种托管服务,这意味着与群集的所有连接都通过安全网关进行管理。HDInsight is a managed service, which means all connections to the cluster are managed via a secure Gateway. 不能直接在端口 10001 或 10000 上连接到 HiveServer 2。You can't connect to HiveServer 2 directly on ports 10001 or 10000. 这些端口不向外公开。These ports aren't exposed to the outside.

身份验证Authentication

建立连接时,请使用 HDInsight 群集管理员名称和密码进行身份验证。When establishing the connection, use the HDInsight cluster admin name and password to authenticate. 在 JDBC 客户端(如 SQuirreL SQL)的客户端设置中输入管理员名称和密码。From JDBC clients such as SQuirreL SQL, enter admin name and password in client settings.

从 Java 应用程序建立连接时,必须使用该名称和密码。From a Java application, you must use the name and password when establishing a connection. 例如,以下 Java 代码会打开新的连接:For example, the following Java code opens a new connection:

DriverManager.getConnection(connectionString,clusterAdmin,clusterPassword);

使用 SQuirreL SQL 客户端进行连接Connect with SQuirreL SQL client

SQuirreL SQL 是一个 JDBC 客户端,可用于通过 HDInsight 群集远程运行 Hive 查询。SQuirreL SQL is a JDBC client that can be used to remotely run Hive queries with your HDInsight cluster. 以下步骤假设已安装 SQuirreL SQL。The following steps assume that you have already installed SQuirreL SQL.

  1. 创建一个目录来包含要从群集复制的某些文件。Create a directory to contain certain files to be copied from your cluster.

  2. 在下面的脚本中,将 sshuser 替换为群集的 SSH 用户帐户名。In the following script, replace sshuser with the SSH user account name for the cluster. CLUSTERNAME 替换为 HDInsight 群集名称。Replace CLUSTERNAME with the HDInsight cluster name. 从命令行将工作目录更改为上一步中创建的目录,然后输入以下命令以从 HDInsight 群集复制文件:From a command line, change your work directory to the one created in the prior step, and then enter the following command to copy files from an HDInsight cluster:

    scp sshuser@CLUSTERNAME-ssh.azurehdinsight.cn:/usr/hdp/current/hadoop-client/{hadoop-auth.jar,hadoop-common.jar,lib/log4j-*.jar,lib/slf4j-*.jar,lib/curator-*.jar} .
    
    scp sshuser@CLUSTERNAME-ssh.azurehdinsight.cn:/usr/hdp/current/hive-client/lib/{commons-codec*.jar,commons-logging-*.jar,hive-*-*.jar,httpclient-*.jar,httpcore-*.jar,libfb*.jar,libthrift-*.jar} .
    
  3. 启动 SQuirreL SQL 应用程序。Start the SQuirreL SQL application. 在窗口左侧中,选择“驱动程序”。From the left of the window, select Drivers.

    窗口左侧的“驱动程序”选项卡

  4. 从“驱动程序” 对话框顶部的图标中,选择 + 图标创建驱动程序。From the icons at the top of the Drivers dialog, select the + icon to create a driver.

    驱动程序图标

  5. 在“添加驱动程序”对话框中,添加以下信息:In the Add Driver dialog, add the following information:

    属性Property ValueValue
    名称Name HiveHive
    示例 URLExample URL jdbc:hive2://localhost:443/default;transportMode=http;ssl=true;httpPath=/hive2
    额外类路径Extra Class Path 使用“添加”按钮添加此前下载的所有 jar 文件。Use the Add button to add the all of jar files downloaded earlier.
    类名Class Name org.apache.hive.jdbc.HiveDriverorg.apache.hive.jdbc.HiveDriver

    添加驱动程序对话框

    选择“确定” 保存这些设置。Select OK to save these settings.

  6. 在 SQuirreL SQL 窗口左侧,选择“别名”。On the left of the SQuirreL SQL window, select Aliases. 然后选择 + 图标来创建连接别名。Then select the + icon to create a connection alias.

    SQuirreL SQL“添加新别名”对话框

  7. 将以下值用于“添加别名” 对话框。Use the following values for the Add Alias dialog.

    属性Property ValueValue
    名称Name Hive on HDInsightHive on HDInsight
    驱动程序Driver 使用下拉列表选择 Hive 驱动程序。Use the drop-down to select the Hive driver.
    URLURL jdbc:hive2://CLUSTERNAME.azurehdinsight.cn:443/default;transportMode=http;ssl=true;httpPath=/hive2jdbc:hive2://CLUSTERNAME.azurehdinsight.cn:443/default;transportMode=http;ssl=true;httpPath=/hive2. CLUSTERNAME 替换为 HDInsight 群集的名称。Replace CLUSTERNAME with the name of your HDInsight cluster.
    用户名User Name HDInsight 群集的群集登录帐户名。The cluster login account name for your HDInsight cluster. 默认值为“admin”。The default is admin.
    密码Password 群集登录帐户的密码。The password for the cluster login account.

    添加别名对话框

    重要

    使用“测试” 按钮验证连接是否有效。Use the Test button to verify that the connection works. 出现“连接到: Hive on HDInsight”对话框时,选择“连接” 进行测试。When Connect to: Hive on HDInsight dialog appears, select Connect to perform the test. 如果测试成功,将会显示“连接成功” 对话框。If the test succeeds, you see a Connection successful dialog. 如果发生错误,请参阅故障排除If an error occurs, see Troubleshooting.

    若要保存连接别名,请使用“添加别名” 对话框底部的“确定” 按钮。To save the connection alias, use the Ok button at the bottom of the Add Alias dialog.

  8. 在 SQuirreL SQL 顶部的“连接到” 下拉列表中,选择“Hive on HDInsight”。From the Connect to dropdown at the top of SQuirreL SQL, select Hive on HDInsight. 出现提示时,选择“连接”。When prompted, select Connect.

    连接对话框

  9. 连接后,在 SQL 查询对话框中输入以下查询,然后选择“运行” 图标(一个正在跑步的人)。Once connected, enter the following query into the SQL query dialog, and then select the Run icon (a running person). 结果区域会显示查询的结果。The results area should show the results of the query.

    select * from hivesampletable limit 10;
    

    sql 查询对话框,其中包括结果

从 Java 应用程序示例进行连接Connect from an example Java application

https://github.com/Azure-Samples/hdinsight-java-hive-jdbc 上提供了使用 Java 客户端查询 Hive on HDInsight 的示例。An example of using a Java client to query Hive on HDInsight is available at https://github.com/Azure-Samples/hdinsight-java-hive-jdbc. 按照存储库中的说明生成并运行该示例。Follow the instructions in the repository to build and run the sample.

故障排除Troubleshooting

尝试打开 SQL 连接时发生意外错误Unexpected Error occurred attempting to open an SQL connection

症状:连接到 HDInsight 群集版本 3.3 或更高版本时,可能会遇到意外错误。Symptoms: When connecting to an HDInsight cluster that is version 3.3 or greater, you may receive an error that an unexpected error occurred. 此错误的堆栈跟踪的开头为以下行:The stack trace for this error begins with the following lines:

java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.<init>(I)V
at java.util.concurrent.FutureTas...(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:206)

原因: 此错误由 SQuirreL 随附的较旧版本 commons-codec.jar 文件引起。Cause: This error is caused by an older version commons-codec.jar file included with SQuirreL.

解决方法:若要解决此错误,请使用以下步骤:Resolution: To fix this error, use the following steps:

  1. 退出 SQuirreL,并转到系统上安装 SQuirreL 的目录(可能是 C:\Program Files\squirrel-sql-4.0.0\lib)。Exit SQuirreL, and then go to the directory where SQuirreL is installed on your system, perhaps C:\Program Files\squirrel-sql-4.0.0\lib. 在 SquirreL 目录的 lib 目录下,将现有的 commons-codec.jar 替换为从 HDInsight 群集下载的文件。In the SquirreL directory, under the lib directory, replace the existing commons-codec.jar with the one downloaded from the HDInsight cluster.

  2. 重新启动 SQuirreL。Restart SQuirreL. 连接到 HDInsight 上的 Hive 时,应不再会出现该错误。The error should no longer occur when connecting to Hive on HDInsight.

HDInsight 断开了连接Connection disconnected by HDInsight

症状:尝试通过 JDBC/ODBC 下载大量数据(例如数 GB)时,下载过程中 HDInsight 意外地断开了连接。Symptoms: When trying to download huge amount of data (say several GBs) through JDBC/ODBC, the connection is disconnected by HDInsight unexpectedly while downloading.

原因: 此错误是由网关节点上的限制引起的。Cause: This error is caused by the limitation on Gateway nodes. 从 JDBC/ODBC 获取数据时,所有数据都需要通过网关节点。When getting data from JDBC/ODBC, all data needs to pass through the Gateway node. 但网关并非设计用于下载大量数据,因此,如果网关无法处理该流量,它可能会关闭连接。However, a gateway isn't designed to download a huge amount of data, so the Gateway might close the connection if it can't handle the traffic.

解决方法:避免使用 JDBC/ODBC 驱动程序下载大量数据,Resolution: Avoid using JDBC/ODBC driver to download huge amounts of data. 而是直接从 blob 存储复制数据。Copy data directly from blob storage instead.

后续步骤Next steps

现在,已了解如何将 JDBC 与 Hive 配合使用,请使用以下链接学习 Azure HDInsight 的其他用法。Now that you have learned how to use JDBC to work with Hive, use the following links to explore other ways to work with Azure HDInsight.