使用 REST 在 HDInsight 中通过 Apache Hadoop 运行 Apache Hive 查询Run Apache Hive queries with Apache Hadoop in HDInsight using REST

了解如何使用 WebHCat REST API 通过 Apache Hadoop on Azure HDInsight 群集运行 Apache Hive 查询。Learn how to use the WebHCat REST API to run Apache Hive queries with Apache Hadoop on Azure HDInsight cluster.

必备条件Prerequisites

用于 Rest API 的基 URIBase URI for Rest API

HDInsight 上 REST API 的基本统一资源标识符 (URI) 为 https://CLUSTERNAME.azurehdinsight.cn/api/v1/clusters/CLUSTERNAME,其中 CLUSTERNAME 是群集的名称。The base Uniform Resource Identifier (URI) for the REST API on HDInsight is https://CLUSTERNAME.azurehdinsight.cn/api/v1/clusters/CLUSTERNAME, where CLUSTERNAME is the name of your cluster. URI 中的群集名称区分大小写Cluster names in URIs are case-sensitive. 虽然 URI (CLUSTERNAME.azurehdinsight.cn) 的完全限定域名 (FQDN) 部分中的群集名称不区分大小写,但 URI 中的其他部分是区分大小写的。While the cluster name in the fully qualified domain name (FQDN) part of the URI (CLUSTERNAME.azurehdinsight.cn) is case-insensitive, other occurrences in the URI are case-sensitive.

身份验证Authentication

使用 cURL 或者与 WebHCat 进行任何其他形式的 REST 通信时,必须提供 HDInsight 群集管理员的用户名和密码以对请求进行身份验证。When using cURL or any other REST communication with WebHCat, you must authenticate the requests by providing the user name and password for the HDInsight cluster administrator. REST API 通过 基本身份验证进行保护。The REST API is secured via basic authentication. 为了有助于确保将凭据安全地发送到服务器,应始终使用安全 HTTP (HTTPS) 发出请求。To help ensure that your credentials are securely sent to the server, always make requests by using Secure HTTP (HTTPS).

设置(保留凭据)Setup (Preserve credentials)

请保留凭据,以免在每个示例中重复输入。Preserve your credentials to avoid reentering them for each example. 群集名称将在单独的步骤中保留。The cluster name will be preserved in a separate step.

A.BashA. Bash
编辑以下脚本,将 PASSWORD 替换为实际密码。Edit the script below by replacing PASSWORD with your actual password. 然后输入该命令。Then enter the command.

export password='PASSWORD'

B.PowerShell 执行以下代码并在弹出窗口中输入凭据:B. PowerShell Execute the code below and enter your credentials at the pop-up window:

$creds = Get-Credential -UserName "admin" -Message "Enter the HDInsight login"

识别大小写正确的群集名称Identify correctly cased cluster name

群集名称的实际大小写格式可能出乎预期,具体取决于群集的创建方式。The actual casing of the cluster name may be different than you expect, depending on how the cluster was created. 此处的步骤将显示实际大小写,然后将其存储在某个变量中,以便在后续示例中使用。The steps here will show the actual casing, and then store it in a variable for all subsequent examples.

编辑以下脚本,将 CLUSTERNAME 替换为群集名称。Edit the scripts below to replace CLUSTERNAME with your cluster name. 然后输入该命令。Then enter the command. (FQDN 的群集名称不区分大小写。)(The cluster name for the FQDN is not case-sensitive.)

export clusterName=$(curl -u admin:$password -sS -G "https://CLUSTERNAME.azurehdinsight.cn/api/v1/clusters" | jq -r '.items[].Clusters.cluster_name')
echo $clusterName
# Identify properly cased cluster name
$resp = Invoke-WebRequest -Uri "https://CLUSTERNAME.azurehdinsight.cn/api/v1/clusters" `
    -Credential $creds -UseBasicParsing
$clusterName = (ConvertFrom-Json $resp.Content).items.Clusters.cluster_name;

# Show cluster name
$clusterName

运行 Hive 查询Run a Hive query

  1. 若要验证是否可以连接到 HDInsight 群集,请使用下列命令之一:To verify that you can connect to your HDInsight cluster, use one of the following commands:

    curl -u admin:$password -G https://$CLUSTERNAME.azurehdinsight.cn/templeton/v1/status
    
    $resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.cn/templeton/v1/status" `
       -Credential $creds `
       -UseBasicParsing
    $resp.Content
    

    将收到类似于以下文本的响应:You receive a response similar to the following text:

    {"status":"ok","version":"v1"}
    

    此命令中使用的参数如下:The parameters used in this command are as follows:

    • -u - 用来对请求进行身份验证的用户名和密码。-u - The user name and password used to authenticate the request.
    • -G - 指示此请求是一个 GET 操作。-G - Indicates that this request is a GET operation.

    URL 的开头 (https://$CLUSTERNAME.azurehdinsight.cn/templeton/v1) 对于所有请求都是相同的。The beginning of the URL, https://$CLUSTERNAME.azurehdinsight.cn/templeton/v1, is the same for all requests. 路径 /status 指示请求将返回服务器的 WebHCat(也称为 Templeton)状态。The path, /status, indicates that the request is to return a status of WebHCat (also known as Templeton) for the server. 还可以通过使用以下命令请求 Hive 的版本:You can also request the version of Hive by using the following command:

    curl -u admin:$password -G https://$clusterName.azurehdinsight.cn/templeton/v1/version/hive
    
    $resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.cn/templeton/v1/version/hive" `
       -Credential $creds `
       -UseBasicParsing
    $resp.Content
    

    此请求返回的响应类似于以下文本:This request returns a response similar to the following text:

    {"module":"hive","version":"1.2.1000.2.6.5.3008-11"}
    
  2. 使用以下命令创建名为 log4jLogs 的表:Use the following to create a table named log4jLogs:

    jobid=$(curl -s -u admin:$password -d user.name=admin -d execute="DROP+TABLE+log4jLogs;CREATE+EXTERNAL+TABLE+log4jLogs(t1+string,t2+string,t3+string,t4+string,t5+string,t6+string,t7+string)+ROW+FORMAT+DELIMITED+FIELDS+TERMINATED+BY+' '+STORED+AS+TEXTFILE+LOCATION+'/example/data/';SELECT+t4+AS+sev,COUNT(*)+AS+count+FROM+log4jLogs+WHERE+t4+=+'[ERROR]'+AND+INPUT__FILE__NAME+LIKE+'%25.log'+GROUP+BY+t4;" -d statusdir="/example/rest" https://$clusterName.azurehdinsight.cn/templeton/v1/hive | jq -r .id)
    echo $jobid
    
    $reqParams = @{"user.name"="admin";"execute"="DROP TABLE log4jLogs;CREATE EXTERNAL TABLE log4jLogs(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW FORMAT DELIMITED BY ' ' STORED AS TEXTFILE LOCATION '/example/data/;SELECT t4 AS sev,COUNT(*) AS count FROM log4jLogs WHERE t4 = '[ERROR]' GROUP BY t4;";"statusdir"="/example/rest"}
    $resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.cn/templeton/v1/hive" `
       -Credential $creds `
       -Body $reqParams `
       -Method POST `
       -UseBasicParsing
    $jobID = (ConvertFrom-Json $resp.Content).id
    $jobID
    

    此请求使用 POST 方法,该方法将数据作为请求的一部分发送到 REST API。This request uses the POST method, which sends data as part of the request to the REST API. 将随请求发送以下数据值:The following data values are sent with the request:

    • user.name - 正在运行命令的用户。user.name - The user that is running the command.
    • execute - 要执行的 HiveQL 语句。execute - The HiveQL statements to execute.
    • statusdir - 此作业的状态要写入到的目录。statusdir - The directory that the status for this job is written to.

    这些语句执行以下操作:These statements perform the following actions:

    • DROP TABLE - 如果表已存在,则删除该表。DROP TABLE - If the table already exists, it is deleted.

    • CREATE EXTERNAL TABLE - 在 Hive 中创建一个新的“外部”表。CREATE EXTERNAL TABLE - Creates a new 'external' table in Hive. 外部表仅在 Hive 中存储表定义。External tables store only the table definition in Hive. 数据保留在原始位置。The data is left in the original location.

      Note

      如果希望通过外部源更新基础数据,应使用外部表。External tables should be used when you expect the underlying data to be updated by an external source. 例如,使用自动化数据上传过程或其他 MapReduce 操作。For example, an automated data upload process or another MapReduce operation.

      删除外部表 不会 删除数据,只会删除表定义。Dropping an external table does not delete the data, only the table definition.

    • ROW FORMAT - 如何设置数据的格式。ROW FORMAT - How the data is formatted. 每个日志中的字段都用空格分隔。The fields in each log are separated by a space.

    • STORED AS TEXTFILE LOCATION - 数据的存储位置(example/data 目录),并且数据存储为文本。STORED AS TEXTFILE LOCATION - Where the data is stored (the example/data directory) and that it is stored as text.

    • SELECT - 选择 t4 列包含值 [ERROR] 的所有行的计数。SELECT - Selects a count of all rows where column t4 contains the value [ERROR]. 此语句返回的值为 3,因为有三行包含此值。This statement returns a value of 3 as there are three rows that contain this value.

      Note

      请注意,在与 Curl 配合使用时,将用 + 字符替换 HiveQL 语句之间的空格。Notice that the spaces between HiveQL statements are replaced by the + character when used with Curl. 如果带引号的值包含空格(例如分隔符),则不应替换为 +Quoted values that contain a space, such as the delimiter, should not be replaced by +.

      此命令会返回可用来检查作业状态的作业 ID。This command returns a job ID that can be used to check the status of the job.

  3. 若要检查作业的状态,请使用以下命令:To check the status of the job, use the following command:

    curl -u admin:$password -d user.name=admin -G https://$clusterName.azurehdinsight.cn/templeton/v1/jobs/$jobid | jq .status.state
    
    $reqParams=@{"user.name"="admin"}
    $resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.cn/templeton/v1/jobs/$jobID" `
       -Credential $creds `
       -Body $reqParams `
       -UseBasicParsing
    # ConvertFrom-JSON can't handle duplicate names with different case
    # So change one to prevent the error
    $fixDup=$resp.Content.Replace("jobID","job_ID")
    (ConvertFrom-Json $fixDup).status.state
    

    如果作业已完成,状态是 SUCCEEDEDIf the job has finished, the state is SUCCEEDED.

  4. 在作业的状态更改为“SUCCEEDED” 后,可以从 Azure Blob 存储中检索作业的结果。Once the state of the job has changed to SUCCEEDED, you can retrieve the results of the job from Azure Blob storage. 随查询一起传递的 statusdir 参数包含输出文件的位置;在本例中,该位置为 /example/restThe statusdir parameter passed with the query contains the location of the output file; in this case, /example/rest. 此地址将输出存储在群集默认存储中的 example/curl 目录。This address stores the output in the example/curl directory in the clusters default storage.

    可以使用 Azure CLI 列出并下载这些文件。You can list and download these files by using the Azure CLI. 有关将 Azure CLI 与 Azure 存储配合使用的详细信息,请参阅将 Azure CLI 与 Azure 存储配合使用文档。For more information on using the Azure CLI with Azure Storage, see the Use Azure CLI with Azure Storage document.

后续步骤Next steps

有关 HDInsight 上 Hadoop 的其他使用方法的信息:For information on other ways you can work with Hadoop on HDInsight:

有关在本文档中使用的 REST API 的详细信息,请参阅 WebHCat 参考文档。For more information on the REST API used in this document, see the WebHCat reference document.