使用 REST 通过 HDInsight 上的 Apache Hadoop 运行 Apache Pig 作业Run Apache Pig jobs with Apache Hadoop on HDInsight by using REST

了解如何通过向 Azure HDInsight 群集发出 REST 请求运行 Apache Pig Latin 作业。Learn how to run Apache Pig Latin jobs by making REST requests to an Azure HDInsight cluster. Curl 用于演示如何使用 WebHCat REST API 与 HDInsight 交互。Curl is used to demonstrate how you can interact with HDInsight using the WebHCat REST API.

Note

如果已熟悉如何使用基于 Linux 的 Apache Hadoop 服务器,但刚接触 HDInsight,请参阅基于 Linux 的 HDInsight 提示If you are already familiar with using Linux-based Apache Hadoop servers, but are new to HDInsight, see Linux-based HDInsight Tips.

先决条件Prerequisites

  • Azure HDInsight(HDInsight 上的 Hadoop)群集(基于 Linux 或 Windows)An Azure HDInsight (Hadoop on HDInsight) cluster (Linux-based or Windows-based)

    Important

    Linux 是 HDInsight 3.4 或更高版本上使用的唯一操作系统。Linux is the only operating system used on HDInsight version 3.4 or greater. 有关详细信息,请参阅 HDInsight 在 Windows 上停用For more information, see HDInsight retirement on Windows.

  • CurlCurl

  • jqjq

使用 Curl 运行 Apache Pig 作业Run Apache Pig jobs by using Curl

Note

REST API 通过 基本访问身份验证进行保护。The REST API is secured via basic access authentication. 始终使用安全 HTTP (HTTPS) 发出请求,以确保安全地将凭据发送到服务器。Always make requests by using Secure HTTP (HTTPS) to ensure that your credentials are securely sent to the server.

使用本部分中的命令时,请将 USERNAME 替换为要向群集进行身份验证的用户,并将 PASSWORD 替换为用户帐户的密码。When using the commands in this section, replace USERNAME with the user to authenticate to the cluster, and replace PASSWORD with the password for the user account. CLUSTERNAME 替换为群集的名称。Replace CLUSTERNAME with the name of your cluster.

  1. 在命令行中,使用以下命令验证你是否可以连接到 HDInsight 群集。From a command line, use the following command to verify that you can connect to your HDInsight cluster:

    curl -u USERNAME:PASSWORD -G https://CLUSTERNAME.azurehdinsight.cn/templeton/v1/status
    

    此时会收到以下 JSON 响应:You should receive the following JSON response:

     {"status":"ok","version":"v1"}
    

    此命令中使用的参数如下:The parameters used in this command are as follows:

    • -u:用来对请求进行身份验证的用户名和密码-u: The user name and password used to authenticate the request

    • -G:指示此请求是 GET 请求-G: Indicates that this request is a GET request

      URL 的开头 (https://CLUSTERNAME.azurehdinsight.cn/templeton/v1) 对于所有请求都是相同的。The beginning of the URL, https://CLUSTERNAME.azurehdinsight.cn/templeton/v1, is the same for all requests. 路径 /status 指示请求是要返回服务器的 WebHCat(也称为 Templeton)状态。The path, /status, indicates that the request is to return the status of WebHCat (also known as Templeton) for the server.

  2. 使用以下代码将 Pig Latin 作业提交到群集:Use the following code to submit a Pig Latin job to the cluster:

    curl -u USERNAME:PASSWORD -d user.name=USERNAME -d execute="LOGS=LOAD+'/example/data/sample.log';LEVELS=foreach+LOGS+generate+REGEX_EXTRACT($0,'(TRACE|DEBUG|INFO|WARN|ERROR|FATAL)',1)+as+LOGLEVEL;FILTEREDLEVELS=FILTER+LEVELS+by+LOGLEVEL+is+not+null;GROUPEDLEVELS=GROUP+FILTEREDLEVELS+by+LOGLEVEL;FREQUENCIES=foreach+GROUPEDLEVELS+generate+group+as+LOGLEVEL,COUNT(FILTEREDLEVELS.LOGLEVEL)+as+count;RESULT=order+FREQUENCIES+by+COUNT+desc;DUMP+RESULT;" -d statusdir="/example/pigcurl" https://CLUSTERNAME.azurehdinsight.cn/templeton/v1/pig
    

    此命令中使用的参数如下:The parameters used in this command are as follows:

    • -d:由于未使用 -G,因此该请求默认为使用 POST 方法。-d: Because -G is not used, the request defaults to the POST method. -d 指定与请求一起发送的数据值。-d specifies the data values that are sent with the request.

    • user.name:正在运行命令的用户user.name: The user who is running the command

    • execute:要执行的 Pig Latin 语句execute: The Pig Latin statements to execute

    • statusdir:此作业的状态要写入到的目录statusdir: The directory that the status for this job is written to

      Note

      请注意,在与 Curl 配合使用时,将使用 + 字符替换 Pig Latin 语句中的空格。Notice that the spaces in Pig Latin statements are replaced by the + character when used with Curl.

      此命令应返回可用来检查作业状态的作业 ID,例如:This command should return a job ID that can be used to check the status of the job, for example:

      {"id":"job_1415651640909_0026"}{"id":"job_1415651640909_0026"}

  3. 若要检查作业的状态,请使用以下命令To check the status of the job, use the following command

    curl -G -u USERNAME:PASSWORD -d user.name=USERNAME https://CLUSTERNAME.azurehdinsight.cn/templeton/v1/jobs/JOBID | jq .status.state
    

    JOBID 替换为上一步骤返回的值。Replace JOBID with the value returned in the previous step. 例如,如果返回值为 {"id":"job_1415651640909_0026"},则 JOBID 将是 job_1415651640909_0026For example, if the return value was {"id":"job_1415651640909_0026"}, then JOBID is job_1415651640909_0026.

    如果作业已完成,状态是 SUCCEEDEDIf the job has finished, the state is SUCCEEDED.

    Note

    此 Curl 请求返回具有作业相关信息的 JavaScript 对象表示法 (JSON) 文档;使用 jq 可以仅检索状态值。This Curl request returns a JavaScript Object Notation (JSON) document with information about the job, and jq is used to retrieve only the state value.

查看结果View results

当作业的状态更改为“成功”后,可检索作业结果。When the state of the job has changed to SUCCEEDED, you can retrieve the results of the job. 随查询一起传递的 statusdir 参数包含输出文件的位置;在本例中,该位置为 /example/pigcurlThe statusdir parameter passed with the query contains the location of the output file; in this case, /example/pigcurl.

HDInsight 可以使用 Azure 存储作为默认数据存储。HDInsight can use Azure Storage as the default data store. 有关详细信息,请参阅基于 Linux 的 HDInsight 信息文档的存储部分。For more information, see the storage section of the Linux-based HDInsight information document.

摘要Summary

如本文档中所示,可以使用原始 HTTP 请求运行、监视和查看 HDInsight 群集上的 Pig 作业的结果。As demonstrated in this document, you can use a raw HTTP request to run, monitor, and view the results of Pig jobs on your HDInsight cluster.

有关本文中使用的 REST 接口的详细信息,请参阅 WebHCat 参考For more information about the REST interface used in this article, see the WebHCat Reference.

后续步骤Next steps

有关 HDInsight 上的 Pig 的一般信息:For general information about Pig on HDInsight:

有关 HDInsight 上 Hadoop 的其他使用方法的信息:For information about other ways you can work with Hadoop on HDInsight: