使用 REST 通过 HDInsight 上的 Apache Hadoop 运行 MapReduce 作业Run MapReduce jobs with Apache Hadoop on HDInsight using REST

了解如何使用 Apache Hive WebHCat REST API 在 Apache Hadoop on HDInsight 群集上运行 MapReduce 作业。Learn how to use the Apache Hive WebHCat REST API to run MapReduce jobs on an Apache Hadoop on HDInsight cluster. 本文档使用 Curl 演示如何通过使用原始 HTTP 请求来与 HDInsight 交互,以便运行 MapReduce 作业。Curl is used to demonstrate how you can interact with HDInsight by using raw HTTP requests to run MapReduce jobs.

Note

如果已熟悉如何使用基于 Linux 的 Hadoop 服务器,但刚接触 HDInsight,请参阅 HDInsight 上的基于 Linux 的 Apache Hadoop 须知信息文档。If you are already familiar with using Linux-based Hadoop servers, but you are new to HDInsight, see the What you need to know about Linux-based Apache Hadoop on HDInsight document.

必备条件Prerequisites

任一方法:Either:

  • Windows PowerShell 或Windows PowerShell or,
  • CurljqCurl with jq

运行 MapReduce 作业Run a MapReduce job

Note

使用 Curl 或者与 WebHCat 进行任何其他形式的 REST 通信时,必须通过提供 HDInsight 群集管理员用户名和密码对请求进行身份验证。When you use Curl or any other REST communication with WebHCat, you must authenticate the requests by providing the HDInsight cluster administrator user name and password. 必须使用群集名称作为用来向服务器发送请求的 URI 的一部分。You must use the cluster name as part of the URI that is used to send the requests to the server.

REST API 使用 基本访问身份验证进行保护。The REST API is secured by using basic access authentication. 应该始终通过使用 HTTPS 来发出请求,以确保安全地将凭据发送到服务器。You should always make requests by using HTTPS to ensure that your credentials are securely sent to the server.

CurlCurl

  1. 为方便使用,请设置以下变量。For ease of use, set the variables below. 此示例基于 Windows 环境,根据环境需要进行修订。This example is based on a Windows environment, revise as needed for your environment.

    set CLUSTERNAME=
    set PASSWORD=
    
  2. 在命令行中,使用以下命令验证你是否可以连接到 HDInsight 群集。From a command line, use the following command to verify that you can connect to your HDInsight cluster:

    curl -u admin:%PASSWORD% -G https://%CLUSTERNAME%.azurehdinsight.cn/templeton/v1/status
    

    此命令中使用的参数如下:The parameters used in this command are as follows:

    • -u:指示用来对请求进行身份验证的用户名和密码-u: Indicates the user name and password used to authenticate the request
    • -G:指示此操作是 GET 请求。-G: Indicates that this operation is a GET request

    URI 的开头 (https://CLUSTERNAME.azurehdinsight.cn/templeton/v1) 对于所有请求都是相同的。The beginning of the URI, https://CLUSTERNAME.azurehdinsight.cn/templeton/v1, is the same for all requests.

    将收到类似于以下 JSON 的响应:You receive a response similar to the following JSON:

    {"version":"v1","status":"ok"}
    
  3. 若要提交 MapReduce 作业,请使用以下命令。To submit a MapReduce job, use the following command. 根据需要修改 jq 的路径。Modify the path to jq as needed.

    curl -u admin:%PASSWORD% -d user.name=admin ^
    -d jar=/example/jars/hadoop-mapreduce-examples.jar ^
    -d class=wordcount -d arg=/example/data/gutenberg/davinci.txt -d arg=/example/data/output ^
    https://%CLUSTERNAME%.azurehdinsight.net/templeton/v1/mapreduce/jar | ^
    C:\HDI\jq-win64.exe .id
    

    URI 的末尾 (/mapreduce/jar) 告知 WebHCat,此请求从 jar 文件中的类启动 MapReduce 作业。The end of the URI (/mapreduce/jar) tells WebHCat that this request starts a MapReduce job from a class in a jar file. 此命令中使用的参数如下:The parameters used in this command are as follows:

    • -d:由于不使用 -G,因此请求默认为使用 POST 方法。-d: -G isn't used, so the request defaults to the POST method. -d 指定与请求一起发送的数据值。-d specifies the data values that are sent with the request.
      • user.name:正在运行命令的用户user.name: The user who is running the command
      • jar:包含要运行的类的 jar 文件所在位置jar: The location of the jar file that contains class to be ran
      • class:包含 MapReduce 逻辑的类class: The class that contains the MapReduce logic
      • arg:要传递到 MapReduce 作业的参数。arg: The arguments to be passed to the MapReduce job. 在此示例中是用于输出的输入文本文件和目录In this case, the input text file and the directory that are used for the output

    此命令应返回可用来检查作业状态的作业 ID:This command should return a job ID that can be used to check the status of the job:

    <span data-ttu-id="01218-139">job_1415651640909_0026</span><span class="sxs-lookup"><span data-stu-id="01218-139">job_1415651640909_0026</span></span>
    
  4. 若要检查作业的状态,请使用以下命令。To check the status of the job, use the following command. JOBID 的值替换为上一步返回的实际值。Replace the value for JOBID with the actual value returned in the previous step. 根据需要修改 jq 的位置。Revise location of jq as needed.

    set JOBID=job_1415651640909_0026
    
    curl -G -u admin:%PASSWORD% -d user.name=admin https://%CLUSTERNAME%.azurehdinsight.net/templeton/v1/jobs/%JOBID% | ^
    C:\HDI\jq-win64.exe .status.state
    

PowerShellPowerShell

  1. 为方便使用,请设置以下变量。For ease of use, set the variables below. CLUSTERNAME 替换为实际群集名称。Replace CLUSTERNAME with your actual cluster name. 执行命令,并在出现提示时输入群集登录密码。Execute the command and enter the cluster login password when prompted.

    $clusterName="CLUSTERNAME"
    $creds = Get-Credential -UserName admin -Message "Enter the cluster login password"
    
  2. 使用以下命令验证你是否可以连接到 HDInsight 群集。use the following command to verify that you can connect to your HDInsight cluster:

    $resp = Invoke-WebRequest -Uri "https://$clustername.azurehdinsight.cn/templeton/v1/status" `
        -Credential $creds `
        -UseBasicParsing
    $resp.Content
    

    将收到类似于以下 JSON 的响应:You receive a response similar to the following JSON:

    {"version":"v1","status":"ok"}
    
  3. 要提交 MapReduce 作业,请使用以下命令:To submit a MapReduce job, use the following command:

    $reqParams = @{}
    $reqParams."user.name" = "admin"
    $reqParams.jar = "/example/jars/hadoop-mapreduce-examples.jar"
    $reqParams.class = "wordcount"
    $reqParams.arg = @()
    $reqParams.arg += "/example/data/gutenberg/davinci.txt"
    $reqparams.arg += "/example/data/output"
    $resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.cn/templeton/v1/mapreduce/jar" `
       -Credential $creds `
       -Body $reqParams `
       -Method POST `
       -UseBasicParsing
    $jobID = (ConvertFrom-Json $resp.Content).id
    $jobID
    

    URI 的末尾 (/mapreduce/jar) 告知 WebHCat,此请求从 jar 文件中的类启动 MapReduce 作业。The end of the URI (/mapreduce/jar) tells WebHCat that this request starts a MapReduce job from a class in a jar file. 此命令中使用的参数如下:The parameters used in this command are as follows:

    • user.name:正在运行命令的用户user.name: The user who is running the command

    • jar:包含要运行的类的 jar 文件所在位置jar: The location of the jar file that contains class to be ran

    • class:包含 MapReduce 逻辑的类class: The class that contains the MapReduce logic

    • arg:要传递到 MapReduce 作业的参数。arg: The arguments to be passed to the MapReduce job. 在此示例中是用于输出的输入文本文件和目录In this case, the input text file and the directory that are used for the output

      此命令应返回可用来检查作业状态的作业 ID:This command should return a job ID that can be used to check the status of the job:

      job_1415651640909_0026job_1415651640909_0026

  4. 若要检查作业的状态,请使用以下命令:To check the status of the job, use the following command:

    $reqParams=@{"user.name"="admin"}
    $resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.cn/templeton/v1/jobs/$jobID" `
       -Credential $creds `
       -Body $reqParams `
       -UseBasicParsing
    # ConvertFrom-JSON can't handle duplicate names with different case
    # So change one to prevent the error
    $fixDup=$resp.Content.Replace("jobID","job_ID")
    (ConvertFrom-Json $fixDup).status.state
    

两种方法Both methods

If the job is complete, the state returned is `SUCCEEDED`.
  1. 在作业的状态更改为 SUCCEEDED 后,可以从 Azure Blob 存储中检索作业的结果。When the state of the job has changed to SUCCEEDED, you can retrieve the results of the job from Azure Blob storage. 随查询一起传递的 statusdir 参数包含输出文件的位置。The statusdir parameter that is passed with the query contains the location of the output file. 在本示例中,位置为 /example/curlIn this example, the location is /example/curl. 此地址在群集默认存储的 /example/curl 中存储作业的输出。This address stores the output of the job in the clusters default storage at /example/curl.

可以使用 Azure CLI 列出并下载这些文件。You can list and download these files by using the Azure CLI. 有关从 Azure CLI 中使用 blob 的详细信息,请参阅将 Azure CLI 与 Azure 存储配合使用文档。For more information on working with blobs from the Azure CLI, see the Using the Azure CLI with Azure Storage document.

后续步骤Next steps

有关 HDInsight 上 Hadoop 的其他使用方法的信息:For information about other ways you can work with Hadoop on HDInsight:

有关本文中使用的 REST 接口的详细信息,请参阅 WebHCat Reference(WebHCat 参考)。For more information about the REST interface that is used in this article, see the WebHCat Reference.