使用 Curl 通过 HDInsight 中的 Hadoop 运行 Apache Sqoop 作业Run Apache Sqoop jobs with Hadoop in HDInsight with Curl

了解如何使用 Curl 在 HDInsight 中的 Apache Hadoop 群集上运行 Apache Sqoop 作业。Learn how to use Curl to run Apache Sqoop jobs on an Apache Hadoop cluster in HDInsight.

本文档使用 Curl 演示如何使用原始 HTTP 请求来与 HDInsight 交互,以便运行、监视和检索 Sqoop 作业的结果。Curl is used to demonstrate how you can interact with HDInsight by using raw HTTP requests to run, monitor, and retrieve the results of Sqoop jobs. 若要执行这些操作,需要使用 HDInsight 群集提供的 WebHCat REST API(前称 Templeton)。This works by using the WebHCat REST API (formerly known as Templeton) provided by your HDInsight cluster.

先决条件Prerequisites

要完成本文中的步骤,需要:To complete the steps in this article, you will need the following:

  • 完成在 HDInsight 中将 Apache Sqoop 与 Hadoop 配合使用,以配置具有 HDInsight 群集和 Azure SQL 数据库的环境。Complete Use Apache Sqoop with Hadoop in HDInsight to configure an environment with a HDInsight cluster and an Azure SQL database.
  • Curl.Curl. Curl 是一种将数据传入或传出 HDInsight 群集的工具。Curl is a tool to transfer data from or to a HDInsight cluster.
  • jq.jq. jq 实用工具用于处理从 REST 请求返回的 JSON 数据。The jq utility is used to process the JSON data returned from REST requests.

使用 Curl 提交 Apache Sqoop 作业Submit Apache Sqoop jobs by using Curl

Note

使用 Curl 或者与 WebHCat 进行任何其他形式的 REST 通信时,必须提供 HDInsight 群集管理员用户名和密码对请求进行身份验证。When using Curl or any other REST communication with WebHCat, you must authenticate the requests by providing the user name and password for the HDInsight cluster administrator. 此外,还必须使用群集名称作为用来向服务器发送请求的统一资源标识符 (URI) 的一部分。You must also use the cluster name as part of the Uniform Resource Identifier (URI) used to send the requests to the server.

对本部分中的所有命令,请将 USERNAME替换为在群集上进行身份验证的用户,并将 PASSWORD 替换为用户帐户的密码。For the commands in this section, replace USERNAME with the user to authenticate to the cluster, and replace PASSWORD with the password for the user account. CLUSTERNAME 替换为群集名称。Replace CLUSTERNAME with the name of your cluster.

REST API 通过 基本身份验证进行保护。The REST API is secured via basic authentication. 始终应该使用安全 HTTP (HTTPS) 来发出请求,以确保安全地将凭据发送到服务器。You should always make requests by using Secure HTTP (HTTPS) to help ensure that your credentials are securely sent to the server.

  1. 在命令行中,使用以下命令验证你是否可以连接到 HDInsight 群集。From a command line, use the following command to verify that you can connect to your HDInsight cluster:

    curl -u USERNAME:PASSWORD -G https://CLUSTERNAME.azurehdinsight.cn/templeton/v1/status
    

    应会收到类似于下面的响应:You should receive a response similar to the following:

    {"status":"ok","version":"v1"}
    

    此命令中使用的参数如下:The parameters used in this command are as follows:

    • -u - 用来对请求进行身份验证的用户名和密码。-u - The user name and password used to authenticate the request.

    • -G - 指出这是 GET 请求。-G - Indicates that this is a GET request.

      URL 的开头 (https://CLUSTERNAME.azurehdinsight.cn/templeton/v1) 对于所有请求都是相同的。The beginning of the URL, https://CLUSTERNAME.azurehdinsight.cn/templeton/v1, is the same for all requests. 路径 /status 指示请求会返回服务器的 WebHCat(也称为 Templeton)状态。The path, /status, indicates that the request is to return a status of WebHCat (also known as Templeton) for the server.

  2. 使用以下命令 sqoop 作业:Use the following to submit a sqoop job:

    curl -u USERNAME:PASSWORD -d user.name=USERNAME -d command="export --connect jdbc:sqlserver://SQLDATABASESERVERNAME.database.chinacloudapi.cn;user=USERNAME@SQLDATABASESERVERNAME;password=PASSWORD;database=SQLDATABASENAME --table log4jlogs --export-dir /example/data/sample.log --input-fields-terminated-by \0x20 -m 1" -d statusdir="wasb:///example/data/sqoop/curl" https://CLUSTERNAME.azurehdinsight.cn/templeton/v1/sqoop
    

    此命令中使用的参数如下:The parameters used in this command are as follows:

    • -d - 由于未使用 -G,请求将按默认使用 POST 方法。-d - Since -G is not used, the request defaults to the POST method. -d 指定与请求一起发送的数据值。specifies the data values that are sent with the request.

      • user.name - 正在运行命令的用户。user.name - The user that is running the command.

      • command - 要执行的 Sqoop 命令。command - The Sqoop command to execute.

      • statusdir - 此作业的状态要写入到的目录。statusdir - The directory that the status for this job will be written to.

      此命令应会返回可用于检查作业状态的作业 ID。This command shall return a job ID that can be used to check the status of the job.

      {"id":"job_1415651640909_0026"}
      
  3. 若要检查作业的状态,请使用以下命令。To check the status of the job, use the following command. 将“JOBID”替换为上一步骤返回的值。Replace JOBID with the value returned in the previous step. 例如,如果返回值为 {"id":"job_1415651640909_0026"},则 JOBID 将是 job_1415651640909_0026For example, if the return value was {"id":"job_1415651640909_0026"}, then JOBID would be job_1415651640909_0026.

        curl -G -u USERNAME:PASSWORD -d user.name=USERNAME https://CLUSTERNAME.azurehdinsight.cn/templeton/v1/jobs/JOBID | jq .status.state
    

    如果作业已完成,状态将是 SUCCEEDEDIf the job has finished, the state will be SUCCEEDED.

    Note

    此 Curl 请求返回具有作业相关信息的 JavaScript 对象表示法 (JSON) 文档;使用 jq 可以仅检索状态值。This Curl request returns a JavaScript Object Notation (JSON) document with information about the job; jq is used to retrieve only the state value.

  4. 在作业的状态更改为“SUCCEEDED”后,可以从 Azure Blob 存储中检索作业的结果。Once the state of the job has changed to SUCCEEDED, you can retrieve the results of the job from Azure Blob storage. 随查询一起传递的 statusdir 参数包含输出文件的位置;在本例中,位置为 wasb:///example/data/sqoop/curl。The statusdir parameter passed with the query contains the location of the output file; in this case, wasb:///example/data/sqoop/curl. 此地址会将作业的输出存储在 HDInsight 群集所用的默认存储容器的 example/data/sqoop/curl 目录中。This address stores the output of the job in the example/data/sqoop/curl directory on the default storage container used by your HDInsight cluster.

    可使用 Azure 门户访问 stderr 和 stdout blob。You can use the Azure portal to access stderr and stdout blobs. 还可使用 Microsoft SQL Server Management Studio 检查上传到 log4jlogs 表的数据。You can also use Microsoft SQL Server Management Studio to check the data that is uploaded to the log4jlogs table.

限制Limitations

  • 批量导出 - 在基于 Linux 的 HDInsight 上,用于将数据导出到 Microsoft SQL Server 或 Azure SQL 数据库的 Sqoop 连接器目前不支持批量插入。Bulk export - With Linux-based HDInsight, the Sqoop connector used to export data to Microsoft SQL Server or Azure SQL Database does not currently support bulk inserts.
  • 批处理 - 在基于 Linux 的 HDInsight 上,如果执行插入时使用 -batch 开关,Sqoop 会执行多次插入而不是批处理插入操作。Batching - With Linux-based HDInsight, When using the -batch switch when performing inserts, Sqoop will perform multiple inserts instead of batching the insert operations.

摘要Summary

如本文档中所示,可以使用原始 HTTP 请求来运行、监视和查看 HDInsight 群集上的 Sqoop 作业的结果。As demonstrated in this document, you can use a raw HTTP request to run, monitor, and view the results of Sqoop jobs on your HDInsight cluster.

有关本文中使用的 REST 接口的详细信息,请参阅 Apache Sqoop REST API 指南For more information on the REST interface used in this article, see the Apache Sqoop REST API guide.

后续步骤Next steps

有关将 Hive 与 HDInsight 配合使用的一般信息:For general information on Hive with HDInsight:

有关 HDInsight 上 Hadoop 的其他使用方法的信息:For information on other ways you can work with Hadoop on HDInsight:

涉及 curl 的其他 HDInsight 文章:For other HDInsight articles involving curl: