使用 Apache Spark REST API 将远程作业提交到 HDInsight Spark 群集Use Apache Spark REST API to submit remote jobs to an HDInsight Spark cluster

了解如何使用 Apache Livy(即 Apache Spark REST API),用于将远程作业提交到 Azure HDInsight Spark 群集。Learn how to use Apache Livy, the Apache Spark REST API, which is used to submit remote jobs to an Azure HDInsight Spark cluster. 有关详细文档,请参阅 Apache LivyFor detailed documentation, see Apache Livy.

可以使用 Livy 运行交互式 Spark shell,或提交要在 Spark 上运行的批处理作业。You can use Livy to run interactive Spark shells or submit batch jobs to be run on Spark. 本文介绍如何使用 Livy 提交批处理作业。This article talks about using Livy to submit batch jobs. 本文中的代码片段使用 cURL 向Livy Spark 终结点发出 REST API 调用。The snippets in this article use cURL to make REST API calls to the Livy Spark endpoint.

先决条件Prerequisites

提交 Apache Livy Spark 批处理作业Submit an Apache Livy Spark batch job

在提交批处理作业之前,必须将应用程序 jar 上传到与群集关联的群集存储。Before you submit a batch job, you must upload the application jar on the cluster storage associated with the cluster. 可以使用命令行实用工具 AzCopy 来执行此操作。You can use AzCopy, a command-line utility, to do so. 可以使用其他各种客户端来上传数据。There are various other clients you can use to upload data. 有关详细信息,请参阅在 HDInsight 中上传 Apache Hadoop 作业的数据You can find more about them at Upload data for Apache Hadoop jobs in HDInsight.

curl -k --user "admin:password" -v -H "Content-Type: application/json" -X POST -d '{ "file":"<path to application jar>", "className":"<classname in jar>" }' 'https://<spark_cluster_name>.azurehdinsight.cn/livy/batches' -H "X-Requested-By: admin"

示例Examples

  • 如果 jar 文件位于群集存储 (WASBS) 中If the jar file is on the cluster storage (WASBS)

    curl -k --user "admin:mypassword1!" -v -H "Content-Type: application/json" -X POST -d '{ "file":"wasbs://mycontainer@mystorageaccount.blob.core.windows.net/data/SparkSimpleTest.jar", "className":"com.microsoft.spark.test.SimpleFile" }' "https://mysparkcluster.azurehdinsight.cn/livy/batches" -H "X-Requested-By: admin"
    
  • 如果想要传递 jar 文件名和类名作为输入文件(在本示例中为 input.txt)的一部分If you want to pass the jar filename and the classname as part of an input file (in this example, input.txt)

    curl -k  --user "admin:mypassword1!" -v -H "Content-Type: application/json" -X POST --data @C:\Temp\input.txt "https://mysparkcluster.azurehdinsight.cn/livy/batches" -H "X-Requested-By: admin"
    

获取在群集上运行的 Livy Spark 批处理的相关信息Get information on Livy Spark batches running on the cluster

语法:Syntax:

curl -k --user "admin:password" -v -X GET "https://<spark_cluster_name>.azurehdinsight.cn/livy/batches"

示例Examples

  • 如果想要检索在群集上运行的所有Livy Spark 批处理:If you want to retrieve all the Livy Spark batches running on the cluster:

    curl -k --user "admin:mypassword1!" -v -X GET "https://mysparkcluster.azurehdinsight.cn/livy/batches" 
    
  • 如果想要检索具有给定批 ID 的特定批If you want to retrieve a specific batch with a given batch ID

    curl -k --user "admin:mypassword1!" -v -X GET "https://mysparkcluster.azurehdinsight.cn/livy/batches/{batchId}"
    

删除 Livy Spark 批处理作业Delete a Livy Spark batch job

curl -k --user "admin:mypassword1!" -v -X DELETE "https://<spark_cluster_name>.azurehdinsight.cn/livy/batches/{batchId}"

示例Example

删除批 ID 为 5 的批处理作业。Deleting a batch job with batch ID 5.

curl -k --user "admin:mypassword1!" -v -X DELETE "https://mysparkcluster.azurehdinsight.cn/livy/batches/5"

Livy Spark 和高可用性Livy Spark and high-availability

Livy 可为群集上运行的 Spark 作业提供高可用性。Livy provides high-availability for Spark jobs running on the cluster. 下面是几个示例。Here is a couple of examples.

  • 将作业远程提交到 Spark 群集之后,如果 Livy 服务关闭,作业将继续在后台运行。If the Livy service goes down after you have submitted a job remotely to a Spark cluster, the job continues to run in the background. 当 Livy 恢复运行时,会还原并报告作业的状态。When Livy is back up, it restores the status of the job and reports it back.
  • 适用于 HDInsight 的 Jupyter 笔记本由后端中的 Livy 提供支持。Jupyter notebooks for HDInsight are powered by Livy in the backend. 如果在 Notebook 运行 Spark 作业时重启 Livy 服务,Notebook 会继续运行代码单元。If a notebook is running a Spark job and the Livy service gets restarted, the notebook continues to run the code cells.

举个例子Show me an example

本部分通过示例介绍如何使用 Livy Spark 提交批处理作业、监视作业进度,并删除作业。In this section, we look at examples to use Livy Spark to submit batch job, monitor the progress of the job, and then delete it. 本示例中使用的应用程序是创建独立的 Scala 应用程序并在 HDInsight Spark 群集上运行一文中开发的应用程序。The application we use in this example is the one developed in the article Create a standalone Scala application and to run on HDInsight Spark cluster. 此处的步骤假设:The steps here assume that:

  • 已将应用程序 jar 复制到与群集关联的存储帐户。You have already copied over the application jar to the storage account associated with the cluster.
  • 已将 CuRL 安装在用于执行这些步骤的计算机上。You have CuRL installed on the computer where you are trying these steps.

执行以下步骤:Perform the following steps:

  1. 为方便使用,请设置环境变量。For ease of use, set environment variables. 此示例基于 Windows 环境,请根据环境需要修改变量。This example is based on a Windows environment, revise variables as needed for your environment. CLUSTERNAMEPASSWORD 替换为相应的值。Replace CLUSTERNAME, and PASSWORD with the appropriate values.

    set clustername=CLUSTERNAME
    set password=PASSWORD
    
  2. 验证 Livy Spark 是否正在群集上运行。Verify that Livy Spark is running on the cluster. 为此,我们可以获取正在运行的批的列表。We can do so by getting a list of running batches. 首次使用 Livy 运行作业时,输出应返回零。If you're running a job using Livy for the first time, the output should return zero.

    curl -k --user "admin:%password%" -v -X GET "https://%clustername%.azurehdinsight.cn/livy/batches"
    

    应会看到类似于以下代码片段的输出:You should get an output similar to the following snippet:

    < HTTP/1.1 200 OK
    < Content-Type: application/json; charset=UTF-8
    < Server: Microsoft-IIS/8.5
    < X-Powered-By: ARR/2.5
    < X-Powered-By: ASP.NET
    < Date: Fri, 20 Nov 2015 23:47:53 GMT
    < Content-Length: 34
    <
    {"from":0,"total":0,"sessions":[]}* Connection #0 to host mysparkcluster.azurehdinsight.cn left intact
    

    请注意输出中的最后一行显示为 total:0,这意味着未运行任何批处理。Notice how the last line in the output says total:0, which suggests no running batches.

  3. 现在,让我们提交批处理作业。Let us now submit a batch job. 以下代码片段使用输入文件 (input.txt) 传递 jar 名称和类名称作为参数。The following snippet uses an input file (input.txt) to pass the jar name and the class name as parameters. 如果要从 Windows 计算机运行这些步骤,我们建议使用输入文件。If you are running these steps from a Windows computer, using an input file is the recommended approach.

    curl -k --user "admin:%password%" -v -H "Content-Type: application/json" -X POST --data @C:\Temp\input.txt "https://%clustername%.azurehdinsight.cn/livy/batches" -H "X-Requested-By: admin"
    

    文件 input.txt 中的参数定义如下:The parameters in the file input.txt are defined as follows:

    { "file":"wasb:///example/jars/SparkSimpleApp.jar", "className":"com.microsoft.spark.example.WasbIOTest" }
    

    应会看到类似于以下代码片段的输出:You should see an output similar to the following snippet:

    < HTTP/1.1 201 Created
    < Content-Type: application/json; charset=UTF-8
    < Location: /0
    < Server: Microsoft-IIS/8.5
    < X-Powered-By: ARR/2.5
    < X-Powered-By: ASP.NET
    < Date: Fri, 20 Nov 2015 23:51:30 GMT
    < Content-Length: 36
    <
    {"id":0,"state":"starting","log":[]}* Connection #0 to host mysparkcluster.azurehdinsight.cn left intact
    

    请注意输出的最后一行显示为 state:startingNotice how the last line of the output says state:starting. 此外还显示了 id:0It also says, id:0. 此处的 0 为批 ID。Here, 0 is the batch ID.

  4. 现在,可以使用批 ID 来检索此特定批的状态。You can now retrieve the status of this specific batch using the batch ID.

    curl -k --user "admin:%password%" -v -X GET "https://%clustername%.azurehdinsight.cn/livy/batches/0"
    

    应会看到类似于以下代码片段的输出:You should see an output similar to the following snippet:

    < HTTP/1.1 200 OK
    < Content-Type: application/json; charset=UTF-8
    < Server: Microsoft-IIS/8.5
    < X-Powered-By: ARR/2.5
    < X-Powered-By: ASP.NET
    < Date: Fri, 20 Nov 2015 23:54:42 GMT
    < Content-Length: 509
    <
    {"id":0,"state":"success","log":["\t diagnostics: N/A","\t ApplicationMaster host: 10.0.0.4","\t ApplicationMaster RPC port: 0","\t queue: default","\t start time: 1448063505350","\t final status: SUCCEEDED","\t tracking URL: http://myspar.lpel.jx.internal.cloudapp.net:8088/proxy/application_1447984474852_0002/","\t user: root","15/11/20 23:52:47 INFO Utils: Shutdown hook called","15/11/20 23:52:47 INFO Utils: Deleting directory /tmp/spark-b72cd2bf-280b-4c57-8ceb-9e3e69ac7d0c"]}* Connection #0 to host mysparkcluster.azurehdinsight.cn left intact
    

    现在,输出显示 state:success,这意味着作业已成功完成。The output now shows state:success, which suggests that the job was successfully completed.

  5. 现在,可以根据需要删除该批。If you want, you can now delete the batch.

    curl -k --user "admin:%password%" -v -X DELETE "https://%clustername%.azurehdinsight.cn/livy/batches/0"
    

    应会看到类似于以下代码片段的输出:You should see an output similar to the following snippet:

    < HTTP/1.1 200 OK
    < Content-Type: application/json; charset=UTF-8
    < Server: Microsoft-IIS/8.5
    < X-Powered-By: ARR/2.5
    < X-Powered-By: ASP.NET
    < Date: Sat, 21 Nov 2015 18:51:54 GMT
    < Content-Length: 17
    <
    {"msg":"deleted"}* Connection #0 to host mysparkcluster.azurehdinsight.cn left intact
    

    输出的最后一行显示批已成功删除。The last line of the output shows that the batch was successfully deleted. 删除正在运行的作业也会终止该作业Deleting a job, while it is running, also kills the job. 如果删除已完成的作业,则不管该作业是否已成功完成,都会完全删除该作业的信息。If you delete a job that has completed, successfully or otherwise, it deletes the job information completely.

更新到从 HDInsight 3.5 版本开始的 Livy 配置Updates to Livy configuration starting with HDInsight 3.5 version

HDInsight 3.5 群集及更高版本群集默认情况下禁止使用本地文件路径访问示例数据文件或 jar。HDInsight 3.5 clusters and above, by default, disable use of local file paths to access sample data files or jars. 建议改用 wasb:// 路径访问群集中的 jar 或示例数据文件。We encourage you to use the wasb:// path instead to access jars or sample data files from the cluster.

在 Azure 虚拟网络中提交群集的 Livy 作业Submitting Livy jobs for a cluster within an Azure virtual network

如果从 Azure 虚拟网络内部连接到 HDInsight Spark 群集,可以直接连接到群集上的 Livy。If you connect to an HDInsight Spark cluster from within an Azure Virtual Network, you can directly connect to Livy on the cluster. 在这种情况下,Livy 终结点的 URL 为 http://<IP address of the headnode>:8998/batchesIn such a case, the URL for Livy endpoint is http://<IP address of the headnode>:8998/batches. 此处的 8998 是群集头节点上运行 Livy 的端口。Here, 8998 is the port on which Livy runs on the cluster headnode. 有关在非公共端口上访问服务的详细信息,请参阅 HDInsight 上的 Apache Hadoop 服务使用的端口For more information on accessing services on non-public ports, see Ports used by Apache Hadoop services on HDInsight.

后续步骤Next steps