在 AKS 中运行 Apache Spark 作业Running Apache Spark jobs on AKS

Apache Spark 是用于大规模数据处理的高速引擎。Apache Spark is a fast engine for large-scale data processing. 从 [Spark 2.3.0 版][spark-最新版] 开始,Apache Spark 支持与 Kubernetes 集群进行本机集成。As of the [Spark 2.3.0 release][spark-latest-release], Apache Spark supports native integration with Kubernetes clusters. Azure Kubernetes 服务 (AKS) 是 Azure 中运行的托管 Kubernetes 环境。Azure Kubernetes Service (AKS) is a managed Kubernetes environment running in Azure. 本文档详细说明如何在 Azure Kubernetes 服务 (AKS) 群集上准备和运行 Apache Spark 作业。This document details preparing and running Apache Spark jobs on an Azure Kubernetes Service (AKS) cluster.

先决条件Prerequisites

为了完成本文中的步骤,需要具备以下各项。In order to complete the steps within this article, you need the following.

创建 AKS 群集Create an AKS cluster

Spark 用于大规模数据处理,要求根据 Spark 资源的要求调整 Kubernetes 节点的大小。Spark is used for large-scale data processing and requires that Kubernetes nodes are sized to meet the Spark resources requirements. 我们建议对 Azure Kubernetes 服务 (AKS) 节点至少使用 Standard_D3_v2 大小。We recommend a minimum size of Standard_D3_v2 for your Azure Kubernetes Service (AKS) nodes.

如果需要一个可以满足此最低建议要求的 AKS 群集,请运行以下命令。If you need an AKS cluster that meets this minimum recommendation, run the following commands.

为群集创建资源组。Create a resource group for the cluster.

az group create --name mySparkCluster --location chinaeast2

创建群集的服务主体。Create a Service Principal for the cluster. 创建后,下一条命令将需要服务主体 appId 和密码。After it is created, you will need the Service Principal appId and password for the next command.

az ad sp create-for-rbac --name SparkSP

使用大小为 Standard_D3_v2 的节点以及作为服务主体和客户端密码参数传递的 appId 和密码值创建 AKS 群集。Create the AKS cluster with nodes that are of size Standard_D3_v2, and values of appId and password passed as service-principal and client-secret parameters.

az aks create --resource-group mySparkCluster --name mySparkCluster --node-vm-size Standard_D3_v2 --generate-ssh-keys --service-principal <APPID> --client-secret <PASSWORD>

连接到 AKS 群集。Connect to the AKS cluster.

az aks get-credentials --resource-group mySparkCluster --name mySparkCluster

如果使用 Azure 容器注册表 (ACR) 来存储容器映像,请在 AKS 与 ACR 之间配置身份验证。If you are using Azure Container Registry (ACR) to store container images, configure authentication between AKS and ACR. 请参阅 ACR 身份验证文档来了解相关步骤。See the ACR authentication documentation for these steps.

生成 Spark 源Build the Spark source

在 AKS 群集上运行 Spark 作业之前,需要生成 Spark 源代码并将其打包到容器映像中。Before running Spark jobs on an AKS cluster, you need to build the Spark source code and package it into a container image. Spark 源包含可用于完成此过程的脚本。The Spark source includes scripts that can be used to complete this process.

将 Spark 项目存储库克隆到开发系统。Clone the Spark project repository to your development system.

git clone -b branch-2.4 https://github.com/apache/spark

切换到克隆的存储库所在的目录,并将 Spark 源的路径保存到某个变量。Change into the directory of the cloned repository and save the path of the Spark source to a variable.

cd spark
sparkdir=$(pwd)

如果已安装多个 JDK 版本,请将 JAVA_HOME 设置为使用当前会话的版本 8。If you have multiple JDK versions installed, set JAVA_HOME to use version 8 for the current session.

export JAVA_HOME=`/usr/libexec/java_home -d 64 -v "1.8*"`

运行以下命令,在 Kubernetes 的支持下生成 Spark 源代码。Run the following command to build the Spark source code with Kubernetes support.

./build/mvn -Pkubernetes -DskipTests clean package

以下命令创建 Spark 容器映像并将其推送到容器映像注册表。The following commands create the Spark container image and push it to a container image registry. registry.example.com 替换为容器注册表名称,将 v1 替换为要使用的标记。Replace registry.example.com with the name of your container registry and v1 with the tag you prefer to use. 如果使用 Docker 中心,则此值是注册表名称。If using Docker Hub, this value is the registry name. 如果使用 Azure 容器注册表 (ACR),则此值是 ACR 登录服务器名称。If using Azure Container Registry (ACR), this value is the ACR login server name.

REGISTRY_NAME=registry.example.com
REGISTRY_TAG=v1
./bin/docker-image-tool.sh -r $REGISTRY_NAME -t $REGISTRY_TAG build

将容器映像推送到容器映像注册表。Push the container image to your container image registry.

./bin/docker-image-tool.sh -r $REGISTRY_NAME -t $REGISTRY_TAG push

准备 Spark 作业Prepare a Spark job

接下来,请准备 Spark 作业。Next, prepare a Spark job. 使用一个 jar 文件来保存 Spark 作业;运行 spark-submit 命令时,需要指定此文件。A jar file is used to hold the Spark job and is needed when running the spark-submit command. 可以通过公共 URL 访问该 jar 文件,或者将它预先打包在容器映像中。The jar can be made accessible through a public URL or pre-packaged within a container image. 在此示例中,已创建一个示例 jar 来计算 Pi 的值。In this example, a sample jar is created to calculate the value of Pi. 然后,此 jar 将上传到 Azure 存储。This jar is then uploaded to Azure storage. 如果有现有的 jar,请任意替换 jar 文件If you have an existing jar, feel free to substitute

创建一个目录,以便在其中创建 Spark 作业的项目。Create a directory where you would like to create the project for a Spark job.

mkdir myprojects
cd myprojects

通过模板创建新的 Scala 项目。Create a new Scala project from a template.

sbt new sbt/scala-seed.g8

出现提示时,请输入 SparkPi 作为项目名称。When prompted, enter SparkPi for the project name.

name [Scala Seed Project]: SparkPi

导航到新建的项目目录。Navigate to the newly created project directory.

cd sparkpi

运行以下命令来添加 SBT 插件,以便将项目打包为 jar 文件。Run the following commands to add an SBT plugin, which allows packaging the project as a jar file.

touch project/assembly.sbt
echo 'addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.10")' >> project/assembly.sbt

运行以下命令,将示例代码复制到新建的项目,并添加全部所需的依赖项。Run these commands to copy the sample code into the newly created project and add all necessary dependencies.

EXAMPLESDIR="src/main/scala/org/apache/spark/examples"
mkdir -p $EXAMPLESDIR
cp $sparkdir/examples/$EXAMPLESDIR/SparkPi.scala $EXAMPLESDIR/SparkPi.scala

cat <<EOT >> build.sbt
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0" % "provided"
EOT

sed -ie 's/scalaVersion.*/scalaVersion := "2.11.11"/' build.sbt
sed -ie 's/name.*/name := "SparkPi",/' build.sbt

若要将项目打包成 jar,请运行以下命令。To package the project into a jar, run the following command.

sbt assembly

成功打包后,应会看到如下所示的输出。After successful packaging, you should see output similar to the following.

[info] Packaging /Users/me/myprojects/sparkpi/target/scala-2.11/SparkPi-assembly-0.1.0-SNAPSHOT.jar ...
[info] Done packaging.
[success] Total time: 10 s, completed Mar 6, 2018 11:07:54 AM

将作业复制到存储Copy job to storage

创建 Azure 存储帐户和容器用于保存 jar 文件。Create an Azure storage account and container to hold the jar file.

RESOURCE_GROUP=sparkdemo
STORAGE_ACCT=sparkdemo$RANDOM
az group create --name $RESOURCE_GROUP --location chinaeast2
az storage account create --resource-group $RESOURCE_GROUP --name $STORAGE_ACCT --sku Standard_LRS
export AZURE_STORAGE_CONNECTION_STRING=`az storage account show-connection-string --resource-group $RESOURCE_GROUP --name $STORAGE_ACCT -o tsv`

使用以下命令将 jar 文件上传到 Azure 存储帐户。Upload the jar file to the Azure storage account with the following commands.

CONTAINER_NAME=jars
BLOB_NAME=SparkPi-assembly-0.1.0-SNAPSHOT.jar
FILE_TO_UPLOAD=target/scala-2.11/SparkPi-assembly-0.1.0-SNAPSHOT.jar

echo "Creating the container..."
az storage container create --name $CONTAINER_NAME
az storage container set-permission --name $CONTAINER_NAME --public-access blob

echo "Uploading the file..."
az storage blob upload --container-name $CONTAINER_NAME --file $FILE_TO_UPLOAD --name $BLOB_NAME

jarUrl=$(az storage blob url --container-name $CONTAINER_NAME --name $BLOB_NAME | tr -d '"')

现在,变量 jarUrl 包含 jar 文件的可公开访问路径。Variable jarUrl now contains the publicly accessible path to the jar file.

提交 Spark 作业Submit a Spark job

使用如下命令在单独的命令行中启动 kube 代理。Start kube-proxy in a separate command-line with the following code.

kubectl proxy

导航回到 Spark 存储库的根目录。Navigate back to the root of Spark repository.

cd $sparkdir

创建具有足够权限的服务帐户来运行作业。Create a service account that has sufficient permissions for running a job.

kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

使用 spark-submit 提交作业。Submit the job using spark-submit.

./bin/spark-submit \
  --master k8s://http://127.0.0.1:8001 \
  --deploy-mode cluster \
  --name spark-pi \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.executor.instances=3 \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.container.image=$REGISTRY_NAME/spark:$REGISTRY_TAG \
  $jarUrl

此操作会启动 Spark 作业,该作业将作业状态流式传输到 shell 会话。This operation starts the Spark job, which streams job status to your shell session. 运行作业时,可以使用 kubectl get pods 命令查看 Spark 驱动程序 pod 和执行器 pod。While the job is running, you can see Spark driver pod and executor pods using the kubectl get pods command. 打开另一个终端会话以运行这些命令。Open a second terminal session to run these commands.

kubectl get pods
NAME                                               READY     STATUS     RESTARTS   AGE
spark-pi-2232778d0f663768ab27edc35cb73040-driver   1/1       Running    0          16s
spark-pi-2232778d0f663768ab27edc35cb73040-exec-1   0/1       Init:0/1   0          4s
spark-pi-2232778d0f663768ab27edc35cb73040-exec-2   0/1       Init:0/1   0          4s
spark-pi-2232778d0f663768ab27edc35cb73040-exec-3   0/1       Init:0/1   0          4s

运行作业时,还可以访问 Spark UI。While the job is running, you can also access the Spark UI. 在第二个终端会话中,使用 kubectl port-forward 命令提供对 Spark UI 的访问权限。In the second terminal session, use the kubectl port-forward command provide access to Spark UI.

kubectl port-forward spark-pi-2232778d0f663768ab27edc35cb73040-driver 4040:4040

若要访问 Spark UI,请在浏览器中打开地址 127.0.0.1:4040To access Spark UI, open the address 127.0.0.1:4040 in a browser.

Spark UI

获取作业结果和日志Get job results and logs

作业完成后,驱动程序 pod 将处于“已完成”状态。After the job has finished, the driver pod will be in a "Completed" state. 使用以下命令获取 pod 的名称。Get the name of the pod with the following command.

kubectl get pods --show-all

输出:Output:

NAME                                               READY     STATUS      RESTARTS   AGE
spark-pi-2232778d0f663768ab27edc35cb73040-driver   0/1       Completed   0          1m

使用 kubectl logs 命令从 spark 驱动程序 pod 获取日志。Use the kubectl logs command to get logs from the spark driver pod. 将 pod 名称替换为驱动程序 pod 的名称。Replace the pod name with your driver pod's name.

kubectl logs spark-pi-2232778d0f663768ab27edc35cb73040-driver

在这些日志中,可以看到 Spark 作业的结果,即 Pi 的值。Within these logs, you can see the result of the Spark job, which is the value of Pi.

Pi is roughly 3.152155760778804

使用容器映像打包 jarPackage jar with container image

在上述示例中,Spark jar 文件已上传到 Azure 存储。In the above example, the Spark jar file was uploaded to Azure storage. 另一种做法是将 jar 文件打包成自定义生成的 Docker 映像。Another option is to package the jar file into custom-built Docker images.

为此,请在 $sparkdir/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/ 目录中查找 Spark 映像的 dockerfileTo do so, find the dockerfile for the Spark image located at $sparkdir/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/ directory. WORKDIRENTRYPOINT 声明之间的某个位置为 Spark 作业 jar 添加 ADD 语句。Add an ADD statement for the Spark job jar somewhere between WORKDIR and ENTRYPOINT declarations.

将 jar 路径更新为 SparkPi-assembly-0.1.0-SNAPSHOT.jar 文件在开发系统上的位置。Update the jar path to the location of the SparkPi-assembly-0.1.0-SNAPSHOT.jar file on your development system. 也可以使用自己的自定义 jar 文件。You can also use your own custom jar file.

WORKDIR /opt/spark/work-dir

ADD /path/to/SparkPi-assembly-0.1.0-SNAPSHOT.jar SparkPi-assembly-0.1.0-SNAPSHOT.jar

ENTRYPOINT [ "/opt/entrypoint.sh" ]

使用随附的 Spark 脚本生成并推送映像。Build and push the image with the included Spark scripts.

./bin/docker-image-tool.sh -r <your container repository name> -t <tag> build
./bin/docker-image-tool.sh -r <your container repository name> -t <tag> push

运行作业时,可以不指示远程 jar URL,而是将 local:// 方案与 Docker 映像中 jar 文件的路径结合使用。When running the job, instead of indicating a remote jar URL, the local:// scheme can be used with the path to the jar file in the Docker image.

./bin/spark-submit \
    --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=3 \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    --conf spark.kubernetes.container.image=<spark-image> \
    local:///opt/spark/work-dir/<your-jar-name>.jar

警告

摘自 Spark 文档:“Kubernetes 计划程序当前处于实验阶段。From Spark documentation: "The Kubernetes scheduler is currently experimental. 将来版本中可能在配置、容器映像和入口点方面有一些方行为更改。”In future versions, there may be behavioral changes around configuration, container images and entrypoints".

后续步骤Next steps

查看 Spark 文档了解更多详细信息。Check out Spark documentation for more details.