使用 Caffe on Azure HDInsight Spark 进行分布式深度学习Use Caffe on Azure HDInsight Spark for distributed deep learning

简介Introduction

深度学习正在影响我们生活中的方方面面,从医疗保健到交通运输到生产制造,不一而足。Deep learning is impacting everything from healthcare to transportation to manufacturing, and more. 很多公司都在致力于通过深度学习来解决各种棘手的问题,例如图像分类语音识别、物体识别和机器翻译。Companies are turning to deep learning to solve hard problems, like image classification, speech recognition, object recognition, and machine translation.

许多常用框架,其中包括 Microsoft Cognitive ToolkitTensorflowApache MXNet、Theano 等。Caffe 是最著名的非符号(命令式)神经网络框架之一,广泛用于包括计算机视觉在内的许多领域。There are many popular frameworks, including Microsoft Cognitive Toolkit, Tensorflow, Apache MXNet, Theano, etc. Caffe is one of the most famous non-symbolic (imperative) neural network frameworks, and widely used in many areas including computer vision. 此外,CaffeOnSpark 将 Caffe 与 Apache Spark 相结合,因此,可在现有 Hadoop 集群上轻松使用深度学习。Furthermore, CaffeOnSpark combines Caffe with Apache Spark, in which case deep learning can be easily used on an existing Hadoop cluster. 可将深度学习与 Spark ETL 管道搭配使用,降低系统复杂性和完整解决方案学习中的延迟。You can use deep learning together with Spark ETL pipelines, reducing system complexity, and latency for complete solution learning.

HDInsight 是云 Apache Hadoop 产品/服务,为 Apache Spark、Apache Hive、Apache Hadoop、Apache HBase、Apache Storm 和Apache Kafka 提供优化的开源分析群集。HDInsight is a cloud Apache Hadoop offering that provides optimized open-source analytic clusters for Apache Spark, Apache Hive, Apache Hadoop, Apache HBase, Apache Storm and Apache Kafka. HDInsight 提供 99.9% SLA 支持。HDInsight is backed by a 99.9% SLA. 这些大数据技术和 ISV 应用程序均可轻松部署为受企业保护和监视的托管群集。Each of these big data technologies and ISV applications is easily deployable as managed clusters with security and monitoring for enterprises.

本文演示如何为 HDInsight 群集安装 CaffeonSparkThis article demonstrates how to install Caffe on Spark for an HDInsight cluster. 本文还内置了 MNIST 演示,展示如何通过 CPU 上的 HDInsight Spark 使用分布式深度学习。This article also uses the built-in MNIST demo to show how to use Distributed Deep Learning using HDInsight Spark on CPUs.

完成此任务需要四个步骤:There are four steps to accomplish the task:

  1. 在所有节点上安装必需的依赖项Install the required dependencies on all the nodes
  2. 在头节点上生成 Caffe on Spark for HDInsightBuild Caffe on Spark for HDInsight on the head node
  3. 将所需库分发到所有工作节点Distribute the required libraries to all the worker nodes
  4. 编写 Caffe 模型,以分布式方式运行。Compose a Caffe model and run it in a distributed manner.

HDInsight 是一种 PaaS 解决方案,因此提供了出色的平台功能,可以轻松地执行某些任务。Since HDInsight is a PaaS solution, it offers great platform features - so it is easy to perform some tasks. 我们在本博客文章中使用的一项功能称为脚本操作,适合执行 Shell 命令来自定义群集节点(头节点、工作节点或边缘节点)。One of the features used in this blog post is called Script Action, with which you can execute shell commands to customize cluster nodes (head node, worker node, or edge node).

步骤 1:在所有节点上安装必需的依赖项Step 1: Install the required dependencies on all the nodes

若要开始,需安装依赖项。To get started, you need to install the dependencies. Caffe 站点和 CaffeOnSpark 站点提供一些有用 wiki,用于安装 Spark on YARN 模式的依赖项。The Caffe site and CaffeOnSpark site offers some useful wiki for installing the dependencies for Spark on YARN mode. HDInsight 也使用 Spark on YARN 模式。HDInsight also uses Spark on YARN mode. 但是,需要为 HDInsight 平台添加一些其他依赖项。However, you need to add a few more dependencies for HDInsight platform. 因此,请使用脚本操作并让其在所有头节点和工作节点上运行。To do so, you use a script action and run it on all the head nodes and worker nodes. 该脚本操作需时约 20 分钟,因为那些依赖项也依赖于其他包。This script action takes about 20 minutes, as those dependencies also depend on other packages. 应将其置于某个可供 HDInsight 群集访问的位置,例如置于 GitHub 或默认的 BLOB 存储帐户中。You should put it in some location that is accessible to your HDInsight cluster, such as a GitHub location or the default BLOB storage account.

#!/bin/bash
#Please be aware that installing the below will add additional 20 mins to cluster creation because of the dependencies
#installing all dependencies, including the ones mentioned in https://caffe.berkeleyvision.org/install_apt.html, as well a few packages that are not included in HDInsight, such as gflags, glog, lmdb, numpy
#It seems numpy will only needed during compilation time, but for safety purpose you install them on all the nodes

sudo apt-get install -y libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libhdf5-serial-dev protobuf-compiler maven libatlas-base-dev libgflags-dev libgoogle-glog-dev liblmdb-dev build-essential  libboost-all-dev python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose

#install protobuf
wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
sudo tar xzvf protobuf-2.5.0.tar.gz -C /tmp/
cd /tmp/protobuf-2.5.0/
sudo ./configure
sudo make
sudo make check
sudo make install
sudo ldconfig
echo "protobuf installation done"

脚本操作有两个步骤。There are two steps in the script action. 第一步是安装所有必需的库。The first step is to install all the required libraries. 这些库包括编译 Caffe 所必需的库(例如 gflags、glog)和运行 Caffe 所必需的库(例如 numpy)。Those libraries include the necessary libraries for both compiling Caffe(such as gflags, glog) and running Caffe (such as numpy). 考虑到 CPU 优化,你使用的是 libatlas,但始终可以按照 CaffeOnSpark Wiki 上的说明来安装其他优化库,例如 MKL 或 CUDA(适合 GPU)。you are using libatlas for CPU optimization, but you can always follow the CaffeOnSpark wiki on installing other optimization libraries, such as MKL or CUDA (for GPU).

第二步是在运行时下载、编译和安装适用于 Caffe 的 protobuf 2.5.0。The second step is to download, compile, and install protobuf 2.5.0 for Caffe during runtime. Protobuf 2.5.0 是必需的,但 Ubuntu 16 不提供包形式的该版本,因此需从源代码对其进行编译。Protobuf 2.5.0 is required, however this version is not available as a package on Ubuntu 16, so you need to compile it from the source code. Internet 上也有一些介绍其编译方法的资源。There are also a few resources on the Internet on how to compile it. 有关详细信息,请参阅此文For more information, see here.

若要开始,可直接针对群集的所有工作节点和头节点运行此脚本操作(适用于 HDInsight 3.5)。To get started, you can just run this script action against your cluster to all the worker nodes and head nodes (for HDInsight 3.5). 可在现有群集上运行脚本操作,或在群集创建过程中使用脚本操作。You can either run the script actions on an existing cluster, or use script actions during the cluster creation. 有关脚本操作的详细信息,请参阅此处的文档For more information on the script actions, see the documentation here

用于安装依赖项的脚本操作

步骤 2:在头节点上生成 Caffe on Apache Spark for HDInsightStep 2: Build Caffe on Apache Spark for HDInsight on the head node

第二步是在头节点上生成 Caffe,然后将编译的库分发到所有工作节点。The second step is to build Caffe on the headnode, and then distribute the compiled libraries to all the worker nodes. 在此步骤中,必须使用 SSH 连接到头节点In this step, you must ssh into your headnode. 之后,必须执行 CaffeOnSpark 生成步骤After that, you must follow the CaffeOnSpark build process. 下面是使用其他步骤生成 CaffeOnSpark 的脚本。Below is the script you can use to build CaffeOnSpark with a few additional steps.

#!/bin/bash
git clone https://github.com/yahoo/CaffeOnSpark.git --recursive
export CAFFE_ON_SPARK=$(pwd)/CaffeOnSpark

pushd ${CAFFE_ON_SPARK}/caffe-public/
cp Makefile.config.example Makefile.config
echo "INCLUDE_DIRS += ${JAVA_HOME}/include" >> Makefile.config
#Below configurations might need to be updated based on actual cases. For example, if you are using GPU, or using a different BLAS library, you may want to update those settings accordingly.
echo "CPU_ONLY := 1" >> Makefile.config
echo "BLAS := atlas" >> Makefile.config
echo "INCLUDE_DIRS += /usr/include/hdf5/serial/" >> Makefile.config
echo "LIBRARY_DIRS += /usr/lib/x86_64-linux-gnu/hdf5/serial/" >> Makefile.config
popd

#compile CaffeOnSpark
pushd ${CAFFE_ON_SPARK}
#always clean up the environment before building (especially when rebuiding), or there will be errors such as "failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (proto) on project caffe-distri: An Ant BuildException has occurred: exec returned: 2"
make clean 
#the build step usually takes 20~30 mins, since it has a lot maven dependencies
make build 
popd
export LD_LIBRARY_PATH=${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib

hadoop fs -mkdir -p wasb:///projects/machine_learning/image_dataset

${CAFFE_ON_SPARK}/scripts/setup-mnist.sh
hadoop fs -put -f ${CAFFE_ON_SPARK}/data/mnist_*_lmdb wasb:///projects/machine_learning/image_dataset/

${CAFFE_ON_SPARK}/scripts/setup-cifar10.sh
hadoop fs -put -f ${CAFFE_ON_SPARK}/data/cifar10_*_lmdb wasb:///projects/machine_learning/image_dataset/

#put the already compiled CaffeOnSpark libraries to wasb storage, then read back to each node using script actions. This is because CaffeOnSpark requires all the nodes have the libarries
hadoop fs -mkdir -p /CaffeOnSpark/caffe-public/distribute/lib/
hadoop fs -mkdir -p /CaffeOnSpark/caffe-distri/distribute/lib/
hadoop fs -put CaffeOnSpark/caffe-distri/distribute/lib/* /CaffeOnSpark/caffe-distri/distribute/lib/
hadoop fs -put CaffeOnSpark/caffe-public/distribute/lib/* /CaffeOnSpark/caffe-public/distribute/lib/

除了 CaffeOnSpark 文档所述步骤,可能还需执行其他步骤。You may need to do more than what the documentation of CaffeOnSpark says. 所做更改如下:The changes are:

  • 更改仅针对 CPU,特此使用 libatlas。Change to CPU only and use libatlas for this particular purpose.
  • 将数据集置于 BLOB 存储,这是一个共享位置,可供所有工作节点在以后使用时访问。Put the datasets to the BLOB storage, which is a shared location that is accessible to all worker nodes for later use.
  • 将编译的 Caffe 库置于 BLOB 存储,以便将来使用脚本操作将这些库复制到所有节点,不再需要编译。Put the compiled Caffe libraries to BLOB storage, and later you copy those libraries to all the nodes using script actions to avoid additional compilation time.

故障排除:出现 Ant BuildException: exec 返回: 2Troubleshooting: An Ant BuildException has occurred: exec returned: 2

首次尝试生成 CaffeOnSpark 时,有时会出现以下错误消息:When first trying to build CaffeOnSpark, sometimes it says

failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (proto) on project caffe-distri: An Ant BuildException has occurred: exec returned: 2

运行“make clean”清除代码存储库,再运行“make build”即可解决该问题,前提是依赖项正确。Clean the code repository by "make clean" and then run "make build" to solve this issue, as long as you have the correct dependencies.

故障排除:Maven 存储库连接超时Troubleshooting: Maven repository connection time-out

有时 Maven 会出现连接超时错误,类似如下片段:Sometimes maven gives a connection time-out error, similar to the following snippet:

Retry:
[INFO] Downloading: https://repo.maven.apache.org/maven2/com/twitter/chill_2.11/0.8.0/chill_2.11-0.8.0.jar
Feb 01, 2017 5:14:49 AM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request to {s}->https://repo.maven.apache.org:443: Connection timed out (Read failed)

必须在几分钟后重试。You must retry after a few minutes.

故障排除:Caffe 测试失败Troubleshooting: Test failure for Caffe

对 CaffeOnSpark 进行最后检查时,可能会出现一条测试失败的消息。You probably see a test failure when doing the final check for CaffeOnSpark. 这可能与 UTF-8 编码有关,但不会影响 Caffe 的使用This is probably related with UTF-8 encoding, but should not impact the usage of Caffe

Run completed in 32 seconds, 78 milliseconds.
Total number of tests run: 7
Suites: completed 5, aborted 0
Tests: succeeded 6, failed 1, canceled 0, ignored 0, pending 0
*** 1 TEST FAILED ***

步骤 3:将所需库分发到所有工作节点Step 3: Distribute the required libraries to all the worker nodes

下一步是将库(基本上是 CaffeOnSpark/caffe-public/distribute/lib/ 和 CaffeOnSpark/caffe-distri/distribute/lib/ 中的库)分发到所有节点。The next step is to distribute the libraries (basically the libraries in CaffeOnSpark/caffe-public/distribute/lib/ and CaffeOnSpark/caffe-distri/distribute/lib/) to all the nodes. 在步骤 2 中,你将这些库置于 BLOB 存储,而在此步骤中,请使用脚本操作将其复制到所有头节点和工作节点。In Step 2, you put those libraries on BLOB storage, and in this step, you use script actions to copy it to all the head nodes and worker nodes.

因此,请运行脚本操作,如以下片段所示:To do this, run a script action as shown in the following snippet:

#!/bin/bash
hadoop fs -get wasb:///CaffeOnSpark /home/changetoyourusername/

请确保指向特定群集的正确位置Make sure you need point to the right location specific to your cluster)

你已在步骤 2 中将其置于可供所有节点访问的 BLOB 存储,因此在此步骤中,只需直接将其复制到所有节点即可。Because in step 2, you put it on the BLOB storage, which is accessible to all the nodes, in this step you just copy it to all the nodes.

步骤 4:编写 Caffe 模型,以分布式方式运行Step 4: Compose a Caffe model and run it in a distributed manner

执行上述步骤后,Caffe 安装完成。Caffe is installed after running the preceding steps. 下一步是编写 Caffe 模型。The next step is to write a Caffe model.

Caffe 使用“富有表现力的体系结构”,因此若要编写模型,只需定义配置文件即可,根本不需编码(大多数情况下)。Caffe is using an "expressive architecture", where for composing a model, you just need to define a configuration file, and without coding at all (in most cases). 让我们看看实际情况。So let's take a look there.

你训练的模型是用于 MNIST 训练的示例模型。The model that you train is a sample model for MNIST training. 包含手写数字的 MNIST 数据库有一个 60,000 示例的训练集,还有一个 10,000 示例的测试集。The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. 它是 NIST 提供的更大型集的子集。It is a subset of a larger set available from NIST. 这些数字已在大小方面规范化,在固定大小的图像中居中。The digits have been size-normalized and centered in a fixed-size image. CaffeOnSpark 提供的一些脚本可以下载该数据集并将其转换成正确的格式。CaffeOnSpark has some scripts to download the dataset and convert it into the right format.

CaffeOnSpark 提供了一些用于 MNIST 培训的网络拓扑示例。CaffeOnSpark provides some network topologies example for MNIST training. 它具有良好的设计,将网络体系结构(网络拓扑)和优化进行了拆分。It has a nice design of splitting the network architecture (the topology of the network) and optimization. 在本示例中,需要两个文件:In this case, There are two files required:

“解算器”文件 (${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt) 用于监控优化情况和生成参数更新。the "Solver" file (${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt) is used for overseeing the optimization and generating parameter updates. 例如,它可以定义是使用 CPU 还是 GPU,以及具体的动量和迭代次数等。它还定义程序应使用哪个神经元网络拓扑(即你需要的第二个文件)。For example, it defines whether CPU or GPU is used, what's the momentum, how many iterations are, etc. It also defines which neuron network topology should the program use (which is the second file you need). 有关解算器的详细信息,请参阅 Caffe 文档For more information about Solver, see Caffe documentation.

就此示例来说,你使用的是 CPU 而不是 GPU,因此应将最后一行更改为:For this example, since you are using CPU rather than GPU, you should change the last line to:

# solver mode: CPU or GPU
solver_mode: CPU

Caffe Config1

可以根据需要更改其他行。You can change other lines as needed.

第二个文件 (${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt) 定义神经元网络的情况,以及相关的输入和输出文件。The second file (${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt) defines how the neuron network looks like, and the relevant input and output file. 此外还需根据训练数据位置更新文件。you also need to update the file to reflect the training data location. 更改 lenet_memory_train_test.prototxt 中的以下部分(需指向特定于群集的正确位置):Change the following part in lenet_memory_train_test.prototxt (you need to point to the right location specific to your cluster):

  • 将 "file:/Users/mridul/bigml/demodl/mnist_train_lmdb" 更改为 "wasb:///projects/machine_learning/image_dataset/mnist_train_lmdb"change the "file:/Users/mridul/bigml/demodl/mnist_train_lmdb" to "wasb:///projects/machine_learning/image_dataset/mnist_train_lmdb"
  • 将 "file:/Users/mridul/bigml/demodl/mnist_test_lmdb/" 更改为 "wasb:///projects/machine_learning/image_dataset/mnist_test_lmdb"change "file:/Users/mridul/bigml/demodl/mnist_test_lmdb/" to "wasb:///projects/machine_learning/image_dataset/mnist_test_lmdb"

Caffe Config2

如需详细了解如何定义网络,请查看有关 MNIST 数据集的 Caffe 文档For more information on how to define the network, check the Caffe documentation on MNIST dataset

对于本文,请使用此 MNIST 示例。For the purpose of this article, you use this MNIST example. 从头节点运行以下命令:Run the following commands from the head node:

spark-submit --master yarn --deploy-mode cluster --num-executors 8 --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" --class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar -train -features accuracy,loss -label label -conf lenet_memory_solver.prototxt -devices 1 -connection ethernet -model wasb:///mnist.model -output wasb:///mnist_features_result

上述命令将所需的文件(lenet_memory_solver.prototxt 和 lenet_memory_train_test.prototxt)分发给每个 YARN 容器。The preceding command distributes the required files (lenet_memory_solver.prototxt and lenet_memory_train_test.prototxt) to each YARN container. 该命令还将每个 Spark 驱动程序/执行程序的相关路径设置为 LD_LIBRARY_PATH。The command also sets the relevant PATH of each Spark driver/executor to LD_LIBRARY_PATH. LD_LIBRARY_PATH 在前面的代码片段中定义并指向具有 CaffeOnSpark 库的位置。LD_LIBRARY_PATH is defined in the previous code snippet and points to the location that has CaffeOnSpark libraries.

监视和故障排除Monitoring and troubleshooting

你使用的是 YARN 群集模式,可将 Spark 驱动程序调度到任意容器(以及任意工作节点),因此在控制台中只会看到类似如下的输出:Since you are using YARN cluster mode, in which case the Spark driver will be scheduled to an arbitrary container (and an arbitrary worker node) you should only see in the console outputting something like:

17/02/01 23:22:16 INFO Client: Application report for application_1485916338528_0015 (state: RUNNING)

若要了解所发生的情况,通常需获取 Spark 驱动程序的日志,其中包含详细信息。If you want to know what happened, you usually need to get the Spark driver's log, which has more information. 在本例中,需转到 YARN UI 查找相关的 YARN 日志。In this case, you need to go to the YARN UI to find the relevant YARN logs. 可通过以下 URL 获取 YARN UI:You can get the YARN UI by this URL:

https://yourclustername.azurehdinsight.cn/yarnui

YARN UI

可以看看为这个特定的应用程序分配了多少资源。You can take a look at how many resources are allocated for this particular application. 单击“计划程序”链接即可查看此应用程序的资源分配情况,有九个容器正在运行。You can click the "Scheduler" link, and then you will see that for this application, there are nine containers running. 你要求 YARN 提供八个执行程序,另一个容器用于驱动程序进程。you ask YARN to provide eight executors, and another container is for driver process.

YARN 计划程序

如果发生故障,可能需要查看驱动程序日志或容器日志。You may want to check the driver logs or container logs if there are failures. 如果要查看驱动程序日志,可在 YARN UI 中单击应用程序 ID,并单击“日志”按钮。For driver logs, you can click the application ID in YARN UI, then click the "Logs" button. 此时驱动程序日志会写入 stderr 中。The driver logs are written into stderr.

YARN UI 2

例如,可能会显示下面列出的来自驱动程序日志的部分错误,指示用户分配的执行程序过多。For example, you might see some of the error below from the driver logs, indicating you allocate too many executors.

17/02/01 07:26:06 ERROR ApplicationMaster: User class threw exception: java.lang.IllegalStateException: Insufficient training data. Please adjust hyperparameters or increase dataset.
java.lang.IllegalStateException: Insufficient training data. Please adjust hyperparameters or increase dataset.
    at com.yahoo.ml.caffe.CaffeOnSpark.trainWithValidation(CaffeOnSpark.scala:261)
    at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:42)
    at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)

有时,问题可能会发生在执行程序而非驱动程序中。Sometimes, the issue can happen in executors rather than drivers. 在这种情况下,需检查容器日志。In this case, you need to check the container logs. 始终可以获取容器日志,然后获取发生故障的容器。You can always get the container logs, and then get the failed container. 例如,可能会在运行 Caffe 时遇到这种故障。For example, you might meet this failure when running Caffe.

17/02/01 07:12:05 WARN YarnAllocator: Container marked as failed: container_1485916338528_0008_05_000005 on host: 10.0.0.14. Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1485916338528_0008_05_000005
Exit code: 134
Exception message: /bin/bash: line 1: 12230 Aborted                 (core dumped) LD_LIBRARY_PATH=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64:/home/xiaoyzhu/CaffeOnSpark/caffe-public/distribute/lib:/home/xiaoyzhu/CaffeOnSpark/caffe-distri/distribute/lib /usr/lib/jvm/java-8-openjdk-amd64/bin/java -server -Xmx4608m '-Dhdp.version=' '-Detwlogger.component=sparkexecutor' '-DlogFilter.filename=SparkLogFilters.xml' '-DpatternGroup.filename=SparkPatternGroups.xml' '-Dlog4jspark.root.logger=INFO,console,RFA,ETW,Anonymizer' '-Dlog4jspark.log.dir=/var/log/sparkapp/${user.name}' '-Dlog4jspark.log.file=sparkexecutor.log' '-Dlog4j.configuration=file:/usr/hdp/current/spark2-client/conf/log4j.properties' '-Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl' -Djava.io.tmpdir=/mnt/resource/hadoop/yarn/local/usercache/xiaoyzhu/appcache/application_1485916338528_0008/container_1485916338528_0008_05_000005/tmp '-Dspark.driver.port=43942' '-Dspark.history.ui.port=18080' '-Dspark.ui.port=0' -Dspark.yarn.app.container.log.dir=/mnt/resource/hadoop/yarn/log/application_1485916338528_0008/container_1485916338528_0008_05_000005 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@10.0.0.13:43942 --executor-id 4 --hostname 10.0.0.14 --cores 3 --app-id application_1485916338528_0008 --user-class-path file:/mnt/resource/hadoop/yarn/local/usercache/xiaoyzhu/appcache/application_1485916338528_0008/container_1485916338528_0008_05_000005/__app__.jar > /mnt/resource/hadoop/yarn/log/application_1485916338528_0008/container_1485916338528_0008_05_000005/stdout 2> /mnt/resource/hadoop/yarn/log/application_1485916338528_0008/container_1485916338528_0008_05_000005/stderr

Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12230 Aborted                 (core dumped) LD_LIBRARY_PATH=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64:/home/xiaoyzhu/CaffeOnSpark/caffe-public/distribute/lib:/home/xiaoyzhu/CaffeOnSpark/caffe-distri/distribute/lib /usr/lib/jvm/java-8-openjdk-amd64/bin/java -server -Xmx4608m '-Dhdp.version=' '-Detwlogger.component=sparkexecutor' '-DlogFilter.filename=SparkLogFilters.xml' '-DpatternGroup.filename=SparkPatternGroups.xml' '-Dlog4jspark.root.logger=INFO,console,RFA,ETW,Anonymizer' '-Dlog4jspark.log.dir=/var/log/sparkapp/${user.name}' '-Dlog4jspark.log.file=sparkexecutor.log' '-Dlog4j.configuration=file:/usr/hdp/current/spark2-client/conf/log4j.properties' '-Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl' -Djava.io.tmpdir=/mnt/resource/hadoop/yarn/local/usercache/xiaoyzhu/appcache/application_1485916338528_0008/container_1485916338528_0008_05_000005/tmp '-Dspark.driver.port=43942' '-Dspark.history.ui.port=18080' '-Dspark.ui.port=0' -Dspark.yarn.app.container.log.dir=/mnt/resource/hadoop/yarn/log/application_1485916338528_0008/container_1485916338528_0008_05_000005 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@10.0.0.13:43942 --executor-id 4 --hostname 10.0.0.14 --cores 3 --app-id application_1485916338528_0008 --user-class-path file:/mnt/resource/hadoop/yarn/local/usercache/xiaoyzhu/appcache/application_1485916338528_0008/container_1485916338528_0008_05_000005/__app__.jar > /mnt/resource/hadoop/yarn/log/application_1485916338528_0008/container_1485916338528_0008_05_000005/stdout 2> /mnt/resource/hadoop/yarn/log/application_1485916338528_0008/container_1485916338528_0008_05_000005/stderr

    at org.apache.hadoop.util.Shell.runCommand(Shell.java:933)
    at org.apache.hadoop.util.Shell.run(Shell.java:844)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1123)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:225)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 134

在这种情况下,需获取发生故障的容器的 ID(在上面的示例中,该 ID 为 container_1485916338528_0008_05_000005)。In this case, you need to get the failed container ID (in the above case, it is container_1485916338528_0008_05_000005). 然后需要运行Then you need to run

yarn logs -containerId container_1485916338528_0008_03_000005

(从头节点)。from the headnode. 查看容器故障后,会发现该故障的原因是在 lenet_memory_solver.prototxt 中使用了 GPU 模式(本应使用 CPU 模式)。After checking container failure, it is caused by using GPU mode (where you should use CPU mode instead) in lenet_memory_solver.prototxt.

17/02/01 07:10:48 INFO LMDB: Batch size:100
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0201 07:10:48.309725 11624 common.cpp:79] Cannot use GPU in CPU-only Caffe: check mode.

获取结果Getting results

由于分配的执行程序为 8 个,且网络拓扑简单,获取结果应该只需要大约 30 分钟。Since you are allocating 8 executors, and the network topology is simple, it should only take around 30 minutes to run the result. 从命令行中,可以看到模型置于 wasb:///mnist.model 中,结果置于名为 wasb:///mnist_features_result 的文件夹中。From the command line, you can see that you put the model to wasb:///mnist.model, and put the results to a folder named wasb:///mnist_features_result.

可以通过运行以下命令获取结果:You can get the results by running

hadoop fs -cat hdfs:///mnist_features_result/*

结果如下所示:and the result looks like:

{"SampleID":"00009597","accuracy":[1.0],"loss":[0.028171852],"label":[2.0]}
{"SampleID":"00009598","accuracy":[1.0],"loss":[0.028171852],"label":[6.0]}
{"SampleID":"00009599","accuracy":[1.0],"loss":[0.028171852],"label":[1.0]}
{"SampleID":"00009600","accuracy":[0.97],"loss":[0.0677709],"label":[5.0]}
{"SampleID":"00009601","accuracy":[0.97],"loss":[0.0677709],"label":[0.0]}
{"SampleID":"00009602","accuracy":[0.97],"loss":[0.0677709],"label":[1.0]}
{"SampleID":"00009603","accuracy":[0.97],"loss":[0.0677709],"label":[2.0]}
{"SampleID":"00009604","accuracy":[0.97],"loss":[0.0677709],"label":[3.0]}
{"SampleID":"00009605","accuracy":[0.97],"loss":[0.0677709],"label":[4.0]}

SampleID 表示 MNIST 数据集中的 ID,标签是模型的标识数字。The SampleID represents the ID in the MNIST dataset, and the label is the number that the model identifies.

结论Conclusion

在本文档中,用户尝试安装 CaffeOnSpark 并运行一个简单的示例。In this documentation, you have tried to install CaffeOnSpark with running a simple example. HDInsight 是完全托管的云分布式计算平台,尤其适合在大型数据集上运行机器学习和高级分析工作负荷,而对于分布式深度学习,可以使用 Caffe on HDInsight Spark 执行深度学习任务。HDInsight is a full managed cloud distributed compute platform, and is the best place for running machine learning and advanced analytics workloads on large data set, and for distributed deep learning, you can use Caffe on HDInsight Spark to perform deep learning tasks.

另请参阅See also

方案Scenarios

管理资源Manage resources