调试 Apache Spark 流式处理应用程序Debugging Apache Spark streaming applications

本指南将逐步介绍各种可用的调试选项,以便你了解 Apache Spark 流式处理应用程序的内部机制。This guide walks you through the different debugging options available to peek at the internals of your Apache Spark Streaming application. 应注意以下三个重要地方:The three important places to look are:

  • Spark UISpark UI
  • 驱动程序日志Driver logs
  • 执行程序日志Executor logs

Spark UISpark UI

启动流式处理作业后,Spark 和流式处理 UI 中会提供丰富的信息,方便你详细了解流式处理应用程序中发生的情况。Once you start the streaming job, there is a wealth of information available in the Spark and Streaming UI to know more about what’s happening in your streaming application. 若要转到 Spark UI,你可单击附加的群集:To get to the Spark UI, you can click the attached cluster:

选择 Spark UISelect Spark UI

“流式处理”选项卡Streaming tab

进入 Spark UI 后,如果流式处理作业正在此群集中运行,则会显示“流式处理”选项卡。Once you get to the Spark UI, you will see a Streaming tab if a streaming job is running in this cluster. 如果此群集中没有正在运行的流式处理作业,此选项卡将不可见。If there is no streaming job running in this cluster, this tab will not be visible. 你可跳到驱动程序日志了解如何检查启动流式处理作业时可能发生的异常。You can skip to Driver logs to learn how to check for exceptions that might have happened while starting the streaming job.

在此页中需要注意的第一件事是检查流式处理应用程序是否正在从你的源接收任何输入事件。The first thing to look for in this page is to check if your streaming application is receiving any input events from your source. 在本例中,你可看到该作业每秒接收 1000 个事件。In this case, you can see the job receives 1000 events/second.

如果你有一个接收多个输入流的应用程序,则可单击“输入速率”链接,该链接将显示每个接收器接收的事件数。If you have an application that receives multiple input streams, you can click the Input Rate link which will show the # of events received for each receiver.

输入速率Input rate

处理时间Processing time

向下滚动,找到“处理时间”关系图。As you scroll down, find the graph for Processing Time . 这是了解流式处理作业性能的关键图之一。This is one of the key graphs to understand the performance of your streaming job. 一般来说,如果你可在批处理时间的 80% 内处理每个批,那就很好。As a general rule of thumb, it is good if you can process each batch within 80% of your batch processing time.

对于此应用程序,批处理时间间隔为 2 秒。For this application, the batch interval was 2 seconds. 平均处理时间为 450 毫秒,大大低于批处理时间间隔。The average processing time is 450ms which is well under the batch interval. 如果平均处理时间接近或大于批处理时间间隔,则你的流式处理应用程序将开始排队,很快就会导致积压工作 (backlog),最终导致流式处理作业失败。If the average processing time is closer or greater than your batch interval, then you will have a streaming application that will start queuing up resulting in backlog soon which can bring down your streaming job eventually.

处理时间Processing time

已完成的批处理Completed batches

在页面末尾,你将看到所有已完成的批处理列表。Towards the end of the page, you will see a list of all the completed batches. 页面显示有关最近完成的 1000 个批处理的详细信息。The page displays details about the last 1000 batches that completed. 从表中,你可以获得每个批处理的事件数及其处理时间。From the table, you can get the # of events processed for each batch and their processing time. 如果你想要详细了解某个批处理的情况,可单击该批处理链接进入“批处理详细信息”页。If you want to know more about what happened on one of the batches, you can click the batch link to get to the Batch Details Page.

已完成的批处理Completed batches

“批处理详细信息”页Batch details page

此页面包含你想要了解的有关批处理的所有详细信息。This page has all the details you want to know about a batch. 两个关键事项:Two key things are:

  • 输入:它包含有关批处理输入的详细信息。Input: It has details about the input to the batch. 在本例中,它包含有关 Apache Kafka 主题、Spark 流式处理为此批处理读取的分区和偏移量的详细信息。In this case, it has details about the Apache Kafka topic, partition and offsets read by Spark Streaming for this batch. 对于 TextFileStream,你将看到为此批处理读取的文件名列表。In case of TextFileStream, you will see a list of file names that was read for this batch. 对于从文本文件读取的流式处理应用程序,这是启动调试的最佳方式。This is the best way to start debugging a Streaming application reading from text files.
  • 处理:你可单击作业 ID 的链接,其中包含有关在此批处理期间完成的处理的所有详细信息。Processing: You can click on the link to the Job ID which has all the details about the processing done during this batch.

批处理详细信息Batch details

“作业详细信息”页Job details page

“作业详细信息”页显示该批处理的 DStream DAG 可视化。The job details page shows the DStream DAG visualization for the batch. 若要了解每个批处理操作的 DAG,这一可视化非常有用。This is a very useful visualization to understand the DAG of operations for every batch. 在本例中,可以看到批处理从 Kafka 直接流读取输入,然后执行平面映射操作,最后执行映射操作。In this case, you can see that the batch read input from Kafka direct stream followed by a flat map operation and then a map operation. 然后通过 updateStateByKey,使用生成的 DStream 更新全局状态。The resulting DStream was then used to update a global state using updateStateByKey. (灰色框表示已跳过的阶段。(The greyed boxes represents skipped stages. Spark 非常智能,如果某些阶段不需要重新计算,则会跳过。Spark is smart enough to skip some stages if they don’t need to be recomputed. 如果该数据已设置检查点或已缓存,Spark 不会重新计算这些阶段。If the data is checkpointed or cached, then Spark would skip recomputing those stages. 在本例中,由于 updateStateBykey,这些阶段对应于以前的批处理的依赖项。In this case, those stages correspond to the dependency on previous batches because of updateStateBykey. 由于 Spark 流式处理会在 DStream 内部设置检查点,并且它会从检查点读取而不是根据以前的批处理,因此它们显示为灰色阶段。)Since Spark Streaming internally checkpoints the DStream and it reads from the checkpoint instead of depending on the previous batches, they are shown as greyed stages.)

在页面底部,还可找到为此批处理执行的作业列表。At the bottom of the page, you will also find the list of jobs that were executed for this batch. 你可单击说明中的链接,深入了解任务级别执行。You can click on the links in the description to drill further into the task level execution.

作业详细信息Job details

已完成的阶段Completed stages

“任务详细信息”页Task details page

对于 Spark 流式处理应用程序,这是可从 Spark UI 获取的最精细调试级别。This is the most granular level of debugging you can get into from the Spark UI for a Spark Streaming application. 此页面包含为此批处理执行的所有任务。This page has all the tasks that were executed for this batch. 如果你正在调查流式处理应用程序的性能问题,此页会提供一些信息,例如已执行的任务数及其执行位置(在哪个执行程序上)、随机信息等。If you are investigating performance issues of your streaming application, then this page would provide information like the # of tasks that were executed and where they were executed (on which executors), shuffle information, etc.

提示

请确保任务在群集中的多个执行程序(节点)上执行,以便在处理时具有足够的并行度。Ensure that the tasks are executed on multiple executors (nodes) in your cluster to have enough parallelism while processing. 如果你只有一个接收器,可能有时只有一个执行器来执行所有工作,尽管群集中有多个执行器。If you have a single receiver, sometimes only one executor might be doing all the work though you have more than one executor in your cluster.

任务详细信息Task details

驱动程序日志Driver logs

驱动程序日志有以下 2 个用途:Driver logs are helpful for 2 purposes:

  • 异常:有时,Spark UI 中可能未显示“流式处理”选项卡。Exceptions: Sometimes, you may not see the Streaming tab in the Spark UI. 这是因为流式处理作业由于某些异常而未启动。This is because the Streaming job was not started because of some exception. 你可以深入驱动程序日志,查看异常的堆栈跟踪。You can drill into the Driver logs to look at the stack trace of the exception. 在某些情况下,流式处理作业可能已正常启动。In some cases, the streaming job may have started properly. 但你会发现所有批处理永远不会转到“已完成的批处理”部分。But you will see all the batches never going to the Completed batches section. 它们可能都处于“正在处理”或“已失败”状态。They might all be in processing or failed state. 在这种情况下,借助驱动程序日志,可以很方便地了解基本问题的性质。In such cases too, driver logs could be handy to understand on the nature of the underlying issues.

  • 打印:作为 DStream DAG 一部分的任何 print 语句也会显示在日志中。Prints: Any print statements as part of the DStream DAG shows up in the logs too. 若要快速检查 DStream 的内容,可执行 dstream.print() (Scala) 或 dstream.pprint() (Python)。If you want to quickly check the contents of a DStream, you can do dstream.print() (Scala) or dstream.pprint() (Python). DStream 的内容将显示在日志中。The contents of the DStream will be in the logs. 还可执行 dstream.foreachRDD{ print statements here }You can also do dstream.foreachRDD{ print statements here }. 它们也将显示在日志中。They will also show up in the logs.

    备注

    只是在 DStream DAG 之外的流式处理函数中使用 print 语句时,内容将不会显示在日志中。Just having print statements in the streaming function outside of the DStream DAG will not show up in the logs. Spark 流式处理只生成并执行 DStream DAG。Spark Streaming generates and executes only the DStream DAG. 因此,print 语句必须是该 DAG 的组成部分。So the print statements have to be part of that DAG.

下表显示了 DStream 转换和相应日志的显示位置(如果转换有 print 语句):The following table shows the DStream transformations and where the corresponding log location would be if the transformation had a print statement:

说明Description 位置Location
foreachRDD()、transform()foreachRDD(), transform() 驱动程序 Stdout 日志Driver Stdout Logs
foreachPartition()foreachPartition() 执行程序的 Stdout 日志Executor’s Stdout Logs

若要获取驱动程序日志,可单击附加的群集。To get to the driver logs, you can click on the attached cluster.

选择驱动程序日志Select driver logs

驱动程序日志Driver log

备注

对于 PySpark 流式处理,所有打印和异常不会自动显示在日志中。For PySpark streaming, all the prints and exceptions does not automatically show up in the logs. 当前的限制是,笔记本单元格必须处于活动状态才能显示日志。The current limitation is that a notebook cell needs to be active for the logs to show up. 由于流式处理作业在后台线程中运行,因此日志会丢失。Since the streaming job runs in the background thread, the logs are lost. 若要在运行 pyspark 流式处理应用程序时查看日志,可在笔记本的某个单元格中提供 ssc.awaitTerminationOrTimeout(x)If you want to see the logs while running a pyspark streaming application, you can provide ssc.awaitTerminationOrTimeout(x) in one of the cells in the notebook. 这会使单元格等待“x”秒。This will put the cell on hold for ‘x’ seconds. “x”秒后,该时间段内的所有打印和异常都将显示在日志中。After the ‘x’ seconds, all the prints and exceptions during that time will be present in the logs.

执行程序日志Executor logs

如果发现某些任务行为异常,并且想要查看特定任务的日志,则执行程序日志有时会很有帮助。Executor logs are sometimes helpful if you see certain tasks are misbehaving and would like to see the logs for specific tasks. 在上面所示的“任务详细信息”页中,可以获取运行任务的执行程序。From the task details page shown above, you can get the executor where the task was run. 获得后,即可进入“群集 UI”页,单击 # 个节点,然后单击主节点。Once you have that, you can go to the clusters UI page, click on the # nodes and then the master. 主节点页会列出所有工作节点。The master page lists all the workers. 可选择运行可疑任务的工作节点,然后转到 log4j 输出。You can choose the worker where the suspicious task was run and then get to the log4j output.

选择主节点Select master

Spark 主节点Spark master

Spark 工作节点Spark worker