区分活动作业与不活动作业Distinguish active and dead jobs

本文介绍如何区分活动作业和不活动作业。This article describes how to distinguish between active and dead jobs.

问题Problem

在具有太多并发作业的群集上,经常会看到某些作业卡在 Spark UI 中,进度卡滞不前。On clusters where there are too many concurrent jobs, you often see some jobs stuck in the Spark UI without any progress. 这会使活动作业/阶段和不活动作业/阶段的识别变得复杂。This complicates identifying which are the active jobs/stages versus the dead jobs/stages.

no-alternative-textno-alternative-text

原因Cause

每当群集上运行的并发作业过多时,Spark 内部 eventListenerBus 都可能会删除事件。Whenever there are too many concurrent jobs running on a cluster, there is a chance that the Spark internal eventListenerBus drops events. 这些事件用于在 Spark UI 中跟踪作业进度。These events are used to track job progress in the Spark UI. 每当事件侦听器删除事件时,都会在 Spark UI 中看到永远无法完成的不活动作业/阶段。Whenever the event listener drops events you start seeing dead jobs/stages in Spark UI, which never finish. 作业实际上已经完成,但在 Spark UI 中并未显示为“已完成”。The jobs are actually finished but not shown as completed in the Spark UI.

你在驱动程序日志中看到以下跟踪:You observe the following traces in driver logs:

18/01/25 06:37:32 WARN LiveListenerBus: Dropped 5044 SparkListenerEvents since Thu Jan 25 06:36:32 UTC 2018

解决方案Solution

如果不重启群集,就无法从 Spark UI 中删除不活动的作业。There is no way to remove dead jobs from the Spark UI without restarting the cluster. 不过,你可以通过运行以下命令来识别活动的作业和阶段:However, you can identify the active jobs and stages by running the following commands:

sc.statusTracker.getActiveJobIds()  // Returns an array containing the IDs of all active jobs.
sc.statusTracker.getActiveStageIds() // Returns an array containing the IDs of all active stages.