通过按状态对任务和节点计数来监视 Batch 解决方案Monitor Batch solutions by counting tasks and nodes by state

若要监视和管理大规模 Azure Batch 解决方案,可能需要确定处于各种状态的资源的计数。To monitor and manage large-scale Azure Batch solutions, you may need to determine counts of resources in various states. Azure Batch 提供有效的操作来获取 Batch 任务和计算节点的计数。Azure Batch provides efficient operations to get counts for Batch tasks and compute nodes. 可使用这些操作而不是可能非常耗时的列表查询来返回大型任务或节点集合的详细信息。You can use these operations instead of potentially time-consuming list queries that return detailed information about large collections of tasks or nodes.

  • 获取任务计数可以获取一个作业中处于“活动”、“正在运行”和“已完成”状态的任务以及处于“已成功”或“已失败”状态的任务的聚合计数。Get Task Counts gets an aggregate count of active, running, and completed tasks in a job, and of tasks that succeeded or failed.

    通过对每种状态的任务计数,可以更轻松地为用户展现作业进度,或检测可能影响作业的意外延迟或故障。By counting tasks in each state, you can more easily display job progress to a user, or detect unexpected delays or failures that may affect the job. “获取任务计数”功能在 Batch Service API 版本 2017-06-01.5.1 以及相关的 SDK 和工具中提供。Get Task Counts is available as of Batch Service API version 2017-06-01.5.1 and related SDKs and tools.

  • 列出池节点计数获取每个池中处于不同状态(“正在创建”、“空闲”、“脱机”、“已占用”、“正在重启”、“正在重置映像”、“正在启动”、“其他”)的专用计算节点的数目。List Pool Node Counts gets the number of dedicated compute nodes in each pool that are in various states: creating, idle, offline, preempted, rebooting, reimaging, starting, and others.

    通过对每种状态的节点计数,你就可以确定是否有足够的计算资源来运行作业,并确定池可能存在的问题。By counting nodes in each state, you can determine when you have adequate compute resources to run your jobs, and identify potential issues with your pools. “列出池节点计数”功能在 Batch Service API 版本 2018-03-01.6.1 以及相关的 SDK 和工具中提供。List Pool Node Counts is available as of Batch Service API version 2018-03-01.6.1 and related SDKs and tools.

请注意,在这种情况下,这些操作返回的数字可能不是最新的。Note that at times, the numbers returned by these operations may not be up to date. 如果需要确保计数是准确的,请使用列表查询对这些资源进行计数。If you need to be sure that a count is accurate, use a list query to count these resources. 使用列表查询还可以获取其他 Batch 资源(例如应用程序)的信息。List queries also let you get information about other Batch resources such as applications. 若要详细了解如何将筛选器应用于列表查询,请参阅创建可高效列出 Batch 资源的查询For more information about applying filters to list queries, see Create queries to list Batch resources efficiently.

任务状态计数Task state counts

“获取任务计数”操作按以下状态进行任务计数:The Get Task Counts operation counts tasks by the following states:

  • 活动 - 任务已排队且能够运行,但目前没有分配到计算节点。Active - A task that is queued and able to run, but is not currently assigned to a compute node. 如果任务所依赖的父任务尚未完成,则该任务也处于active状态。A task is also active if it is dependent on a parent task that has not yet completed.
  • 正在运行 - 任务已分配到计算节点但尚未完成。Running - A task that has been assigned to a compute node, but has not yet completed. 当任务状态为preparingrunning时,将它视为running,正如获取有关任务的信息操作所示。A task is counted as running when its state is either preparing or running, as indicated by the Get information about a task operation.
  • 已完成 - 任务不再有资格运行,因为已成功完成,或者虽未成功完成但已达到其重试次数限制。Completed - A task that is no longer eligible to run, because it either finished successfully, or finished unsuccessfully and also exhausted its retry limit.
  • 已成功 - 执行结果为success的任务。Succeeded - A task whose result of task execution is success. Batch 通过检查 executionInfo 属性的 TaskExecutionResult 属性来确定任务是已成功还是已失败。Batch determines whether a task has succeeded or failed by checking the TaskExecutionResult property of the executionInfo property.
  • 已失败 - 执行结果为failure的任务。Failed A task whose result of task execution is failure.

下方的 .NET 代码示例演示如何按状态检索任务计数:The following .NET code sample shows how to retrieve task counts by state:

var taskCounts = await batchClient.JobOperations.GetJobTaskCountsAsync("job-1");

Console.WriteLine("Task count in active state: {0}", taskCounts.Active);
Console.WriteLine("Task count in preparing or running state: {0}", taskCounts.Running);
Console.WriteLine("Task count in completed state: {0}", taskCounts.Completed);
Console.WriteLine("Succeeded task count: {0}", taskCounts.Succeeded);
Console.WriteLine("Failed task count: {0}", taskCounts.Failed);

可以对 REST 和支持的其他语言使用类似的模式获取作业的任务计数。You can use a similar pattern for REST and other supported languages to get task counts for a job.

备注

2018-08-01.7.0 之前的 Batch Service API 也会在 Get Task Counts 响应中返回一个 validationStatus 属性。Batch Service API versions before 2018-08-01.7.0 also return a validationStatus property in the Get Task Counts response. 此属性表示 Batch 是否已检查状态计数是否与 List Tasks API 中报告的状态一致。This property indicates whether Batch checked the state counts for consistency with the states reported in the List Tasks API. validated 的值表示 Batch 至少为作业检查了一次一致性。A value of validated indicates only that Batch checked for consistency at least once for the job. validationStatus 属性的值不表示 Get Task Counts 返回的计数当前是否是最新的。The value of the validationStatus property does not indicate whether the counts that Get Task Counts returns are currently up to date.

节点状态计数Node state counts

“列出池节点计数”操作按以下状态对每个池中的计算节点进行计数。The List Pool Node Counts operation counts compute nodes by the following states in each pool. 会对每个池中的专用节点进行单独的聚合计数。Separate aggregate counts are provided for dedicated nodes nodes in each pool.

  • 正在创建 - Azure 分配的 VM 尚未开始加入池。Creating - An Azure-allocated VM that has not yet started to join a pool.

  • 空闲 - 目前未运行任务的可用计算节点。Idle - An available compute node that is not currently running a task.

  • 正在离开池 - 节点正在离开池,可能是因为用户显式删除了它,或者是因为池正在重设大小或者正在进行向下自动缩放。LeavingPool - A node that is leaving the pool, either because the user explicitly removed it or because the pool is resizing or autoscaling down.

  • 脱机 - Batch 无法用来计划 新任务的节点。Offline - A node that Batch cannot use to schedule new tasks.

  • 正在重启 - 节点正在重启。Rebooting - A node that is restarting.

  • 正在重置映像 - 操作系统正在节点上重新安装。Reimaging - A node on which the operating system is being reinstalled.

  • 正在运行 - 节点正在运行一个或多个任务(不是启动任务)。Running - A node that is running one or more tasks (other than the start task).

  • 正在启动 - Batch 服务正在节点上启动。Starting - A node on which the Batch service is starting.

  • 启动任务已失败 - 节点上的启动任务已失败,已达到重试次数限制,并且已在启动任务上设置 waitForSuccessStartTaskFailed - A node on which the start task failed and exhausted all retries, and on which waitForSuccess is set on the start task. 此节点不可用于运行任务。The node is not usable for running tasks.

  • 未知 - 节点失去与 Batch 服务的联系,其状态未知。Unknown - A node that lost contact with the Batch service and whose state isn't known.

  • 不可使用 - 节点因错误而不能用于执行任务。Unusable - A node that can't be used for task execution because of errors.

  • 等待启动任务 - 节点上的启动任务已开始运行,但是设置了 waitForSuccess,启动任务尚未完成。WaitingForStartTask - A node on which the start task started running, but waitForSuccess is set and the start task has not completed.

以下 C# 代码片段演示如何列出当前帐户中所有池的节点计数:The following C# snippet shows how to list node counts for all pools in the current account:

foreach (var nodeCounts in batchClient.PoolOperations.ListPoolNodeCounts())
{
    Console.WriteLine("Pool Id: {0}", nodeCounts.PoolId);

    Console.WriteLine("Total dedicated node count: {0}", nodeCounts.Dedicated.Total);

    // Get dedicated node counts in Idle and Offline states; you can get additional states.
    Console.WriteLine("Dedicated node count in Idle state: {0}", nodeCounts.Dedicated.Idle);
    Console.WriteLine("Dedicated node count in Offline state: {0}", nodeCounts.Dedicated.Offline);

}

以下 C# 代码片段演示如何列出当前帐户中给定池的节点计数。The following C# snippet shows how to list node counts for a given pool in the current account.

foreach (var nodeCounts in batchClient.PoolOperations.ListPoolNodeCounts(new ODATADetailLevel(filterClause: "poolId eq 'testpool'")))
{
    Console.WriteLine("Pool Id: {0}", nodeCounts.PoolId);

    Console.WriteLine("Total dedicated node count: {0}", nodeCounts.Dedicated.Total);

    // Get dedicated node counts in Idle and Offline states; you can get additional states.
    Console.WriteLine("Dedicated node count in Idle state: {0}", nodeCounts.Dedicated.Idle);
    Console.WriteLine("Dedicated node count in Offline state: {0}", nodeCounts.Dedicated.Offline);

}

可以对 REST 和支持的其他语言使用类似的模式获取池的节点计数。You can use a similar pattern for REST and other supported languages to get node counts for pools.

后续步骤Next steps