Monitor Batch solutions by counting tasks and nodes by state

To monitor and manage large-scale Azure Batch solutions, you may need to determine counts of resources in various states. Azure Batch provides efficient operations to get counts for Batch tasks and compute nodes. You can use these operations instead of potentially time-consuming list queries that return detailed information about large collections of tasks or nodes.

  • Get Task Counts gets an aggregate count of active, running, and completed tasks in a job, and of tasks that succeeded or failed. By counting tasks in each state, you can more easily display job progress to a user, or detect unexpected delays or failures that may affect the job.

  • List Pool Node Counts gets the number of dedicated compute nodes in each pool that are in various states: creating, idle, offline, preempted, rebooting, reimaging, starting, and others. By counting nodes in each state, you can determine when you have adequate compute resources to run your jobs, and identify potential issues with your pools.

Note that at times, the numbers returned by these operations may not be up to date. If you need to be sure that a count is accurate, use a list query to count these resources. List queries also let you get information about other Batch resources such as applications. For more information about applying filters to list queries, see Create queries to list Batch resources efficiently.

Task state counts

The Get Task Counts operation counts tasks by the following states:

  • Active: A task that is queued and able to run, but is not currently assigned to a compute node. A task is also active if it is dependent on a parent task that has not yet completed.
  • Running: A task that has been assigned to a compute node, but has not yet completed. A task is counted as running when its state is either preparing or running, as indicated by the Get information about a task operation.
  • Completed: A task that is no longer eligible to run, because it either finished successfully, or finished unsuccessfully and also exhausted its retry limit.
  • Succeeded: A task where the result of task execution is success. Batch determines whether a task has succeeded or failed by checking the TaskExecutionResult property of the executionInfo property.
  • Failed: A task where the result of task execution is failure.

The following .NET code sample shows how to retrieve task counts by state.

var taskCounts = await batchClient.JobOperations.GetJobTaskCountsAsync("job-1");

Console.WriteLine("Task count in active state: {0}", taskCounts.Active);
Console.WriteLine("Task count in preparing or running state: {0}", taskCounts.Running);
Console.WriteLine("Task count in completed state: {0}", taskCounts.Completed);
Console.WriteLine("Succeeded task count: {0}", taskCounts.Succeeded);
Console.WriteLine("Failed task count: {0}", taskCounts.Failed);

You can use a similar pattern for REST and other supported languages to get task counts for a job.

Node state counts

The List Pool Node Counts operation counts compute nodes by the following states in each pool. Separate aggregate counts are provided for dedicated nodes nodes in each pool.

  • Creating: An Azure-allocated VM that has not yet started to join a pool.
  • Idle: An available compute node that is not currently running a task.
  • LeavingPool: A node that is leaving the pool, either because the user explicitly removed it or because the pool is resizing or autoscaling down.
  • Offline: A node that Batch cannot use to schedule new tasks.
  • Rebooting: A node that is restarting.
  • Reimaging: A node on which the operating system is being reinstalled.
  • Running : A node that is running one or more tasks (other than the start task).
  • Starting: A node on which the Batch service is starting.
  • StartTaskFailed: A node on which the start task failed and exhausted all retries, and on which waitForSuccess is set on the start task. The node is not usable for running tasks.
  • Unknown: A node that lost contact with the Batch service and whose state isn't known.
  • Unusable: A node that can't be used for task execution because of errors.
  • WaitingForStartTask: A node on which the start task started running, but waitForSuccess is set and the start task has not completed.

The following C# snippet shows how to list node counts for all pools in the current account:

foreach (var nodeCounts in batchClient.PoolOperations.ListPoolNodeCounts())
{
    Console.WriteLine("Pool Id: {0}", nodeCounts.PoolId);

    Console.WriteLine("Total dedicated node count: {0}", nodeCounts.Dedicated.Total);

    // Get dedicated node counts in Idle and Offline states; you can get additional states.
    Console.WriteLine("Dedicated node count in Idle state: {0}", nodeCounts.Dedicated.Idle);
    Console.WriteLine("Dedicated node count in Offline state: {0}", nodeCounts.Dedicated.Offline);

}

The following C# snippet shows how to list node counts for a given pool in the current account.

foreach (var nodeCounts in batchClient.PoolOperations.ListPoolNodeCounts(new ODATADetailLevel(filterClause: "poolId eq 'testpool'")))
{
    Console.WriteLine("Pool Id: {0}", nodeCounts.PoolId);

    Console.WriteLine("Total dedicated node count: {0}", nodeCounts.Dedicated.Total);

    // Get dedicated node counts in Idle and Offline states; you can get additional states.
    Console.WriteLine("Dedicated node count in Idle state: {0}", nodeCounts.Dedicated.Idle);
    Console.WriteLine("Dedicated node count in Offline state: {0}", nodeCounts.Dedicated.Offline);

}

You can use a similar pattern for REST and other supported languages to get node counts for pools.

Next steps