Azure Batch 中的错误处理和检测Error handling and detection in Azure Batch

有时,你可能发现需要处理 Batch 解决方案中的任务和应用程序失败。At times, you may find it necessary to handle both task and application failures within your Batch solution. 本文介绍错误类型以及解决方法。This article talks about types of errors and how to resolve them.

错误代码Error codes

常规错误类型包括:General types of errors include:

  • 请求从未到达 Batch 或 Batch 响应未及时到达客户端的网络故障。Networking failures for requests that never reached Batch, or when the Batch response didn't reach the client in time.
  • 内部服务器错误(标准 5xx 状态代码 HTTP 响应)。Internal server errors (standard 5xx status code HTTP response).
  • 与限制相关的错误,如带有 Retry-after 标头的 429 或 503 状态代码 HTTP 响应。Throttling-related errors, such as 429 or 503 status code HTTP responses with the Retry-after header.
  • 4xx 错误,如 AlreadyExists 和 InvalidOperation。4xx errors such as AlreadyExists and InvalidOperation. 这意味着资源未处于状态转换所需的正确状态。This means that the resource is not in the correct state for the state transition.

有关特定错误代码的详细信息(包括 REST API、Batch 服务和作业任务/调度的错误代码),请参阅 Batch 状态和错误代码For detailed information about specific error codes, including error codes for REST API, Batch service, and job task/scheduling, see Batch Status and Error Codes.

应用程序失败Application failures

在执行过程中,应用程序可以生成诊断输出,这些信息可用于排查问题。During execution, an application might produce diagnostic output that you can use to troubleshoot issues. 文件和目录中所述,Batch 服务会将标准输出和标准错误输出发送到计算节点上任务目录中的 stdout.txtstderr.txt 文件。As described in Files and directories, the Batch service writes standard output and standard error output to stdout.txt and stderr.txt files in the task directory on the compute node.

可以使用 Azure 门户或 Batch SDK 之一下载这些文件。You can use the Azure portal or one of the Batch SDKs to download these files. 例如,可以使用 Batch .NET 库中的 ComputeNode.GetNodeFileCloudTask.GetNodeFile 检索这些文件和其他文件来进行故障排除。For example, you can retrieve these and other files for troubleshooting purposes by using ComputeNode.GetNodeFile and CloudTask.GetNodeFile in the Batch .NET library.

任务错误Task errors

任务错误分为多个类别。Task errors fall into several categories.

预处理错误Pre-processing errors

如果任务无法启动,则会为任务设置预处理错误。If a task fails to start, a pre-processing error is set for the task.

如果任务的资源文件已移动、存储帐户不再可用,或者发生其他使文件无法成功复制到节点的问题,则可能会出现预处理错误。Pre-processing errors can occur if the task's resource files have moved, the storage account is no longer available, or another issue was encountered that prevented the successful copying of files to the node.

文件上传错误File upload errors

如果为任务指定的文件由于某种原因而上传失败,则会为该任务设置文件上传错误。If files that are specified for a task fail to upload for any reason, a file upload error is set for the task.

如果提供的用于访问 Azure 存储的 SAS 无效或未提供写权限,如果存储帐户不再可用,或者如果遇到了另一问题,导致无法从节点成功复制文件,则可能会发生文件上传错误。File upload errors can occur if the SAS supplied for accessing Azure Storage is invalid or does not provide write permissions, if the storage account is no longer available, or if another issue was encountered that prevented the successful copying of files from the node.

应用程序错误Application errors

任务命令行指定的进程也可能会失败。The process that is specified by the task's command line can also fail. 如果任务执行的进程返回非零退出代码,则将该进程视为失败(请参阅下一部分中的 任务退出代码 )。The process is deemed to have failed when a nonzero exit code is returned by the process that is executed by the task (see Task exit codes in the next section).

对于应用程序错误,可以将 Batch 配置为自动重试任务,并最多重试指定的次数。For application errors, you can configure Batch to automatically retry the task up to a specified number of times.

约束错误Constraint errors

可以设置一个约束来指定作业或任务的最大执行持续期间,即 maxWallClockTimeYou can set a constraint that specifies the maximum execution duration for a job or task, the maxWallClockTime. 此约束可用于终止未能继续进行的任务。This can be useful for terminating tasks that fail to progress.

如果超出了最长时间,则将任务标记为已完成,但退出代码将设置为 0xC000013AschedulingError 字段将标记为 { category:"ServerError", code="TaskEnded"}When the maximum amount of time has been exceeded, the task is marked as completed, but the exit code is set to 0xC000013A and the schedulingError field is marked as { category:"ServerError", code="TaskEnded"}.

任务退出代码Task exit codes

如前所述,如果任务执行的程序返回非零退出代码,则 Batch 服务会将此任务标记为失败。As mentioned earlier, a task is marked as failed by the Batch service if the process that is executed by the task returns a nonzero exit code. 当任务执行某个进程时,Batch 将使用进程的返回代码填充任务的退出代码属性。When a task executes a process, Batch populates the task's exit code property with the return code of the process.

请务必注意,任务的退出代码不是由 Batch 服务确定,It is important to note that a task's exit code is not determined by the Batch service. 而是由进程本身或此进程在其上运行的操作系统确定。A task's exit code is determined by the process itself or the operating system on which the process executed.

任务失败或中断Task failures or interruptions

任务偶尔会失败或中断。Tasks might occasionally fail or be interrupted. 任务应用程序本身可能会失败,运行任务的节点可能会重新启动,或者在执行大小调整操作期间可能从池中删除节点(如果池的取消分配策略设置为在不等待任务完成的情况下立即删除节点)。The task application itself might fail, the node on which the task is running might be rebooted, or the node might be removed from the pool during a resize operation (if the pool's deallocation policy is set to remove nodes immediately without waiting for tasks to finish). 在所有情况下,任务都可以由 Batch 自动排队,并在另一个节点上执行。In all cases, the task can be automatically requeued by Batch for execution on another node.

间歇性的问题也有可能会导致任务停止响应,或者花费很长时间才能完成执行。It is also possible for an intermittent issue to cause a task to stop responding or take too long to execute. 可为任务设置最长的执行时间间隔。You can set the maximum execution interval for a task. 如果超出最长执行时间间隔,Batch 服务会中断任务应用程序。If the maximum execution interval is exceeded, the Batch service interrupts the task application.

连接到计算节点Connect to compute nodes

可通过远程登录到计算节点来进一步执行调试和故障排除。You can perform additional debugging and troubleshooting by signing in to a compute node remotely. 可以使用 Azure 门户下载 Windows 节点的远程桌面协议 (RDP) 文件,并获取 Linux 节点的安全外壳 (SSH) 连接信息。You can use the Azure portal to download a Remote Desktop Protocol (RDP) file for Windows nodes and obtain Secure Shell (SSH) connection information for Linux nodes. 也可以使用 Batch API(例如,使用 Batch .NETBatch Python)来执行此操作。You can also do this by using the Batch APIs such as with Batch .NET or Batch Python.

重要

若要通过 RDP 或 SSH 连接到某个节点,必须先在该节点上创建一个用户。To connect to a node via RDP or SSH, you must first create a user on the node. 为此,可以使用 Azure 门户通过 Batch REST API 将用户帐户添加到节点、在 Batch .NET 中调用 ComputeNode.CreateComputeNodeUser 方法,或在 Batch Python 模块中调用 add_user 方法。To do this, you can use the Azure portal, add a user account to a node by using the Batch REST API, call the ComputeNode.CreateComputeNodeUser method in Batch .NET, or call the add_user method in the Batch Python module.

如需限制或禁用通过 RDP 或 SSH 访问计算节点的功能,请参阅在 Azure Batch 池中配置或禁用到计算节点的远程访问If you need to restrict or disable RDP or SSH access to compute nodes, see Configure or disable remote access to compute nodes in an Azure Batch pool.

对有问题的节点进行故障排除Troubleshoot problem nodes

在部分任务失败的情况下,Batch 客户端应用程序或服务可以检查失败任务的元数据来找出行为异常的节点。In situations where some of your tasks are failing, your Batch client application or service can examine the metadata of the failed tasks to identify a misbehaving node. 池中的每个节点都有一个唯一 ID,运行任务的节点包含在任务元数据中。Each node in a pool is given a unique ID, and the node on which a task runs is included in the task metadata. 识别出“有问题的节点”后,可对其执行多种操作:After you've identified a problem node, you can take several actions with it:

  • 重新启动节点 (REST | .NET)Reboot the node (REST | .NET))

    重新启动节点有时可以清除潜在的问题,例如进程停滞或崩溃。Restarting the node can sometimes clear up latent issues like stuck or crashed processes. 如果池使用启动任务或作业使用作业准备任务,节点重新启动时将执行这些任务。If your pool uses a start task or your job uses a job preparation task, they are executed when the node restarts.

  • 重置映像节点 (REST | .NET)Reimage the node (REST | .NET)

    这会在节点上重新安装操作系统。This reinstalls the operating system on the node. 和重新启动节点一样,在重置映像节点后,便重新执行启动任务和作业准备任务。As with rebooting a node, start tasks and job preparation tasks are rerun after the node has been reimaged.

  • 从池中删除节点 (REST | .NET)Remove the node from the pool (REST | .NET)

    有时必须从池中完全删除节点。Sometimes it is necessary to completely remove the node from the pool.

  • 禁用节点上的任务调度 (REST | .NET)Disable task scheduling on the node (REST | .NET)

    这实际上是使节点脱机,以便不再收到任何分配的任务,但允许节点继续运行并保留在池中。This effectively takes the node offline so that no further tasks are assigned to it, but allows the node to remain running and in the pool. 这可让你执行进一步的调查以了解失败原因,却又会不丢失失败任务的数据,并且不让节点造成额外的任务失败。This enables you to perform further investigation into the cause of the failures without losing the failed task's data, and without the node causing additional task failures. 例如,可以禁用节点上的任务调度,并从远程登录以检查节点的事件日志,或执行其他故障排除操作。For example, you can disable task scheduling on the node, then sign in remotely to examine the node's event logs or perform other troubleshooting. 完成调查后,可以启用任务调度 (REST | .NET) 使节点重新联机,或者执行上述其中一个其他操作。After you've finished your investigation, you can then bring the node back online by enabling task scheduling (REST | .NET, or perform one of the other actions discussed earlier.

重要

通过上述操作,你可以指定在执行该操作时如何处理当前正在节点上运行的任务。With the actions described above, youc can specify how tasks currently running on the node are handled when you perform the action. 例如,在使用 Batch .NET 客户端库的节点上禁用任务调度时,可以指定 DisableComputeNodeSchedulingOption 枚举值,以指定是要终止正在运行的任务、重新将任务列入队列以在其他节点上调度,还是允许正在运行的任务先完成再执行操作 (TaskCompletion) 。For example, when you disable task scheduling on a node by using the Batch .NET client library, you can specify a DisableComputeNodeSchedulingOption enum value to specify whether to Terminate running tasks, Requeue them for scheduling on other nodes, or allow running tasks to complete before performing the action (TaskCompletion).

出错后重试Retry after errors

如果操作失败,Batch API 会通知你。The Batch APIs will notify you if there is a failure. 所有操作都包含一个全局重试处理程序,因此都可重试。They can all be retried, and they all include a global retry handler for that purpose. 最好使用此内置机制。It is best to use this built-in mechanism.

失败后,应等待一段时间(重试间隔几秒),然后重试。After a failure, you should wait a bit (several seconds between retries) before retrying. 如果重试次数过于频繁或重试速度过快,重试处理程序将中止。If you retry too frequently or too quickly, the retry handler will throttle.

后续步骤Next steps