作业和任务错误检查Job and task error checking

添加作业和任务时可能会出现各种错误。There are various errors that can occur when adding jobs and tasks. 检测这些操作的错误的方式非常直接,因为任何错误都会由 API、CLI 或 UI 立即返回。Detecting failures for these operations is straightforward because any failures are returned immediately by the API, CLI, or UI. 但是,以后在计划和运行作业和任务时,也可能会发生失败。However, there are also failures that can happen later, when jobs and tasks are scheduled and run.

本文介绍在提交作业和任务后可能出现的错误,以及如何检查和处理这些错误。This article covers the errors that can occur after jobs and tasks are submitted and how to check for and handle them.

作业Jobs

作业是包含一个或多个任务的分组,任务实际指定了要运行的命令行。A job is a grouping of one or more tasks, with the tasks actually specifying the command lines to be run.

添加作业时,可以指定以下影响作业失败方式的参数:When adding a job, the following parameters can be specified which can influence how the job can fail:

  • 作业约束Job Constraints
    • 可以选择指定 maxWallClockTime 属性,用于设置作业可处于活动或运行状态的最大时间量。The maxWallClockTime property can optionally be specified to set the maximum amount of time a job can be active or running. 如果超过此限制,作业将终止,并在作业的 executionInfo 中设置 terminateReason 属性。If exceeded, the job will be terminated with the terminateReason property set in the executionInfo for the job.
  • 作业准备任务Job Preparation Task
    • 如果指定此参数,则在第一次为节点上的作业运行任务时运行作业准备任务。If specified, a job preparation task is run the first time a task is run for a job on a node. 作业准备任务可能会失败,这会导致任务未运行且作业未完成。The job preparation task can fail, which will lead to the task not being run and the job not completing.
  • 作业发布任务Job Release Task
    • 只有在配置了作业准备任务后,才能指定作业发布任务。A job release task can only be specified if a job preparation task is configured. 终止作业时,作业发布任务会在运行作业准备任务的每个池节点上运行。When a job is being terminated, the job release task is run on the each of pool nodes where a job preparation task was run. 作业发布任务可能会失败,但作业仍会变为 completed 状态。A job release task can fail, but the job will still move to a completed state.

作业属性Job properties

应检查以下作业属性是否存在错误:The following job properties should be checked for errors:

  • executionInfo”:'executionInfo':
    • terminateReason 属性的值可以指示已超过作业约束中指定的 maxWallClockTime,从而终止了作业。The terminateReason property can have values to indicate that the maxWallClockTime, specified in the job constraints, was exceeded and therefore the job was terminated. 如果正确设置了作业 onTaskFailure 属性,还可以将上述属性设置为指示任务失败。It can also be set to indicate a task failed if the job onTaskFailure property was set appropriately.
    • 如果出现计划错误,则设置 schedulingError 属性。The schedulingError property is set if there has been a scheduling error.

作业准备任务Job preparation tasks

如果为作业指定作业准备任务,那么,当第一次在节点上运行作业的任务时,将运行该任务的实例。If a job preparation task is specified for a job, then an instance of that task will be run the first time a task for the job is run on a node. 对作业配置的作业准备任务可以看作是一个任务模板,可以运行多个作业准备任务实例,最多可运行一个池中的节点数。The job preparation task configured on the job can be thought of as a task template, with multiple job preparation task instances being run, up to the number of nodes in a pool.

应检查作业准备任务实例,确定是否存在错误:The job preparation task instances should be checked to determine if there were errors:

  • 运行作业准备任务时,触发作业准备任务的那个任务将变为 preparing 状态;如果作业准备任务随后失败,触发任务将恢复为 active 状态,并且不会运行。When a job preparation task is run, then the task that triggered the job preparation task will move to a state of preparing; if the job preparation task then fails, the triggering task will revert to the active state and will not be run.
  • 可以使用列出准备和发布任务状态 API 从作业中获取所有已运行的作业准备任务实例。All the instances of the job preparation task that have been run can be obtained from the job using the List Preparation and Release Task Status API. 与任何任务一样,执行信息可与 failureInfoexitCoderesult 等属性一起使用。As with any task, there is execution information available with properties such as failureInfo, exitCode, and result.
  • 如果作业准备任务失败,将不会运行触发作业任务,作业将无法完成并处于停滞状态。If job preparation tasks fail, then the triggering job tasks will not be run, the job will not complete and will be stuck. 如果没有其他作业包含可计划的任务,池可能会闲置。The pool may go unutilized if there are no other jobs with tasks that can be scheduled.

作业发布任务Job release tasks

如果为某个作业指定了作业释放任务,则当终止某个作业时,会在运行了作业准备任务的每个池节点上运行作业释放任务的实例。If a job release task is specified for a job, then when a job is being terminated, an instance of the job release task is run on each pool node where a job preparation task was run. 应检查作业发布任务实例,确定是否存在错误:The job release task instances should be checked to determine if there were errors:

  • 可以使用列出准备和发布任务状态 API 从作业中获取所有正在运行的作业发布任务实例。All the instances of the job release task being run can be obtained from the job using the API List Preparation and Release Task Status. 与任何任务一样,执行信息可与 failureInfoexitCoderesult 等属性一起使用。As with any task, there is execution information available with properties such as failureInfo, exitCode, and result.
  • 如果一个或多个作业发布任务失败,该作业仍会终止,并变为 completed 状态。If one or more job release tasks fail, then the job will still be terminated and move to a completed state.

任务Tasks

以下几个原因可能导致作业任务失败:Job tasks can fail for multiple reasons:

  • 任务命令行失败,返回非零退出代码。The task command line fails, returning with a non-zero exit code.
  • 为任务指定了 resourceFiles,但发生了表示未下载一个或多个文件的失败。There are resourceFiles specified for a task, but there was a failure that meant one or more files didn't download.
  • 为任务指定了 outputFiles,但发生了表示未上传一个或多个文件的失败。There are outputFiles specified for a task, but there was a failure that meant one or more files didn't upload.
  • 超出了任务约束中的 maxWallClockTime 属性指定的任务运行时间。The elapsed time for the task, specified by the maxWallClockTime property in the task constraints, was exceeded.

在所有情况下,都必须检查以下属性是否存在错误,并查看错误相关信息:In all cases the following properties must be checked for errors and information about the errors:

  • 任务 executionInfo 属性包含可提供错误相关信息的多个属性。The tasks executionInfo property contains multiple properties that provide information about an error. result 指示任务是否因任何原因而失败,exitCodefailureInfo 提供有关失败的详细信息。result indicates if the task failed for any reason, with exitCode and failureInfo providing more information about the failure.
  • 无论任务成功还是失败,任务将始终变为 completed 状态The task will always move to the completed state, independent of whether it succeeded or failed.

必须考虑任务失败对作业和所有任务依赖项的影响。The impact of task failures on the job and any task dependencies must be considered. 可为任务指定 exitConditions 属性,以便为依赖项和作业配置操作。The exitConditions property can be specified for a task to configure an action for dependencies and for the job.

  • 对于依赖项,DependencyAction 控制依赖于失败任务的那些任务是被阻止还是运行。For dependencies, DependencyAction controls whether the tasks dependent on the failed task are blocked or are run.
  • 对于作业,JobAction 控制失败任务是导致作业被禁用、终止还是保持不变。For the job, JobAction controls whether the failed task leads to the job being disabled, terminated, or left unchanged.

任务命令行错误Task command line failures

运行任务命令行时,输出将被写入 stderr.txtstdout.txtWhen the task command line is run, output is written to stderr.txt and stdout.txt. 此外,应用程序还可以写入特定于应用程序的日志文件。In addition, the application may write to application-specific log files.

如果用于运行任务的池节点仍存在,则可以获取并查看日志文件。If the pool node on which a task has run still exists, then the log files can be obtained and viewed. 例如,Azure 门户将列出并可以查看任务或池节点的日志文件。For example, the Azure portal lists and can view log files for a task or a pool node. 通过多个 API 还可以列出和获取任务文件,如从任务中获取Multiple APIs also allow task files to be listed and obtained, such as Get From Task.

由于池和池节点通常是临时性的,会不断地添加和删除节点,因此建议保存日志文件。Since pools and pool nodes are frequently ephemeral, with nodes being continuously added and deleted, we recommend saving log files. 任务输出文件是将日志文件保存到 Azure 存储的一种简便方法。Task output files are a convenient way to save log files to Azure Storage.

计算节点上的任务所执行的命令行不在 shell 下运行,因此它们无法原生地利用 shell 功能(例如环境变量扩展)。The command lines executed by tasks on compute nodes do not run under a shell, so they can't natively take advantage of shell features such as environment variable expansion. 若要利用此类功能,必须在命令行中调用 shellTo take advantage of such features, you must invoke the shell in the command line.

输出文件错误Output file failures

每次上传文件时,Batch 都会将以下两个日志文件写入计算节点:fileuploadout.txtfileuploaderr.txtOn every file upload, Batch writes two log files to the compute node, fileuploadout.txt and fileuploaderr.txt. 可以检查这些日志文件来详细了解具体的失败情况。You can examine these log files to learn more about a specific failure. 如果从未尝试上传文件(例如,因为任务本身无法运行),则这些日志文件不会存在。In cases where the file upload was never attempted, for example because the task itself couldn't run, then these log files will not exist.

后续步骤Next steps

  • 检查应用程序是否实现了全面的错误检查;及时检测和诊断问题非常重要。Check that your application implements comprehensive error checking; it can be critical to promptly detect and diagnose issues.
  • 详细了解作业和任务Learn more about jobs and tasks.