作业和任务错误检查Job and task error checking

添加作业和任务时,可能会发生各种错误。There are various errors that can occur when adding jobs and tasks. 检测这些操作的失败非常简单,因为 API、CLI 或 UI 会立即返回任何失败结果。Detecting failures for these operations is straightforward because any failures are returned immediately by the API, CLI, or UI. 但是,以后在计划及运行作业和任务时,也可能会发生失败。However, there are failures that can happen later when jobs and tasks are scheduled and run.

本文介绍了在提交作业和任务后可能会发生的错误。This article covers the errors that can occur after jobs and tasks are submitted. 其中列出并解释了需要检查和处理的错误。It lists and explains the errors that need to be checked and handled.

作业Jobs

作业是包含一个或多个任务的分组,任务实际指定了要运行的命令行。A job is a grouping of one or more tasks, the tasks actually specifying the command lines to be run.

添加作业时,可以指定以下参数来影响作业的失败方式:When adding a job, the following parameters can be specified which can influence how the job can fail:

  • 作业约束Job Constraints
    • 可以选择性地指定 maxWallClockTime 属性,以设置作业可以处于活动或运行状态的最长时间。The maxWallClockTime property can optionally be specified to set the maximum amount of time a job can be active or running. 如果超过此时间,则会根据作业的 executionInfo 中设置的 terminateReason 属性终止作业。If exceeded, the job will be terminated with the terminateReason property set in the executionInfo for the job.
  • 作业准备任务Job Preparation Task
    • 如果指定了此项,则首次在节点上为某个作业运行任务时,会运行作业准备任务。If specified, a job preparation task is run the first time a task is run for a job on a node. 作业准备任务可能会失败,导致任务不会运行并且作业不会完成。The job preparation task can fail, which will lead to the task not being run and the job not completing.
  • 作业释放任务Job Release Task
    • 仅当配置了作业准备任务时,才能指定作业释放任务。A job release task can only be specified if a job preparation task is configured. 终止某个作业时,将会在运行了作业准备任务的每个池节点上运行作业释放任务。When a job is being terminated, the job release task is run on the each of pool nodes where a job preparation task was run. 作业释放任务可能会失败,但作业仍会转换为 completed 状态。A job release task can fail, but the job will still move to a completed state.

作业属性Job properties

应检查以下作业属性来查找错误:The following job properties should be checked for errors:

  • 'executionInfo':'executionInfo':
    • terminateReason 属性可以包含相应的值来指示是否已超出作业约束中指定的 maxWallClockTime,因而终止了作业。The terminateReason property can have values to indicate that the maxWallClockTime, specified in the job constraints, was exceeded and therefore the job was terminated. 还可以设置此属性来指示任务失败(如果正确设置了作业 onTaskFailure 属性)。It can also be set to indicate a task failed if the job onTaskFailure property was set appropriately.
    • 如果发生了调度错误,将会设置 schedulingError 属性。The schedulingError property is set if there has been a scheduling error.

作业准备任务Job preparation tasks

如果为某个作业指定了作业准备任务,则首次在节点上运行该作业的任务时,将运行作业准备任务的一个实例。If a job preparation task is specified for a job, then an instance of that task will be run the first time a task for the job is run on a node. 可将作业中配置的作业准备任务视为任务模板,其中包含所要运行的多个作业准备任务实例(实例数目不超过池中的节点数)。The job preparation task configured on the job can be thought of as a task template, with multiple job preparation task instances being run, up to the number of nodes in a pool.

应检查作业准备任务实例以确定是否存在错误:The job preparation task instances should be checked to determine if there were errors:

  • 运行作业准备任务时,触发作业准备任务的任务将转换为 preparing 状态;如果作业准备任务随后失败,则触发任务将还原为 active 状态且不会运行。When a job preparation task is run, then the task that triggered the job preparation task will move to a state of preparing; if the job preparation task then fails, the triggering task will revert to the active state and will not be run.
  • 可以使用列出准备和释放任务状态 API 从作业中获取已运行的作业准备任务的所有实例。All the instances of the job preparation task that have been run can be obtained from the job using the List Preparation and Release Task Status API. 与任何任务一样,可以使用 failureInfoexitCoderesult 等属性获取执行信息As with any task, there is execution information available with properties such as failureInfo, exitCode, and result.
  • 如果作业准备任务失败,则触发作业任务不会运行,作业不会完成且会停滞。If job preparation tasks fail, then the triggering job tasks will not be run, the job will not complete and will be stuck. 如果没有任何其他作业包含可调度的任务,则池可能会进入未利用状态。The pool may go unutilized if there are no other jobs with tasks that can be scheduled.

作业释放任务Job release tasks

如果为某个作业指定了作业释放任务,当终止某个作业时,将在运行了作业准备任务的每个池节点上运行作业释放任务的实例。If a job release task is specified for a job, then when a job is being terminated an instance of the job release task is run on each of the pool nodes where a job preparation task was run. 应检查作业释放任务实例以确定是否存在错误:The job release task instances should be checked to determine if there were errors:

  • 可以使用列出准备和释放任务状态 API 从作业中获取正在运行的作业释放任务的所有实例。All the instances of the job release task being run can be obtained from the job using the API List Preparation and Release Task Status. 与任何任务一样,可以使用 failureInfoexitCoderesult 等属性获取执行信息As with any task, there is execution information available with properties such as failureInfo, exitCode, and result.
  • 如果一个或多个作业释放任务失败,则该作业仍会终止并转换为 completed 状态。If one or more job release tasks fail, then the job will still be terminated and move to a completed state.

任务Tasks

作业任务可能因多种原因而失败:Job tasks can fail for multiple reasons:

  • 任务命令行失败,并返回非零退出代码。The task command line fails, returning with a non-zero exit code.
  • 为任务指定了 resourceFiles,但发生了表示未下载一个或多个文件的失败。There are resourceFiles specified for a task, but there was a failure that meant one or more files did not download.
  • 为任务指定了 outputFiles,但发生了表示未上传一个或多个文件的失败。There are outputFiles specified for a task, but there was a failure that meant one or more files did not upload.
  • 超出了任务约束中的 maxWallClockTime 属性为任务指定的已用时间。The elapsed time for the task, specified by the maxWallClockTime property in the task constraints, was exceeded.

在所有情况下,都必须检查以下属性来查找错误以及有关错误的信息:In all cases the following properties must be checked for errors and information about the errors:

  • 任务 executionInfo 属性包含用于提供有关错误的信息的多个属性。The tasks executionInfo property contains multiple properties that provide information about an error. result 指示任务是否因任何原因而失败,exitCodefailureInfo 提供有关失败的详细信息。result indicates if the task failed for any reason, with exitCode and failureInfo providing more information about the failure.
  • 任务不管是成功还是失败,都会转换为 completed 状态The task will always move to the completed state, independent of whether it succeeded or failed.

必须考虑任务失败对作业和所有任务依赖项的影响。The impact of task failures on the job and any task dependencies must be considered. 可为任务指定 exitConditions 属性,以配置针对依赖项和作业的操作。The exitConditions property can be specified for a task to configure an action for dependencies and for the job.

  • 对于依赖项,DependencyAction 控制是要阻止还是运行依赖于失败任务的任务。For dependencies, DependencyAction controls whether the tasks dependent on the failed task are blocked or are run.
  • 对于作业,JobAction 控制失败的任务是导致作业被禁用、终止还是保持不变。For the job, JobAction controls whether the failed task leads to the job being disabled, terminated, or left unchanged.

任务命令行失败Task command line failures

运行任务命令行时,输出将写入到 stderr.txtstdout.txtWhen the task command line is run, output is written to stderr.txt and stdout.txt. 此外,应用程序还可以写入到应用程序特定的日志文件。In addition, the application may write to application-specific log files.

如果运行任务的池节点仍然存在,则可以获取和查看日志文件。If the pool node on which a task has run still exists, then the log files can be obtained and viewed. 例如,Azure 门户会列出任务或池节点的日志文件供查看。For example, the Azure portal lists and can view log files for a task or a pool node. 多个 API(例如从任务获取)还允许列出和获取任务文件。Multiple APIs also allow task files to be listed and obtained, such as Get From Task.

由于池和池节点往往是暂时性的,如果需要持续添加和删除节点,则我们建议保留日志文件。Due to pools and pool nodes frequently being ephemeral, with nodes being continuously added and deleted, then it is recommended that log files are persisted. 可以通过任务输出文件方便地将日志文件保存到 Azure 存储。Task output files are a convenient way to save log files to Azure Storage.

输出文件失败Output file failures

每次上传文件时,Batch 都会向计算节点写入两个日志文件,即 fileuploadout.txtfileuploaderr.txtOn every file upload, Batch writes two log files to the compute node, fileuploadout.txt and fileuploaderr.txt. 可以检查这些日志文件,详细了解具体的故障。You can examine these log files to learn more about a specific failure. 如果从未尝试过上传文件(例如,因任务本身无法运行而导致这种情况),则这些日志文件不会存在。In cases where the file upload was never attempted, for example because the task itself couldn't run, then these log files will not exist.

后续步骤Next steps

检查应用程序是否实施了全面的错误检查;及时检测和诊断问题有时至关重要。Check that your application implements comprehensive error checking; it can be critical to promptly detect and diagnose issues.