使用 Batch 开发大规模并行计算解决方案Develop large-scale parallel compute solutions with Batch

这篇 Azure Batch 服务核心组件的概述将介绍 Batch 开发人员可用来构建大规模并发计算解决方案的主要服务功能和资源。In this overview of the core components of the Azure Batch service, we discuss the primary service features and resources that Batch developers can use to build large-scale parallel compute solutions.

不管是在开发可发出直接 REST API 调用的分布式计算应用程序或服务,还是使用某个 Batch SDK,都可以使用本文中介绍的多种资源和功能。Whether you're developing a distributed computational application or service that issues direct REST API calls or you're using one of the Batch SDKs, you'll use many of the resources and features discussed in this article.

提示

有关 Batch 服务的更全面介绍,请参阅 Basics of Azure Batch(Azure Batch 基础知识)。For a higher-level introduction to the Batch service, see Basics of Azure Batch.

Batch 服务工作流Batch service workflow

几乎所有使用 Batch 服务处理并行工作负荷的应用程序和服务都使用以下典型高级工作流:The following high-level workflow is typical of nearly all applications and services that use the Batch service for processing parallel workloads:

  1. 将要处理的数据文件上传到 Azure 存储 帐户。Upload the data files that you want to process to an Azure Storage account. Batch 包含访问 Azure Blob 存储的内置支持,在运行任务时,任务可以将这些文件下载到计算节点Batch includes built-in support for accessing Azure Blob storage, and your tasks can download these files to compute nodes when the tasks are run.
  2. 上传任务所要运行的 应用程序文件Upload the application files that your tasks will run. 这些文件可能是二进制文件或脚本及其依赖项,并由作业中的任务执行。These files can be binaries or scripts and their dependencies, and are executed by the tasks in your jobs. 任务可以从存储帐户下载这些文件,也可使用 Batch 的 应用程序包 功能来管理和部署应用程序。Your tasks can download these files from your Storage account, or you can use the application packages feature of Batch for application management and deployment.
  3. 创建计算节点的 Create a pool of compute nodes. 创建池时,可以指定池的计算节点数目、其大小和操作系统。When you create a pool, you specify the number of compute nodes for the pool, their size, and the operating system. 运行作业中的每个任务时,会将任务分配到池中的某个节点以执行。When each task in your job runs, it's assigned to execute on one of the nodes in your pool.
  4. 创建 作业Create a job. 作业管理任务的集合。A job manages a collection of tasks. 你可以将每个作业关联到要运行该作业的任务的特定池。You associate each job to a specific pool where that job's tasks will run.
  5. 任务添加到作业。Add tasks to the job. 每个任务将运行上传的应用程序或脚本,以处理它从存储帐户下载的数据文件。Each task runs the application or script that you uploaded to process the data files it downloads from your Storage account. 当每个任务完成时,可将其输出上传到 Azure 存储。As each task completes, it can upload its output to Azure Storage.
  6. 监视作业进度并从 Azure 存储检索任务输出。Monitor job progress and retrieve the task output from Azure Storage.

以下部分介绍可实现分布式计算方案的上述和其他批处理资源。The following sections discuss these and the other resources of Batch that enable your distributed computational scenario.

备注

需要有批处理帐户才能使用批处理服务。You need a Batch account to use the Batch service. 此外,大多数 Batch 解决方案都可以使用关联的 Azure 存储帐户存储和检索文件。Most Batch solutions also use an associated Azure Storage account for file storage and retrieval.

Batch 服务资源Batch service resources

使用 Batch 服务的所有解决方案需要以下某些资源:帐户、计算节点、池、作业、任务。Some of the following resources--accounts, compute nodes, pools, jobs, and tasks--are required by all solutions that use the Batch service. 其他资源(如作业计划和应用程序包)都很有用,但为可选功能。Others, like job schedules and application packages, are helpful, but optional, features.

帐户Account

批处理帐户是批处理服务中唯一标识的实体。A Batch account is a uniquely identified entity within the Batch service. 所有处理都与一个 Batch 帐户相关联。All processing is associated with a Batch account.

可以通过 Azure 门户或编程方式(例如使用批处理管理 .NET 库)创建 Azure Batch 帐户。You can create an Azure Batch account using the Azure portal or programmatically, such as with the Batch Management .NET library. 创建该帐户时,可以关联一个 Azure 存储帐户,用于存储与作业相关的输入和输出数据或应用程序。When creating the account, you can associate an Azure storage account for storing job-related input and output data or applications.

可以在单个批处理帐户中运行多个批处理工作负荷,或者在相同订阅的不同 Azure 区域的批处理帐户之间分散工作负荷。You can run multiple Batch workloads in a single Batch account, or distribute your workloads among Batch accounts that are in the same subscription, but in different Azure regions.

备注

创建 Batch 帐户时,可在两种“池分配”模式间进行选择:“用户订阅”和“Batch 服务”。When creating a Batch account, you can choose between two pool allocation modes: user subscription and Batch service. 在大部分情况下,应使用默认的 Batch 服务模式,使用此模式时,池在 Azure 托管的订阅中以幕后方式分配。For most cases, you should use the default Batch service mode, in which pools are allocated behind the scenes in Azure-managed subscriptions. 在备用的“用户订阅”模式下,会在创建池后直接在订阅中创建 Batch VM 和其他资源。In the alternative user subscription mode, Batch VMs and other resources are created directly in your subscription when a pool is created. 如果想要使用 Azure 虚拟机预留实例创建 Batch 池,则需要使用“用户订阅”模式。User subscription mode is required if you want to create Batch pools using Azure Reserved VM Instances. 若要在用户订阅模式下创建 Batch 帐户,还需将订阅注册到 Azure Batch 中,并将该帐户与 Azure Key Vault 相关联。To create a Batch account in user subscription mode, you must also register your subscription with Azure Batch, and associate the account with an Azure Key Vault.

Azure 存储帐户Azure Storage account

大多数 Batch 解决方案使用 Azure 存储来存储资源文件和输出文件。Most Batch solutions use Azure Storage for storing resource files and output files. 例如,Batch 任务(包括标准任务、启动任务、作业准备任务和作业释放任务)通常指定位于存储帐户中的资源文件。For example, your Batch tasks (including standard tasks, start tasks, job preparation tasks, and job release tasks) typically specify resource files that reside in a storage account.

Batch 支持以下类型的 Azure 存储帐户:Batch supports the following types of Azure Storage accounts:

  • 常规用途 v2 (GPv2) 帐户General-purpose v2 (GPv2) accounts
  • 常规用途 v1 (GPv1) 帐户General-purpose v1 (GPv1) accounts
  • Blob 存储帐户(目前支持虚拟机配置中的池)Blob storage accounts (currently supported for pools in the Virtual Machine configuration)

有关存储帐户的详细信息,请参阅 Azure 存储帐户概述For more information about storage accounts, see Azure storage account overview.

创建 Batch 帐户时可以将存储帐户与 Batch 帐户关联,也可以稍后关联。You can associate a storage account with your Batch account when you create the Batch account, or later. 选择存储帐户时,请考虑成本和性能要求。Consider your cost and performance requirements when choosing a storage account. 例如,与 GPv1 相比,GPv2 和 blob 存储帐户选项支持更大的容量和可伸缩性限制For example, the GPv2 and blob storage account options support greater capacity and scalability limits compared with GPv1. (请联系 Azure 支持以请求提高存储上限。)对于包含大量读取或写入存储帐户的并行任务的 Batch 解决方案,这些帐户选项可以提高其性能。(Contact Azure Support to request an increase in a storage limit.) These account options can improve the performance of Batch solutions that contain a large number of parallel tasks that read from or write to the storage account.

计算节点Compute node

计算节点是专门用于处理一部分应用程序工作负荷的 Azure 虚拟机 (VM) 或云服务 VM。A compute node is an Azure virtual machine (VM) or cloud service VM that is dedicated to processing a portion of your application's workload. 节点大小确定了 CPU 核心数目、内存容量,以及分配给节点的本地文件系统大小。The size of a node determines the number of CPU cores, memory capacity, and local file system size that is allocated to the node. 可以使用 Azure 虚拟机市场提供的 Azure 云服务映像或自己准备的自定义映像创建 Windows 或 Linux 节点池。You can create pools of Windows or Linux nodes by using Azure Cloud Services, images from the Azure Virtual Machines Marketplace, or custom images that you prepare. 有关这些选项的详细信息,请参阅下面的 部分。See the following Pool section for more information on these options.

节点可以运行节点操作系统环境支持的任何可执行文件或脚本。Nodes can run any executable or script that is supported by the operating system environment of the node. 这包括适用于 Windows 的 *.exe、*.cmd、*.bat 和 PowerShell 脚本,以及适用于 Linux 的二进制文件、shell 和 Python 脚本。This includes *.exe, *.cmd, *.bat and PowerShell scripts for Windows--and binaries, shell, and Python scripts for Linux.

Batch 中的所有计算节点还包括:All compute nodes in Batch also include:

Pool

池是运行应用程序的节点集合。A pool is a collection of nodes that your application runs on. 你可以手动创建池;或者在你指定要完成的工作时,由 Batch 服务自动创建池。The pool can be created manually by you, or automatically by the Batch service when you specify the work to be done. 你可以创建和管理符合应用程序资源要求的池。You can create and manage a pool that meets the resource requirements of your application. 池只能由创建它的 Batch 帐户使用。A pool can be used only by the Batch account in which it was created. 一个批处理帐户可以有多个池。A Batch account can have more than one pool.

Azure Batch 池构建在核心 Azure 计算平台的顶层。Azure Batch pools build on top of the core Azure compute platform. 它们提供大规模的分配、应用程序安装、数据分发和运行状况监视,以及在池内灵活调整计算节点数目(缩放)等功能。They provide large-scale allocation, application installation, data distribution, health monitoring, and flexible adjustment of the number of compute nodes within a pool (scaling).

添加到池中的每个节点都分配有唯一的名称和 IP 地址。Every node that is added to a pool is assigned a unique name and IP address. 从池中删除某个节点时,会丢失对操作系统或文件所做的任何更改,并且节点的名称和 IP 地址将被释放供将来使用。When a node is removed from a pool, any changes that are made to the operating system or files are lost, and its name and IP address are released for future use. 当某个节点退出池时,它的生存期即告结束。When a node leaves a pool, its lifetime is over.

在创建池时,可以指定以下属性:When you create a pool, you can specify the following attributes:

  • 计算节点的操作系统和版本Compute node operating system and version
  • 计算节点类型和目标节点数Compute node type and target number of nodes
  • 计算节点大小Size of the compute nodes
  • 缩放策略Scaling policy
  • 任务计划策略Task scheduling policy
  • 计算节点的通信状态Communication status for compute nodes
  • 计算节点的启动任务Start tasks for compute nodes
  • 应用程序包Application packages
  • 网络配置Network configuration

以下部分更详细地介绍了每个设置。Each of these settings is described in more detail in the following sections.

重要

Batch 帐户具有默认配额,用于限制 Batch 帐户中的核心数。Batch accounts have a default quota that limits the number of cores in a Batch account. 核心数对应于计算节点数。The number of cores corresponds to the number of compute nodes. 可以在 Azure Batch 服务的配额和限制中找到默认配额以及如何提高配额的说明。You can find the default quotas and instructions on how to increase a quota in Quotas and limits for the Azure Batch service. 如果池不能实现其目标节点数,则问题可能出在核心配额上。If your pool is not achieving its target number of nodes, the core quota might be the reason.

计算节点的操作系统和版本Compute node operating system and version

在创建 Batch 池时,可指定 Azure 虚拟机配置和想要在池中每个计算节点上运行的操作系统类型。When you create a Batch pool, you can specify the Azure virtual machine configuration and the type of operating system you want to run on each compute node in the pool. Batch 中可用的两个配置类型为:The two types of configurations available in Batch are:

  • 虚拟机配置,它指定池由 Azure 虚拟机组成。The Virtual Machine Configuration, which specifies that the pool is comprised of Azure virtual machines. 可以从 Linux 或 Windows 映像创建这些 VM。These VMs may be created from either Linux or Windows images.

    基于虚拟机配置创建池时,不仅要指定节点大小和用于创建它们的映像源,还必须指定要安装在节点上的“虚拟机映像引用”和批处理“节点代理 SKU”。When you create a pool based on the Virtual Machine Configuration, you must specify not only the size of the nodes and the source of the images used to create them, but also the virtual machine image reference and the Batch node agent SKU to be installed on the nodes. 有关指定这些池属性的详细信息,请参阅 Provision Linux compute nodes in Azure Batch pools(在 Azure Batch 池中预配 Linux 计算节点)。For more information about specifying these pool properties, see Provision Linux compute nodes in Azure Batch pools. 可选选择性地将一个或多个空数据磁盘附加到从市场映像创建的池 VM,也可将数据磁盘包括在用于创建 VM 的自定义映像中。You can optionally attach one or more empty data disks to pool VMs created from Marketplace images, or include data disks in custom images used to create the VMs.

  • 云服务配置,它指定池由 Azure 云服务节点组成。The Cloud Services Configuration, which specifies that the pool is comprised of Azure Cloud Services nodes. 云服务只提供 Windows 计算节点。Cloud Services provide Windows compute nodes only.

    Azure Guest OS releases and SDK compatibility matrix(Azure 来宾 OS 版本和 SDK 兼容性对照表)中列出了适用于云服务配置池的操作系统。Available operating systems for Cloud Services Configuration pools are listed in the Azure Guest OS releases and SDK compatibility matrix. 创建包含云服务节点的池时,需要指定节点大小及其 OS 系列。When you create a pool that contains Cloud Services nodes, you need to specify the node size and its OS Family. 将云服务部署到 Azure 的速度比部署运行 Windows 的虚拟机更快。Cloud Services are deployed to Azure more quickly than virtual machines running Windows. 如果需要 Windows 计算节点池,可能会发现云服务具有部署时间上的性能优势。If you want pools of Windows compute nodes, you may find that Cloud Services provide a performance benefit in terms of deployment time.

    • OS 系列 还确定了要与操作系统一起安装哪些版本的 .NET。The OS Family also determines which versions of .NET are installed with the OS.
    • 与云服务中的辅助角色一样,可以指定 OS 版本(有关辅助角色的详细信息,请参阅云服务概述)。As with worker roles within Cloud Services, you can specify an OS Version (for more information on worker roles, see the Cloud Services overview).
    • 与辅助角色一样,对于 OS 版本,建议指定 *,使节点可自动升级,而无需采取措施来适应新的版本。As with worker roles, we recommend that you specify * for the OS Version so that the nodes are automatically upgraded, and there is no work required to cater to newly released versions. 选择特定 OS 版本的主要用例是在允许更新版本之前执行向后兼容测试,以确保保持应用程序兼容性。The primary use case for selecting a specific OS version is to ensure application compatibility, which allows backward compatibility testing to be performed before allowing the version to be updated. 验证后,便可以更新池的 OS 版本 并安装新的操作系统映像 – 所有正在运行的任务会中断并重新排队。After validation, the OS Version for the pool can be updated and the new OS image can be installed--any running tasks are interrupted and requeued.

创建池时,需要选择适当的 nodeAgentSkuId,具体取决于 VHD 基本映像的 OS。When you create a pool, you need to select the appropriate nodeAgentSkuId, depending on the OS of the base image of your VHD. 可通过调用列出支持的节点代理 SKU 操作获得可用节点代理 SKU ID 到其 OS 映像引用的映射。You can get a mapping of available node agent SKU ID's to their OS Image references by calling the List Supported Node Agent SKUs operation.

虚拟机池中的容器支持Container support in Virtual Machine pools

使用 Batch API 创建虚拟机配置池时,可以将池设置为在 Docker 容器中运行任务。When creating a Virtual Machine Configuration pool using the Batch APIs, you can set up the pool to run tasks in Docker containers. 目前,必须使用支持 Docker 容器的映像创建池。Currently, you must create the pool using an image that supports Docker containers. 将 Windows Server 2016 Datacenter 与 Azure 市场中的容器映像配合使用,或者提供自定义 VM 映像(其中包含 Docker Community Edition 或 Enterprise Edition 以及任何必需的驱动程序)。Use the Windows Server 2016 Datacenter with Containers image from the Azure Marketplace, or supply a custom VM image that includes Docker Community Edition or Enterprise Edition and any required drivers. 池设置必须包括容器配置,该配置在创建池时将容器映像复制到 VM。The pool settings must include a container configuration that copies container images to the VMs when the pool is created. 然后,在池中运行的任务即可引用容器映像和容器运行选项。Tasks that run on the pool can then reference the container images and container run options.

计算节点类型和目标节点数Compute node type and target number of nodes

创建池时,可以指定所需的计算节点类型和每种类型的目标节点数。When you create a pool, you can specify which types of compute nodes you want and the target number for each. 有两种类型的计算节点:The two types of compute nodes are:

  • 专用计算节点Dedicated compute nodes. 专用计算节点会为工作负荷保留。Dedicated compute nodes are reserved for your workloads. 它们比低优先级节点开销高,但可确保永远不会被抢占。They are more expensive than low-priority nodes, but they are guaranteed to never be preempted.

  • 低优先级计算节点Low-priority compute nodes. 低优先级节点利用 Azure 中的多余容量运行 Batch 工作负荷。Low-priority nodes take advantage of surplus capacity in Azure to run your Batch workloads. 低优先级节点每小时的成本比专用节点低,可支持需要大量计算能力的工作负荷。Low-priority nodes are less expensive per hour than dedicated nodes, and enable workloads requiring a lot of compute power. 有关详细信息,请参阅在 Batch 中使用低优先级 VMFor more information, see Use low-priority VMs with Batch.

    当 Azure 的多余容量不足时,低优先级计算节点可能会被抢占。Low-priority compute nodes may be preempted when Azure has insufficient surplus capacity. 如果某个节点在运行任务时被抢占,这些任务会重新排队并在计算节点重新变为可用后,重新运行。If a node is preempted while running tasks, the tasks are requeued and run again once a compute node becomes available again. 对于作业完成时间很灵活且工作分布在多个节点上的工作负荷来说,低优先级节点是一个很好选择。Low-priority nodes are a good option for workloads where the job completion time is flexible and the work is distributed across many nodes. 在决定为自己的方案使用低优先级节点之前,请确保会因其他资源优先使用而导致丢失的工作是最少的,且这些工作易于重新创建。Before you decide to use low-priority nodes for your scenario, make sure that any work lost due to preemption will be minimal and easy to recreate.

在同一池中可同时有低优先级计算节点和专用计算节点。You can have both low-priority and dedicated compute nodes in the same pool. 每种类型的节点 — 低优先级节点和专用节点 — 都有其自己的目标设置,可以为其指定所需的节点数。Each type of node — low-priority and dedicated — has its own target setting, for which you can specify the desired number of nodes.

计算节点数之所以称为目标,是因为在某些情况下,池可能无法达到所需的节点数。The number of compute nodes is referred to as a target because, in some situations, your pool might not reach the desired number of nodes. 例如,如果池先达到了 Batch 帐户的核心配额,则该池可能达不到目标。For example, a pool might not achieve the target if it reaches the core quota for your Batch account first. 或者,如果已将限制最大节点数的自动缩放公式应用于池,则该池也可能达不到目标。Or, the pool might not achieve the target if you have applied an auto-scaling formula to the pool that limits the maximum number of nodes.

有关低优先级计算节点和专用计算节点的定价信息,请参阅 Batch 定价For pricing information for both low-priority and dedicated compute nodes, see Batch Pricing.

计算节点大小Size of the compute nodes

创建 Azure Batch 池时,可以在 Azure 提供的几乎所有 VM 系列和大小中进行选择。When you create an Azure Batch pool, you can choose from among almost all the VM families and sizes available in Azure. Azure 提供一系列适用于不同工作负荷的 VM 大小,包括专用启用了 GPU 的 VM 大小。Azure offers a range of VM sizes for different workloads, including specialized GPU-enabled VM sizes.

有关详细信息,请参阅在 Azure Batch 池中选择适用于计算节点的 VM 大小For more information, see Choose a VM size for compute nodes in an Azure Batch pool.

缩放策略Scaling policy

对于动态工作负荷,可编写自动缩放公式并将其应用到池中。For dynamic workloads, you can write and apply an auto-scaling formula to a pool. Batch 服务将定期计算该公式,并根据可以指定的各个池、作业、和任务参数,调整池中的节点数目。The Batch service periodically evaluates your formula and adjusts the number of nodes within the pool based on various pool, job, and task parameters that you can specify.

任务计划策略Task scheduling policy

每个节点的最大任务数 配置选项确定了可以在池中每个计算节点上并行运行的最大任务数。The max tasks per node configuration option determines the maximum number of tasks that can be run in parallel on each compute node within the pool.

默认配置指定每次在节点上运行一个任务,但在某些情况下,在一个节点上同时执行两个或多个任务可能更有利。The default configuration specifies that one task at a time runs on a node, but there are scenarios where it is beneficial to have two or more tasks executed on a node simultaneously. 请参阅 concurrent node tasks(并发节点任务)一文中的示例方案,了解如何通过在每个节点上运行多个任务来受益。See the example scenario in the concurrent node tasks article to see how you can benefit from multiple tasks per node.

还可以指定一个 填充类型,用于确定 Batch 是要将任务平均分散到池中的所有节点,还是在将最大数目的任务分配给一个节点后,再将任务分配给另一个节点。You can also specify a fill type which determines whether Batch spreads the tasks evenly across all nodes in a pool, or packs each node with the maximum number of tasks before assigning tasks to another node.

计算节点的通信状态Communication status for compute nodes

在大多数情况下,任务将独立运行,并不需要彼此通信。In most scenarios, tasks operate independently and do not need to communicate with one another. 但是,某些应用程序中的任务必须能够通信,例如 MPI 方案However, there are some applications in which tasks must communicate, like MPI scenarios.

用户可以配置一个池用于 节点间通信,以便池中的节点可在运行时进行通信。You can configure a pool to allow internode communication, so that nodes within a pool can communicate at runtime. 启用节点间通信时,云服务配置池中的节点可以在超过 1100 个端口上彼此通信,并且虚拟机配置池不会限制任何端口的流量。When internode communication is enabled, nodes in Cloud Services Configuration pools can communicate with each other on ports greater than 1100, and Virtual Machine Configuration pools do not restrict traffic on any port.

请注意,启用节点间通信也会影响群集内的节点位置,并且由于部署限制,可能限制池中的最大节点数。Note that enabling internode communication also impacts the placement of the nodes within clusters and might limit the maximum number of nodes in a pool because of deployment restrictions. 如果应用程序不需要节点之间的通信,Batch 服务可以将许多不同的群集和数据中心的大量节点分配给池,以发挥更强大的并行处理能力。If your application does not require communication between nodes, the Batch service can allocate a potentially large number of nodes to the pool from many different clusters and datacenters to enable increased parallel processing power.

计算节点的启动任务Start tasks for compute nodes

可选的 启动任务 会在每个节点加入池以及节点每次重新启动或重置映像时在该节点上运行。The optional start task executes on each node as that node joins the pool, and each time a node is restarted or reimaged. 启动任务特别适用于准备计算节点,以便执行任务,例如安装可通过任务在计算节点上运行的应用程序。The start task is especially useful for preparing compute nodes for the execution of tasks, like installing the applications that your tasks run on the compute nodes.

应用程序包Application packages

可以指定要部署到池中计算节点的 应用程序包You can specify application packages to deploy to the compute nodes in the pool. 应用程序包提供任务运行的应用程序的简化部署和版本控制。Application packages provide simplified deployment and versioning of the applications that your tasks run. 为池指定的应用程序包安装在加入该池的每个节点上,每次节点重新启动或重置映像时,将安装这些包。Application packages that you specify for a pool are installed on every node that joins that pool, and every time a node is rebooted or reimaged.

备注

在 2017 年 7 月 5 日以后创建的所有 Batch 池都支持应用程序包。Application packages are supported on all Batch pools created after 5 July 2017. 在 2016 年 3 月 10 日和 2017 年 7 月 5 日期间创建的 Batch 池也支持应用程序包,但前提是该池是使用云服务配置创建的。They are supported on Batch pools created between 10 March 2016 and 5 July 2017 only if the pool was created using a Cloud Service configuration. 在 2016 年 3 月 10 日以前创建的 Batch 池不支持应用程序包。Batch pools created prior to 10 March 2016 do not support application packages. 若要详细了解如何使用应用程序包将应用程序部署到 Batch 节点,请参阅使用 Batch 应用程序包将应用程序部署到计算节点For more information about using application packages to deploy your applications to your Batch nodes, see Deploy applications to compute nodes with Batch application packages.

网络配置 Network configuration

可以指定应在其中创建池计算节点的 Azure 虚拟网络 (VNet) 的子网。You can specify the subnet of an Azure virtual network (VNet) in which the pool's compute nodes should be created. 有关详细信息,请参阅“池网络配置”部分。See the Pool network configuration section for more information.

作业Job

作业是任务的集合。A job is a collection of tasks. 作业控制其任务对池中计算节点执行计算的方式。It manages how computation is performed by its tasks on the compute nodes in a pool.

  • 作业指定要在其中运行工作的 The job specifies the pool in which the work is to be run. 可以为每个作业创建新池,或将池用于多个作业。You can create a new pool for each job, or use one pool for many jobs. 可以针对与作业计划关联的每个作业创建池,或者针对与作业计划关联的所有作业创建池。You can create a pool for each job that is associated with a job schedule, or for all jobs that are associated with a job schedule.

  • 可以指定可选的 作业优先级You can specify an optional job priority. 如果提交的作业的优先级高于当前正在进行的其他作业,则会将高优先级作业的任务插入到队列中低优先级作业的任务前面。When a job is submitted with a higher priority than jobs that are currently in progress, the tasks for the higher-priority job are inserted into the queue ahead of tasks for the lower-priority jobs. 已经运行的低优先级作业中的任务不会预先清空。Tasks in lower-priority jobs that are already running are not preempted.

  • 可以使用作业 约束 来为作业指定特定的限制:You can use job constraints to specify certain limits for your jobs:

    可以设置 最大挂钟时间,以便在作业的运行时间超过指定的最大挂钟时间时,终止该作业及其所有关联的任务。You can set a maximum wallclock time, so that if a job runs for longer than the maximum wallclock time that is specified, the job and all of its tasks are terminated.

    Batch 可以检测并重试失败的任务。Batch can detect and then retry failed tasks. 可以将任务重试最大次数指定为约束,包括指定是要始终重试还是永不重试某个任务。You can specify the maximum number of task retries as a constraint, including whether a task is always or never retried. 重试某个任务意味着要将任务重新排队以再次运行。Retrying a task means that the task is requeued to be run again.

  • 客户端应用程序可将任务添加到作业,或者也可以指定作业管理器任务Your client application can add tasks to a job, or you can specify a job manager task. 作业管理器任务包含必要的信息用于为池中某个计算节点上运行的包含作业管理器任务的作业创建所需的任务。A job manager task contains the information that is necessary to create the required tasks for a job, with the job manager task being run on one of the compute nodes in the pool. 作业管理器任务专门由 Batch 来处理 – 创建作业和重新启动失败的作业后,会立即将任务排队。The job manager task is handled specifically by Batch--it is queued as soon as the job is created, and is restarted if it fails. 作业计划 创建的作业 需要 作业管理器任务,因为它是在实例化作业之前定义任务的唯一方式。A job manager task is required for jobs that are created by a job schedule because it is the only way to define the tasks before the job is instantiated.

  • 默认情况下,当作业内的所有任务都完成时,作业仍保持活动状态。By default, jobs remain in the active state when all tasks within the job are complete. 可以更改此行为,使作业在其中的所有任务完成时自动终止。You can change this behavior so that the job is automatically terminated when all tasks in the job are complete. 将作业的 onAllTasksComplete 属性(在 Batch .NET 中为 OnAllTasksComplete)设置为 terminatejob,可在作业的所有任务处于已完成状态时自动终止该作业。Set the job's onAllTasksComplete property (OnAllTasksComplete in Batch .NET) to terminatejob to automatically terminate the job when all of its tasks are in the completed state.

    请注意,Batch 服务将 没有 任务的作业视为其所有任务都已完成。Note that the Batch service considers a job with no tasks to have all of its tasks completed. 因此,此选项往往与 作业管理器任务配合使用。Therefore, this option is most commonly used with a job manager task. 如果想要使用自动作业终止而不通过作业管理器终止,首先应该将新作业的 onAllTasksComplete 属性设置为 noaction,然后只有在完成将任务添加到作业之后才将它设置为 terminatejobIf you want to use automatic job termination without a job manager, you should initially set a new job's onAllTasksComplete property to noaction, then set it to terminatejob only after you've finished adding tasks to the job.

作业优先级Job priority

可以向你在 Batch 中创建的作业分配优先级。You can assign a priority to jobs that you create in Batch. Batch 服务使用作业的优先级值来确定帐户中的作业计划顺序(不要与 计划的作业相混淆)。The Batch service uses the priority value of the job to determine the order of job scheduling within an account (this is not to be confused with a scheduled job). 优先级值的范围为 -1000 到 1000,-1000 表示最低优先级,1000 表示最高优先级。The priority values range from -1000 to 1000, with -1000 being the lowest priority and 1000 being the highest. 若要更新作业的优先级,请调用更新作业的属性操作 (Batch REST) 或修改 CloudJob.Priority 属性 (Batch .NET)。To update the priority of a job, call the Update the properties of a job operation (Batch REST), or modify the CloudJob.Priority property (Batch .NET).

在同一个帐户内,高优先级作业的计划优先顺序高于低优先级作业。Within the same account, higher-priority jobs have scheduling precedence over lower-priority jobs. 一个帐户中具有较高优先级值的作业,其计划优先级并不高于不同帐户中较低优先级值的另一个作业。A job with a higher-priority value in one account does not have scheduling precedence over another job with a lower-priority value in a different account.

不同池的作业计划是独立的。Job scheduling across pools is independent. 在不同的池之间,即使作业的优先级较高,如果其关联的池缺少空闲的节点,则不保证此作业优先计划。Between different pools, it is not guaranteed that a higher-priority job is scheduled first if its associated pool is short of idle nodes. 在同一个池中,相同优先级的作业有相同的计划机会。In the same pool, jobs with the same priority level have an equal chance of being scheduled.

计划的作业Scheduled jobs

作业计划 可在 Batch 服务中创建周期性作业。Job schedules enable you to create recurring jobs within the Batch service. 作业计划指定何时要运行作业,并包含要运行的作业的规范。A job schedule specifies when to run jobs and includes the specifications for the jobs to be run. 可以指定计划的持续时间(计划的持续时间和生效时间),以及在计划的时间段内创建作业的频率。You can specify the duration of the schedule--how long and when the schedule is in effect--and how frequently jobs are created during the scheduled period.

任务Task

任务是与作业关联的计算单位。A task is a unit of computation that is associated with a job. 它在节点上运行。It runs on a node. 任务将分配到节点以执行,或排入队列直到节点空闲。Tasks are assigned to a node for execution, or are queued until a node becomes free. 简而言之,任务将在计算节点上运行一个或多个程序或脚本,以执行你需要完成的工作。Put simply, a task runs one or more programs or scripts on a compute node to perform the work you need done.

创建任务时,可以指定:When you create a task, you can specify:

  • 任务的 命令行The command line for the task. 这是可在计算节点上运行应用程序或脚本的命令行。This is the command line that runs your application or script on the compute node.

    请务必注意,命令行实际上不是在 shell 下运行。It is important to note that the command line does not actually run under a shell. 因此无法以本机方式利用 shell 功能,例如环境变量扩展(包括 PATH)。Therefore, it cannot natively take advantage of shell features like environment variable expansion (this includes the PATH). 若要利用此类功能,必须在命令行中调用 shell - 例如,在 Windows 节点上启动 cmd.exe,或者在 Linux 上启动 /bin/shTo take advantage of such features, you must invoke the shell in the command line--for example, by launching cmd.exe on Windows nodes or /bin/sh on Linux:

    cmd /c MyTaskApplication.exe %MY_ENV_VAR%

    /bin/sh -c MyTaskApplication $MY_ENV_VAR

    如果任务需要运行不在节点的 PATH 中的应用程序或脚本,或在引用环境变量,请在任务命令行中显式调用 shell。If your tasks need to run an application or script that is not in the node's PATH or reference environment variables, invoke the shell explicitly in the task command line.

  • 资源文件Resource files that contain the data to be processed. 在执行任务的命令行之前,这些文件将自动从 Azure 存储帐户中的 Blob 存储复制到节点。These files are automatically copied to the node from Blob storage in an Azure Storage account before the task's command line is executed. 有关详细信息,请参阅下面的启动任务文件和目录部分。For more information, see the sections Start task and Files and directories.

  • 应用程序所需的 环境变量The environment variables that are required by your application. 有关详细信息,请参阅下面的 任务的环境设置 部分。For more information, see the Environment settings for tasks section.

  • 执行任务所依据的 约束The constraints under which the task should execute. 例如,约束包括允许任务运行的最长时间、重试任务失败的次数上限,以及保留任务工作目录中的文件的最长时间。For example, constraints include the maximum time that the task is allowed to run, the maximum number of times a failed task should be retried, and the maximum time that files in the task's working directory are retained.

  • Application packagesApplication packages to deploy to the compute node on which the task is scheduled to run. 应用程序包 提供任务运行的应用程序的简化部署和版本控制。Application packages provide simplified deployment and versioning of the applications that your tasks run. 在共享池的环境中,任务级应用程序包特别有用:不同的作业在一个池上运行,完成某个作业时不删除该池。Task-level application packages are especially useful in shared-pool environments, where different jobs are run on one pool, and the pool is not deleted when a job is completed. 如果作业中的任务少于池中的节点,任务应用程序包可以减少数据传输,因为应用程序只部署到运行任务的节点。If your job has fewer tasks than nodes in the pool, task application packages can minimize data transfer since your application is deployed only to the nodes that run tasks.

  • Docker 中心的容器映像引用,或者专用注册表和其他设置,用于创建 Docker 容器,其中的任务运行在节点上。A container image reference in Docker Hub or a private registry and additional settings to create a Docker container in which the task runs on the node. 如果池使用容器配置进行设置,则仅指定此信息。You only specify this information if the pool is set up with a container configuration.

备注

最长任务生存期(从添加到作业时算起到任务完成时结束)为 180 天。The maximum lifetime of a task, from when it is added to the job to when it completes, is 180 days. 已完成的任务会保存 7 天;最长生存期内未完成的任务的数据不可访问。Completed tasks persist for 7 days; data for tasks not completed within the maximum lifetime is not accessible.

除了可以定义在节点上运行计算的任务以外,Batch 服务还提供以下特殊任务:In addition to tasks you define to perform computation on a node, the following special tasks are also provided by the Batch service:

启动任务Start task

通过将启动任务与池相关联,可以准备池节点的操作环境。By associating a start task with a pool, you can prepare the operating environment of its nodes. 可以执行各种操作,例如,安装任务所要运行的应用程序或启动后台进程。For example, you can perform actions such as installing the applications that your tasks run, or starting background processes. 启动任务在节点每次启动时运行,且只要保留在池中就会持续运行(包括首次将节点添加到池时,以及节点重新启动或重置映像时)。The start task runs every time a node starts, for as long as it remains in the pool--including when the node is first added to the pool and when it is restarted or reimaged.

启动任务的主要优点是可以包含全部所需的信息,使你能够配置计算节点,以及安装执行任务所需的应用程序。A primary benefit of the start task is that it can contain all the information necessary to configure a compute node and install the applications required for task execution. 因此,增加池中的节点数和指定新的目标节点计数一样简单。Therefore, increasing the number of nodes in a pool is as simple as specifying the new target node count. 启动任务向 Batch 服务提供配置新节点并使其准备好接受任务所需的信息。The start task provides the Batch service the information needed to configure the new nodes and get them ready for accepting tasks.

与任何 Azure Batch 任务一样,除了指定要执行的命令行以外,还可以指定 Azure 存储中的资源文件列表。As with any Azure Batch task, you can specify a list of resource files in Azure Storage, in addition to a command line to be executed. Batch 服务先将资源文件从 Azure 存储复制到节点,然后运行命令行。The Batch service first copies the resource files to the node from Azure Storage, and then runs the command line. 对于池启动任务,文件列表通常包含任务应用程序及其依赖项。For a pool start task, the file list typically contains the task application and its dependencies.

但是,启动任务还可能包含计算节点上运行的所有任务使用的引用数据。However, the start task could also include reference data to be used by all tasks that are running on the compute node. 例如,启动任务的命令行可执行 robocopy 操作,将应用程序文件(已指定为资源文件并下载到节点)从启动任务的工作目录复制到共享文件夹,运行 MSI 或 setup.exeFor example, a start task's command line could perform a robocopy operation to copy application files (which were specified as resource files and downloaded to the node) from the start task's working directory to the shared folder, and then run an MSI or setup.exe.

通常,Batch 服务需要等待启动任务完成,然后认为节点已准备好分配任务,但可以配置这种行为。It is typically desirable for the Batch service to wait for the start task to complete before considering the node ready to be assigned tasks, but you can configure this.

如果某个计算节点上的启动任务失败,则节点的状态将会更新以反映失败状态,同时,不会为该节点分配任何任务。If a start task fails on a compute node, then the state of the node is updated to reflect the failure, and the node is not assigned any tasks. 如果从存储中复制启动任务的资源文件时出现问题,或由其命令行执行的进程返回了非零退出代码,则启动任务可能会失败。A start task can fail if there is an issue copying its resource files from storage, or if the process executed by its command line returns a nonzero exit code.

如果添加或更新现有池的启动任务,必须重新启动其计算节点,启动任务才应用到节点。If you add or update the start task for an existing pool, you must reboot its compute nodes for the start task to be applied to the nodes.

备注

Batch 限制启动任务的总大小,其中包括资源文件和环境变量。Batch limits the total size of a start task, which includes resource files and environment variables. 如需缩小启动任务,可使用下述两种方法中的一种:If you need to reduce the size of a start task, you can use one of two approaches:

  1. 可以使用应用程序包,将应用程序或数据分发到 Batch 池中的每个节点。You can use application packages to distribute applications or data across each node in your Batch pool. 有关应用程序包的详细信息,请参阅使用 Batch 应用程序包将应用程序部署到计算节点For more information about application packages, see Deploy applications to compute nodes with Batch application packages.

  2. 可以手动创建压缩的存档,其中包含应用程序文件。You can manually create a zipped archive containing your applications files. 将压缩的存档作为 Blob 上传到 Azure 存储。Upload your zipped archive to Azure Storage as a blob. 将压缩的存档指定为启动任务的资源文件。Specify the zipped archive as a resource file for your start task. 为启动任务运行命令行之前,请在命令行中将存档解压缩。Before you run the command line for your start task, unzip the archive from the command line.

    若要解压缩存档,可以使用所选归档工具。To unzip the archive, you can use the archiving tool of your choice. 需包括相关工具,以便为启动任务解压缩资源文件形式的存档。You will need to include the tool that you use to unzip the archive as a resource file for the start task.

作业管理器任务Job manager task

通常使用 作业管理器任务 来控制和/或监视作业的执行 - 例如,创建和提交作业的任务、确定其他要运行的任务,以及确定任务何时完成。You typically use a job manager task to control and/or monitor job execution--for example, to create and submit the tasks for a job, determine additional tasks to run, and determine when work is complete. 但是,作业管理器任务并不限定于这些活动。However, a job manager task is not restricted to these activities. 它是功能齐备的任务,可执行作业所需的任何操作。It is a fully fledged task that can perform any actions that are required for the job. 例如,作业管理器任务可以下载指定为参数的文件、分析该文件的内容,并根据这些内容提交其他任务。For example, a job manager task might download a file that is specified as a parameter, analyze the contents of that file, and submit additional tasks based on those contents.

作业管理员任务在所有其他任务之前启动。A job manager task is started before all other tasks. 它提供以下功能:It provides the following features:

  • 创建作业时由 Batch 服务自动提交为任务。It is automatically submitted as a task by the Batch service when the job is created.
  • 安排在作业中的其他任务之前执行。It is scheduled to execute before the other tasks in a job.
  • 缩小池时,关联的节点最后才从池中删除。Its associated node is the last to be removed from a pool when the pool is being downsized.
  • 此终止可能完全取决于作业中的所有任务终止。Its termination can be tied to the termination of all tasks in the job.
  • 需要重新启动时,作业管理器任务有最高的优先级。A job manager task is given the highest priority when it needs to be restarted. 如果找不到空闲的节点,Batch 服务可以终止池中正在运行的其他某个任务,以便腾出空间供作业管理器任务运行。If an idle node is not available, the Batch service might terminate one of the other running tasks in the pool to make room for the job manager task to run.
  • 一个作业中的作业管理器任务的优先级不高于其他作业的任务。A job manager task in one job does not have priority over the tasks of other jobs. 不同作业之间只遵循作业级别的优先级。Across jobs, only job-level priorities are observed.

作业准备和释放任务Job preparation and release tasks

Batch 提供作业准备任务来设置作业前的执行。Batch provides job preparation tasks for pre-job execution setup. 作业释放任务用于作业后的维护或清理。Job release tasks are for post-job maintenance or cleanup.

  • 作业准备任务:在任何其他作业任务执行之前,作业准备任务在计划要运行任务的所有计算节点上运行。Job preparation task: A job preparation task runs on all compute nodes that are scheduled to run tasks, before any of the other job tasks are executed. 可使用作业准备任务,复制所有任务共享的、但对作业而言唯一的数据。You can use a job preparation task to copy data that is shared by all tasks, but is unique to the job, for example.
  • 作业释放任务:作业完成后,作业释放任务在池中至少运行了一个任务的每个节点上运行。Job release task: When a job has completed, a job release task runs on each node in the pool that executed at least one task. 可使用作业释放任务,删除作业准备任务所复制的数据,或压缩并上传诊断日志数据。You can use a job release task to delete data that is copied by the job preparation task, or to compress and upload diagnostic log data, for example.

作业准备和释放任务允许指定调用任务时要运行的命令行。Both job preparation and release tasks allow you to specify a command line to run when the task is invoked. 这些任务提供许多功能,例如文件下载、以提升权限方式执行、自定义环境变量、最大执行持续时间、重试计数和文件保留时间。They offer features like file download, elevated execution, custom environment variables, maximum execution duration, retry count, and file retention time.

有关作业准备和释放任务的详细信息,请参阅 在 Azure Batch 计算节点上运行作业准备和完成任务For more information on job preparation and release tasks, see Run job preparation and completion tasks on Azure Batch compute nodes.

多实例任务Multi-instance task

多实例任务 是经过配置后可以在多个计算节点上同时运行的任务。A multi-instance task is a task that is configured to run on more than one compute node simultaneously. 通过多实例任务,可以启用等高性能计算方案(例如消息传递接口 (MPI)),此类方案需要将一组计算节点分配到一起来处理单个工作负荷。With multi-instance tasks, you can enable high-performance computing scenarios that require a group of compute nodes that are allocated together to process a single workload (like Message Passing Interface (MPI)).

有关在 Batch 中使用 Batch .NET 库运行 MPI 作业的详细介绍,请参阅 Use multi-instance tasks to run Message Passing Interface (MPI) applications in Azure Batch(在 Azure Batch 中使用多实例任务来执行消息传递接口 (MPI) 应用程序)。For a detailed discussion on running MPI jobs in Batch by using the Batch .NET library, check out Use multi-instance tasks to run Message Passing Interface (MPI) applications in Azure Batch.

任务依赖项Task dependencies

顾名思义,使用任务依赖项可以在执行某个任务之前,指定该任务与其他任务的依赖性。Task dependencies, as the name implies, allow you to specify that a task depends on the completion of other tasks before its execution. 此功能提供以下情况的支持:“下游”任务取用“上游”任务的输出,或当上游任务执行下游任务所需的某种初始化时。This feature provides support for situations in which a "downstream" task consumes the output of an "upstream" task--or when an upstream task performs some initialization that is required by a downstream task. 若要使用此功能,必须先在 Batch 作业中启用任务依赖性。To use this feature, you must first enable task dependencies on your Batch job. 然后,针对每个依赖于另一个任务(或其他许多任务)的任务,指定该任务依赖的任务。Then, for each task that depends on another (or many others), you specify the tasks which that task depends on.

使用任务依赖性,可以配置如下所述的方案:With task dependencies, you can configure scenarios like the following:

  • taskB 依赖于 taskA(直到 taskA 完成,才开始执行 taskB)。taskB depends on taskA (taskB will not begin execution until taskA has completed).
  • taskC 同时依赖于 taskAtaskBtaskC depends on both taskA and taskB.
  • taskD 在执行前依赖于某个范围的任务,例如任务 110taskD depends on a range of tasks, such as tasks 1 through 10, before it executes.

有关此功能的更深入信息,请查看 Azure Batch 中的任务依赖关系azure-batch-samples GitHub 存储库中的 TaskDependencies 代码示例。Check out Task dependencies in Azure Batch and the TaskDependencies code sample in the azure-batch-samples GitHub repository for more in-depth details on this feature.

任务的环境设置Environment settings for tasks

批处理服务执行的每个任务都可以访问在计算节点上设置的环境变量。Each task executed by the Batch service has access to environment variables that it sets on compute nodes. 这包括 Batch 服务定义的(服务定义型)环境变量以及用户可以针对其任务定义的自定义环境变量。This includes environment variables defined by the Batch service (service-defined) and custom environment variables that you can define for your tasks. 任务执行的应用程序和脚本可以在执行期间访问这些环境变量。The applications and scripts your tasks execute have access to these environment variables during execution.

可以通过填充这些实体的 环境设置 属性,在任务或作业级别设置自定义环境变量。You can set custom environment variables at the task or job level by populating the environment settings property for these entities. 有关示例,请参阅将任务添加到作业操作 (Batch REST API),或 Batch .NET 中的 CloudTask.EnvironmentSettingsCloudJob.CommonEnvironmentSettings 属性。For example, see the Add a task to a job operation (Batch REST API), or the CloudTask.EnvironmentSettings and CloudJob.CommonEnvironmentSettings properties in Batch .NET.

客户端应用程序或服务可使用获取有关任务的信息操作 (Batch REST) 或通过访问 CloudTask.EnvironmentSettings 属性 (Batch .NET),来获取任务的环境变量(服务定义型和自定义环境变量)。Your client application or service can obtain a task's environment variables, both service-defined and custom, by using the Get information about a task operation (Batch REST) or by accessing the CloudTask.EnvironmentSettings property (Batch .NET). 在计算节点上执行的进程可以在节点上访问这些和其他环境变量,例如,通过使用熟悉的 %VARIABLE_NAME% (Windows) 或 $VARIABLE_NAME (Linux) 语法。Processes executing on a compute node can access these and other environment variables on the node, for example, by using the familiar %VARIABLE_NAME% (Windows) or $VARIABLE_NAME (Linux) syntax.

可以在计算节点环境变量中找到包含所有服务定义型环境变量的完整列表。You can find a full list of all service-defined environment variables in Compute node environment variables.

文件和目录Files and directories

每个任务都有一个工作目录 ,任务会在该目录中创建零个或多个文件和目录。Each task has a working directory under which it creates zero or more files and directories. 此工作目录可用于存储任务运行的程序、任务处理的数据,以及任务执行的处理的输出。This working directory can be used for storing the program that is run by the task, the data that it processes, and the output of the processing it performs. 任务的所有文件和目录由任务用户拥有。All files and directories of a task are owned by the task user.

Batch 服务在节点上公开文件系统的一部分作为 根目录The Batch service exposes a portion of the file system on a node as the root directory. 任务可通过引用 AZ_BATCH_NODE_ROOT_DIR 环境变量来访问根目录。Tasks can access the root directory by referencing the AZ_BATCH_NODE_ROOT_DIR environment variable. 有关使用环境变量的详细信息,请参阅 任务的环境设置For more information about using environment variables, see Environment settings for tasks.

根目录包含以下目录结构:The root directory contains the following directory structure:

计算节点目录结构

  • applications:包含在计算节点上安装的应用程序包的详细信息。applications: Contains information about the details of application packages installed on the compute node. 任务可通过引用 AZ_BATCH_APP_PACKAGE 环境变量来访问此目录。Tasks can access this directory by referencing the AZ_BATCH_APP_PACKAGE environment variable.

  • fsmounts:此目录包含在计算节点上装载的任何文件系统。fsmounts: The directory contains any file systems that are mounted on a compute node. 任务可通过引用 AZ_BATCH_NODE_MOUNTS_DIR 环境变量来访问此目录。Tasks can access this directory by referencing the AZ_BATCH_NODE_MOUNTS_DIR environment variable.

  • 共享:此目录允许对节点上运行的 所有 任务进行读取/写入访问。shared: This directory provides read/write access to all tasks that run on a node. 在节点上运行的任何任务都可以创建、读取、更新和删除此目录中的文件。Any task that runs on the node can create, read, update, and delete files in this directory. 任务可通过引用 AZ_BATCH_NODE_SHARED_DIR 环境变量来访问此目录。Tasks can access this directory by referencing the AZ_BATCH_NODE_SHARED_DIR environment variable.

  • 启动:启动任务使用此目录作为它的工作目录。startup: This directory is used by a start task as its working directory. 由启动任务下载到的节点所有文件都存储在此处。All of the files that are downloaded to the node by the start task are stored here. 启动任务可以创建、读取、更新和删除此目录下的文件。The start task can create, read, update, and delete files under this directory. 任务可通过引用 AZ_BATCH_NODE_STARTUP_DIR 环境变量来访问此目录。Tasks can access this directory by referencing the AZ_BATCH_NODE_STARTUP_DIR environment variable.

  • volatile:此目录供内部使用。volatile: This directory is for internal purposes. 不保证此目录中的任何文件或者此目录本身在将来会存在。There's no guarantee that any files in this directory or that the directory itself will exist in the future.

  • workitems:此目录包含计算节点上的作业及其任务的目录。workitems: This directory contains the directories for jobs and their tasks on the compute node.

  • 任务:在 workitems 目录中,为节点上运行的每个任务创建一个目录。Tasks: Within the workitems directory, a directory is created for each task that runs on the node. 可通过引用 AZ_BATCH_TASK_DIR 环境变量来访问该目录。It's accessed by referencing the AZ_BATCH_TASK_DIR environment variable.

    在每个任务目录中,Batch 服务会创建由 AZ_BATCH_TASK_WORKING_DIR 环境变量指定唯一路径的任务目录 (wd)。Within each task directory, the Batch service creates a working directory (wd) whose unique path is specified by the AZ_BATCH_TASK_WORKING_DIR environment variable. 此目录提供对任务的读/写访问权限。This directory provides read/write access to the task. 任务可以创建、读取、更新和删除此目录下的文件。The task can create, read, update, and delete files under this directory. 此目录根据指定给任务的 RetentionTime 约束来保留。This directory is retained based on the RetentionTime constraint that is specified for the task.

    stdout.txtstderr.txt:在任务执行期间,会将这些文件写入任务文件夹。stdout.txt and stderr.txt: These files are written to the task folder during the execution of the task.

重要

从池中删除节点时,也会删除节点上存储的 所有 文件。When a node is removed from the pool, all of the files that are stored on the node are removed.

应用程序包Application packages

应用程序包 功能可为池中的计算节点提供简单的应用程序管理和部署能力。The application packages feature provides easy management and deployment of applications to the compute nodes in your pools. 可上传和管理任务所运行应用程序的多个版本,包括二进制文件和支持文件。You can upload and manage multiple versions of the applications run by your tasks, including their binaries and support files. 然后可以将一个或多个这样的应用程序自动部署到池中的计算节点。Then you can automatically deploy one or more of these applications to the compute nodes in your pool.

可以在池和任务级别指定应用程序包。You can specify application packages at the pool and task level. 指定池应用程序包时,应用程序将部署到池中的每个节点。When you specify pool application packages, the application is deployed to every node in the pool. 指定任务应用程序包时,应用程序只在运行任务的命令行之前,部署到计划要运行作业的至少一个任务的节点。When you specify task application packages, the application is deployed only to nodes that are scheduled to run at least one of the job's tasks, just before the task's command line is run.

Batch 可以处理使用 Azure 存储将应用程序包存储及部署到计算节点的详细信息,因此可以简化代码和管理开销。Batch handles the details of working with Azure Storage to store your application packages and deploy them to compute nodes, so both your code and management overhead can be simplified.

若要了解应用程序包功能的详细信息,请参阅使用 Batch 应用程序包将应用程序部署到计算节点To find out more about the application package feature, check out Deploy applications to compute nodes with Batch application packages.

备注

如果将池应用程序包添加到现有池,则必须重新启动其计算节点,应用程序包才会应用到节点。If you add pool application packages to an existing pool, you must reboot its compute nodes for the application packages to be deployed to the nodes.

池和计算节点生存期Pool and compute node lifetime

在设计 Azure Batch 解决方案时,必须做出有关如何及何时创建池,以及这些池中的计算节点可用性要保持多久的设计决策。When you design your Azure Batch solution, you have to make a design decision about how and when pools are created, and how long compute nodes within those pools are kept available.

在极端情况下,可以针对提交的每个作业创建一个池,并在其任务执行完成时立即删除该池。On one end of the spectrum, you can create a pool for each job that you submit, and delete the pool as soon as its tasks finish execution. 这样,只有在需要时才分配节点,节点空闲时会立即关闭,因此可以最高程度地提高利用率。This maximizes utilization because the nodes are only allocated when needed, and shut down as soon as they're idle. 这意味着作业必须等待系统分配节点,但需注意,一旦节点单独可用,处于已分配状态且启动任务已完成,系统就会立即安排任务执行。While this means that the job must wait for the nodes to be allocated, it's important to note that tasks are scheduled for execution as soon as nodes are individually available, allocated, and the start task has completed. 批处理不会等到池中的所有节点都可用后才将任务分配给节点。Batch does not wait until all nodes within a pool are available before assigning tasks to the nodes. 这可确保最大程度地利用所有可用节点。This ensures maximum utilization of all available nodes.

在另一种极端情况下,如果最高优先级是让作业立即启动,则你可以预先创建池,并使其节点在提交作业之前可用。At the other end of the spectrum, if having jobs start immediately is the highest priority, you can create a pool ahead of time and make its nodes available before jobs are submitted. 在此情况下,任务可以立即启动,但节点可能会保持空闲状态以等待分配任务。In this scenario, tasks can start immediately, but nodes might sit idle while waiting for them to be assigned.

通常会使用一种组合方法来处理可变但持续存在的负载。A combined approach is typically used for handling a variable, but ongoing, load. 可以创建一个池用于容纳提交的多个作业,但同时根据作业负载扩展或缩减节点数目(请参阅下一部分中的 缩放计算资源 )。You can have a pool that multiple jobs are submitted to, but can scale the number of nodes up or down according to the job load (see Scaling compute resources in the following section). 可以根据当前负载被动执行此操作,或者在负载可预测时主动执行此操作。You can do this reactively, based on current load, or proactively, if load can be predicted.

虚拟网络 (VNet) 和防火墙配置Virtual network (VNet) and firewall configuration

在 Batch 中预配计算节点池时,可以将池与 Azure 虚拟网络 (VNet) 的子网相关联。When you provision a pool of compute nodes in Batch, you can associate the pool with a subnet of an Azure virtual network (VNet). 若要使用 Azure VNet,Batch 客户端 API 必须使用 Azure Active Directory (AD) 身份验证。To use an Azure VNet, the Batch client API must use Azure Active Directory (AD) authentication. 有关 Azure AD 的 Azure Batch 支持,请参阅使用 Active Directory 对 Batch 服务解决方案进行身份验证Azure Batch support for Azure AD is documented in Authenticate Batch service solutions with Active Directory.

VNet 要求VNet requirements

  • VNet 与 Batch 帐户必须位于同一 Azure 区域 和订阅 中。The VNet must be in the same Azure region and subscription as the Batch account.

  • 对于使用虚拟机配置创建的池,仅支持基于 Azure 资源管理器的 VNet。For pools created with a virtual machine configuration, only Azure Resource Manager-based VNets are supported. 对于使用云服务配置创建的池,仅支持经典 VNet。For pools created with a cloud services configuration, only classic VNets are supported.

  • 若要使用经典 VNet,MicrosoftAzureBatch 服务主体必须为指定的 VNet 提供 Classic Virtual Machine Contributor 基于角色的访问控制 (RBAC) 角色。To use a classic VNet, the MicrosoftAzureBatch service principal must have the Classic Virtual Machine Contributor Role-Based Access Control (RBAC) role for the specified VNet. 若要使用基于 Azure 资源管理器的 VNet,你需要拥有访问 VNet 并在子网中部署 VM 的权限。 To use an Azure Resource Manager-based VNet, you need to have permissions to access the VNet and to deploy VMs in the subnet.

  • 为池指定的子网必须提供足够的未分配 IP 地址来容纳面向该池的 VM 的数量;即,池的 targetDedicatedNodestargetLowPriorityNodes 属性的总和。The subnet specified for the pool must have enough unassigned IP addresses to accommodate the number of VMs targeted for the pool; that is, the sum of the targetDedicatedNodes and targetLowPriorityNodes properties of the pool. 如果子网没有足够的未分配 IP 地址,池将分配部分计算节点,并发生调整大小错误。If the subnet doesn't have enough unassigned IP addresses, the pool partially allocates the compute nodes, and a resize error occurs. 

  • 部署在 Azure VNet 的虚拟机配置中的池会自动分配其他 Azure 网络资源。Pools in the virtual machine configuration deployed in an Azure VNet automatically allocate additional Azure networking resources. 在 VNet 中,每 50 个池节点需要以下资源:1 个网络安全组、1 个公共 IP 地址、1 个负载均衡器。The following resources are needed for each 50 pool nodes in a VNet: 1 network security group, 1 public IP address, and 1 load balancer. 在包含创建 Batch 池时提供的虚拟网络的订阅中,这些资源受配额的限制。These resources are limited by quotas in the subscription that contains the virtual network supplied when creating the Batch pool.

  • VNet 必须允许来自 Batch 服务的通信,才能在计算节点上计划任务。The VNet must allow communication from the Batch service to be able to schedule tasks on the compute nodes. 这可以通过检查 VNet 是否具有任何关联的网络安全组 (NSG) 来进行验证。This can be verified by checking if the VNet has any associated network security groups (NSGs). 如果 NSG 拒绝与指定子网中的计算节点通信,则 Batch 服务会将计算节点的状态设置为“不可用”。If communication to the compute nodes in the specified subnet is denied by an NSG, then the Batch service sets the state of the compute nodes to unusable. 

  • 如果指定的 VNet 具有关联的网络安全组 (NSG) 和/或防火墙,则配置入站端口和出站端口,如以下各表中所示:If the specified VNet has associated Network Security Groups (NSGs) and/or a firewall, configure the inbound and outbound ports as shown in the following tables:

    目标端口Destination Port(s) 源 IP 地址Source IP address Source PortSource port Batch 是否添加 NSG?Does Batch add NSGs? 是使用 VM 所必需的吗?Required for VM to be usable? 来自用户的操作Action from user
    • 使用虚拟机配置创建的池:29876、29877For pools created with the virtual machine configuration: 29876, 29877
    • 使用云服务配置创建的池:10100、20100、30100For pools created with the cloud services configuration: 10100, 20100, 30100
    *

    虽然这需要有效地“全部允许”,但 Batch 服务在虚拟机配置下创建的每个 VM 上的网络接口级别应用 NSG,以筛选掉所有非 Batch 服务 IP 地址。Although this requires effectively "allow all", the Batch service applies an NSG at the network interface level on each VM created under virtual machine configuration that filters out all non-Batch service IP addresses.
    * 或 443* or 443 是的。Yes. Batch 在附加到 VM 的网络接口 (NIC) 级别添加 NSG。Batch adds NSGs at the level of network interfaces (NIC) attached to VMs. 这些 NSG 仅允许来自 Batch 服务角色 IP 地址的流量。These NSGs allow traffic only from Batch service role IP addresses. 即使为 Internet 打开这些端口,流量也会在 NIC 上被阻止。Even if you open these ports to the Internet, the traffic will get blocked at the NIC. Yes 不需指定 NSG,因为 Batch 仅允许 Batch IP 地址。You do not need to specify an NSG, because Batch allows only Batch IP addresses.

    但是,如果指定 NSG,请确保这些端口对入站流量开放。However, if you do specify an NSG, please ensure that these ports are open for inbound traffic.
    3389 (Windows)、22 (Linux)3389 (Windows), 22 (Linux) 用户计算机,用于调试目的,方便远程访问 VM。User machines, used for debugging purposes, so that you can remotely access the VM. * No No 如需允许远程访问(RDP 或 SSH)VM,请添加 NSG。Add NSGs if you want to permit remote access (RDP or SSH) to the VM.
    出站端口Outbound Port(s) 目标Destination Batch 是否添加 NSG?Does Batch add NSGs? 是使用 VM 所必需的吗?Required for VM to be usable? 来自用户的操作Action from user
    443443 Azure 存储Azure Storage No Yes 如果添加任何 NSG,请确保该端口对出站流量开放。If you add any NSGs, then ensure that this port is open to outbound traffic.

    另请确保可以通过为 VNet 提供服务的自定义 DNS 服务器解析 Azure 存储终结点。Also, ensure that your Azure Storage endpoint can be resolved by any custom DNS servers that serve your VNet. 具体而言,<account>.table.core.chinacloudapi.cn<account>.queue.core.chinacloudapi.cn<account>.blob.core.chinacloudapi.cn 形式的 URL 应当是可以解析的。Specifically, URLs of the form <account>.table.core.chinacloudapi.cn, <account>.queue.core.chinacloudapi.cn, and <account>.blob.core.chinacloudapi.cn should be resolvable. 

    如果添加基于资源管理器的 NSG,则可使用服务标记,针对特定区域选择适用于出站连接的存储 IP 地址。If you add a Resource Manager based NSG, you can make use of service tags to select the Storage IP addresses for the specific region for outbound connections. 请注意,存储 IP 地址必须与 Batch 帐户及 VNet 位于同一区域。Note that the Storage IP addresses must be the same region as your Batch account and VNet. 所选 Azure 区域中的服务标记目前为预览版。Service tags are currently in preview in selected Azure regions.

若要详细了解如何在 VNet 中设置 Batch 池,请参阅通过虚拟网络创建虚拟机池For more information about setting up a Batch pool in a VNet, see Create a pool of virtual machines with your virtual network.

缩放计算资源Scaling compute resources

通过 自动缩放功能,可以让 Batch 服务根据计算方案的当前工作负荷和资源使用状况动态缩放池中的计算节点数目。With automatic scaling, you can have the Batch service dynamically adjust the number of compute nodes in a pool according to the current workload and resource usage of your compute scenario. 这样,便可做到只使用所需资源并可释放不需要的资源,因而能够降低运行应用程序的整体成本。This allows you to lower the overall cost of running your application by using only the resources you need, and releasing those you don't need.

可通过编写 自动缩放公式 并将该公式与池相关联,来启用自动缩放。You enable automatic scaling by writing an automatic scaling formula and associating that formula with a pool. Batch 服务使用该公式来确定池中下一个缩放间隔(可配置的间隔)的目标节点数目。The Batch service uses the formula to determine the target number of nodes in the pool for the next scaling interval (an interval that you can configure). 可以在创建池时指定池的自动缩放设置,或稍后在池上启用缩放。You can specify the automatic scaling settings for a pool when you create it, or enable scaling on a pool later. 还可以更新已启用缩放的池上的缩放设置。You can also update the scaling settings on a scaling-enabled pool.

例如,也许某个作业要求提交大量需执行的任务。As an example, perhaps a job requires that you submit a large number of tasks to be executed. 可以将缩放公式分配到池,以根据当前的排队任务数和作业中任务的完成率来调整池中的节点数目。You can assign a scaling formula to the pool that adjusts the number of nodes in the pool based on the current number of queued tasks and the completion rate of the tasks in the job. Batch 服务定期评估公式,并根据工作负荷和其他公式设置来调整池的大小。The Batch service periodically evaluates the formula and resizes the pool, based on workload and your other formula settings. 当有大量的排队任务时该服务会根据需要添加节点,当没有正在排队或运行的任务时则会删除节点。The service adds nodes as needed when there are a large number of queued tasks, and removes nodes when there are no queued or running tasks.

缩放公式可以基于以下度量值:A scaling formula can be based on the following metrics:

  • 时间度量值 基于指定的时数内每隔五分钟收集的统计信息。Time metrics are based on statistics collected every five minutes in the specified number of hours.
  • 资源度量值 基于 CPU 使用率、带宽使用率、内存使用率和节点的数目。Resource metrics are based on CPU usage, bandwidth usage, memory usage, and number of nodes.
  • 任务指标基于任务状态,例如“活动”(已排队)、“正在运行”或“已完成”。Task metrics are based on task state, such as Active (queued), Running, or Completed.

如果自动缩放会减少池中的计算节点数,则必须考虑如何处理在执行减少操作时运行的任务。When automatic scaling decreases the number of compute nodes in a pool, you must consider how to handle tasks that are running at the time of the decrease operation. 为了满足这一点,Batch 提供可包含在公式中的 节点解除分配选项To accommodate this, Batch provides a node deallocation option that you can include in your formulas. 例如,可以指定运行中的任务立即停止,然后重新排入队列,以便在另一个节点上运行,或允许先完成再从池中删除节点。For example, you can specify that running tasks are stopped immediately and then requeued for execution on another node, or allowed to finish before the node is removed from the pool.

有关自动缩放应用程序的详细信息,请参阅 自动缩放 Azure Batch 池中的计算节点For more information about automatically scaling an application, see Automatically scale compute nodes in an Azure Batch pool.

提示

若要获得最大的计算资源使用率,请将节点的目标数目设置成在作业结束时降为零,但允许正在运行的任务完成。To maximize compute resource utilization, set the target number of nodes to zero at the end of a job, but allow running tasks to finish.

证书的安全性Security with certificates

在加密或解密任务的敏感信息(例如 Azure 存储帐户的密钥)时,通常需要使用证书。You typically need to use certificates when you encrypt or decrypt sensitive information for tasks, like the key for an Azure Storage account. 为此,可以在节点上安装证书。To support this, you can install certificates on nodes. 加密的机密通过命令行参数或内嵌在某个任务资源中来传递给任务,已安装的证书可用于解密机密。Encrypted secrets are passed to tasks via command-line parameters or embedded in one of the task resources, and the installed certificates can be used to decrypt them.

可以使用添加证书操作 (Batch REST) 或 CertificateOperations.CreateCertificate 方法 (Batch .NET) 将证书添加到 Batch 帐户。You use the Add certificate operation (Batch REST) or CertificateOperations.CreateCertificate method (Batch .NET) to add a certificate to a Batch account. 然后,可以将该证书与新池或现有池相关联。You can then associate the certificate with a new or existing pool. 将证书与池关联后,Batch 服务将在池中的每个节点上安装该证书。When a certificate is associated with a pool, the Batch service installs the certificate on each node in the pool. 在启动节点之后、启动任何任务(包括启动任务和作业管理器任务)之前,Batch 服务将安装相应的证书。The Batch service installs the appropriate certificates when the node starts up, before launching any tasks (including the start task and job manager task).

如果将证书添加到 现有 池,必须重新启动其计算节点,证书才会应用到节点。If you add certificates to an existing pool, you must reboot its compute nodes for the certificates to be applied to the nodes.

错误处理。Error handling

有时你可能需要处理 Batch 解决方案中的任务和应用程序失败。You might find it necessary to handle both task and application failures within your Batch solution.

任务失败处理Task failure handling

任务失败划分为以下类别:Task failures fall into these categories:

  • 预处理失败Pre-processing failures

    如果任务无法启动,则会为任务设置预处理错误。If a task fails to start, a pre-processing error is set for the task.

    如果任务的资源文件已移动、存储帐户不再可用,或者遇到其他使文件无法成功复制到节点的问题,则可能会出现预处理错误。Pre-processing errors can occur if the task's resource files have moved, the Storage account is no longer available, or another issue was encountered that prevented the successful copying of files to the node.

  • 文件上传失败File upload failures

    如果上传为任务指定的文件因任何原因而失败,则会为该任务设置文件上传错误。If uploading files that are specified for a task fails for any reason, a file upload error is set for the task.

    如果提供的用于访问 Azure 存储的 SAS 无效或未提供写权限,如果存储帐户不再可用,或者如果遇到其他使文件无法成功复制到节点的问题,则可能会发生文件上传错误。File upload errors can occur if the SAS supplied for accessing Azure Storage is invalid or does not provide write permissions, if the storage account is no longer available, or if another issue was encountered that prevented the successful copying of files from the node.

  • 应用程序失败Application failures

    任务命令行指定的进程也可能会失败。The process that is specified by the task's command line can also fail. 如果任务执行的进程返回非零退出代码,则将该进程视为失败(请参阅下一部分中的 任务退出代码 )。The process is deemed to have failed when a nonzero exit code is returned by the process that is executed by the task (see Task exit codes in the next section).

    对于应用程序失败,可以将 Batch 配置为自动重试任务,并最多重试指定的次数。For application failures, you can configure Batch to automatically retry the task up to a specified number of times.

  • 约束失败Constraint failures

    可以设置一个约束来指定作业或任务的最大执行持续期间,即 maxWallClockTimeYou can set a constraint that specifies the maximum execution duration for a job or task, the maxWallClockTime. 此约束可用于终止未能继续进行的任务。This can be useful for terminating tasks that fail to progress.

    如果超出了最长时间,则将任务标记为已完成,但退出代码将设置为 0xC000013AschedulingError 字段将标记为 { category:"ServerError", code="TaskEnded"}When the maximum amount of time has been exceeded, the task is marked as completed, but the exit code is set to 0xC000013A and the schedulingError field is marked as { category:"ServerError", code="TaskEnded"}.

调试应用程序失败Debugging application failures

  • stderrstdoutstderr and stdout

    在执行过程中,应用程序可以生成诊断输出,这些信息可用于排查问题。During execution, an application might produce diagnostic output that you can use to troubleshoot issues. 如前一部分文件和目录中所述,批处理服务会将标准输出和标准错误输出发送到计算节点上的任务目录中的 stdout.txtstderr.txt 文件。As mentioned in the earlier section Files and directories, the Batch service writes standard output and standard error output to stdout.txt and stderr.txt files in the task directory on the compute node. 可以使用 Azure 门户或 Batch SDK 之一下载这些文件。You can use the Azure portal or one of the Batch SDKs to download these files. 例如,可以使用 Batch .NET 库中的 ComputeNode.GetNodeFileCloudTask.GetNodeFile 检索这些文件和其他文件来进行故障排除。For example, you can retrieve these and other files for troubleshooting purposes by using ComputeNode.GetNodeFile and CloudTask.GetNodeFile in the Batch .NET library.

  • 任务退出代码Task exit codes

    如前所述,如果任务执行的程序返回非零退出代码,则 Batch 服务会将此任务标记为失败。As mentioned earlier, a task is marked as failed by the Batch service if the process that is executed by the task returns a nonzero exit code. 当任务执行某个进程时,Batch 将使用 进程的返回代码填充任务的退出代码属性。When a task executes a process, Batch populates the task's exit code property with the return code of the process. 请务必注意,任务的退出代码不是由 Batch 服务确定,It is important to note that a task's exit code is not determined by the Batch service. 而是由进程本身或此进程在其上运行的操作系统确定。A task's exit code is determined by the process itself or the operating system on which the process executed.

应对任务失败或中断Accounting for task failures or interruptions

任务偶尔会失败或中断。Tasks might occasionally fail or be interrupted. 任务应用程序本身可能会失败,运行任务的节点可能会重新启动,或者在调整大小操作期间,可能会因为池的取消分配策略设置为在不等待任务完成的情况下立即删除节点,而从池中删除节点。The task application itself might fail, the node on which the task is running might be rebooted, or the node might be removed from the pool during a resize operation if the pool's deallocation policy is set to remove nodes immediately without waiting for tasks to finish. 在所有情况下,任务都可以由 Batch 自动排队,并在另一个节点上执行。In all cases, the task can be automatically requeued by Batch for execution on another node.

间歇性的问题也有可能会导致任务停止响应,或者花费很长时间才能执行完毕。It is also possible for an intermittent issue to cause a task to stop responding or take too long to execute. 可为任务设置最长执行间隔时间。You can set the maximum execution interval for a task. 如果超出最长执行间隔时间,Batch 服务会中断任务应用程序。If the maximum execution interval is exceeded, the Batch service interrupts the task application.

连接到计算节点Connecting to compute nodes

可通过远程登录到计算节点来进一步执行调试和故障排除。You can perform additional debugging and troubleshooting by signing in to a compute node remotely. 可以使用 Azure 门户下载 Windows 节点的远程桌面协议 (RDP) 文件,并获取 Linux 节点的安全外壳 (SSH) 连接信息。You can use the Azure portal to download a Remote Desktop Protocol (RDP) file for Windows nodes and obtain Secure Shell (SSH) connection information for Linux nodes. 也可以使用 Batch API(例如,使用 Batch .NETBatch Python)执行此操作。You can also do this by using the Batch APIs--for example, with Batch .NET or Batch Python.

重要

若要通过 RDP 或 SSH 连接到某个节点,必须先在该节点上创建一个用户。To connect to a node via RDP or SSH, you must first create a user on the node. 为此,可以使用 Azure 门户通过 Batch REST API 将用户帐户添加到节点、在 Batch .NET 中调用 ComputeNode.CreateComputeNodeUser 方法,或在 Batch Python 模块中调用 add_user 方法。To do this, you can use the Azure portal, add a user account to a node by using the Batch REST API, call the ComputeNode.CreateComputeNodeUser method in Batch .NET, or call the add_user method in the Batch Python module.

如需限制或禁用通过 RDP 或 SSH 访问计算节点的功能,请参阅在 Azure Batch 池中配置或禁用到计算节点的远程访问If you need to restrict or disable RDP or SSH access to compute nodes, see Configure or disable remote access to compute nodes in an Azure Batch pool.

对有问题的计算节点进行故障排除Troubleshooting problematic compute nodes

在部分任务失败的情况下,Batch 客户端应用程序或服务可以检查失败任务的元数据来找出行为异常的节点。In situations where some of your tasks are failing, your Batch client application or service can examine the metadata of the failed tasks to identify a misbehaving node. 池中的每个节点都有一个唯一 ID,运行任务的节点包含在任务元数据中。Each node in a pool is given a unique ID, and the node on which a task runs is included in the task metadata. 识别出“有问题的节点”后,可对其执行多种操作:After you've identified a problem node, you can take several actions with it:

  • 重新启动节点 (REST | .NET)Reboot the node (REST | .NET)

    重新启动节点有时可以清除潜在的问题,例如进程停滞或崩溃。Restarting the node can sometimes clear up latent issues like stuck or crashed processes. 如果池使用启动任务或作业使用作业准备任务,节点重新启动时将执行这些任务。If your pool uses a start task or your job uses a job preparation task, they are executed when the node restarts.

  • 重置节点映像 (REST | .NET)Reimage the node (REST | .NET)

    这会在节点上重新安装操作系统。This reinstalls the operating system on the node. 和重新启动节点一样,在重置映像节点后,便重新执行启动任务和作业准备任务。As with rebooting a node, start tasks and job preparation tasks are rerun after the node has been reimaged.

  • 从池中删除节点 (REST | .NET)Remove the node from the pool (REST | .NET)

    有时必须从池中完全删除节点。Sometimes it is necessary to completely remove the node from the pool.

  • 禁用节点上的任务计划 (REST | .NET)Disable task scheduling on the node (REST | .NET)

    这实际上是使节点脱机,以便不再收到任何分配的任务,但允许节点继续运行并保留在池中。This effectively takes the node offline so that no further tasks are assigned to it, but allows the node to remain running and in the pool. 这可让你执行进一步的调查以了解失败原因,却又会不丢失失败任务的数据,并且不让节点造成额外的任务失败。This enables you to perform further investigation into the cause of the failures without losing the failed task's data, and without the node causing additional task failures. 例如,可以禁用节点上的任务计划,然后从 远程登录 以检查节点的事件日志,或执行其他故障排除操作。For example, you can disable task scheduling on the node, then sign in remotely to examine the node's event logs or perform other troubleshooting. 完成调查后,可以启用任务计划 (REST | .NET) 使节点重新联机,或者执行上述其他操作。After you've finished your investigation, you can then bring the node back online by enabling task scheduling (REST | .NET), or perform one of the other actions discussed earlier.

重要

可以使用本部分中所述的每项操作(重新启动、重置映像、删除和禁用任务计划),来指定当执行操作时要如何处理节点上当前正在运行的任务。With each action that is described in this section--reboot, reimage, remove, and disable task scheduling--you are able to specify how tasks currently running on the node are handled when you perform the action. 例如,禁用具有 Batch .NET 客户端库的节点上的任务计划时,可以指定 DisableComputeNodeSchedulingOption 枚举值,以指定是要终止运行中的任务、将任务重新排队以在其他节点上计划,还是允许执行中的任务先完成再执行操作 (TaskCompletion)。For example, when you disable task scheduling on a node by using the Batch .NET client library, you can specify a DisableComputeNodeSchedulingOption enum value to specify whether to Terminate running tasks, Requeue them for scheduling on other nodes, or allow running tasks to complete before performing the action (TaskCompletion).

后续步骤Next steps

  • 了解适用于生成批处理解决方案的批处理 API 和工具Learn about the Batch APIs and tools available for building Batch solutions.
  • 了解使用批处理 .NET 客户端库或 Python 开发支持批处理的应用程序的基本概念。Learn the basics of developing a Batch-enabled application using the Batch .NET client library or Python. 这些快速入门介绍了使用 Batch 服务在多个计算节点上执行工作负荷的示例应用程序,并说明了如何使用 Azure 存储进行工作负荷文件暂存和检索。These quickstarts guide you through a sample application that uses the Batch service to execute a workload on multiple compute nodes, and includes using Azure Storage for workload file staging and retrieval.
  • 下载并安装 Batch Explorer,供开发 Batch 解决方案时使用。Download and install Batch Explorer for use while you develop your Batch solutions. 借助 Batch Explorer 来创建、调试和监视 Azure Batch 应用程序。Use Batch Explorer to help create, debug, and monitor Azure Batch applications.
  • 请参阅社区资源,包括 Stack OverflowBatch 社区存储库和 MSDN 上的 Azure Batch 论坛See community resources including Stack Overflow, the Batch Community repo, and the Azure Batch forum on MSDN.