Azure Batch 最佳做法Azure Batch best practices

本文介绍了有效使用 Azure Batch 服务的最佳做法和有用技巧集合,这些做法基于使用 Batch 的实际体验。This article discusses a collection of best practices and useful tips for using the Azure Batch service effectively, based on real-life experiences with Batch. 这些技巧有助于增强性能,并避免 Azure Batch 解决方案中出现设计缺陷。These tips can help you enhance performance and avoid design pitfalls in your Azure Batch solutions.


是用于在 Batch 服务上执行作业的计算资源。Pools are the compute resources for executing jobs on the Batch service. 以下各部分提供了有关使用 Batch 池的建议。The following sections provide recommendations for working with Batch pools.

池配置和命名Pool configuration and naming

  • 池分配模式 创建 Batch 帐户时,可以在两种池分配模式之间进行选择:Batch 服务用户订阅Pool allocation mode When creating a Batch account, you can choose between two pool allocation modes: Batch service or user subscription. 在大部分情况下,应使用默认的 Batch 服务模式,使用此模式时,池在幕后在 Batch 托管的订阅中分配。For most cases, you should use the default Batch service mode, in which pools are allocated behind the scenes in Batch-managed subscriptions. 在备用的“用户订阅”模式下,会在创建池后直接在订阅中创建 Batch VM 和其他资源。In the alternative user subscription mode, Batch VMs and other resources are created directly in your subscription when a pool is created. 用户订阅帐户主要用于实现重要但却不太多见的方案。User subscription accounts are primarily used to enable an important, but small subset of scenarios. 有关用户订阅模式的详细信息,请参阅用户订阅模式的其他配置You can read more about user subscription mode at Additional configuration for user subscription mode.

  • “cloudServiceConfiguration”或“virtualMachineConfiguration”。'cloudServiceConfiguration' or 'virtualMachineConfiguration'. 应使用“virtualMachineConfiguration”。'virtualMachineConfiguration' should be used. “virtualMachineConfiguration”池支持所有 Batch 功能。All Batch features are supported by 'virtualMachineConfiguration' pools. “cloudServiceConfiguration”池并非支持所有功能,也没有计划新的功能。Not all features are supported for 'cloudServiceConfiguration' pools and no new capabilities are being planned.

  • 确定用于池映射的作业时考虑作业和任务运行时间。Consider job and task run time when determining job to pool mapping. 如果作业主要包括短时间运行的任务,且预期的任务总计数较小,因此作业的总预期运行时间不长,那么,请不要为每个作业分配新池。If you have jobs comprised primarily of short-running tasks, and the expected total task counts are small, so that the overall expected run time of the job is not long, do not allocate a new pool for each job. 节点的分配时间会缩减作业运行时间。The allocation time of the nodes will diminish the run time of the job.

  • 池应包含多个计算节点。Pools should have more than one compute node. 不保证各个节点始终可用。Individual nodes are not guaranteed to always be available. 硬件故障、操作系统更新和其他许多问题虽然不太常见,但它们可能会导致个别节点脱机。While uncommon, hardware failures, operating system updates, and a host of other issues can cause individual nodes to be offline. 如果 Batch 工作负荷需要具有确定性且有保证的进度,则你应该分配包含多个节点的池。If your Batch workload requires deterministic, guaranteed progress, you should allocate pools with multiple nodes.

  • 不要重复使用资源名称。Do not reuse resource names. 我们往往会不断地分配和解除 Batch 资源(作业、池等)。Batch resources (jobs, pools, etc.) often come and go over time. 例如,你可能会在星期一创建一个池,在星期二将它删除,然后在星期四又创建一个池。For example, you may create a pool on Monday, delete it on Tuesday, and then create another pool on Thursday. 应为创建的每个新资源指定一个以前从未用过的唯一名称。Each new resource you create should be given a unique name that you haven't used before. 为此,可以使用 GUID(作为整个资源名称或其中的一部分),或者在资源名称中嵌入资源的创建时间。This can be done by using a GUID (either as the entire resource name, or as a part of it) or embedding the time the resource was created in the resource name. Batch 支持 DisplayName,使用此属性可为资源指定一个用户可读的名称,即使实际资源 ID 不够用户友好。Batch supports DisplayName, which can be used to give a resource a human readable name even if the actual resource ID is something that isn't that human friendly. 使用唯一名称可以更方便地区分哪个特定资源在日志和指标中产生了影响。Using unique names makes it easier for you to differentiate which particular resource did something in logs and metrics. 如果需要针对某个资源提交支持案例,唯一名称还可以消除不明确性。It also removes ambiguity if you ever have to file a support case for a resource.

  • 池维护和故障期间的连续性。Continuity during pool maintenance and failure. 最好是让作业动态使用池。It's best to have your jobs use pools dynamically. 如果作业将同一个池用于所有用途,在该池出现问题时,作业有可能无法运行。If your jobs use the same pool for everything, there's a chance that your jobs won't run if something goes wrong with the pool. 这对于时间敏感型工作负载尤其重要。This is especially important for time-sensitive workloads. 若要解决此问题,请在计划每个作业时动态选择或创建池,或通过某种方式替代池名称,以便可以绕过运行不正常的池。To fix this, select or create a pool dynamically when you schedule each job, or have a way to override the pool name so that you can bypass an unhealthy pool.

  • 池维护和故障期间的业务连续性 有许多可能的原因(例如内部错误、容量约束等)会阻止池缩放到所需的大小。出于此原因,应当做好相应准备,以便在必要时将作业目标重新定为不同的池(也许可以使用不同的 VM 大小 - Batch 通过 UpdateJob 实现此目的)。Business continuity during pool maintenance and failure There are many possible causes that may prevent a pool from growing to the required size you desire, such as internal errors, capacity constraints, etc. For this reason, you should be ready to retarget jobs at a different pool (possibly with a different VM size - Batch supports this via UpdateJob) if necessary. 避免使用预计永远不会删除或更改的静态池 ID。Avoid using a static pool ID with the expectation that it will never be deleted and never change.

池生存期和计费Pool lifetime and billing

池生存期可能根据分配方法和应用于池配置的选项而有所不同。Pool lifetime can vary depending upon the method of allocation and options applied to the pool configuration. 池可以具有任意生存期,并且在任意时间点,池中的计算节点数可能会变化。Pools can have an arbitrary lifetime and a varying number of compute nodes in the pool at any point in time. 你需要负责显式管理池中的计算节点,或者通过服务提供的功能(自动缩放自动池)进行管理。It's your responsibility to manage the compute nodes in the pool either explicitly, or through features provided by the service (autoscale or autopool).

  • 使池保持最新状态。Keep pools fresh. 每隔几个月将池的大小调整为零,以确保获得最新的节点代理更新和 bug 修复Resize your pools to zero every few months to ensure you get the latest node agent updates and bug fixes. 除非重新创建池或者将其大小调整为 0 个计算节点,否则池不会接收节点代理更新。Your pool won't receive node agent updates unless it's recreated, or resized to 0 compute nodes. 重新创建池或调整池大小之前,建议根据节点部分中所述,下载所有节点代理日志进行调试。Before you recreate or resize your pool, it's recommended to download any node agent logs for debugging purposes, as discussed in the Nodes section.

  • 重新创建池 同样需要注意的是,不建议每天删除再重新创建池。Pool re-creation On a similar note, it's not recommended to delete and re-create your pools on a daily basis. 应该创建新池,并将现有作业更新为指向新池。Instead, create a new pool, update your existing jobs to point to the new pool. 将所有任务移到新池后删除旧池。Once all of the tasks have been moved to the new pool, then delete the old pool.

  • 池效率和计费 Batch 本身不会产生额外的费用,但使用的计算资源确实会产生费用。Pool efficiency and billing Batch itself incurs no extra charges, but you do incur charges for the compute resources used. 你需要为池中的每个计算节点付费,不管这些节点处于何种状态。You're billed for every compute node in the pool, regardless of the state it's in. 这包括运行节点所需的任何费用,例如存储和网络成本。This includes any charges required for the node to run such as storage and networking costs. 若要详细了解最佳做法,请参阅 Azure Batch 的成本分析和预算To learn more best practices, see Cost analysis and budgets for Azure Batch.

池分配失败Pool allocation failures

在首次分配或后续的调整大小期间,可能会发生池分配失败。Pool allocation failures can happen at any point during first allocation or subsequent resizes. 可能的原因是区域中出现了暂时性的容量耗尽,或 Batch 所依赖的其他 Azure 服务发生故障。This can be due to temporary capacity exhaustion in a region or failures in other Azure services that Batch relies on. 核心配额没有保障,而且存在限制。Your core quota is not a guarantee but rather a limit.

计划外停机Unplanned downtime

Azure 中的 Batch 池可能会遇到停机事件。It's possible for Batch pools to experience downtime events in Azure. 在规划和开发 Batch 方案或工作流时,请牢记这一点。Keep this in mind when planning and developing your scenario or workflow for Batch.

如果某个节点发生故障,Batch 会代表你自动尝试恢复这些计算节点。In the case that a node fails, Batch automatically attempts to recover these compute nodes on your behalf. 这可能会触发在恢复的节点上重新计划任何正在运行的任务。This may trigger rescheduling any running task on the node that is recovered. 若要详细了解中断的任务,请参阅重试设计See Designing for retries to learn more about interrupted tasks.

自定义映像池Custom image pools

使用虚拟机配置创建 Azure Batch 池时,需指定一个虚拟机 (VM) 映像,为池中每个计算节点提供操作系统。When you create an Azure Batch pool using the Virtual Machine Configuration, you specify a VM image that provides the operating system for each compute node in the pool. 可以使用支持的 Azure 市场映像创建池,也可以使用共享映像库映像创建自定义映像You can create the pool with a supported Azure Marketplace image, or you can create a custom image with a Shared Image Gallery image. 尽管你也可以使用托管映像来创建自定义映像池,但我们建议尽可能使用共享映像库创建自定义映像。While you can also use a managed image to create a custom image pool, we recommend creating custom images using the Shared Image Gallery whenever possible. 使用共享映像库可以更快地预配池、缩放更大数量的 VM 以及在预配 VM 时提高可靠性。Using the Shared Image Gallery helps you provision pools faster, scale larger quantities of VMs, and improve reliability when provisioning VMs.

第三方映像Third-party images

可以使用发布到 Azure 市场的第三方映像创建池。Pools can be created using third-party images published to Azure Marketplace. 对于用户订阅模式 Batch 帐户,在使用某些第三方映像创建池时,你可能会看到错误“由于市场购买资格检查造成分配失败”。With user subscription mode Batch accounts, you may see the error "Allocation failed due to marketplace purchase eligibility check" when creating a pool with certain third-party images. 若要解决此错误,请接受映像发布者设置的术语。To resolve this error, accept the terms set by the publisher of the image. 可以通过 Azure PowerShellAzure CLI 来实现此目的。You can do so by using Azure PowerShell or Azure CLI.

Azure 区域依赖项Azure region dependency

如果你有时间敏感型工作负载或生产工作负载,建议不要依赖单个 Azure 区域。It's advised to not depend on a single Azure region if you have a time-sensitive or production workload. 有些问题虽然极少出现,但可能会影响整个区域。While rare, there are issues that can affect an entire region. 例如,如果处理需要在特定的时间开始,请考虑在开始之前的相当长一段时间纵向扩展主要区域中的池。For example, if your processing needs to start at a specific time, consider scaling up the pool in your primary region well before your start time. 如果扩展该池失败,可以回退并在一个或多个备份区域中纵向扩展池。If that pool scale fails, you can fall back to scaling up a pool in a backup region (or regions). 如果另一个池出现问题,跨不同区域中多个帐户的池可提供一个现成的且易于访问的备份。Pools across multiple accounts in different regions provide a ready, easily accessible backup if something goes wrong with another pool. 有关详细信息,请参阅设计应用程序以实现高可用性For more information, see Design your application for high availability.


作业是可以包含数百、数千甚至数百万个任务的容器。A job is a container designed to contain hundreds, thousands, or even millions of tasks. 创建作业时,请遵循这些指导原则。Follow these guidelines when creating jobs.

更少作业,更多任务Fewer jobs, more tasks

使用一个作业运行单个任务是低效的做法。Using a job to run a single task is inefficient. 例如,使用包含 1000 个任务的单个作业,比创建 100 个作业并在每个作业中包含 10 个任务更高效。For example, it's more efficient to use a single job containing 1000 tasks rather than creating 100 jobs that contain 10 tasks each. 运行 1000 个作业并在每个作业中包含单个任务是最低效、速度最慢且成本最高的做法。Running 1000 jobs, each with a single task, would be the least efficient, slowest, and most expensive approach to take.

因此,请确保不要设计同时需要数千个活动作业的 Batch 解决方案。Because of this, make sure not to design a Batch solution that requires thousands of simultaneously active jobs. 不存在针对任务的配额,因此,通过尽量少的作业执行尽量多的任务可以有效利用作业和作业计划配额There is no quota for tasks, so executing many tasks under as few jobs as possible efficiently uses your job and job schedule quotas.

作业生存期Job lifetime

在从系统中删除之前,Batch 作业生存期是无限的。A Batch job has an indefinite lifetime until it's deleted from the system. 其状态会指示该作业是否可以接受更多任务来进行计划。Its state designates whether it can accept more tasks for scheduling or not.

除非显式终止作业,否则作业不会自动转换为已完成状态。A job does not automatically move to completed state unless explicitly terminated. 可以通过 onAllTasksComplete 属性或 maxWallClockTime 自动触发此状态转换。This can be automatically triggered through the onAllTasksComplete property or maxWallClockTime.

存在默认的活动作业和作业计划配额There is a default active job and job schedule quota. 处于已完成状态的作业和作业计划不会计入此配额。Jobs and job schedules in completed state do not count towards this quota.


任务是构成作业的单个工作单位。Tasks are individual units of work that comprise a job. 任务由用户提交,并由 Batch 在计算节点上进行计划。Tasks are submitted by the user and scheduled by Batch on to compute nodes. 创建和执行任务时,需要考虑多个设计注意事项。There are several design considerations to make when creating and executing tasks. 以下各部分介绍了常见方案,以及如何设计任务以便能够处理问题并有效地执行任务。The following sections explain common scenarios and how to design your tasks to handle issues and perform efficiently.

保存任务数据Save task data

计算节点具有瞬态性。Compute nodes are by their nature ephemeral. Batch 中的许多功能(例如自动池自动缩放)很容易使节点消失。There are many features in Batch such as autopool and autoscale that can make it easy for nodes to disappear. 当节点离开池时(由于重设大小或删除池),这些节点上的所有文件也会一并删除。When nodes leave a pool (due to a resize or a pool delete) all the files on those nodes are also deleted. 因此,在某个任务完成之前,它应将其自身的输出从运行它的节点移到持久存储。Because of this, a task should move its output off of the node it is running on and to a durable store before it completes. 同样,如果任务失败,则应将诊断失败问题所需的日志移到持久存储。Similarly, if a task fails, it should move logs required to diagnose the failure to a durable store.

Batch 中集成了用于通过 OutputFiles 上传数据的支持 Azure 存储以及各种共享文件系统,你也可以在任务中自行执行上传。Batch has integrated support Azure Storage to upload data via OutputFiles, as well as a variety of shared file systems, or you can perform the upload yourself in your tasks.

管理任务生存期Manage task lifetime

当不再需要这些任务时将其删除,或者设置 retentionTime 任务约束。Delete tasks when they are no longer needed, or set a retentionTime task constraint. 如果设置了 retentionTime,当 retentionTime 过期时,Batch 会自动清理该任务占用的磁盘空间。If a retentionTime is set, Batch automatically cleans up the disk space used by the task when the retentionTime expires.

删除任务可以实现两种目的。Deleting tasks accomplishes two things. 它可以确保作业中不会存在积累的任务,存在积累任务会使查询/查找感兴趣的任务变得更困难(因为你必须在“已完成”的任务中筛选)。It ensures that you do not have a build-up of tasks in the job, which can make it harder to query/find the task you're interested in (because you'll have to filter through the Completed tasks). 此外,它还会清理节点上的相应任务数据(假设尚未达到 retentionTime)。It also cleans up the corresponding task data on the node (provided retentionTime has not already been hit). 这有助于确保节点中不会填满任务数据且不会耗尽磁盘空间。This helps ensure that your nodes don't fill up with task data and run out of disk space.

以集合形式提交大量任务Submit large numbers of tasks in collection

可以逐个提交或以集合形式提交任务。Tasks can be submitted on an individual basis or in collections. 执行批量任务提交时,每次以最多包含 100 个任务的集合形式提交任务可以减少开销并缩短提交时间。Submit tasks in collections of up to 100 at a time when doing bulk submission of tasks to reduce overhead and submission time.

适当地设置每个节点的最大任务数Set max tasks per node appropriately

Batch 在节点上支持超额订阅的任务(运行的任务数超过节点所具有的核心数)。Batch supports oversubscribing tasks on nodes (running more tasks than a node has cores). 需要由你确保任务量“适合”池中的节点。It's up to you to ensure that your tasks "fit" into the nodes in your pool. 例如,如果你尝试计划 8 个任务,其中每个任务消耗 25% 的 CPU 使用率(在设置了 taskSlotsPerNode = 8 的池中),则体验可能会下降。For example, you may have a degraded experience if you attempt to schedule eight tasks that each consume 25% CPU usage onto one node (in a pool with taskSlotsPerNode = 8).

设计重试和重新执行Design for retries and re-execution

Batch 可以自动重试任务。Tasks can be automatically retried by Batch. 有两种类型的重试:用户控制的重试和内部重试。There are two types of retries: user-controlled and internal. 用户控制的重试由任务的 maxTaskRetryCount 指定。User-controlled retries are specified by the task's maxTaskRetryCount. 如果任务中指定的程序退出并出现非零退出代码,则会将该任务重试最多 maxTaskRetryCount 次。When a program specified in the task exits with a non-zero exit code, the task is retried up to the value of the maxTaskRetryCount.

可能会由于计算节点上发生故障(例如,在运行任务时无法更新内部状态或节点上发生故障)而在内部重试任务,不过,这种情况很罕见。Although rare, a task can be retried internally due to failures on the compute node, such as not being able to update internal state or a failure on the node while the task is running. 将尽可能地在同一计算节点上重试任务,直到达到内部限制,重试失败后将放弃该任务,并推迟任务以让 Batch 重新对其进行计划(可能会将其安排在不同的计算节点上)。The task will be retried on the same compute node, if possible, up to an internal limit before giving up on the task and deferring the task to be rescheduled by Batch, potentially on a different compute node.

生成持久任务Build durable tasks

在设计任务时应当使其能够承受故障并提供重试机制。Tasks should be designed to withstand failure and accommodate retry. 对于长时间运行的任务,这一点尤其重要。This is especially important for long running tasks. 为此,请确保任务即使多次运行,也会生成同一种的结果。To do this, ensure tasks generate the same, single result even if they are run more than once. 实现此目的方法之一是使任务“寻找目标”。One way to achieve this is to make your tasks "goal seeking." 另一种方法是确保任务是幂等的(无论任务运行多少次,都生成相同的结果)。Another way is to make sure your tasks are idempotent (tasks will have the same outcome no matter how many times they are run).

一个常见示例是通过某个任务将文件复制到计算节点。A common example is a task to copy files to a compute node. 简单的方法是每次运行任务时复制所有指定的文件,但这种方法非常低效,且不能承受故障。A simple approach is a task that copies all the specified files every time it runs, which is inefficient and isn't built to withstand failure. 替代方法是创建一个任务来确保文件位于计算节点上,该任务不会重新复制已存在的文件。Instead, create a task to ensure the files are on the compute node; a task that doesn't recopy files that are already present. 通过这种方式,在该任务中断时,它会从上次中断的位置继续运行。In this way, the task picks up where it left off if it was interrupted.

避免短执行时间Avoid short execution time

仅运行一两秒的任务并不是很理想的任务。Tasks that only run for one to two seconds are not ideal. 应该尝试在单个任务(最少运行 10 秒,最多运行几小时甚至几天)中执行大量的工作。You should try to do a significant amount of work in an individual task (10 second minimum, going up to hours or days). 如果每个任务执行一分钟(或更长时间),则调度开销将仅占总体计算时间的很少一部分。If each task is executing for one minute (or more), then the scheduling overhead as a fraction of overall compute time is small.

将池范围用于 Windows 节点上的短任务Use pool scope for short tasks on Windows nodes

在 Batch 节点上计划任务时,可以选择是否使用任务范围或池范围运行任务。When scheduling a task on Batch nodes, you can choose whether to run it with task scope or pool scope. 如果任务只运行很短的时间,由于为该任务创建自动用户帐户所需的资源,任务范围可能效率不高。If the task will only run for a short time, task scope can be inefficient due to the resources needed to create the auto-user account for that task. 为了提高效率,请考虑将这些任务设置为池范围。For greater efficiency, consider setting these tasks to pool scope. 有关详细信息,请参阅以具有池范围的自动用户身份运行任务.For more information, see Run a task as an auto-user with pool scope.


计算节点是专门用于处理一部分应用程序工作负载的 Azure 虚拟机 (VM) 或云服务 VM。A compute node is an Azure virtual machine (VM) or cloud service VM that is dedicated to processing a portion of your application's workload. 使用节点时,请遵循以下指导原则。Follow these guidelines when working with nodes.

幂等启动任务Idempotent start tasks

就像其他任务一样,节点启动任务应该是幂等的,因为每次节点启动时,都要重新运行该任务。Just as with other tasks, the node start task should be idempotent, as it will be rerun every time the node boots. 幂等任务就是在多次运行时生成一致结果的任务。An idempotent task is simply one that produces a consistent result when run multiple times.

独立节点Isolated nodes

请考虑对具有符合性或法规要求的工作负荷使用独立的 VM 大小。Consider using isolated VM sizes for workloads with compliance or regulatory requirements. 虚拟机配置模式下支持的独立大小包括 Standard_M128msStandard_F72s_v2Standard_E64i_v3Supported isolated sizes in virtual machine configuration mode include Standard_M128ms, Standard_F72s_v2 and Standard_E64i_v3. 有关独立 VM 大小的详细信息,请参阅 Azure 中的虚拟机隔离For more information about isolated VM sizes, see Virtual machine isolation in Azure.

通过操作系统服务接口管理长时间运行的服务Manage long-running services via the operating system services interface

有时,需要在节点中将 Batch 代理与另一代理一起运行。Sometimes there is a need to run another agent alongside the Batch agent in the node. 例如,你可能想要从节点收集数据并生成相关报告。For example, you may want to gather data from the node and report it. 建议将这些代理部署为 OS 服务,例如 Windows 服务或 Linux systemd 服务。We recommend that these agents be deployed as OS services, such as a Windows service or a Linux systemd service.

运行这些服务时,它们不得对节点上 Batch 托管目录中的任何文件创建文件锁,否则 Batch 会由于存在文件锁而无法删除这些目录。When running these services, they must not take file locks on any files in Batch-managed directories on the node, because otherwise Batch will be unable to delete those directories due to the file locks. 例如,如果在启动任务中安装 Windows 服务,而不是直接从启动任务工作目录启动该服务,请将文件复制到其他位置(或者,如果文件已存在,则直接跳过该复制操作)。For example, if installing a Windows service in a start task, instead of launching the service directly from the start task working directory, copy the files elsewhere (or if the files exist just skip the copy). 然后从该位置安装服务。Then install the service from that location. 当 Batch 重新运行启动任务时,它会删除然后重新创建启动任务工作目录。When Batch reruns your start task, it will delete the start task working directory and create it again. 这是可行的,因为服务在其他目录而不是在启动任务工作目录上具有文件锁。This works because the service has file locks on the other directory, not the start task working directory.

避免在 Windows 中创建目录联接Avoid creating directory junctions in Windows

在清理任务和作业时,很难处理目录联接(有时称为目录硬链接)。Directory junctions, sometimes called directory hard-links, are difficult to deal with during task and job cleanup. 请使用符号链接(软链接),而不要使用硬链接。Use symlinks (soft-links) rather than hard-links.

收集 Batch 代理日志Collect the Batch agent logs

如果发现节点的行为或节点上运行的任务出现问题,请在解除分配有问题的节点之前收集 Batch 代理日志。If you notice a problem involving the behavior of a node or tasks running on a node, collect the Batch agent logs prior to deallocating the nodes in question. 可以使用“上传 Batch 服务日志”API 收集 Batch 代理日志。The Batch agent logs can be collected using the Upload Batch service logs API. 这些日志可以作为 Azure 支持工单的一部分提供,并将有助于问题的故障排除和解决。These logs can be supplied as part of a support ticket to Azure and will help with issue troubleshooting and resolution.

管理 OS 升级Manage OS upgrades

对于用户订阅模式 Batch 帐户,自动 OS 升级可能会中断任务进程,尤其是在任务长时间运行的情况下。For user subscription mode Batch accounts, automated OS upgrades can interrupt task progress, especially if the tasks are long-running. 生成幂等任务有助于减少由这些中断导致的错误。Building idempotent tasks can help to reduce errors caused by these interruptions. 我们还建议在任务不需要运行时安排 OS 映像升级We also recommend scheduling OS image upgrades for times where tasks aren't expected to run.

隔离安全性Isolation security

出于隔离目的,如果方案需要将作业相互隔离,请通过将这些作业放入不同的池中进行隔离。For the purposes of isolation, if your scenario requires isolating jobs from each other, do so by having them in separate pools. 池是 Batch 中的安全隔离边界,默认情况下,两个池之间互不可见,相互之间也无法通信。A pool is the security isolation boundary in Batch, and by default, two pools are not visible or able to communicate with each other. 请避免使用单独的 Batch 帐户作为隔离方式。Avoid using separate Batch accounts as a means of isolation.

跨区域移动 Batch 帐户Moving Batch accounts across regions

在某些情况下,将现有 Batch 帐户从一个区域移到另一个区域可能会很有帮助。There are scenarios in which it might be helpful to move an existing Batch account from one region to another. 例如在灾难恢复计划中,你可能想将帐户移到另一个区域。For example, you may want to move to another region as part of disaster recovery planning.

Azure Batch 帐户无法直接从一个区域移到另一个区域。Azure Batch accounts cannot be directly moved from one region to another. 但是,可以使用 Azure 资源管理器模板来导出 Batch 帐户的现有配置。You can however, use an Azure Resource Manager template to export the existing configuration of your Batch account. 然后,可将资源暂存在另一区域,方法是:将 Batch 帐户导出到模板,根据目标区域的情况修改参数,然后将模板部署到新区域。You can then stage the resource in another region by exporting the Batch account to a template, modifying the parameters to match the destination region, and then deploying the template to the new region.

将模板上传到新区域后,必须重新创建证书、作业计划和应用程序包。After you upload the template to the new region, you will have to recreate certificates, job schedules, and application packages. 若要提交更改并完成 Batch 帐户的移动,请记得删除原始 Batch 帐户或资源组。To commit the changes and complete the move of the Batch account, remember to delete the original Batch account or resource group.

有关资源管理器和模板的详细信息,请参阅快速入门:使用 Azure 门户创建和部署 Azure 资源管理器模板For more information on Resource Manager and templates, see Quickstart: Create and deploy Azure Resource Manager templates by using the Azure portal.


查看以下有关 Batch 解决方案中的连接性的指导。Review the following guidance related to connectivity in your Batch solutions.

网络安全组 (NSG) 和用户定义的路由 (UDR)Network Security Groups (NSGs) and User Defined Routes (UDRs)

在虚拟网络中预配 Batch 池时,请确保严格遵循有关使用 BatchNodeManagement 服务标记、端口、协议和规则方向的指导原则。When provisioning Batch pools in a virtual network, ensure that you are closely following the guidelines regarding the use of the BatchNodeManagement service tag, ports, protocols and direction of the rule. 强烈建议使用服务标记,而不要使用基础 Batch 服务 IP 地址。Use of the service tag is highly recommended, rather than using the underlying Batch service IP addresses. 这是因为 IP 地址会随时间而更改。This is because the IP addresses can change over time. 直接使用 Batch 服务 IP 地址可能会导致 Batch 池不稳定、受干扰或中断。Using Batch service IP addresses directly can cause instability, interruptions, or outages for your Batch pools.

对于用户定义的路由 (UDR),请确保制定一个流程来定期更新路由表中的 Batch 服务 IP 地址,因为这些地址会随时间而更改。For User Defined Routes (UDRs), ensure that you have a process in place to update Batch service IP addresses periodically in your route table, since these addresses change over time. 若要了解如何获取 Batch 服务 IP 地址列表,请参阅本地服务标记To learn how to obtain the list of Batch service IP addresses, see Service tags on-premises. Batch 服务 IP 地址将与 BatchNodeManagement 服务标记(或与你的 Batch 帐户区域匹配的区域变体)相关联。The Batch service IP addresses will be associated with the BatchNodeManagement service tag (or the regional variant that matches your Batch account region).

遵守 DNSHonoring DNS

确保系统遵守 Batch 帐户服务 URL 的 DNS 生存时间 (TTL)。Ensure that your systems are honoring DNS Time-to-Live (TTL) for your Batch account service URL. 此外,请确保 Batch 服务客户端以及 Batch 服务的其他连接机制不依赖于 IP 地址(或创建一个具有静态公用 IP 地址的池)。Additionally, ensure that your Batch service clients and other connectivity mechanisms to the Batch service do not rely on IP addresses (or create a pool with static public IP addresses as described below).

如果请求收到 5xx 级别 HTTP 响应并且响应中包含“Connection: close”标头,则 Batch 服务客户端应遵循建议关闭现有连接,重新解析 Batch 帐户服务 URL 的 DNS,然后在新的连接上尝试后续请求。If your requests receive 5xx level HTTP responses and there is a "Connection: close" header in the response, your Batch service client should observe the recommendation by closing the existing connection, re-resolving DNS for the Batch account service URL, and attempt following requests on a new connection.

自动重试请求Retry requests automatically

确保 Batch 服务客户端实施了适当的重试策略来自动重试请求,即使在正常操作期间也要实施重试机制,而不仅仅是在任何服务维护时段实施。Ensure that your Batch service clients have appropriate retry policies in place to automatically retry your requests, even during normal operation and not exclusively during any service maintenance time periods. 这些重试策略的间隔时间应该至少为 5 分钟。These retry policies should span an interval of at least 5 minutes. 各种 Batch SDK(例如 .NET RetryPolicyProvider 类)都附带了自动重试功能。Automatic retry capabilities are provided with various Batch SDKs, such as the .NET RetryPolicyProvider class.

静态公共 IP 地址Static public IP addresses

通常, Batch 池中的虚拟机是通过公用 IP 地址访问的,这些地址在池的生命周期中会发生更改。Typically, virtual machines in a Batch pool are accessed through public IP addresses that can change over the lifetime of the pool. 这会使与数据库或其他限制访问某些 IP 地址的外部服务交互变得困难。This can make it difficult to interact with a database or other external service that limits access to certain IP addresses. 若要确保池中的公用 IP 地址不会意外更改,可以使用一组你控制的静态公用 IP 地址创建池。To ensure that the public IP addresses in your pool don't change unexpectedly, you can create a pool using a set of static public IP addresses that you control. 有关详细信息,请参阅使用指定的公用 IP 地址创建 Azure Batch 池For more information, see Create an Azure Batch pool with specified public IP addresses.

测试与云服务配置的连接Testing connectivity with Cloud Services configuration

无法将正常的“ping”/ICMP 协议与云服务结合使用,因为不允许通过 Azure 负载均衡器使用 ICMP 协议。You can't use the normal "ping"/ICMP protocol with cloud services, because the ICMP protocol is not permitted through the Azure load balancer. 有关详细信息,请参阅 Azure 云服务的连接和网络For more information, see Connectivity and networking for Azure Cloud Services.

Batch 节点的基本依赖项Batch node underlying dependencies

设计 Batch 解决方案时,请考虑以下依赖项和限制。Consider the following dependencies and restrictions when designing your Batch solutions.

系统创建的资源System-created resources

Azure Batch 在 VM 上创建和管理一组用户和组,这些不应受到更改。Azure Batch creates and manages a set of users and groups on the VM, which should not be altered. 这些限制如下:They are as follows:


  • 名为“PoolNonAdmin”的用户A user named PoolNonAdmin
  • 名为“WATaskCommon”的用户组A user group named WATaskCommon


  • 名为“_azbatch”的用户A user named _azbatch

文件清理File cleanup

当任务的保留期到期后,Batch 会主动尝试清理运行任务的工作目录。Batch actively tries to clean up the working directory that tasks are run in, once their retention time expires. 你应负责清理在此目录之外写入的所有文件,以避免占用磁盘空间。Any files written outside of this directory are your responsibility to clean up to avoid filling up disk space.

如果在 Windows 上从 startTask 工作目录运行服务,则会阻止对工作目录的自动清理,因为文件夹仍在使用中。The automated cleanup for the working directory will be blocked if you run a service on Windows from the startTask working directory, due to the folder still being in use. 这将导致性能下降。This will result in degraded performance. 若要解决此问题,请将该服务的目录更改为不受 Batch 管理的单独目录。To fix this, change the directory for that service to a separate directory that isn't managed by Batch.

后续步骤Next steps