Azure Batch 中的节点和池Nodes and pools in Azure Batch

在 Azure Batch 工作流中,计算节点(或节点)是用于处理一部分应用程序工作负荷的虚拟机 。In an Azure Batch workflow, a compute node (or node) is a virtual machine that processes a portion of your application's workload. 池是运行应用程序的节点集合。A pool is a collection of these nodes for your application to runs on. 本文将详细介绍节点和池,以及创建节点和池并在 Azure Batch 工作流中使用时的注意事项。This article explains more about nodes and pools, along with considerations when creating and using them in an Azure Batch workflow.

NodesNodes

节点是专门用于处理一部分应用程序工作负荷的 Azure 虚拟机 (VM) 或云服务 VM。A node is an Azure virtual machine (VM) or cloud service VM that is dedicated to processing a portion of your application's workload. 节点大小确定了 CPU 核心数目、内存容量,以及分配给节点的本地文件系统大小。The size of a node determines the number of CPU cores, memory capacity, and local file system size that is allocated to the node.

可以使用 Azure 虚拟机市场提供的 Azure 云服务映像或自己准备的自定义映像创建 Windows 或 Linux 节点池。You can create pools of Windows or Linux nodes by using Azure Cloud Services, images from the Azure Virtual Machines Marketplace, or custom images that you prepare.

节点可以运行节点操作系统环境支持的任何可执行文件或脚本。Nodes can run any executable or script that is supported by the operating system environment of the node. 可执行文件或脚本包括 *.exe、*.cmd、*.bat 和 PowerShell 脚本(适用于 Windows),以及二进制文件、shell 和 Python 脚本(适用于 Linux)。Executables or scripts include *.exe, *.cmd, *.bat, and PowerShell scripts (for Windows) and binaries, shell, and Python scripts (for Linux).

Batch 中的所有计算节点还包括:All compute nodes in Batch also include:

默认情况下,节点可以彼此通信,但无法与不属于同一池的虚拟机通信。By default, nodes can communicate with each other, but they can't communicate with virtual machines that are not part of the same pool. 若要允许节点安全地与其他虚拟机或本地网络通信,可以在 Azure 虚拟网络 (VNet) 的子网中预配该池。To allow nodes to communicate securely with other virtual machines, or with an on-premises network, you can provision the pool in a subnet of an Azure virtual network (VNet). 当你这样做时,可以通过公共 IP 地址访问节点。When you do so, your nodes can be accessed through public IP addresses. 这些公共 IP 地址由 Batch 创建,可能会在池的生存期内更改。These public IP addresses are created by Batch and may change over the lifetime of the pool. 你还可以创建具有所控制的静态公共 IP 地址的池,这样可确保它们不会意外更改。You can also create a pool with static public IP addresses that you control, which ensures that they won't change unexpectedly.

Pools

池是运行应用程序的节点集合。A pool is the collection of nodes that your application runs on.

Azure Batch 池构建在核心 Azure 计算平台的顶层。Azure Batch pools build on top of the core Azure compute platform. 它们提供大规模的分配、应用程序安装、数据分发和运行状况监视,以及在池内灵活调整(缩放)计算节点数目等功能。They provide large-scale allocation, application installation, data distribution, health monitoring, and flexible adjustment (scaling) of the number of compute nodes within a pool.

添加到池中的每个节点都分配有唯一的名称和 IP 地址。Every node that is added to a pool is assigned a unique name and IP address. 从池中删除某个节点时,会丢失对操作系统或文件所做的任何更改,并且节点的名称和 IP 地址将被释放供将来使用。When a node is removed from a pool, any changes that are made to the operating system or files are lost, and its name and IP address are released for future use. 当某个节点退出池时,它的生存期即告结束。When a node leaves a pool, its lifetime is over.

池只能由创建它的 Batch 帐户使用。A pool can be used only by the Batch account in which it was created. Batch 帐户可以创建多个池,以满足将运行的应用程序的资源要求。A Batch account can create multiple pools to meet the resource requirements of the applications it will run.

可以手动创建池;或者在你指定要完成的工作时,由 Batch 服务自动创建池。The pool can be created manually, or automatically by the Batch service when you specify the work to be done. 在创建池时,可以指定以下属性:When you create a pool, you can specify the following attributes:

重要

Batch 帐户具有默认配额,用于限制 Batch 帐户中的核心数。Batch accounts have a default quota that limits the number of cores in a Batch account. 核心数对应于计算节点数。The number of cores corresponds to the number of compute nodes. 可以在 Azure Batch 服务的配额和限制中找到默认配额以及如何提高配额的说明。You can find the default quotas and instructions on how to increase a quota in Quotas and limits for the Azure Batch service. 如果池不能实现其目标节点数,则问题可能出在核心配额上。If your pool is not achieving its target number of nodes, the core quota might be the reason.

操作系统和版本Operating system and version

在创建 Batch 池时,可指定 Azure 虚拟机配置和想要在池中每个计算节点上运行的操作系统类型。When you create a Batch pool, you specify the Azure virtual machine configuration and the type of operating system you want to run on each compute node in the pool.

配置Configurations

Batch 中提供了两种类型的池配置。There are two types of pool configurations available in Batch.

虚拟机配置Virtual Machine Configuration

虚拟机配置指定池由 Azure 虚拟机组成。The Virtual Machine Configuration specifies that the pool is composed of Azure virtual machines. 可以从 Linux 或 Windows 映像创建这些 VM。These VMs may be created from either Linux or Windows images.

基于虚拟机配置创建池时,不仅要指定节点大小和用于创建它们的映像源,还必须指定要安装在节点上的“虚拟机映像引用”和批处理“节点代理 SKU”。When you create a pool based on the Virtual Machine Configuration, you must specify not only the size of the nodes and the source of the images used to create them, but also the virtual machine image reference and the Batch node agent SKU to be installed on the nodes. 有关指定这些池属性的详细信息,请参阅 Provision Linux compute nodes in Azure Batch pools(在 Azure Batch 池中预配 Linux 计算节点)。For more information about specifying these pool properties, see Provision Linux compute nodes in Azure Batch pools. 可选选择性地将一个或多个空数据磁盘附加到从市场映像创建的池 VM,也可将数据磁盘包括在用于创建 VM 的自定义映像中。You can optionally attach one or more empty data disks to pool VMs created from Marketplace images, or include data disks in custom images used to create the VMs. 如果包括数据磁盘,需要在 VM 中装载并格式化这些磁盘,然后才能使用。When including data disks, you need to mount and format the disks from within a VM to use them.

云服务配置Cloud Services Configuration

云服务配置指定池由 Azure 云服务节点组成。The Cloud Services Configuration specifies that the pool is composed of Azure Cloud Services nodes. 云服务只提供 Windows 计算节点。Cloud Services provides Windows compute nodes only.

Azure Guest OS releases and SDK compatibility matrix(Azure 来宾 OS 版本和 SDK 兼容性对照表)中列出了适用于云服务配置池的操作系统。Available operating systems for Cloud Services Configuration pools are listed in the Azure Guest OS releases and SDK compatibility matrix. 创建包含云服务节点的池时,需要指定节点大小及其 OS 系列(用于确定哪些版本的 .NET 随 OS 一起安装)。When you create a pool that contains Cloud Services nodes, you need to specify the node size and its OS Family (which determines which versions of .NET are installed with the OS). 将云服务部署到 Azure 的速度比部署运行 Windows 的虚拟机更快。Cloud Services is deployed to Azure more quickly than virtual machines running Windows. 如果需要 Windows 计算节点池,可能会发现云服务具有部署时间上的性能优势。If you want pools of Windows compute nodes, you may find that Cloud Services provide a performance benefit in terms of deployment time.

与云服务中的辅助角色一样,可以指定 OS 版本(有关辅助角色的详细信息,请参阅云服务概述)。As with worker roles within Cloud Services, you can specify an OS Version (for more information on worker roles, see the Cloud Services overview). 对于 OS 版本,建议指定 Latest (*),使节点可自动升级,而无需采取措施来适应新的版本。We recommend that you specify Latest (*) for the OS Version so that the nodes are automatically upgraded, and there is no work required to cater to newly released versions. 选择特定 OS 版本的主要用例是在允许更新版本之前执行向后兼容测试,以确保保持应用程序兼容性。The primary use case for selecting a specific OS version is to ensure application compatibility, which allows backward compatibility testing to be performed before allowing the version to be updated. 验证后,便可以更新池的 OS 版本并安装新的 OS 映像。After validation, the OS Version for the pool can be updated and the new OS image can be installed. 所有正在运行的任务将会中断并重新排队。Any running tasks will be interrupted and requeued.

节点代理 SKUNode Agent SKUs

创建池时,需要选择适当的 nodeAgentSkuId,具体取决于 VHD 基本映像的 OS。When you create a pool, you need to select the appropriate nodeAgentSkuId, depending on the OS of the base image of your VHD. 可通过调用列出支持的节点代理 SKU 操作获得可用节点代理 SKU ID 到其 OS 映像引用的映射。You can get a mapping of available node agent SKU IDs to their OS Image references by calling the List Supported Node Agent SKUs operation.

虚拟机池的自定义映像Custom images for Virtual Machine pools

若要了解如何使用自定义映像创建池,请参阅使用共享映像库创建自定义池To learn how to create a pool with custom images, see Use the Shared Image Gallery to create a custom pool.

或者,可以使用托管映像资源创建自定义虚拟机池。Alternatively, you can create a custom pool of virtual machines using a managed image resource. 有关从 Azure VM 准备自定义 Linux 映像的信息,请参阅如何创建虚拟机或 VHD 的映像For information about preparing custom Linux images from Azure VMs, see How to create an image of a virtual machine or VHD. 若要了解如何通过 Azure VM 准备自定义 Windows 映像,请参阅在 Azure 中创建通用 VM 的托管映像For information about preparing custom Windows images from Azure VMs, see Create a managed image of a generalized VM in Azure.

虚拟机池中的容器支持Container support in Virtual Machine pools

使用 Batch API 创建虚拟机配置池时,可以将池设置为在 Docker 容器中运行任务。When creating a Virtual Machine Configuration pool using the Batch APIs, you can set up the pool to run tasks in Docker containers. 目前,必须使用支持 Docker 容器的映像创建池。Currently, you must create the pool using an image that supports Docker containers. 将 Windows Server 2016 Datacenter 与 Azure 市场中的容器映像配合使用,或者提供自定义 VM 映像(其中包含 Docker Community Edition 或 Enterprise Edition 以及任何必需的驱动程序)。Use the Windows Server 2016 Datacenter with Containers image from the Azure Marketplace, or supply a custom VM image that includes Docker Community Edition or Enterprise Edition and any required drivers. 池设置必须包括容器配置,该配置在创建池时将容器映像复制到 VM。The pool settings must include a container configuration that copies container images to the VMs when the pool is created. 然后,在池中运行的任务即可引用容器映像和容器运行选项。Tasks that run on the pool can then reference the container images and container run options.

有关详细信息,请参阅在 Azure Batch 上运行 Docker 容器应用程序For more information, see Run Docker container applications on Azure Batch.

节点类型和目标Node type and target

创建池时,可以指定所需的节点类型和每种类型的目标节点数。When you create a pool, you can specify which types of nodes you want and the target number for each. 有两种类型的节点:The two types of nodes are:

  • 专用节点。Dedicated nodes. 专用计算节点将为工作负荷保留。Dedicated compute nodes are reserved for your workloads. 保证它们永远不会被抢占。They are guaranteed to never be preempted.

专用类型的节点有其自己的目标设置,你可以为其指定所需的节点数。Dedicated type of node has its own target setting, for which you can specify the desired number of nodes.

计算节点数之所以称为目标,是因为在某些情况下,池可能无法达到所需的节点数。The number of compute nodes is referred to as a target because, in some situations, your pool might not reach the desired number of nodes. 例如,如果池先达到了 Batch 帐户的核心配额,则该池可能达不到目标。For example, a pool might not achieve the target if it reaches the core quota for your Batch account first. 或者,如果已将限制最大节点数的自动缩放公式应用于池,则该池也可能达不到目标。Or, the pool might not achieve the target if you have applied an automatic scaling formula to the pool that limits the maximum number of nodes.

有关专用节点的定价信息,请参阅 Batch 定价For pricing information for dedicated nodes, see Batch Pricing.

节点大小Node size

创建 Azure Batch 池时,可以在 Azure 提供的几乎所有 VM 系列和大小中进行选择。When you create an Azure Batch pool, you can choose from among almost all the VM families and sizes available in Azure. Azure 提供一系列适用于不同工作负荷的 VM 大小,包括专用启用了 GPU 的 VM 大小。Azure offers a range of VM sizes for different workloads, including specialized GPU-enabled VM sizes.

有关详细信息,请参阅在 Azure Batch 池中选择适用于计算节点的 VM 大小For more information, see Choose a VM size for compute nodes in an Azure Batch pool.

自动缩放策略Automatic scaling policy

对于动态工作负荷,可以将自动缩放策略应用于池。For dynamic workloads, you can apply an automatic scaling policy to a pool. Batch 服务将定期评估公式,并根据计算方案的当前工作负载和资源使用情况动态调整池中的节点数目。The Batch service will periodically evaluate your formula and dynamically adjusts the number of nodes within the pool according to the current workload and resource usage of your compute scenario. 这样,便可做到只使用所需资源并可释放不需要的资源,因而能够降低运行应用程序的整体成本。This allows you to lower the overall cost of running your application by using only the resources you need, and releasing those you don't need.

可通过编写 自动缩放公式 并将该公式与池相关联,来启用自动缩放。You enable automatic scaling by writing an automatic scaling formula and associating that formula with a pool. Batch 服务使用该公式来确定池中下一个缩放间隔(可配置的间隔)的目标节点数目。The Batch service uses the formula to determine the target number of nodes in the pool for the next scaling interval (an interval that you can configure). 可以在创建池时指定池的自动缩放设置,或稍后在池上启用缩放。You can specify the automatic scaling settings for a pool when you create it, or enable scaling on a pool later. 还可以更新已启用缩放的池上的缩放设置。You can also update the scaling settings on a scaling-enabled pool.

例如,也许某个作业需要提交大量要执行的任务。As an example, perhaps a job requires that you submit a large number of tasks to be executed. 你可以将缩放公式分配到池,以根据当前的排队任务数和作业中任务的完成率来调整池中的节点数目。You can assign a scaling formula to the pool that adjusts the number of nodes in the pool based on the current number of queued tasks and the completion rate of the tasks in the job. Batch 服务将定期评估公式,并根据工作负荷和其他公式设置来调整池的大小。The Batch service periodically evaluates the formula and resizes the pool, based on workload and your other formula settings. 该服务在有大量排队的任务时按需添加节点,在没有排队的任务或正在运行的任务时删除节点。The service adds nodes as needed when there are a large number of queued tasks, and removes nodes when there are no queued or running tasks.

缩放公式可以基于以下度量值:A scaling formula can be based on the following metrics:

  • 时间度量值 基于指定的时数内每隔五分钟收集的统计信息。Time metrics are based on statistics collected every five minutes in the specified number of hours.
  • 资源度量值 基于 CPU 使用率、带宽使用率、内存使用率和节点的数目。Resource metrics are based on CPU usage, bandwidth usage, memory usage, and number of nodes.
  • 任务指标基于任务状态,例如“活动”(已排队)、“正在运行”或“已完成”。 Task metrics are based on task state, such as Active (queued), Running, or Completed.

如果自动缩放会减少池中的计算节点数,则必须考虑如何处理在执行减少操作时运行的任务。When automatic scaling decreases the number of compute nodes in a pool, you must consider how to handle tasks that are running at the time of the decrease operation. 为了满足这一点,Batch 提供可包含在公式中的节点解除分配选项To accommodate this, Batch provides a node deallocation option that you can include in your formulas. 例如,可以指定运行中的任务立即停止,然后重新排入队列,以便在另一个节点上运行,或允许先完成再从池中删除节点。For example, you can specify that running tasks are stopped immediately and then requeued for execution on another node, or allowed to finish before the node is removed from the pool. 请注意,在所有任务都已完成,或者所有任务保留期都已过期之前,将节点解除选项设置为 taskcompletionretaineddata 会阻止池调整大小操作。Note that setting the node deallocation option as taskcompletion or retaineddata will prevent pool resize operations until all tasks have completed, or all task retention periods have expired, respectively.

有关自动缩放应用程序的详细信息,请参阅 自动缩放 Azure Batch 池中的计算节点For more information about automatically scaling an application, see Automatically scale compute nodes in an Azure Batch pool.

提示

若要获得最大的计算资源使用率,请将节点的目标数目设置成在作业结束时降为零,但允许正在运行的任务完成。To maximize compute resource utilization, set the target number of nodes to zero at the end of a job, but allow running tasks to finish.

任务计划策略Task scheduling policy

每个节点的最大任务数 配置选项确定了可以在池中每个计算节点上并行运行的最大任务数。The max tasks per node configuration option determines the maximum number of tasks that can be run in parallel on each compute node within the pool.

默认配置指定每次在节点上运行一个任务,但在某些情况下,在一个节点上同时执行两个或多个任务可能更有利。The default configuration specifies that one task at a time runs on a node, but there are scenarios where it is beneficial to have two or more tasks executed on a node simultaneously. 请参阅 concurrent node tasks(并发节点任务)一文中的示例方案,了解如何通过在每个节点上运行多个任务来受益。See the example scenario in the concurrent node tasks article to see how you can benefit from multiple tasks per node.

还可以指定一个填充类型,用于确定 Batch 是要将任务平均分散到池中的所有节点,还是在将最大数目的任务分配给一个节点后,再将任务分配给另一个节点。You can also specify a fill type, which determines whether Batch spreads the tasks evenly across all nodes in a pool, or packs each node with the maximum number of tasks before assigning tasks to another node.

通信状态Communication status

在大多数情况下,任务将独立运行,并不需要彼此通信。In most scenarios, tasks operate independently and do not need to communicate with one another. 但是,某些应用程序中的任务必须能够通信,例如 MPI 方案However, there are some applications in which tasks must communicate, like MPI scenarios.

可将池配置为允许节点间通信,以便池中的节点可在运行时进行通信。You can configure a pool to allow internode communication so that nodes within a pool can communicate at runtime. 启用节点间通信时,云服务配置池中的节点可以在超过 1100 个端口上彼此通信,并且虚拟机配置池不会限制任何端口的流量。When internode communication is enabled, nodes in Cloud Services Configuration pools can communicate with each other on ports greater than 1100, and Virtual Machine Configuration pools do not restrict traffic on any port.

启用节点间通信也会影响群集内的节点位置,并且由于部署限制,可能限制池中的最大节点数。Enabling internode communication also impacts the placement of the nodes within clusters and might limit the maximum number of nodes in a pool because of deployment restrictions. 如果应用程序不需要节点之间的通信,Batch 服务可以将许多不同的群集和数据中心的大量节点分配给池,以发挥更强大的并行处理能力。If your application does not require communication between nodes, the Batch service can allocate a potentially large number of nodes to the pool from many different clusters and data centers to enable increased parallel processing power.

启动任务Start tasks

如果需要,可以添加一个启动任务该任务将在每个节点加入池以及节点每次重新启动或重置映像时在该节点上运行。If desired, you can add a start task that will executes on each node as that node joins the pool, and each time a node is restarted or reimaged. 启动任务特别适合用于准备计算节点,以便执行任务,例如,在计算节点上安装运行任务的应用程序。The start task is especially useful for preparing compute nodes for the execution of tasks, like installing the applications that your tasks run on the compute nodes.

应用程序包Application packages

可以指定要部署到池中计算节点的应用程序包。You can specify application packages to deploy to the compute nodes in the pool. 应用程序包提供任务运行的应用程序的简化部署和版本控制。Application packages provide simplified deployment and versioning of the applications that your tasks run. 为池指定的应用程序包安装在加入该池的每个节点上,每次节点重新启动或重置映像时,将安装这些包。Application packages that you specify for a pool are installed on every node that joins that pool, and every time a node is rebooted or reimaged.

若要详细了解如何使用应用程序包将应用程序部署到 Batch 节点,请参阅使用 Batch 应用程序包将应用程序部署到计算节点For more information about using application packages to deploy your applications to your Batch nodes, see Deploy applications to compute nodes with Batch application packages.

虚拟网络 (VNet) 和防火墙配置Virtual network (VNet) and firewall configuration

在 Batch 中预配计算节点池时,可以将池与 Azure 虚拟网络 (VNet) 的子网相关联。When you provision a pool of compute nodes in Batch, you can associate the pool with a subnet of an Azure virtual network (VNet). 若要使用 Azure VNet,Batch 客户端 API 必须使用 Azure Active Directory (AD) 身份验证。To use an Azure VNet, the Batch client API must use Azure Active Directory (AD) authentication. 有关 Azure AD 的 Azure Batch 支持,请参阅使用 Active Directory 对 Batch 服务解决方案进行身份验证Azure Batch support for Azure AD is documented in Authenticate Batch service solutions with Active Directory.

VNet 要求VNet requirements

一般要求General requirements

  • VNet 必须与用于创建池的 Batch 帐户位于同一订阅和区域中。The VNet must be in the same subscription and region as the Batch account you use to create your pool.

  • 使用 VNet 的池最多可以有 4096 个节点。The pool using the VNet can have a maximum of 4096 nodes.

  • 为池指定的子网必须提供足够的未分配 IP 地址来容纳面向该池的 VM 的数量;即,池的 targetDedicatedNodestargetLowPriorityNodes 属性的总和。The subnet specified for the pool must have enough unassigned IP addresses to accommodate the number of VMs targeted for the pool; that is, the sum of the targetDedicatedNodes and targetLowPriorityNodes properties of the pool. 如果子网没有足够的未分配 IP 地址,池将分配部分计算节点,并发生调整大小错误。If the subnet doesn't have enough unassigned IP addresses, the pool partially allocates the compute nodes, and a resize error occurs.

  • 需要通过为 VNet 提供服务的自定义 DNS 服务器解析 Azure 存储终结点。Your Azure Storage endpoint needs to be resolved by any custom DNS servers that serve your VNet. 具体而言,<account>.table.core.chinacloudapi.cn<account>.queue.core.chinacloudapi.cn<account>.blob.core.chinacloudapi.cn 形式的 URL 应当是可以解析的。Specifically, URLs of the form <account>.table.core.chinacloudapi.cn, <account>.queue.core.chinacloudapi.cn, and <account>.blob.core.chinacloudapi.cn should be resolvable.

其他 VNet 要求会有所不同,具体取决于 Batch 池是使用“虚拟机”配置还是使用“云服务”配置。Additional VNet requirements differ, depending on whether the Batch pool is in the Virtual Machine configuration or the Cloud Services configuration. 若要进行新的池部署(部署到 VNet 中),建议使用“虚拟机”配置。For new pool deployments into a VNet, the Virtual Machine configuration is recommended.

“虚拟机”配置中的池Pools in the Virtual Machine configuration

支持的 VNet - 仅限基于 Azure 资源管理器的 VNetSupported VNets - Azure Resource Manager-based VNets only

子网 ID - 通过 Batch API 指定子网时,请使用子网的资源标识符。Subnet ID - When specifying the subnet using the Batch APIs, use the resource identifier of the subnet. 标识符的形式为:The subnet identifier is of the form:

/subscriptions/{subscription}/resourceGroups/{group}/providers/Microsoft.Network/virtualNetworks/{network}/subnets/{subnet}

权限 - 检查在 VNet 的订阅或资源组上实施的安全策略或锁定是否限制用户管理 VNet 所需的权限。Permissions - Check whether your security policies or locks on the VNet's subscription or resource group restrict a user's permissions to manage the VNet.

其他网络资源 - Batch 自动在包含 VNet 的资源组中分配其他网络资源。Additional networking resources - Batch automatically allocates additional networking resources in the resource group containing the VNet.

重要

对于每 100 个专用或低优先级节点,Batch 会分配:1 个网络安全组 (NSG)、1 个公共 IP 地址、1 个负载均衡器。For each 100 dedicated or low-priority nodes, Batch allocates: one network security group (NSG), one public IP address, and one load balancer. 这些资源受订阅的资源配额限制。These resources are limited by the subscription's resource quotas. 对于大型池,可能需要为一个或多个此类资源请求增加配额。For large pools, you might need to request a quota increase for one or more of these resources.

网络安全组:Batch 默认值Network security groups: Batch default

子网必须允许来自 Batch 服务的入站通信,才能在计算节点上计划任务,必须允许出站通信,才能根据工作负荷需求与 Azure 存储或其他资源通信。The subnet must allow inbound communication from the Batch service to be able to schedule tasks on the compute nodes, and outbound communication to communicate with Azure Storage or other resources as needed by your workload. 对于“虚拟机”配置中的池,Batch 在附加到计算节点的网络接口 (NIC) 级别添加 NSG。For pools in the Virtual Machine configuration, Batch adds NSGs at the network interfaces (NICs) level attached to compute nodes. 这些 NSG 配置了以下附加规则:These NSGs are configured with the following additional rules:

  • 端口 29876 和 29877 上来自 Batch 服务 IP 地址(对应于 BatchNodeManagement 服务标记)的入站 TCP 流量。Inbound TCP traffic on ports 29876 and 29877 from Batch service IP addresses that correspond to the BatchNodeManagement service tag.
  • 端口 22(Linux 节点)或端口 3389(Windows 节点)上允许远程访问的入站 TCP 流量。Inbound TCP traffic on port 22 (Linux nodes) or port 3389 (Windows nodes) to permit remote access. 对于 Linux 上某些类型的多实例任务(如 MPI),还需要为包含 Batch 计算节点的子网中的 IP 允许 SSH 端口 22 流量。For certain types of multi-instance tasks on Linux (such as MPI), you will need to also allow SSH port 22 traffic for IPs in the subnet containing the Batch compute nodes. 这可能会根据子网级 NSG 规则进行阻止(请参阅下文)。This may be blocked per subnet-level NSG rules (see below).
  • 任何端口上通往虚拟网络的出站流量。Outbound traffic on any port to the virtual network. 这可能会根据子网级 NSG 规则进行修改(请参阅下文)。This may be amended per subnet-level NSG rules (see below).
  • 任何端口上通往 Internet 的出站流量。Outbound traffic on any port to the Internet. 这可能会根据子网级 NSG 规则进行修改(请参阅下文)。This may be amended per subnet-level NSG rules (see below).

重要

在 Batch 配置的 NSG 中修改或添加入站或出站规则时,请务必小心。Use caution if you modify or add inbound or outbound rules in Batch-configured NSGs. 如果 NSG 拒绝与指定子网中的计算节点通信,则 Batch 服务会将计算节点的状态设置为“不可用”。If communication to the compute nodes in the specified subnet is denied by an NSG, the Batch service will set the state of the compute nodes to unusable. 此外,不得将资源锁应用于 Batch 创建的任何资源,因为这可能会由于用户启动的操作(如删除池)而导致资源清理被阻止。Additionally, no resource locks should be applied to any resource created by Batch, since this can prevent cleanup of resources as a result of user-initiated actions such as deleting a pool.

网络安全组:指定子网级规则Network security groups: Specifying subnet-level rules

无需在子网级别指定 NSG,因为 Batch 会配置其自己的 NSG(请参阅上文)。You don't have to specify NSGs at the virtual network subnet level, because Batch configures its own NSGs (see above). 如果你的一个 NSG 与部署了 Batch 计算节点的子网关联,或者你要应用自定义 NSG 规则来替代应用的默认值,则必须为此 NSG 至少配置入站和出站安全规则,如下表所示。If you have an NSG associated with the subnet where Batch compute nodes are deployed, or if you would like to apply custom NSG rules to override the defaults applied, you must configure this NSG with at least the inbound and outbound security rules shown in the following tables.

在端口 3389 (Windows) 或 22 (Linux) 上配置入站流量的前提是,你需要允许对外部源中的计算节点进行远程访问。Configure inbound traffic on port 3389 (Windows) or 22 (Linux) only if you need to permit remote access to the compute nodes from outside sources. 如果需要支持使用某些 MPI 运行时的多实例任务,则可能需要在 Linux 上启用端口 22 规则。You may need to enable port 22 rules on Linux if you require support for multi-instance tasks with certain MPI runtimes. 使池计算节点可用不一定需要允许这些端口上的流量。Allowing traffic on these ports is not strictly required for the pool compute nodes to be usable.

入站安全规则Inbound security rules

源 IP 地址Source IP addresses 源服务标记Source service tag 源端口Source ports 目标Destination 目标端口Destination ports 协议Protocol 操作Action
空值N/A BatchNodeManagement 服务标记(如果使用区域变体,则在与 Batch 帐户相同的区域中)BatchNodeManagement Service tag (if using regional variant, in the same region as your Batch account) * AnyAny 29876-2987729876-29877 TCPTCP 允许Allow
用户源 IP,用于远程访问 Linux 多实例任务的计算节点和/或计算节点子网(如果需要)。User source IPs for remotely accessing compute nodes and/or compute node subnet for Linux multi-instance tasks, if required. 空值N/A * AnyAny 3389 (Windows)、22 (Linux)3389 (Windows), 22 (Linux) TCPTCP 允许Allow

警告

Batch 服务 IP 地址随时可能会更改。Batch service IP addresses can change over time. 因此,强烈建议对 NSG 规则使用 BatchNodeManagement 服务标记(或区域变体)。Therefore, it is highly recommended to use the BatchNodeManagement service tag (or regional variant) for NSG rules. 避免用特定 Batch 服务 IP 地址填充 NSG 规则。Avoid populating NSG rules with specific Batch service IP addresses.

出站安全规则Outbound security rules

Source 源端口Source ports 目标Destination 目标服务标记Destination service tag 目标端口Destination ports 协议Protocol 操作Action
AnyAny * 服务标记Service tag Storage(如果使用区域变体,则在与 Batch 帐户相同的区域中)Storage (if using regional variant, in the same region as your Batch account) 443443 TCPTCP 允许Allow

“云服务”配置中的池Pools in the Cloud Services configuration

支持的 VNet - 仅限经典 VNetSupported VNets - Classic VNets only

子网 ID - 通过 Batch API 指定子网时,请使用子网的资源标识符。Subnet ID - When specifying the subnet using the Batch APIs, use the resource identifier of the subnet. 标识符的形式为:The subnet identifier is of the form:

/subscriptions/{subscription}/resourceGroups/{group}/providers/Microsoft.ClassicNetwork /virtualNetworks/{network}/subnets/{subnet}

权限 - Microsoft Azure Batch 服务主体必须为指定的 VNet 提供 Classic Virtual Machine Contributor Azure 角色。Permissions - The Microsoft Azure Batch service principal must have the Classic Virtual Machine Contributor Azure role for the specified VNet.

网络安全组Network security groups

子网必须允许来自 Batch 服务的入站通信,才能在计算节点上计划任务,必须允许出站通信,才能与 Azure 存储或其他资源通信。The subnet must allow inbound communication from the Batch service to be able to schedule tasks on the compute nodes, and outbound communication to communicate with Azure Storage or other resources.

不需指定 NSG,因为 Batch 将入站通信配置为只能从 Batch IP 地址到池节点。You do not need to specify an NSG, because Batch configures inbound communication only from Batch IP addresses to the pool nodes. 但是,如果指定的子网具有关联的 NSG 和/或防火墙,则配置入站和出站安全规则,如以下各表中所示。However, If the specified subnet has associated NSGs and/or a firewall, configure the inbound and outbound security rules as shown in the following tables. 如果 NSG 拒绝与指定子网中的计算节点通信,则 Batch 服务会将计算节点的状态设置为“不可用”。If communication to the compute nodes in the specified subnet is denied by an NSG, the Batch service sets the state of the compute nodes to unusable.

如果需要允许对池节点进行 RDP 访问,请在端口 3389 上为 Windows 配置入站流量。Configure inbound traffic on port 3389 for Windows if you need to permit RDP access to the pool nodes. 无需此项即可使用池节点。This is not required for the pool nodes to be usable.

入站安全规则Inbound security rules

源 IP 地址Source IP addresses 源端口Source ports 目标Destination 目标端口Destination ports 协议Protocol 操作Action
AnyAny

虽然这需要有效地“全部允许”,但 Batch 服务会在每个节点级别应用 ACL 规则,以筛选掉所有非 Batch 服务 IP 地址。Although this requires effectively "allow all", the Batch service applies an ACL rule at the level of each node that filters out all non-Batch service IP addresses.
* AnyAny 10100、20100、3010010100, 20100, 30100 TCPTCP 允许Allow
可选,用于允许对计算节点进行 RDP 访问。Optional, to allow RDP access to compute nodes. * AnyAny 33893389 TCPTCP 允许Allow

出站安全规则Outbound security rules

Source 源端口Source ports 目标Destination 目标端口Destination ports 协议Protocol 操作Action
任意Any * 任意Any 443443 AnyAny AllowAllow

若要详细了解如何在 VNet 中设置 Batch 池,请参阅通过虚拟网络创建虚拟机池For more information about setting up a Batch pool in a VNet, see Create a pool of virtual machines with your virtual network.

提示

若要确保用于访问节点的公共 IP 地址不会更改,可以使用所控制的指定公共 IP 地址创建池To ensure that the public IP addresses used to access nodes don't change, you can create a pool with specified public IP addresses that you control.

池和计算节点生存期Pool and compute node lifetime

在设计 Azure Batch 解决方案时,必须指定如何及何时创建池,以及这些池中的计算节点可用性要保持多久。When you design your Azure Batch solution, you must specify how and when pools are created, and how long compute nodes within those pools are kept available.

在极端情况下,可以针对提交的每个作业创建一个池,并在其任务执行完成时立即删除该池。On one end of the spectrum, you can create a pool for each job that you submit, and delete the pool as soon as its tasks finish execution. 这样,只有在需要时才分配节点,节点空闲时会立即关闭,因此可以最大程度地提高利用率。This maximizes utilization because the nodes are only allocated when needed, and they are shut down once they're idle. 这意味着作业必须等待分配节点,但务必注意,在任务已单独分配并且启动任务已完成时,会立即计划待执行的任务。While this means that the job must wait for the nodes to be allocated, it's important to note that tasks are scheduled for execution as soon as nodes are individually allocated and the start task has completed. 批处理不会在等到池中的所有节点都可用后才将任务分配到节点。Batch does not wait until all nodes within a pool are available before assigning tasks to the nodes. 这可确保最大程度地利用所有可用节点。This ensures maximum utilization of all available nodes.

在另一种极端情况下,如果最高优先级是让作业立即启动,则你可以预先创建池,并使其节点在提交作业之前可用。At the other end of the spectrum, if having jobs start immediately is the highest priority, you can create a pool ahead of time and make its nodes available before jobs are submitted. 在此情况下,任务可以立即启动,但节点可能会保持空闲状态以等待分配任务。In this scenario, tasks can start immediately, but nodes might sit idle while waiting for them to be assigned.

通常会使用一种组合方法来处理可变但持续存在的负载。A combined approach is typically used for handling a variable but ongoing load. 可以有一个池来容纳提交的多个作业,并且可以根据作业负载扩展或缩减节点数目。You can have a pool in which multiple jobs are submitted, and can scale the number of nodes up or down according to the job load. 可以根据当前负载被动执行此操作,或者在负载可预测时主动执行此操作。You can do this reactively, based on current load, or proactively, if load can be predicted. 有关详细信息,请参阅自动缩放策略For more information, see Automatic scaling policy.

证书的安全性Security with certificates

在加密或解密任务的敏感信息(例如 Azure 存储帐户的密钥)时,通常需要使用证书。You typically need to use certificates when you encrypt or decrypt sensitive information for tasks, like the key for an Azure Storage account. 为此,可以在节点上安装证书。To support this, you can install certificates on nodes. 加密的机密通过命令行参数或内嵌在某个任务资源中来传递给任务,已安装的证书可用于解密机密。Encrypted secrets are passed to tasks via command-line parameters or embedded in one of the task resources, and the installed certificates can be used to decrypt them.

可以使用添加证书操作 (Batch REST) 或 CertificateOperations.CreateCertificate 方法 (Batch .NET) 将证书添加到 Batch 帐户。You use the Add certificate operation (Batch REST) or CertificateOperations.CreateCertificate method (Batch .NET) to add a certificate to a Batch account. 然后,可以将该证书与新池或现有池相关联。You can then associate the certificate with a new or existing pool.

将证书与池关联后,Batch 服务将在池中的每个节点上安装该证书。When a certificate is associated with a pool, the Batch service installs the certificate on each node in the pool. 在启动节点之后、启动任何任务(包括启动任务作业管理器任务)之前,Batch 服务将安装相应的证书。The Batch service installs the appropriate certificates when the node starts up, before launching any tasks (including the start task and job manager task).

如果将证书添加到现有池,必须重新启动其计算节点,证书才会应用到节点。If you add a certificate to an existing pool, you must reboot its compute nodes in order for the certificate to be applied to the nodes.

后续步骤Next steps