在 Batch 池中使用 RDMA 或 GPU 实例Use RDMA or GPU instances in Batch pools

若要运行某些 Batch 作业,可以利用专为大规模计算设计的 Azure VM 大小。To run certain Batch jobs, you can take advantage of Azure VM sizes designed for large-scale computation. 例如:For example:

  • 运行多实例 MPI 工作负荷时,可为远程直接内存访问 (RDMA) 选择具备网络接口的 H 系列或其他大小。To run multi-instance MPI workloads, choose H-series or other sizes that have a network interface for Remote Direct Memory Access (RDMA). 这些大小均连接到用于节点间通信的 InfiniBand 网络,可加快 MPI 应用程序的速度。These sizes connect to an InfiniBand network for inter-node communication, which can accelerate MPI applications.

  • 对于 CUDA 应用程序,可选择包含 NVIDIA Tesla 图形处理单元 (GPU) 卡的 N 系列大小。For CUDA applications, choose N-series sizes that include NVIDIA Tesla graphics processing unit (GPU) cards.

本文介绍了在 Batch 池中使用 Azure 的某些专用大小的指南和示例。This article provides guidance and examples to use some of Azure's specialized sizes in Batch pools. 有关规格和背景的信息,请参阅:For specs and background, see:

备注

某些 VM 大小在创建批处理帐户的区域中可能无法使用。Certain VM sizes might not be available in the regions where you create your Batch accounts. 若要检查大小是否可用,请参阅可用产品(按区域)以及选择 Batch 池的 VM 大小To check that a size is available, see Products available by region and Choose a VM size for a Batch pool.

依赖项Dependencies

Batch 中计算密集型大小的 RDMA 或 GPU 功能仅在某些操作系统中支持。The RDMA or GPU capabilities of compute-intensive sizes in Batch are supported only in certain operating systems. (支持的操作系统列表是一个子集,属于以此类大小创建的虚拟机所支持的操作系统。)根据创建 Batch 池的方式,可能需要在节点上安装或配置其他驱动程序或软件。(The list of supported operating systems is a subset of those supported for virtual machines created in these sizes.) Depending on how you create your Batch pool, you might need to install or configure additional driver or other software on the nodes. 下表总结了这些依存关系。The following tables summarize these dependencies. 有关详细信息,请参阅链接的文章。See linked articles for details. 有关配置 Batch 池的选项,请参阅本文后面部分。For options to configure Batch pools, see later in this article.

Linux 池 - 虚拟机配置Linux pools - Virtual machine configuration

大小Size 功能Capability 操作系统Operating systems 所需软件Required software 池设置Pool settings
NC24rs_v3 *NC24rs_v3 * RDMARDMA Ubuntu 16.04 LTS 或Ubuntu 16.04 LTS, or
基于 CentO 的 HPCCentOS-based HPC
(Azure 市场)(Azure Marketplace)
Intel MPI 5Intel MPI 5

Linux RDMA 驱动程序Linux RDMA drivers
启用节点间通信,禁用并发任务执行Enable inter-node communication, disable concurrent task execution
NCv3 系列NCv3 series NVIDIA Tesla GPU(因系列而异)NVIDIA Tesla GPU (varies by series) Ubuntu 16.04 LTS 或Ubuntu 16.04 LTS, or
CentOS 7.3 或 7.4CentOS 7.3 or 7.4
(Azure 市场)(Azure Marketplace)
NVIDIA CUDA 或 CUDA Toolkit 驱动程序NVIDIA CUDA or CUDA Toolkit drivers 空值N/A

*支持 RDMA 的 N 系列大小还包含 NVIDIA Tesla GPU*RDMA-capable N-series sizes also include NVIDIA Tesla GPUs

Windows 池 - 虚拟机配置Windows pools - Virtual machine configuration

大小Size 功能Capability 操作系统Operating systems 所需软件Required software 池设置Pool settings
NC24rs_v3 *NC24rs_v3 * RDMARDMA Windows Server 2016、2012 R2 或Windows Server 2016, 2012 R2, or
2012(Azure 市场)2012 (Azure Marketplace)
Microsoft MPI 2012 R2 或更高版本,或Microsoft MPI 2012 R2 or later, or
Intel MPI 5Intel MPI 5

Windows RDMA 驱动程序Windows RDMA drivers
启用节点间通信,禁用并发任务执行Enable inter-node communication, disable concurrent task execution
NCv3 系列 NCv3 series NVIDIA Tesla GPU(因系列而异)NVIDIA Tesla GPU (varies by series) Windows Server 2016 或Windows Server 2016 or
2012 R2(Azure 市场)2012 R2 (Azure Marketplace)
NVIDIA CUDA 或 CUDA Toolkit 驱动程序NVIDIA CUDA or CUDA Toolkit drivers 空值N/A

*支持 RDMA 的 N 系列大小还包含 NVIDIA Tesla GPU*RDMA-capable N-series sizes also include NVIDIA Tesla GPUs

池配置选项Pool configuration options

在为 Batch 池配置专用 VM 大小时,有若干选项来安装所需软件或驱动程序:To configure a specialized VM size for your Batch pool, you have several options to install required software or drivers:

  • 对于虚拟机配置中的池,请选择预先配置的 Azure 市场 VM 映像,其中包含预安装的驱动程序和软件。For pools in the virtual machine configuration, choose a preconfigured Azure Marketplace VM image that has drivers and software preinstalled. 示例:Examples:

  • 创建自定义 Windows 或 Linux VM 映像,你已在其上安装了 VM 大小所需的驱动程序、软件或其他设置。Create a custom Windows or Linux VM image on which you have installed drivers, software, or other settings required for the VM size.

  • 从已压缩的驱动程序或应用程序安装程序创建 Batch 应用程序包,并配置 Batch 以将程序包部署到池节点,并在创建每个节点时安装一次。Create a Batch application package from a zipped driver or application installer, and configure Batch to deploy the package to pool nodes and install once when each node is created. 例如,如果应用程序包是安装程序,请创建一个启动任务命令行,以在所有池节点上静默安装该应用。For example, if the application package is an installer, create a start task command line to silently install the app on all pool nodes. 若工作负载取决于特定的驱动程序版本,请考虑使用应用程序包和池启动任务。Consider using an application package and a pool start task if your workload depends on a particular driver version.

    备注

    启动任务必须使用提升的(管理员)权限运行,且必须待其运行成功。The start task must run with elevated (admin) permissions, and it must wait for success. 长时间运行的任务将增加配置 Batch 池的时间。Long-running tasks will increase the time to provision a Batch pool.

  • Batch Shipyard 将自动配置 GPU 和 RDMA 驱动程序,以便透明地用于 Azure Batch 上的容器化工作负荷。Batch Shipyard automatically configures the GPU and RDMA drivers to work transparently with containerized workloads on Azure Batch. Batch Shipyard 完全由配置文件驱动。Batch Shipyard is entirely driven with configuration files. 提供众多的示例配方配置来启用 GPU 和 RDMA 工作负荷,例如 CNTK GPU 配方,它可在 N 系列 VM 上预先配置 GPU 驱动程序,并以 Docker 映像形式加载 Microsoft Cognitive Toolkit 软件。There are many sample recipe configurations available that enable GPU and RDMA workloads such as the CNTK GPU Recipe which preconfigures GPU drivers on N-series VMs and loads Microsoft Cognitive Toolkit software as a Docker image.

示例:Windows NC VM 池上的 NVIDIA GPU 驱动程序Example: NVIDIA GPU drivers on Windows NC VM pool

若要在 Windows NC 节点的池上运行 CUDA 应用程序,需要安装 NVDIA GPU 驱动程序。To run CUDA applications on a pool of Windows NC nodes, you need to install NVDIA GPU drivers. 以下示例步骤使用应用程序包来安装 NVIDIA GPU 驱动程序。The following sample steps use an application package to install the NVIDIA GPU drivers. 如果工作负载取决于特定的 GPU 驱动程序版本,则可以选择此选项。You might choose this option if your workload depends on a specific GPU driver version.

  1. NVIDIA 网站下载 Windows Server 2016 上的 GPU 驱动程序的安装程序包 - 例如,版本 411.82Download a setup package for the GPU drivers on Windows Server 2016 from the NVIDIA website - for example, version 411.82. 使用短名称(如 GPUDriverSetup.exe)在本地保存文件。Save the file locally using a short name like GPUDriverSetup.exe.

  2. 为程序包创建 zip 文件。Create a zip file of the package.

  3. 将程序包上载到 Batch 帐户。Upload the package to your Batch account. 有关详细步骤,请参阅应用程序包指南。For steps, see the application packages guidance. 指定应用程序 ID(如 GPUDriver)和版本(如 411.82) 。Specify an application ID such as GPUDriver, and a version such as 411.82.

  4. 通过 Batch API 或 Azure 门户,在虚拟机配置中创建具有所需节点数和规模的池。Using the Batch APIs or Azure portal, create a pool in the virtual machine configuration with the desired number of nodes and scale. 下表显示了使用启动任务静默安装 NVIDIA GPU 驱动程序的示例设置:The following table shows sample settings to install the NVIDIA GPU drivers silently using a start task:

    设置Setting Value
    映像类型Image Type 市场 (Linux/Windows)Marketplace (Linux/Windows)
    发布者Publisher MicrosoftWindowsServerMicrosoftWindowsServer
    产品Offer WindowsServerWindowsServer
    SkuSku 2016-Datacenter2016-Datacenter
    节点大小Node size NC6 标准NC6 Standard
    应用程序包引用Application package references GPUDriver,版本 411.82GPUDriver, version 411.82
    启用了启动任务Start task enabled TrueTrue
    命令行 - cmd /c "%AZ_BATCH_APP_PACKAGE_GPUDriver#411.82%\\GPUDriverSetup.exe /s"Command line - cmd /c "%AZ_BATCH_APP_PACKAGE_GPUDriver#411.82%\\GPUDriverSetup.exe /s"
    用户标识 - 池自动用户、管理员User identity - Pool autouser, admin
    等待成功 - TrueWait for success - True

示例:Linux NC VM 池上的 NVIDIA GPU 驱动程序Example: NVIDIA GPU drivers on a Linux NC VM pool

若要在 Linux NC 节点的池上运行 CUDA 应用程序,需要从 CUDA Toolkit 安装必要的 NVIDIA Tesla GPU 驱动程序。To run CUDA applications on a pool of Linux NC nodes, you need to install necessary NVIDIA Tesla GPU drivers from the CUDA Toolkit. 以下示例步骤使用 GPU 驱动程序创建和部署自定义 Ubuntu 16.04 LTS 映像:The following sample steps create and deploy a custom Ubuntu 16.04 LTS image with the GPU drivers:

  1. 部署运行 Ubuntu 16.04 LTS 的 Azure NC 系列 VM。Deploy an Azure NC-series VM running Ubuntu 16.04 LTS. 例如:在美国中南部区域创建 VM。For example, create the VM in the US South Central region.

  2. 请照以下步骤手动连接到 VM 并安装 CUDA 驱动程序Follow the steps to connect to the VM and install CUDA drivers manually.

  3. 按照以下步骤为 Batch 创建共享映像库映像Follow the steps to create a Shared Image Gallery image for Batch.

  4. 在支持 NC VM 的区域中创建 Batch 帐户。Create a Batch account in a region that supports NC VMs.

  5. 通过 Batch API 或 Azure 门户,使用自定义映像创建具有所需节点数和规模的池。Using the Batch APIs or Azure portal, create a pool using the custom image and with the desired number of nodes and scale. 下表列出了映像的示例池设置:The following table shows sample pool settings for the image:

    设置Setting Value
    映像类型Image Type 自定义映像Custom Image
    自定义映像Custom Image 映像名称Name of the image
    节点代理 SKUNode agent SKU batch.node.ubuntu 16.04batch.node.ubuntu 16.04
    节点大小Node size NC6 标准NC6 Standard

后续步骤Next steps

  • 若要在 Azure Batch 池上运行 MPI 作业,请参阅 WindowsLinux 示例。To run MPI jobs on an Azure Batch pool, see the Windows or Linux examples.

  • 有关 Batch 上的 GPU 工作负荷示例,请参阅 Batch Shipyard 配方。For examples of GPU workloads on Batch, see the Batch Shipyard recipes.