部署使用 GPU 资源的容器实例

2024-11-04

要在 Azure 容器实例上运行某些计算密集型工作负荷，请使用“GPU 资源”部署容器组。组中的容器实例可以在运行容器工作负荷（例如 CUDA 和深度学习应用程序）的同时访问一个或多个 NVIDIA Tesla GPU。

本文介绍如何在使用 YAML 文件或资源管理器模板部署容器组时添加 GPU 资源。使用 Azure 门户部署容器实例时，还可以指定 GPU 资源。

重要

K80 和 P100 GPU SKU 将于 2023 年 8 月 31 日停用。这是因为使用的基础 VM 将停用：NC 系列和 NCv2 系列尽管 V100 SKU 将可用，但建议改用使用 Azure Kubernetes 服务。 GPU 资源不完全受支持，不应用于生产工作负荷。立即使用以下资源迁移到 AKS：如何迁移到 AKS。

重要

此功能目前以预览版提供，存在一些限制。需同意补充使用条款才可使用预览版。在正式版 (GA) 推出之前，此功能的某些方面可能会有所更改。

先决条件

注意

由于当前的一些限制，并非所有限制增加请求都可以保证得到批准。

如果要将此 SKU 用于生产容器部署，请创建 Azure 支持请求以增加限制。

预览版限制

在预览版中，在容器组中使用 GPU 资源时应用下列限制。

上市区域

区域	操作系统	可用 GPU SKU
中国东部 2、中国北部 3	Linux	V100

以后还会不断增添对其他区域的支持。

支持的 OS 类型：仅限 Linux

其他限制：将容器组部署到虚拟网络中时不能使用 GPU 资源。

关于 GPU 资源

计数和 SKU

若要在容器实例中使用 GPU，请使用以下信息指定 GPU 资源：

计数 - GPU 数量：1、2 或 4 。
SKU - GPU SKU：V100。每个 SKU 都映射到以下支持 Azure GPU 的 VM 系列中的 NVIDIA Tesla GPU：

SKU VM 系列

V100 NCv3

SKU	VM 系列
V100	NCv3

每个 SKU 的最大资源数

OS	GPU SKU	GPU 计数	最大 CPU	最大内存 (GB)	存储器 (GB)
Linux	V100	1	6	112	50
Linux	V100	2	12	224	50
Linux	V100	4	24	448	50

部署 GPU 资源时，请设置适合工作负荷的 CPU 和内存资源，最多可设置为上表所示的最大值。这些值当前大于容器组中可用的 CPU 和内存资源（不含 GPU 资源）。

重要

GPU 资源的默认订阅限制（配额）按 SKU 而有所不同。 V100 SKU 的默认 CPU 限制初始设置为 0。若要请求增加可用区域中的限制，请提交 Azure 支持请求。

使用须知

部署时间 - 创建包含 GPU 资源的容器组最多需要 8-10 分钟。这是因为需为预配和配置 Azure 中的 GPU VM 留出更多时间。
定价 - 类似于不含 GPU 资源的容器组，Azure 对具有 GPU 资源的容器组的持续时间内消耗的资源收费。持续时间自容器开始拉取第一个容器的映像起开始计算，至容器组终止为止。它不包括部署容器组的时间。

请参阅定价详细信息。
CUDA 驱动程序 - 具有 GPU 资源的容器实例使用 NVIDIA CUDA 驱动程序和容器运行时进行预配，因此可以使用专为 CUDA 工作负荷开发的容器映像。

在此阶段，我们支持 CUDA 11。例如，可以对 Dockerfile 使用以下基础映像：
- nvidia/cuda:11.4.2-base-ubuntu20.04
- tensorflow/tensorflow:devel-gpu
注意

若要提高从 Docker Hub 使用公共容器映像时的可靠性，请在专用 Azure 容器注册表中导入和管理该映像，并更新 Dockerfile 以使用专门管理的基础映像。了解有关使用公共映像的详细信息。

YAML 示例

添加 GPU 资源的一种方式就是使用 YAML 文件部署容器组。将以下 YAML 复制到名为 gpu-deploy-aci.yaml 的新文件中，然后保存该文件。此 YAML 创建一个名为“gpucontainergroup”的容器组并指定了一个带有 V100 GPU 的容器实例。该实例运行示例 CUDA 矢量添加应用程序。请求的资源足以运行工作负荷。

注意

以下示例使用公共容器映像。若要提高可靠性，请在专用 Azure 容器注册表中导入和管理映像，并将 YAML 更新为使用单独托管的基本映像。了解有关使用公共映像的详细信息。

additional_properties: {}
apiVersion: '2021-09-01'
name: gpucontainergroup
properties:
  containers:
  - name: gpucontainer
    properties:
      image: k8s-gcrio.azureedge.net/cuda-vector-add:v0.1
      resources:
        requests:
          cpu: 1.0
          memoryInGB: 1.5
          gpu:
            count: 1
            sku: V100
  osType: Linux
  restartPolicy: OnFailure

使用 az container create 命令并在 --file 参数中指定 YAML 文件名，以部署容器组。需要提供支持 GPU 资源的资源组名称和容器组位置（例如 chinaeast2）。

az container create --resource-group myResourceGroup --file gpu-deploy-aci.yaml --location chinaeast2

部署需要数分钟才能完成。然后，容器启动并运行 CUDA 矢量添加操作。运行 az container logs 命令，查看日志输出：

az container logs --resource-group myResourceGroup --name gpucontainergroup --container-name gpucontainer

输出：

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

资源管理器模板示例

使用 GPU 资源部署容器组的另一种方式是使用资源管理器模板。首先，创建名为 gpudeploy.json 的文件，再将以下 JSON 复制到其中。此示例使用 V100 GPU 部署容器实例，该 GPU 针对 MNIST 数据集运行 TensorFlow 培训作业。请求的资源足以运行工作负荷。

{
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
      "containerGroupName": {
        "type": "string",
        "defaultValue": "gpucontainergrouprm",
        "metadata": {
          "description": "Container Group name."
        }
      }
    },
    "variables": {
      "containername": "gpucontainer",
      "containerimage": "mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu"
    },
    "resources": [
      {
        "name": "[parameters('containerGroupName')]",
        "type": "Microsoft.ContainerInstance/containerGroups",
        "apiVersion": "2021-09-01",
        "location": "[resourceGroup().location]",
        "properties": {
            "containers": [
            {
              "name": "[variables('containername')]",
              "properties": {
                "image": "[variables('containerimage')]",
                "resources": {
                  "requests": {
                    "cpu": 4.0,
                    "memoryInGb": 12.0,
                    "gpu": {
                        "count": 1,
                        "sku": "V100"
                  }
                }
              }
            }
          }
        ],
        "osType": "Linux",
        "restartPolicy": "OnFailure"
        }
      }
    ]
}

使用 az deployment group create 命令部署模板。需要提供支持 GPU 资源的在区域（例如 chinaeast2）中创建的资源组的名称。

az deployment group create --resource-group myResourceGroup --template-file gpudeploy.json

部署需要数分钟才能完成。然后，容器启动并运行 TensorFlow 作业。运行 az container logs 命令，查看日志输出：

az container logs --resource-group myResourceGroup --name gpucontainergrouprm --container-name gpucontainer

输出：

2018-10-25 18:31:10.155010: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-10-25 18:31:10.305937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla V100 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: ccb6:00:00.0
totalMemory: 11.92GiB freeMemory: 11.85GiB
2018-10-25 18:31:10.305981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100, pci bus id: ccb6:00:00.0, compute capability: 3.7)
2018-10-25 18:31:14.941723: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
Accuracy at step 0: 0.097
Accuracy at step 10: 0.6993
Accuracy at step 20: 0.8208
Accuracy at step 30: 0.8594
...
Accuracy at step 990: 0.969
Adding run metadata for 999

清理资源

由于使用 GPU 资源可能很昂贵，请确保容器在长时间内运行不会发生意外。在 Azure 门户中监视容器，或使用 az container show 命令查看容器组的状态。例如：

az container show --resource-group myResourceGroup --name gpucontainergroup --output table

处理完所创建的容器实例后，使用以下命令将其删除：

az container delete --resource-group myResourceGroup --name gpucontainergroup -y
az container delete --resource-group myResourceGroup --name gpucontainergrouprm -y

后续步骤

详细了解有关使用 YAML 文件或资源管理器模板部署容器组的信息。
详细了解 Azure 中的 GPU 最佳化的 VM 大小。

通过