部署使用 GPU 资源的容器实例Deploy container instances that use GPU resources

要在 Azure 容器实例上运行某些计算密集型工作负载,请使用“GPU 资源”部署容器组To run certain compute-intensive workloads on Azure Container Instances, deploy your container groups with GPU resources. 组中的容器实例可以在运行容器工作负载(例如 CUDA 和深度学习应用程序)的同时访问一个或多个 NVIDIA Tesla GPU。The container instances in the group can access one or more NVIDIA Tesla GPUs while running container workloads such as CUDA and deep learning applications.

本文介绍如何在使用 YAML 文件资源管理器模板部署容器组时添加 GPU 资源。This article shows how to add GPU resources when you deploy a container group by using a YAML file or Resource Manager template. 使用 Azure 门户部署容器实例时,还可以指定 GPU 资源。You can also specify GPU resources when you deploy a container instance using the Azure portal.

重要

此功能目前以预览版提供,存在一些限制This feature is currently in preview, and some limitations apply. 需同意补充使用条款才可使用预览版。Previews are made available to you on the condition that you agree to the supplemental terms of use. 在正式版 (GA) 推出之前,此功能的某些方面可能会有所更改。Some aspects of this feature may change prior to general availability (GA).

预览版限制Preview limitations

在预览版中,在容器组中使用 GPU 资源时应用下列限制。In preview, the following limitations apply when using GPU resources in container groups.

上市区域Region availability

区域Regions 操作系统OS 可用 GPU SKUAvailable GPU SKUs
中国东部 2China East 2 LinuxLinux V100V100

以后还会不断增添对其他区域的支持。Support will be added for additional regions over time.

支持的 OS 类型:仅限 LinuxSupported OS types: Linux only

其他限制:将容器组部署到虚拟网络中时不能使用 GPU 资源。Additional limitations: GPU resources can't be used when deploying a container group into a virtual network.

关于 GPU 资源About GPU resources

计数和 SKUCount and SKU

若要在容器实例中使用 GPU,请使用以下信息指定 GPU 资源:To use GPUs in a container instance, specify a GPU resource with the following information:

  • 计数 - GPU 数量:1、2 或 4 。Count - The number of GPUs: 1, 2, or 4.

  • SKU - GPU SKU:V100SKU - The GPU SKU: V100. 每个 SKU 都映射到以下支持 Azure GPU 的 VM 系列中的 NVIDIA Tesla GPU:Each SKU maps to the NVIDIA Tesla GPU in one the following Azure GPU-enabled VM families:

    SKUSKU VM 系列VM family
    V100V100 NCv3NCv3

每个 SKU 的最大资源数Maximum resources per SKU

OSOS GPU SKUGPU SKU GPU 计数GPU count 最大 CPUMax CPU 最大内存 (GB)Max Memory (GB) 存储器 (GB)Storage (GB)
LinuxLinux V100V100 11 66 112112 5050
LinuxLinux V100V100 22 1212 224224 5050
LinuxLinux V100V100 44 2424 448448 5050

部署 GPU 资源时,请设置适合工作负载的 CPU 和内存资源,最多可设置为上表所示的最大值。When deploying GPU resources, set CPU and memory resources appropriate for the workload, up to the maximum values shown in the preceding table. 这些值当前大于容器组中可用的 CPU 和内存资源(不含 GPU 资源)。These values are currently larger than the CPU and memory resources available in container groups without GPU resources.

重要

GPU 资源的默认订阅限制(配额)按 SKU 而有所不同。Default subscription limits (quotas) for GPU resources differ by SKU. P100 和 V100 SKU 的默认 CPU 限制最初设置为 0。The default CPU limits for the P100 and V100 SKUs are initially set to 0. 若要请求增加可用区域中的限制,请提交 Azure 支持请求To request an increase in an available region, please submit an Azure support request.

使用须知Things to know

  • 部署时间 - 创建包含 GPU 资源的容器组最多需要 8-10 分钟。Deployment time - Creation of a container group containing GPU resources takes up to 8-10 minutes. 这是因为需为预配和配置 Azure 中的 GPU VM 留出更多时间。This is due to the additional time to provision and configure a GPU VM in Azure.

  • 定价 - 类似于不含 GPU 资源的容器组,Azure 对具有 GPU 资源的容器组的持续时间内消耗的资源收费。Pricing - Similar to container groups without GPU resources, Azure bills for resources consumed over the duration of a container group with GPU resources. 持续时间自容器开始拉取第一个容器的映像起开始计算,至容器组终止为止。The duration is calculated from the time to pull your first container's image until the container group terminates. 它不包括部署容器组的时间。It does not include the time to deploy the container group.

    请参阅定价详细信息See pricing details.

  • CUDA 驱动程序 - 具有 GPU 资源的容器实例使用 NVIDIA CUDA 驱动程序和容器运行时进行预配,因此可以使用专为 CUDA 工作负载开发的容器映像。CUDA drivers - Container instances with GPU resources are pre-provisioned with NVIDIA CUDA drivers and container runtimes, so you can use container images developed for CUDA workloads.

    在此阶段,我们仅支持 CUDA 9.0。We support only CUDA 9.0 at this stage. 例如,可以对 Docker 文件使用以下基础映像:For example, you can use the following base images for your Docker file:

YAML 示例YAML example

添加 GPU 资源的一种方式就是使用 YAML 文件部署容器组。One way to add GPU resources is to deploy a container group by using a YAML file. 将以下 YAML 复制到名为 gpu-deploy-aci.yaml 的新文件中,然后保存该文件。Copy the following YAML into a new file named gpu-deploy-aci.yaml, then save the file. 此 YAML 创建名为 gpucontainergroup 的容器组并使用 K80 GPU 指定容器实例。This YAML creates a container group named gpucontainergroup specifying a container instance with a K80 GPU. 该实例运行示例 CUDA 矢量添加应用程序。The instance runs a sample CUDA vector addition application. 请求的资源足以运行工作负载。The resource requests are sufficient to run the workload.

additional_properties: {}
apiVersion: '2019-12-01'
name: gpucontainergroup
properties:
  containers:
  - name: gpucontainer
    properties:
      image: k8s-gcrio.azureedge.net/cuda-vector-add:v0.1
      resources:
        requests:
          cpu: 1.0
          memoryInGB: 1.5
          gpu:
            count: 1
            sku: K80
  osType: Linux
  restartPolicy: OnFailure

使用 az container create 命令并在 --file 参数中指定 YAML 文件名,以部署容器组。Deploy the container group with the az container create command, specifying the YAML file name for the --file parameter. 需要提供支持 GPU 资源的资源组名称和容器组位置(例如 chinaeast2)。You need to supply the name of a resource group and a location for the container group such as chinaeast2 that supports GPU resources.

az container create --resource-group myResourceGroup --file gpu-deploy-aci.yaml --location chinaeast2

部署需要数分钟才能完成。The deployment takes several minutes to complete. 然后,容器启动并运行 CUDA 矢量添加操作。Then, the container starts and runs a CUDA vector addition operation. 运行 az container logs 命令,查看日志输出:Run the az container logs command to view the log output:

az container logs --resource-group myResourceGroup --name gpucontainergroup --container-name gpucontainer

输出:Output:

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

资源管理器模板示例Resource Manager template example

使用 GPU 资源部署容器组的另一种方式是使用资源管理器模板Another way to deploy a container group with GPU resources is by using a Resource Manager template. 首先,创建名为 gpudeploy.json 的文件,再将以下 JSON 复制到其中。Start by creating a file named gpudeploy.json, then copy the following JSON into it. 此示例使用 V100 GPU 部署容器实例,该 GPU 针对 MNIST 数据集运行 TensorFlow 培训作业。This example deploys a container instance with a V100 GPU that runs a TensorFlow training job against the MNIST dataset. 请求的资源足以运行工作负载。The resource requests are sufficient to run the workload.

{
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
      "containerGroupName": {
        "type": "string",
        "defaultValue": "gpucontainergrouprm",
        "metadata": {
          "description": "Container Group name."
        }
      }
    },
    "variables": {
      "containername": "gpucontainer",
      "containerimage": "mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu"
    },
    "resources": [
      {
        "name": "[parameters('containerGroupName')]",
        "type": "Microsoft.ContainerInstance/containerGroups",
        "apiVersion": "2019-12-01",
        "location": "[resourceGroup().location]",
        "properties": {
            "containers": [
            {
              "name": "[variables('containername')]",
              "properties": {
                "image": "[variables('containerimage')]",
                "resources": {
                  "requests": {
                    "cpu": 4.0,
                    "memoryInGb": 12.0,
                    "gpu": {
                        "count": 1,
                        "sku": "V100"
                  }
                }
              }
            }
          }
        ],
        "osType": "Linux",
        "restartPolicy": "OnFailure"
        }
      }
    ]
}

使用 az deployment group create 命令部署模板。Deploy the template with the az deployment group create command. 需要提供支持 GPU 资源的在区域(例如 chinaeast2)中创建的资源组的名称。You need to supply the name of a resource group that was created in a region such as chinaeast2 that supports GPU resources.

az deployment group create --resource-group myResourceGroup --template-file gpudeploy.json

部署需要数分钟才能完成。The deployment takes several minutes to complete. 然后,容器启动并运行 TensorFlow 作业。Then, the container starts and runs the TensorFlow job. 运行 az container logs 命令,查看日志输出:Run the az container logs command to view the log output:

az container logs --resource-group myResourceGroup --name gpucontainergrouprm --container-name gpucontainer

输出:Output:

2018-10-25 18:31:10.155010: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-10-25 18:31:10.305937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: ccb6:00:00.0
totalMemory: 11.92GiB freeMemory: 11.85GiB
2018-10-25 18:31:10.305981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: ccb6:00:00.0, compute capability: 3.7)
2018-10-25 18:31:14.941723: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
Accuracy at step 0: 0.097
Accuracy at step 10: 0.6993
Accuracy at step 20: 0.8208
Accuracy at step 30: 0.8594
...
Accuracy at step 990: 0.969
Adding run metadata for 999

清理资源Clean up resources

由于使用 GPU 资源可能很昂贵,请确保容器在长时间内运行不会发生意外。Because using GPU resources may be expensive, ensure that your containers don't run unexpectedly for long periods. 在 Azure 门户中监视容器,或使用 az container show 命令查看容器组的状态。Monitor your containers in the Azure portal, or check the status of a container group with the az container show command. 例如:For example:

az container show --resource-group myResourceGroup --name gpucontainergroup --output table

处理完所创建的容器实例后,使用以下命令将其删除:When you're done working with the container instances you created, delete them with the following commands:

az container delete --resource-group myResourceGroup --name gpucontainergroup -y
az container delete --resource-group myResourceGroup --name gpucontainergrouprm -y

后续步骤Next steps