Use AMD GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS)

AMD GPU Virtual Machine (VM) sizes on Azure can provide flexibility in performance and cost, offering high compute capacity while allowing you to choose the right configuration for your workload requirements. AKS supports AMD GPU-enabled Linux node pools to run compute-intensive Kubernetes workloads.

This article helps you provision nodes with schedulable AMD GPUs on new and existing AKS clusters.

Limitations

  • AKS currently supports the Standard_ND96isr_MI300X_v5 Azure VM size powered by the MI300 series AMD GPU.
  • Updating an existing node pool to add an AMD GPU VM size is not supported on AKS.
  • Updating a non-AMD GPU-enabled node pool with an AMD GPU VM size is not supported.
  • AzureLinux and Windows are not yet supported with AMD GPU.

Before you begin

  • This article assumes you have an existing AKS cluster. If you don't have a cluster, create one using the Azure CLI, Azure PowerShell, or the Azure portal.
  • You need the Azure CLI version 2.72.2 or later installed to set the --gpu-driver field. Run az --version to find the version. If you need to install or upgrade, see [Install Azure CLI][install-azure-cli].
  • If you have the aks-preview Azure CLI extension installed, please update the version to 18.0.0b2 or later.

Note

GPU-enabled VMs contain specialized hardware subject to higher pricing and region availability. For more information, see the [pricing][azure-pricing] tool and [region availability][azure-availability].

Get the credentials for your cluster

Get the credentials for your AKS cluster using the az aks get-credentials command. The following example command gets the credentials for the cluster myAKSCluster in the myResourceGroup resource group:

az aks get-credentials --resource-group myResourceGroup --name myAKSCluster

Options for using AMD GPUs

Using AMD GPUs involves the installation of various AMD GPU software components such as the AMD device plugin for Kubernetes, GPU drivers, and more.

Note

Currently, AKS doesn't manage nor automate the installation of GPU drivers or the AMD GPU device plugin on AMD GPU-enabled node pools.

Register the AKSInfinibandSupport feature

  1. If your AMD GPU VM size is RDMA-enabled with the r naming convention (e.g. Standard_ND96isr_MI300X_v5), you will need to ensure that the machines in the node pool land on the same physical Infiniband network. To achieve this, register the AKSInfinibandSupport feature flag using the az feature register command:

    az feature register --name AKSInfinibandSupport --namespace Microsoft.ContainerService
    
  2. Verify the registration status using the az feature show command:

    az feature show \
    --namespace "Microsoft.ContainerService" \
    --name AKSInfinibandSupport
    
  3. Create an AMD GPU-enabled node pool using the az aks nodepool add command and skip default driver installation by setting the API field --gpu-driver to the value none:

    az aks nodepool add \
        --resource-group myResourceGroup \
        --cluster-name myAKSCluster \
        --name gpunp \
        --node-count 1 \
        --node-vm-size Standard_ND96isr_MI300X_v5 \
        --gpu-driver none
    

    Note

    AKS currently enforces the use of the gpu-driver field to skip automatic driver installation at AMD GPU node pool creation time.

Deploy the AMD GPU Operator on AKS

The AMD GPU Operator automates the management and deployment of all AMD software components needed to provision GPU including driver installation, the AMD device plugin for Kubernetes, the AMD container runtime, and more. Since the AMD GPU Operator handles these components, it's not necessary to separately install the AMD device plugin on your AKS cluster. This also means that the automatic GPU driver installation should be skipped in order to use the AMD GPU Operator on AKS.

  1. Follow the AMD documentation to Install the GPU Operator.

  2. Check the status of the AMD GPUs in your node pool using the kubectl get nodes command:

    kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'amd\.com/gpu'
    

    Your output should look similar to the following example output:

    NAME                    STATUS   ROLES   AGE    VERSION
    aks-gpunp-00000000      Ready    agent   2m4s   v1.31.7
    

Confirm that the AMD GPUs are schedulable

After creating your node pool, confirm that GPUs are schedulable in your AKS cluster.

  1. List the nodes in your cluster using the kubectl get nodes command.

    kubectl get nodes
    
  2. Confirm the GPUs are schedulable using the kubectl describe node command.

    kubectl describe node aks-gpunp-00000000
    

    Under the Capacity section, the GPU should list as amd.com/gpu: 1. Your output should look similar to the following condensed example output:

    Name:               aks-gpunp-00000000
    Roles:              agent
    Labels:             accelerator=amd
    
    [...]
    
    Capacity:
    [...]
     amd.com/gpu:                 1
    [...]
    

Run an AMD GPU-enabled workload

To see the AMD GPU in action, you can schedule a GPU-enabled workload with the appropriate resource request. In this example, we'll run a Tensorflow job against the MNIST dataset.

  1. Create a file named samples-tf-mnist-demo.yaml and paste the following YAML manifest, which includes a resource limit of amd.com/gpu: 1:

    apiVersion: batch/v1
    kind: Job
    metadata:
      labels:
        app: samples-tf-mnist-demo
      name: samples-tf-mnist-demo
    spec:
      template:
        metadata:
          labels:
            app: samples-tf-mnist-demo
        spec:
          containers:
          - name: samples-tf-mnist-demo
            image: mcr.azk8s.cn/azuredocs/samples-tf-mnist-demo:gpu
            args: ["--max_steps", "500"]
            imagePullPolicy: IfNotPresent
            resources:
              limits:
               amd.com/gpu: 1
          restartPolicy: OnFailure
          tolerations:
          - key: "sku"
            operator: "Equal"
            value: "gpu"
            effect: "NoSchedule"
    
  2. Run the job using the kubectl apply command, which parses the manifest file and creates the defined Kubernetes objects.

    kubectl apply -f samples-tf-mnist-demo.yaml
    

View the status of the GPU-enabled workload

  1. Monitor the progress of the job using the kubectl get jobs command with the --watch flag. It might take a few minutes to first pull the image and process the dataset.

    kubectl get jobs samples-tf-mnist-demo --watch
    

    When the COMPLETIONS column shows 1/1, the job has successfully finished, as shown in the following example output:

    NAME                    COMPLETIONS   DURATION   AGE
    
    samples-tf-mnist-demo   0/1           3m29s      3m29s
    samples-tf-mnist-demo   1/1   3m10s   3m36s
    
  2. Exit the kubectl --watch process with Ctrl-C.

  3. Get the name of the pod using the kubectl get pods command.

    kubectl get pods --selector app=samples-tf-mnist-demo
    

Clean up resources

Remove the associated Kubernetes objects you created in this article using the kubectl delete job command.

kubectl delete jobs samples-tf-mnist-demo

Next steps