Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
AMD GPU Virtual Machine (VM) sizes on Azure can provide flexibility in performance and cost, offering high compute capacity while allowing you to choose the right configuration for your workload requirements. AKS supports AMD GPU-enabled Linux node pools to run compute-intensive Kubernetes workloads.
This article helps you provision nodes with schedulable AMD GPUs on new and existing AKS clusters.
Limitations
- AKS currently supports the
Standard_ND96isr_MI300X_v5Azure VM size powered by the MI300 series AMD GPU. - Updating an existing node pool to add an AMD GPU VM size is not supported on AKS.
- Updating a non-AMD GPU-enabled node pool with an AMD GPU VM size is not supported.
AzureLinux,Windowsandflatcararen't supported with AMD GPU.
Before you begin
- This article assumes you have an existing AKS cluster. If you don't have a cluster, create one using the Azure CLI, Azure PowerShell, or the Azure portal.
- You need the Azure CLI version 2.72.2 or later installed to set the
--gpu-driverfield. Runaz --versionto find the version. If you need to install or upgrade, see Install Azure CLI. - If you have the
aks-previewAzure CLI extension installed, please update the version to 18.0.0b2 or later.
Note
GPU-enabled VMs contain specialized hardware subject to higher pricing and region availability. For more information, see the pricing tool and region availability.
Get the credentials for your cluster
Get the credentials for your AKS cluster using the az aks get-credentials command. The following example command gets the credentials for the cluster myAKSCluster in the myResourceGroup resource group:
az aks get-credentials --resource-group myResourceGroup --name myAKSCluster
Options for using AMD GPUs
Using AMD GPUs involves the installation of various AMD GPU software components such as the AMD device plugin for Kubernetes, GPU drivers, and more.
Note
Currently, AKS doesn't manage nor automate the installation of GPU drivers or the AMD GPU device plugin on AMD GPU-enabled node pools.
Register the AKSInfinibandSupport feature
If your AMD GPU VM size is RDMA-enabled with the
rnaming convention (e.g.Standard_ND96isr_MI300X_v5), you will need to ensure that the machines in the node pool land on the same physical Infiniband network. To achieve this, register theAKSInfinibandSupportfeature flag using theaz feature registercommand:az feature register --name AKSInfinibandSupport --namespace Microsoft.ContainerServiceVerify the registration status using the
az feature showcommand:az feature show \ --namespace "Microsoft.ContainerService" \ --name AKSInfinibandSupportCreate an AMD GPU-enabled node pool using the
az aks nodepool addcommand and skip default driver installation by setting the API field--gpu-driverto the valuenone:az aks nodepool add \ --resource-group myResourceGroup \ --cluster-name myAKSCluster \ --name gpunp \ --node-count 1 \ --node-vm-size Standard_ND96isr_MI300X_v5 \ --gpu-driver noneNote
AKS currently enforces the use of the
gpu-driverfield to skip automatic driver installation at AMD GPU node pool creation time.
Deploy the AMD GPU Operator on AKS
The AMD GPU Operator automates the management and deployment of all AMD software components needed to provision GPU including driver installation, the AMD device plugin for Kubernetes, the AMD container runtime, and more. Since the AMD GPU Operator handles these components, it's not necessary to separately install the AMD device plugin on your AKS cluster. This also means that the automatic GPU driver installation should be skipped in order to use the AMD GPU Operator on AKS.
Follow the AMD documentation to Install the GPU Operator.
Check the status of the AMD GPUs in your node pool using the
kubectl get nodescommand:kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'amd\.com/gpu'Your output should look similar to the following example output:
NAME STATUS ROLES AGE VERSION aks-gpunp-00000000 Ready agent 2m4s v1.31.7
Confirm that the AMD GPUs are schedulable
After creating your node pool, confirm that GPUs are schedulable in your AKS cluster.
List the nodes in your cluster using the
kubectl get nodescommand.kubectl get nodesConfirm the GPUs are schedulable using the
kubectl describe nodecommand.kubectl describe node aks-gpunp-00000000Under the Capacity section, the GPU should list as
amd.com/gpu: 1. Your output should look similar to the following condensed example output:Name: aks-gpunp-00000000 Roles: agent Labels: accelerator=amd [...] Capacity: [...] amd.com/gpu: 1 [...]
Run an AMD GPU-enabled workload
To see the AMD GPU in action, you can schedule a GPU-enabled workload with the appropriate resource request. In this example, we'll run a Tensorflow job against the MNIST dataset.
Create a file named samples-tf-mnist-demo.yaml and paste the following YAML manifest, which includes a resource limit of
amd.com/gpu: 1:apiVersion: batch/v1 kind: Job metadata: labels: app: samples-tf-mnist-demo name: samples-tf-mnist-demo spec: template: metadata: labels: app: samples-tf-mnist-demo spec: containers: - name: samples-tf-mnist-demo image: mcr.azk8s.cn/azuredocs/samples-tf-mnist-demo:gpu args: ["--max_steps", "500"] imagePullPolicy: IfNotPresent resources: limits: amd.com/gpu: 1 restartPolicy: OnFailure tolerations: - key: "sku" operator: "Equal" value: "gpu" effect: "NoSchedule"Run the job using the
kubectl applycommand, which parses the manifest file and creates the defined Kubernetes objects.kubectl apply -f samples-tf-mnist-demo.yaml
View the status of the GPU-enabled workload
Monitor the progress of the job using the
kubectl get jobscommand with the--watchflag. It might take a few minutes to first pull the image and process the dataset.kubectl get jobs samples-tf-mnist-demo --watchWhen the COMPLETIONS column shows 1/1, the job has successfully finished, as shown in the following example output:
NAME COMPLETIONS DURATION AGE samples-tf-mnist-demo 0/1 3m29s 3m29s samples-tf-mnist-demo 1/1 3m10s 3m36sExit the
kubectl --watchprocess with Ctrl-C.Get the name of the pod using the
kubectl get podscommand.kubectl get pods --selector app=samples-tf-mnist-demo
Clean up resources
Remove the associated Kubernetes objects you created in this article using the kubectl delete job command.
kubectl delete jobs samples-tf-mnist-demo
Next steps
- Explore the different storage options for your GPU-based application on AKS.
- Learn more about Ray clusters on AKS.
- Use NVIDIA GPUs for your compute-intensive AKS workloads.