Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
AMD GPU Virtual Machine (VM) sizes on Azure can provide flexibility in performance and cost, offering high compute capacity while allowing you to choose the right configuration for your workload requirements. AKS supports AMD GPU-enabled Linux node pools to run compute-intensive Kubernetes workloads.
This article helps you provision nodes with schedulable AMD GPUs on new and existing AKS clusters.
Limitations
- AKS currently supports the
Standard_ND96isr_MI300X_v5
Azure VM size powered by the MI300 series AMD GPU. - Updating an existing node pool to add an AMD GPU VM size is not supported on AKS.
- Updating a non-AMD GPU-enabled node pool with an AMD GPU VM size is not supported.
AzureLinux
andWindows
are not yet supported with AMD GPU.
Before you begin
- This article assumes you have an existing AKS cluster. If you don't have a cluster, create one using the Azure CLI, Azure PowerShell, or the Azure portal.
- You need the Azure CLI version 2.72.2 or later installed to set the
--gpu-driver
field. Runaz --version
to find the version. If you need to install or upgrade, see [Install Azure CLI][install-azure-cli]. - If you have the
aks-preview
Azure CLI extension installed, please update the version to 18.0.0b2 or later.
Note
GPU-enabled VMs contain specialized hardware subject to higher pricing and region availability. For more information, see the [pricing][azure-pricing] tool and [region availability][azure-availability].
Get the credentials for your cluster
Get the credentials for your AKS cluster using the az aks get-credentials
command. The following example command gets the credentials for the cluster myAKSCluster
in the myResourceGroup
resource group:
az aks get-credentials --resource-group myResourceGroup --name myAKSCluster
Options for using AMD GPUs
Using AMD GPUs involves the installation of various AMD GPU software components such as the AMD device plugin for Kubernetes, GPU drivers, and more.
Note
Currently, AKS doesn't manage nor automate the installation of GPU drivers or the AMD GPU device plugin on AMD GPU-enabled node pools.
Register the AKSInfinibandSupport feature
If your AMD GPU VM size is RDMA-enabled with the
r
naming convention (e.g.Standard_ND96isr_MI300X_v5
), you will need to ensure that the machines in the node pool land on the same physical Infiniband network. To achieve this, register theAKSInfinibandSupport
feature flag using theaz feature register
command:az feature register --name AKSInfinibandSupport --namespace Microsoft.ContainerService
Verify the registration status using the
az feature show
command:az feature show \ --namespace "Microsoft.ContainerService" \ --name AKSInfinibandSupport
Create an AMD GPU-enabled node pool using the
az aks nodepool add
command and skip default driver installation by setting the API field--gpu-driver
to the valuenone
:az aks nodepool add \ --resource-group myResourceGroup \ --cluster-name myAKSCluster \ --name gpunp \ --node-count 1 \ --node-vm-size Standard_ND96isr_MI300X_v5 \ --gpu-driver none
Note
AKS currently enforces the use of the
gpu-driver
field to skip automatic driver installation at AMD GPU node pool creation time.
Deploy the AMD GPU Operator on AKS
The AMD GPU Operator automates the management and deployment of all AMD software components needed to provision GPU including driver installation, the AMD device plugin for Kubernetes, the AMD container runtime, and more. Since the AMD GPU Operator handles these components, it's not necessary to separately install the AMD device plugin on your AKS cluster. This also means that the automatic GPU driver installation should be skipped in order to use the AMD GPU Operator on AKS.
Follow the AMD documentation to Install the GPU Operator.
Check the status of the AMD GPUs in your node pool using the
kubectl get nodes
command:kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'amd\.com/gpu'
Your output should look similar to the following example output:
NAME STATUS ROLES AGE VERSION aks-gpunp-00000000 Ready agent 2m4s v1.31.7
Confirm that the AMD GPUs are schedulable
After creating your node pool, confirm that GPUs are schedulable in your AKS cluster.
List the nodes in your cluster using the
kubectl get nodes
command.kubectl get nodes
Confirm the GPUs are schedulable using the
kubectl describe node
command.kubectl describe node aks-gpunp-00000000
Under the Capacity section, the GPU should list as
amd.com/gpu: 1
. Your output should look similar to the following condensed example output:Name: aks-gpunp-00000000 Roles: agent Labels: accelerator=amd [...] Capacity: [...] amd.com/gpu: 1 [...]
Run an AMD GPU-enabled workload
To see the AMD GPU in action, you can schedule a GPU-enabled workload with the appropriate resource request. In this example, we'll run a Tensorflow job against the MNIST dataset.
Create a file named samples-tf-mnist-demo.yaml and paste the following YAML manifest, which includes a resource limit of
amd.com/gpu: 1
:apiVersion: batch/v1 kind: Job metadata: labels: app: samples-tf-mnist-demo name: samples-tf-mnist-demo spec: template: metadata: labels: app: samples-tf-mnist-demo spec: containers: - name: samples-tf-mnist-demo image: mcr.azk8s.cn/azuredocs/samples-tf-mnist-demo:gpu args: ["--max_steps", "500"] imagePullPolicy: IfNotPresent resources: limits: amd.com/gpu: 1 restartPolicy: OnFailure tolerations: - key: "sku" operator: "Equal" value: "gpu" effect: "NoSchedule"
Run the job using the
kubectl apply
command, which parses the manifest file and creates the defined Kubernetes objects.kubectl apply -f samples-tf-mnist-demo.yaml
View the status of the GPU-enabled workload
Monitor the progress of the job using the
kubectl get jobs
command with the--watch
flag. It might take a few minutes to first pull the image and process the dataset.kubectl get jobs samples-tf-mnist-demo --watch
When the COMPLETIONS column shows 1/1, the job has successfully finished, as shown in the following example output:
NAME COMPLETIONS DURATION AGE samples-tf-mnist-demo 0/1 3m29s 3m29s samples-tf-mnist-demo 1/1 3m10s 3m36s
Exit the
kubectl --watch
process with Ctrl-C.Get the name of the pod using the
kubectl get pods
command.kubectl get pods --selector app=samples-tf-mnist-demo
Clean up resources
Remove the associated Kubernetes objects you created in this article using the kubectl delete job
command.
kubectl delete jobs samples-tf-mnist-demo
Next steps
- Explore the different storage options for your GPU-based application on AKS.
- Learn more about Ray clusters on AKS.
- Use NVIDIA GPUs for your compute-intensive AKS workloads.