Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
The NVIDIA GPU Operator automates the management and deployment of all NVIDIA software components needed to provision GPU including driver installation, the NVIDIA device plugin for Kubernetes, the NVIDIA container runtime, and more. Since the NVIDIA GPU Operator handles these components, it's not necessary to separately install the NVIDIA device plugin on your AKS cluster. This also means that the automatic GPU driver installation should be skipped in order to use the NVIDIA GPU Operator on AKS.
Important
Open-source software is mentioned throughout AKS documentation and samples. Software that you deploy is excluded from AKS service-level agreements, limited warranty, and Azure support. As you use open-source technology alongside AKS, consult the support options available from the respective communities and project maintainers to develop a plan.
For example, the Ray GitHub repository describes several platforms that vary in response time, purpose, and support level.
Microsoft takes responsibility for building the open-source packages that we deploy on AKS. That responsibility includes having complete ownership of the build, scan, sign, validate, and hotfix process, along with control over the binaries in container images. For more information, see Vulnerability management for AKS and AKS support coverage.
- This article assumes you have an existing AKS cluster. If you don't have a cluster, create one using the Azure CLI, Azure PowerShell, or the Azure portal.
- You need the Azure CLI version 2.72.2 or later installed to set the
--gpu-driver
field. Runaz --version
to find the version. If you need to install or upgrade, see [Install Azure CLI][install-azure-cli].
Note
GPU-enabled VMs contain specialized hardware subject to higher pricing and region availability. For more information, see the [pricing][azure-pricing] tool and [region availability][azure-availability].
Get the credentials for your AKS cluster using the az aks get-credentials
command. The following example command gets the credentials for the cluster myAKSCluster
in the myResourceGroup
resource group:
az aks get-credentials --resource-group myResourceGroup --name myAKSCluster
Note
The NVIDIA GPU Operator is not compatible with multiple OS versions on the same AKS cluster.
Skip automatic GPU driver installation by creating an NVIDIA GPU-enabled node pool using the [
az aks nodepool add
][az-aks-nodepool-add] command and setting the API field--gpu-driver
to the valuenone
. Setting this API field tonone
during node pool creation skips the default GPU driver installation, see this example. Any existing nodes aren't changed. You can scale the node pool to zero and then back up to make the change take effect.Follow the NVIDIA documentation to Install the GPU Operator.
Now that you successfully installed the GPU Operator, you can check that your GPUs are schedulable and run a GPU workload.
Note
There might be additional considerations to take when using the NVIDIA GPU Operator and deploying on SPOT instances. Please refer to https://github.com/NVIDIA/gpu-operator/issues/577
- Learn more about Ray clusters on AKS.