GPU best practices for Azure Kubernetes Service (AKS)

Running GPU workloads on an AKS cluster requires proper setup and continuous validation to ensure that compute resources are accessible, secure, and optimally utilized. This article outlines best practices for managing GPU-enabled nodes, validating configurations, and reducing workload interruptions using vendor-specific diagnostic commands.

GPU workloads, like AI model training, real-time inference, simulations, and video processing, often depend on the following configurations:

Correct GPU driver and runtime compatibility.
Accurate scheduling of GPU resources.
Access to GPU hardware devices inside containers.

Misconfigurations can lead to high costs, unexpected job failures, or GPU underutilization.

Enforce GPU workload placement

By default, the AKS scheduler places pods on any available node with sufficient CPU and memory. Without guidance, this can lead to two key issues:

GPU workloads may be scheduled on nodes without GPUs and fail to start, or
General-purpose workloads may occupy GPU nodes, wasting costly resources.

To enforce correct placement:

Taint your GPU nodes using a key like [gpu-vendor].com/gpu: NoSchedule (e.g., nvidia.com/gpu: NoSchedule). This blocks non-GPU workloads from being scheduled there.
Add a matching toleration in your GPU workload pod spec so it can be scheduled on the tainted GPU nodes.
Define GPU resource requests and limits in your pod, to ensure the scheduler reserves GPU capacity, such as:

resources:
  limits:
    [gpu-vendor].com/gpu: 1

Use validation policies or admission controllers to enforce that GPU workloads include the required tolerations and resource limits.

This approach guarantees that only GPU-ready workloads land on GPU nodes and have access to the specialized compute resources they require.

Before deploying production GPU workloads, always validate that your GPU node pools are:

Equipped with compatible GPU drivers.
Hosting a healthy Kubernetes Device Plugin DaemonSet.
Exposing [gpu-vendor].com/gpu as a schedulable resource.

You can confirm the current driver version running on your GPU node pools with the system management interface (SMI) associated with the GPU vendor.

The following command executes nvidia-smi from inside your GPU device plugin deployment pod, to verify driver installation and runtime readiness on an NVIDIA GPU-enabled node pool:

kubectl exec -it $"{GPU_DEVICE_PLUGIN_POD}" -n {GPU_NAMESPACE} -- nvidia-smi

Your output should resemble the following example output:

+-----------------------------------------------------------------------------+
|NVIDIA-SMI 570.xx.xx    Driver Version: 570.xx.xx    CUDA Version: 12.x|
...
...

Repeat the step above for each GPU node pool to confirm the driver version installed on your nodes.

On your AMD GPU-enabled node pools, alternatively deploy the AMD GPU components and execute the amd-smi command in the ROCm device plugin pod to confirm the driver version that is installed.

Keep GPU-enabled nodes updated to the latest node OS image

To ensure the performance, security, and compatibility of your GPU workloads on AKS, it's essential to keep your GPU node pools up to date with the latest recommended node OS images. These updates are critical because they:

Include the latest production-grade GPU drivers, replacing any deprecated or end-of-life (EOL) versions.
Are fully tested for compatibility with your current Kubernetes version.
Address known vulnerabilities identified by GPU vendors.
Incorporate the latest OS and container runtime improvements for enhanced stability and efficiency.

Upgrade your GPU node pool(s) to the latest recommended node OS image released by AKS, either by setting the autoupgrade channel or through manual upgrade. You can monitor and track the latest node image releases using the AKS release tracker.

Separate GPU workloads when using shared clusters

If a single AKS cluster with GPU node pools is running multiple types of GPU workloads, such as model training, real-time inference, or batch processing, it's important to separate these workloads to:

Avoid accidental interference or resource contention between different workload types.
Improve security and maintain compliance boundaries.
Simplify management and monitoring of GPU resource usage per workload category.

You can isolate GPU workloads within a single AKS cluster by using namespaces and network policies. This enables clearer governance through workload-specific quotas, limits, and logging configurations.

Example scenario

Consider an AKS cluster hosting two different GPU workload types that don’t need to communicate with each other:

Training Workloads: Resource-intensive AI model training jobs.
Inference Workloads: Latency-sensitive real-time inference services.

You can use the following steps to separate the two workloads:

Create dedicated namespaces per workload type using the kubectl create namespace command.
```
kubectl create namespace gpu-training
kubectl create namespace gpu-inference
```

Label GPU workload pods by type, as shown in the following example:

metadata:
  namespace: gpu-training
  labels:
    workload: training

Apply network policies to isolate traffic between workload types. The following manifest blocks all ingress and egress for the gpu-training namespace (unless explicitly allowed):

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-cross-namespace
  namespace: gpu-training
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress: []
  egress: []

This policy:

Applies to all pods in the gpu-training namespace.
Denies all incoming and outgoing traffic by default, supporting strong isolation.

This model enhances clarity, control, and safety in shared GPU environments, especially when workload types have different runtime profiles, risk levels, or operational requirements.

Next steps

To learn more about GPU workload deployment and management on AKS, see the following articles:

Create a GPU-enabled node pool on your AKS cluster.

Last updated on 2025-10-29