Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article describes how Azure Kubernetes Service (AKS) uses Node Problem Detector (NPD) to monitor the health of GPU-enabled node pools. NPD is a Kubernetes component that detects and reports node-level issues, including hardware faults, driver errors, and connectivity problems that can affect the performance and availability of GPU workloads.
About GPU health monitoring in NPD
AKS supports GPU health monitoring through Node Problem Detector (NPD), enabling automatic detection and reporting of issues that affect GPU-enabled node pools on an AKS cluster. GPU health monitoring helps Kubernetes operators keep GPU nodes healthy and performant by surfacing hardware faults, communication failures, and system-level errors. NPD sets GPU-related node conditions and enable platform engineering teams to take action before issues impact application performance or availability.
These health signals are vital for ensuring optimal performance and reliability across a range of GPU workloads, including:
- Machine learning (ML) training and inference.
- AI model development.
- High-performance computing (HPC).
- Graphics rendering and data-intensive simulations.
Limitations
AKS Node Problem Detector does not run GPU health checks on node pools with --gpu-driver none, where self-managed or custom GPU driver was installed on the nodes.
Supported GPU health checks
NPD regularly monitors GPU-enabled node pools and sets conditions when anomalies are detected. The following GPU health checks are currently supported:
GPUMissing: NPD verifies that the number of GPUs detected by the
nvidia-smiutility matches the expected GPU count for the VM SKU assigned to the node.- A mismatch might indicate a hardware fault, driver issue, or misconfiguration. Accurate GPU enumeration is critical for ensuring scheduling accuracy and workload availability on GPU nodes.
GPUXIDErrors: Checks for XID (eXecution ID) errors emitted by the GPU driver in the kernel logs. XID errors are low-level GPU faults that typically occur when:
- The driver misprograms the GPU.
- There's a corruption in the command stream sent to the GPU.
- A hardware failure or instability affects GPU operation.
For more information, see XID errors on NVIDIA GPUs.
NVLink Status: For NVIDIA VM SKUs that support NVLink, this condition confirms that NVLink is active and functioning.
- NVLink is a high-speed interconnect used to facilitate data transfer between multiple GPUs.
- If NVLink is inactive or degraded, multi-GPU workloads might experience reduced performance or communication bottlenecks.
For more information, see NVIDIA NVLink.
InfiniBand Link Flapping: NPD monitors for InfiniBand (IB) link flapping, or intermittent connectivity of the IB network device.
- Link flapping shouldn't occur under normal operating conditions and might result in degraded inter-node communication for distributed workloads.
- It can also signal physical layer issues, misconfigured firmware, or driver instability.
Frequently asked questions
Does Node Problem Detector (NPD) automatically remediate GPU node issues?
NPD doesn't take direct action to remediate GPU-enabled node issues. NPD detects and reports problems by publishing Kubernetes node conditions and events. Any remediation (for example: draining a node, restarting workloads, or replacing faulty hardware) must be handled manually, through external automation, or alerting systems configured by the Kubernetes operator.
On which Azure VM sizes does AKS conduct GPU health monitoring through NPD?
Currently, NPD conducts health checks on GPU nodes provisioned with the Standard_ND96asr_v4 or Standard_ND96isr_H100_v5 VM size on AKS.
Next steps
- Learn more about the types of node conditions and events set by NPD on AKS.