Install NVIDIA GPU drivers on N-series VMs running Linux

Caution

This article references CentOS, a Linux distribution that is nearing End Of Life (EOL) status. Please consider your use and planning accordingly. For more information, see the CentOS End Of Life guidance.

Applies to: ✔️ Linux VMs

To take advantage of the GPU capabilities of Azure N-series VMs backed by NVIDIA GPUs, you must install NVIDIA GPU drivers.

If you choose to install NVIDIA GPU drivers manually, this article provides supported distributions, drivers, and installation and verification steps. Manual driver setup information is also available for Windows VMs.

For N-series VM specs, storage capacities, and disk details, see GPU Linux VM sizes.

Supported distributions and drivers

Caution

This article references CentOS, a Linux distribution that is nearing End Of Life (EOL) status. Please consider your use and planning accordingly.

NVIDIA CUDA drivers

For the latest CUDA drivers and supported operating systems, visit the NVIDIA website. Ensure that you install or upgrade to the latest supported CUDA drivers for your distribution.

Tip

As an alternative to manual CUDA driver installation on a Linux VM, you can deploy an Azure Data Science Virtual Machine image. The DSVM editions for Ubuntu 16.04 LTS or CentOS 7.4 pre-install NVIDIA CUDA drivers, the CUDA Deep Neural Network Library, and other tools.

Install CUDA drivers on N-series VMs

Here are steps to install CUDA drivers from the NVIDIA CUDA Toolkit on N-series VMs.

C and C++ developers can optionally install the full Toolkit to build GPU-accelerated applications. For more information, see the CUDA Installation Guide.

To install CUDA drivers, make an SSH connection to each VM. To verify that the system has a CUDA-capable GPU, run the following command:

lspci | grep -i NVIDIA

Output is similar to the following example (showing an NVIDIA Tesla K80 card):

lspci command output

lspci lists the PCIe devices on the VM, including the InfiniBand NIC and GPUs, if any. If lspci doesn't return successfully, you may need to install LIS on CentOS/RHEL.

Then run installation commands specific for your distribution.

Ubuntu

Ubuntu packages NVIDIA proprietary drivers. Those drivers come directly from NVIDIA and are simply packaged by Ubuntu so that they can be automatically managed by the system. Downloading and installing drivers from another source can lead to a broken system. Moreover, installing third-party drivers requires extra-steps on VMs with TrustedLaunch and Secure Boot enabled. They require the user to add a new Machine Owner Key for the system to boot. Drivers from Ubuntu are signed by Canonical and will work with Secure Boot.

  1. Install ubuntu-drivers utility:

    sudo apt update && sudo apt install -y ubuntu-drivers-common
    
  2. Install the latest NVIDIA drivers:

    sudo ubuntu-drivers install
    
  3. Download and install the CUDA toolkit from NVIDIA:

    Note

    The example shows the CUDA package path for Ubuntu 22.04 LTS. Replace the path specific to the version you plan to use.

    Visit the NVIDIA Download Center or the NVIDIA CUDA Resources page for the full path specific to each version.

    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
    sudo apt install -y ./cuda-keyring_1.1-1_all.deb
    sudo apt update
    sudo apt -y install cuda-toolkit-12-3
    

    The installation can take several minutes.

  4. Verify that the GPU is correctly recognized:

    nvidia-smi
    

NVIDIA driver updates

We recommend that you periodically update NVIDIA drivers after deployment.

sudo apt update
sudo apt full-upgrade

CentOS

  1. Update the kernel (recommended). If you choose not to update the kernel, ensure that the versions of kernel-devel, and dkms are appropriate for your kernel.

    sudo yum install kernel kernel-tools kernel-headers kernel-devel
    sudo reboot
    
  2. Install the latest Linux Integration Services for Hyper-V and Azure. Check if LIS is required by verifying the results of lspci. If all GPU devices are listed as expected, installing LIS isn't required.

    LIS is applicable to CentOS, and the Oracle Linux Red Hat Compatible Kernel 5.2-5.11, 6.0-6.10, and 7.0-7.7. Refer to the Linux Integration Services documentation for more details. Skip this step if you plan to use CentOS/RHEL 7.8 (or higher versions) as LIS is no longer required for these versions.

    wget https://aka.ms/lis
    tar xvzf lis
    cd LISISO
    
    sudo ./install.sh
    sudo reboot
    
  3. Reconnect to the VM and continue installation with the following commands:

    sudo rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    sudo yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo
    sudo yum clean all
    sudo yum -y install nvidia-driver-latest-dkms cuda-drivers
    

    The installation can take several minutes.

    Note

    Visit Fedora and Nvidia CUDA repo to pick the correct package for the CentOS or RHEL version you want to use.

For example, CentOS 8 and RHEL 8 need the following steps.

sudo rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
sudo yum install dkms

sudo wget https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo -O /etc/yum.repos.d/cuda-rhel8.repo

sudo yum install cuda-drivers
  1. To optionally install the complete CUDA toolkit, type:

    sudo yum install cuda
    

    Note

    If you see an error message related to missing packages like vulkan-filesystem then you may need to edit /etc/yum.repos.d/rh-cloud , look for optional-rpms and set enabled to 1

  2. Reboot the VM and proceed to verify the installation.

Verify driver installation

To query the GPU device state, SSH to the VM and run the nvidia-smi command-line utility installed with the driver.

If the driver is installed, Nvidia SMI lists the GPU-Util as 0% until you run a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.

NVIDIA device status

RDMA network connectivity

RDMA network connectivity can be enabled on RDMA-capable N-series VMs such as NC24r deployed in the same availability set or in a single placement group in a virtual machine (VM) scale set. The RDMA network supports Message Passing Interface (MPI) traffic for applications running with Intel MPI 5.x or a later version:

Distributions

Deploy RDMA-capable N-series VMs from one of the images in the Azure Marketplace that supports RDMA connectivity on N-series VMs:

  • Ubuntu 16.04 LTS - Configure RDMA drivers on the VM and register with Intel to download Intel MPI:

    1. Install dapl, rdmacm, ibverbs, and mlx4

      sudo apt-get update
      
      sudo apt-get install libdapl2 libmlx4-1
      
      
    2. In /etc/waagent.conf, enable RDMA by uncommenting the following configuration lines. You need root access to edit this file.

      OS.EnableRDMA=y
      
      OS.UpdateRdmaDriver=y
      
    3. Add or change the following memory settings in KB in the /etc/security/limits.conf file. You need root access to edit this file. For testing purposes you can set memlock to unlimited. For example: <User or group name> hard memlock unlimited.

      <User or group name> hard    memlock <memory required for your application in KB>
      
      <User or group name> soft    memlock <memory required for your application in KB>
      
    4. Install Intel MPI Library. Either purchase and download the library from Intel or download the free evaluation version.

      wget http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/9278/l_mpi_p_5.1.3.223.tgz
      

      Only Intel MPI 5.x runtimes are supported.

      For installation steps, see the Intel MPI Library Installation Guide.

    5. Enable ptrace for non-root non-debugger processes (needed for the most recent versions of Intel MPI).

      echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
      
  • CentOS-based 7.4 HPC - RDMA drivers and Intel MPI 5.1 are installed on the VM.

  • CentOS-based HPC - CentOS-HPC 7.6 and later (for SKUs where InfiniBand is supported over SR-IOV). These images have Mellanox OFED and MPI libraries pre-installed.

Note

CX3-Pro cards are supported only through LTS versions of Mellanox OFED. Use LTS Mellanox OFED version (4.9-0.1.7.0) on the N-series VMs with ConnectX3-Pro cards. For more information, see Linux Drivers.

Also, some of the latest Azure Marketplace HPC images have Mellanox OFED 5.1 and later, which don't support ConnectX3-Pro cards. Check the Mellanox OFED version in the HPC image before using it on VMs with ConnectX3-Pro cards.

The following images are the latest CentOS-HPC images that support ConnectX3-Pro cards:

  • OpenLogic:CentOS-HPC:7.6:7.6.2020062900
  • OpenLogic:CentOS-HPC:7_6gen2:7.6.2020062901
  • OpenLogic:CentOS-HPC:7.7:7.7.2020062600
  • OpenLogic:CentOS-HPC:7_7-gen2:7.7.2020062601
  • OpenLogic:CentOS-HPC:8_1:8.1.2020062400
  • OpenLogic:CentOS-HPC:8_1-gen2:8.1.2020062401

Troubleshooting

  • You can set persistence mode using nvidia-smi so the output of the command is faster when you need to query cards. To set persistence mode, execute nvidia-smi -pm 1. Note that if the VM is restarted, the mode setting goes away. You can always script the mode setting to execute upon startup.
  • If you updated the NVIDIA CUDA drivers to the latest version and find RDMA connectivity is no longer working, reinstall the RDMA drivers to reestablish that connectivity.
  • During installation of LIS, if a certain CentOS/RHEL OS version (or kernel) is not supported for LIS, an error "Unsupported kernel version" is thrown. Please report this error along with the OS and kernel versions.
  • If jobs are interrupted by ECC errors on the GPU (either correctable or uncorrectable), first check to see if the GPU meets any of Nvidia's RMA criteria for ECC errors. If the GPU is eligible for RMA, please contact support about getting it serviced; otherwise, reboot your VM to reattach the GPU as described here. Less invasive methods such as nvidia-smi -r don't work with the virtualization solution deployed in Azure.