在运行 Linux 的 N 系列 VM 上安装 NVIDIA GPU 驱动程序Install NVIDIA GPU drivers on N-series VMs running Linux

若要利用 NVIDIA GPU 支持的 Azure N 系列 VM 的 GPU 功能,必须安装 NVIDIA GPU 驱动程序。To take advantage of the GPU capabilities of Azure N-series VMs backed by NVIDIA GPUs, you must install NVIDIA GPU drivers.

如果选择手动安装 NVIDIA GPU 驱动程序,本文提供受支持的分发版、驱动程序以及安装和验证步骤。If you choose to install NVIDIA GPU drivers manually, this article provides supported distributions, drivers, and installation and verification steps. 针对 Windows VM 也提供了驱动程序手动安装信息。Manual driver setup information is also available for Windows VMs.

有关 N 系列 VM 规格、存储容量和磁盘详细信息,请参阅 GPU Linux VM 大小For N-series VM specs, storage capacities, and disk details, see GPU Linux VM sizes.

支持的分发和驱动程序Supported distributions and drivers

NVIDIA CUDA 驱动程序NVIDIA CUDA drivers

仅下表列出的 Linux 发行版支持适用于 NCv3 系列 VM 的 NVIDIA CUDA 驱动程序。NVIDIA CUDA drivers for NCv3-series VMs is supported only on the Linux distributions listed in the following table. 本文发布时,CUDA 驱动程序信息为最新版本。CUDA driver information is current at time of publication. 有关最新的 CUDA 驱动程序和支持的操作系统,请访问 NVIDIA 网站。For the latest CUDA drivers and supported operating systems, visit the NVIDIA website. 确保安装或升级到最新 CUDA 驱动程序分发软件包。Ensure that you install or upgrade to the latest CUDA drivers for your distribution.

提示

作为一种在 Linux VM 上手动安装 CUDA 驱动程序的替代方法,可以部署 Azure 数据科学虚拟机映像。As an alternative to manual CUDA driver installation on a Linux VM, you can deploy an Azure Data Science Virtual Machine image. 用于 Ubuntu 16.04 LTS 或 CentOS 7.4 的 DSVM 版本预安装 NVIDIA CUDA 驱动程序、CUDA 深度神经网络库和其他工具。The DSVM editions for Ubuntu 16.04 LTS or CentOS 7.4 pre-install NVIDIA CUDA drivers, the CUDA Deep Neural Network Library, and other tools.

在 N 系列 VM 上安装 CUDA 驱动程序Install CUDA drivers on N-series VMs

从 NVIDIA CUDA 工具包在 N 系列 VM 上安装 CUDA 驱动程序的步骤如下。Here are steps to install CUDA drivers from the NVIDIA CUDA Toolkit on N-series VMs.

C 和 C++ 开发人员可以选择安装完整的工具包来生成 GPU 加速应用程序。C and C++ developers can optionally install the full Toolkit to build GPU-accelerated applications. 有关详细信息,请参阅 CUDA 安装指南For more information, see the CUDA Installation Guide.

要安装 CUDA 驱动程序,请建立到每个 VM 的 SSH 连接。To install CUDA drivers, make an SSH connection to each VM. 若要验证系统是否具有支持 CUDA 的 GPU,请运行以下命令:To verify that the system has a CUDA-capable GPU, run the following command:

lspci | grep -i NVIDIA

会看到类似于以下示例(显示 NVIDIA Tesla K80 卡)的输出:You will see output similar to the following example (showing an NVIDIA Tesla K80 card):

lspci 命令输出

然后,运行特定于分发的安装命令。Then run installation commands specific for your distribution.

UbuntuUbuntu

  1. 从 NVIDIA 网站下载并安装 CUDA 驱动程序。Download and install the CUDA drivers from the NVIDIA website. 例如,对于 Ubuntu 16.04 LTS:For example, for Ubuntu 16.04 LTS:

    CUDA_REPO_PKG=cuda-repo-ubuntu1604_10.0.130-1_amd64.deb
    
    wget -O /tmp/${CUDA_REPO_PKG} http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/${CUDA_REPO_PKG} 
    
    sudo dpkg -i /tmp/${CUDA_REPO_PKG}
    
    sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub 
    
    rm -f /tmp/${CUDA_REPO_PKG}
    
    sudo apt-get update
    
    sudo apt-get install cuda-drivers
    
    

    安装可能需要几分钟。The installation can take several minutes.

  2. 若要安装完整的 CUDA 工具包,请键入:To optionally install the complete CUDA toolkit, type:

    sudo apt-get install cuda
    
  3. 重新启动 VM,并继续验证安装。Reboot the VM and proceed to verify the installation.

CUDA 驱动程序更新CUDA driver updates

在部署后,建议定期更新 CUDA 驱动程序。We recommend that you periodically update CUDA drivers after deployment.

sudo apt-get update

sudo apt-get upgrade -y

sudo apt-get dist-upgrade -y

sudo apt-get install cuda-drivers

sudo reboot

CentOSCentOS

  1. 更新内核(建议)。Update the kernel (recommended). 如果选择不更新内核,请确保 kernel-develdkms 的版本适合你的内核。If you choose not to update the kernel, ensure that the versions of kernel-devel and dkms are appropriate for your kernel.

    sudo yum install kernel kernel-tools kernel-headers kernel-devel
    
    sudo reboot
    
  2. 安装最新的适用于 Hyper-V 和 Azure 的 Linux 集成服务Install the latest Linux Integration Services for Hyper-V and Azure.

    wget https://aka.ms/lis
    
    tar xvzf lis
    
    cd LISISO
    
    sudo ./install.sh
    
    sudo reboot
    
  3. 重新连接到 VM 并使用以下命令继续安装:Reconnect to the VM and continue installation with the following commands:

    sudo rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    
    sudo yum install dkms
    
    CUDA_REPO_PKG=cuda-repo-rhel7-10.0.130-1.x86_64.rpm
    
    wget http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/${CUDA_REPO_PKG} -O /tmp/${CUDA_REPO_PKG}
    
    sudo rpm -ivh /tmp/${CUDA_REPO_PKG}
    
    rm -f /tmp/${CUDA_REPO_PKG}
    
    sudo yum install cuda-drivers
    

    安装可能需要几分钟。The installation can take several minutes.

  4. 若要安装完整的 CUDA 工具包,请键入:To optionally install the complete CUDA toolkit, type:

    sudo yum install cuda
    
  5. 重新启动 VM,并继续验证安装。Reboot the VM and proceed to verify the installation.

验证驱动程序安装Verify driver installation

要查询 GPU 设备状态,请建立到 VM 的 SSH 连接,并运行与驱动程序一起安装的 nvidia-smi 命令行实用工具。To query the GPU device state, SSH to the VM and run the nvidia-smi command-line utility installed with the driver.

如果安装了驱动程序,将看到如下输出。If the driver is installed, you will see output similar to the following. 请注意,除非当前正在 VM 上运行 GPU 工作负荷,否则 GPU-Util 将显示 0%。****Note that GPU-Util shows 0% unless you are currently running a GPU workload on the VM. 驱动程序版本和 GPU 详细信息可能与所示的内容不同。Your driver version and GPU details may be different from the ones shown.

NVIDIA 设备状态

RDMA 网络连接RDMA network connectivity

可以在同一可用性集或虚拟机 (VM) 规模集的单个放置组中部署的支持 RDMA 的 N 系列 VM(例如 NC24r)上启用 RDMA 网络连接。RDMA network connectivity can be enabled on RDMA-capable N-series VMs such as NC24r deployed in the same availability set or in a single placement group in a virtual machiine (VM) scale set. 对于使用 Intel MPI 5.x 或更高版本运行的应用程序,RDMA 网络支持消息传递接口 (MPI) 流量。The RDMA network supports Message Passing Interface (MPI) traffic for applications running with Intel MPI 5.x or a later version. 其他要求如下:Additional requirements follow:

分发Distributions

在 N 系列 VM 上,在支持 RDMA 连接的 Azure 市场中,从以下映像之一部署支持 RDMA 的 N 系列 VM:Deploy RDMA-capable N-series VMs from one of the images in the Azure Marketplace that supports RDMA connectivity on N-series VMs:

  • Ubuntu 16.04 LTS - 在 VM 上配置 RDMA 驱动程序,并注册 Intel 下载 Intel MPI:Ubuntu 16.04 LTS - Configure RDMA drivers on the VM and register with Intel to download Intel MPI:

    1. 安装 dapl、rdmacm、ibverbs 和 mlx4Install dapl, rdmacm, ibverbs, and mlx4

      sudo apt-get update
      
      sudo apt-get install libdapl2 libmlx4-1
      
    2. 在 /etc/waagent.conf 中,通过取消注释以下配置行来启用 RDMA。In /etc/waagent.conf, enable RDMA by uncommenting the following configuration lines. 需要根访问权限才能编辑该文件。You need root access to edit this file.

      OS.EnableRDMA=y
      
      OS.UpdateRdmaDriver=y
      
    3. 在 /etc/security/limits.conf 文件中,添加或更改 KB 中的以下内存设置。Add or change the following memory settings in KB in the /etc/security/limits.conf file. 需要根访问权限才能编辑该文件。You need root access to edit this file. 出于测试目的,可以将 memlock 设置为不受限制。For testing purposes you can set memlock to unlimited. 例如:<User or group name> hard memlock unlimitedFor example: <User or group name> hard memlock unlimited.

      <User or group name> hard    memlock <memory required for your application in KB>
      
      <User or group name> soft    memlock <memory required for your application in KB>
      
    4. 安装 Intel MPI 库。Install Intel MPI Library. 从 Intel 购买和下载库或下载免费评估版本Either purchase and download the library from Intel or download the free evaluation version.

      wget http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/9278/l_mpi_p_5.1.3.223.tgz
      

      仅支持 Intel MPI 5.x 运行时。Only Intel MPI 5.x runtimes are supported.

      有关安装步骤,请参阅 Intel MPI 库安装指南For installation steps, see the Intel MPI Library Installation Guide.

    5. 启用非根非调试器进程的 ptrace(为最新版本的 Intel MPI 所需)。Enable ptrace for non-root non-debugger processes (needed for the most recent versions of Intel MPI).

      echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
      
  • 基于 CentOS 的 7.4 HPC - 在 VM 上安装 RDMA 驱动程序和 Intel MPI 5.1。CentOS-based 7.4 HPC - RDMA drivers and Intel MPI 5.1 are installed on the VM.

故障排除Troubleshooting

  • 可以使用 nvidia-smi 设置持久性模式,以便在需要查询卡时该命令的输出更快。You can set persistence mode using nvidia-smi so the output of the command is faster when you need to query cards. 若要设置持久性模式,请执行 nvidia-smi -pm 1To set persistence mode, execute nvidia-smi -pm 1. 请注意,如果重启 VM,此模式设置将消失。Note that if the VM is restarted, the mode setting goes away. 你可以始终将该模式设置编写为在启动时执行。You can always script the mode setting to execute upon startup.
  • 如果已将 NVIDIA CUDA 驱动程序更新到最新版本,并且发现 RDMA 连接不再工作,请重新安装 RDMA 驱动程序以重新建立该连接。If you updated the NVIDIA CUDA drivers to the latest version and find RDMA connectivity is no longer working, reinstall the RDMA drivers to reestablish that connectivity.

后续步骤Next steps