CUDA - Installation

1. 基于Linux 安装

  • 删除其他版本的 CUDA
$ dpkg -l | grep cuda | awk '{print $2}' | xargs -n 1 sudo dpkg -r
  • 安装 NVIDIA驱动

官网上下载最新版本安装。

  • 安装 CUDA

参考:CUDA 9.0 安装手册

$ wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda_9.0.176_384.81_linux-run
$ wget https://developer.nvidia.com/compute/cuda/9.0/Prod/patches/1/cuda_9.0.176.1_linux-run
$ wget https://developer.nvidia.com/compute/cuda/9.0/Prod/patches/2/cuda_9.0.176.2_linux-run
$ wget https://developer.nvidia.com/compute/cuda/9.0/Prod/patches/3/cuda_9.0.176.3_linux-run

$ sudo sh cuda_9.0.176_384.81_linux-run --override
$ sudo sh cuda_9.0.176.1_linux-run
$ sudo sh cuda_9.0.176.2_linux-run
$ sudo sh cuda_9.0.176.3_linux-run
  • 安装 cuDNN

首先下载压缩包,安装步骤参考文档

$ sudo cp cuda/include/cudnn.h /usr/local/cuda/include
$ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
$ sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

如果在运行Tensorflow中出现如下错误:

Loaded runtime CuDNN library: 7104 (compatibility version 7100) but source was compiled with 7004 (compatibility version 7000)

请下载7.0x版本的cuDNN,如cudnn-9.0-linux-x64-v7.tgz

  • 设置环境变量
$ export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}
$ export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
  • 测试
$ python3 cifar10_train.py

Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
2018-07-18 12:13:54.529118: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-18 12:13:55.263256: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.8095
pciBusID: 0000:03:00.0
totalMemory: 7.92GiB freeMemory: 7.80GiB
2018-07-18 12:13:55.263290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-07-18 12:13:55.520475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7537 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
2018-07-18 12:13:59.150650: step 0, loss = 4.68 (271.6 examples/sec; 0.471 sec/batch)
2018-07-18 12:13:59.354900: step 10, loss = 4.63 (6266.7 examples/sec; 0.020 sec/batch)
2018-07-18 12:13:59.493935: step 20, loss = 4.38 (9206.3 examples/sec; 0.014 sec/batch)
2018-07-18 12:13:59.632250: step 30, loss = 4.33 (9254.3 examples/sec; 0.014 sec/batch)
2018-07-18 12:13:59.773219: step 40, loss = 4.35 (9080.0 examples/sec; 0.014 sec/batch)
2018-07-18 12:13:59.913853: step 50, loss = 4.34 (9101.7 examples/sec; 0.014 sec/batch)
...
  • 指定显卡

当有多块显卡时,可以用以下方式指定在哪些显卡上运行,显卡的编号按0,1,2…的顺序

$ CUDA_VISIBLE_DEVICES=1,3 python3 cifar10_train.py

$ export CUDA_VISIBLE_DEVICES=1,3
$ python3 cifar10_train.py
  • 查看GPU利用率
$ watch -n 0.5 nvidia-smi

或者同时查看内存信息

$ watch -n 0.5 "free -h && echo && nvidia-smi"

2. 基于 Windows 安装

问题描述

默认选项安装,界面提示Visual Studio Integration安装失败。

cuda-9.2-failed!

解决步骤

用7zip解压缩安装包至目录,而后以Administrator启动Powershell,运行setup.exe

[D:\Downloads\cuda_9.2.88_win10]
$ .\setup.exe -log:"D:\Temp\CUDA" -loglevel:6

查看日志C:\Temp\CUDA\LOG.setup.exe.log 提示如下错误:

198.566 |    ERROR: [NVI2.NVMsiPhase] 853@CNVMsiPhase::InvokePhase : COM error: -2147467259. 
198.567 |     INFO: [NVI2.NVInstaller] 3004@CNVInstaller::InternalPerformInstallPackagePhases : Exiting Checkpoint: Processing package phase "NsightMSITraffic" ( 0 ms ). 
198.567 |     INFO: [NVI2.NVInstaller] 2043@CNVInstaller::InternalPerformInstall : Exiting Checkpoint: Processing Package Phases in "visual_studio_integration_9.2" ( 0 ms ). 
198.569 |    ERROR: [NVI2.NVInstaller] 2064@CNVInstaller::InternalPerformInstall : Package "visual_studio_integration_9.2" failed with error: Exception {0x80004005 - Unspecified error; File: PerformInstall.cpp; Line: 4029; Phase failure}. 
198.568 |    DEBUG: [DisplayDriver.DisplayDriverExtSite] 1189@CDisplayDriverExtSite::AfterInstallPackage : Package that finished is not Display.Driver , returning early. 
198.569 |    ERROR: [NVI2.NVInstaller] 2123@CNVInstaller::InternalPerformInstall : Failing at package "visual_studio_integration_9.2" failed with error: Exception {0x80004005 - Unspecified error} - aborting install. 
198.570 |     INFO: [NVI2.NVInstaller] 1919@CNVInstaller::InternalPerformInstall : Exiting Checkpoint: Processing Package "visual_studio_integration_9.2" ( 16 ms ). 
198.570 |     INFO: [NVI2.NVInstaller] 1899@CNVInstaller::InternalPerformInstall : Exiting Checkpoint: Processing Packages ( 141 ms ). 
198.569 |     INFO: [system] 464@Nvidia::Logging::Logger::Logger : 2018-Jun-08 10:23:16 :  Logging init OK. Using configuration from HKLM for DefaultProcess, for the setup.exe. 
198.569 |     INFO: [CUDAToolkit.CUDAToolkitExtSite] 700@CCUDAToolkitExtSite::DecideLaunchDocFinishItem : Skipping finish option decision as package not successfully installed. 
198.570 |     INFO: [system] 464@Nvidia::Logging::Logger::Logger : 2018-Jun-08 10:23:16 :  Logging init OK. Using configuration from HKLM for DefaultProcess, for the setup.exe. 
198.570 |     INFO: [DocExt.DocExtSite] 261@CDocExtSite::DecideLaunchDocFinishItem : Skipping finish option decision as package not successfully installed. 
198.572 |    ERROR: [NVI2.InstallThread] 54@CInstallThread::ThreadProc : Install failed - Exception {0x80004005 - Unspecified error; File: PerformInstall.cpp; Line: 4029; Phase failure} - going to fail state. 
198.573 |    DEBUG: [NVI2.NVExtension] 89@CNVExtension::InternalLoad : Loading Extension DLL "D:\Downloads\cuda_9.2.88_win10\MSVCRT\MSVCRTExt.dll". 
198.581 |    DEBUG: [NVI2.NVExtension] 133@CNVExtension::InternalLoad : Loaded Extension DLL "D:\Downloads\cuda_9.2.88_win10\MSVCRT\MSVCRTExt.dll". 

可以断定安装Visual Studio Integration的时候失败,重新运行CUDA安装程序,取消选择Visual Studio Integration,继续完成安装即可。

$ ls

    Directory: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.2


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
d-----       2018-06-08  12:01 PM                bin
d-----       2018-06-08  12:01 PM                doc
d-----       2018-06-08  12:01 PM                extras
d-----       2018-06-08  12:01 PM                include
d-----       2018-06-08  12:00 PM                jre
d-----       2018-06-08  12:01 PM                lib
d-----       2018-06-08  12:00 PM                libnvvp
d-----       2018-06-08  12:01 PM                nvml
d-----       2018-06-08  12:01 PM                nvvm
d-----       2018-06-08  12:01 PM                src
d-----       2018-06-08  12:01 PM                tools
-a----       2018-04-12  02:20 PM           6344 CUDA_Toolkit_Release_Notes.txt
-a----       2018-04-12  02:20 PM          81026 EULA.txt
-a----       2018-04-12  02:20 PM             21 version.txt

另外,值得注意的是,对于v1.8版本的Tensorflow,只支持CUDA 9.0,否则提示找不到cudart64_90.dll. CUDA 9.0和9.2可以共存,关键就是要配置环境变量的路径,让PATH指向其中一个版本即可,如下:

set CUDA_HOME=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0
set PATH=%PATH%;%CUDA_HOME%\bin;%CUDA_HOME%\libnvvp;

Tensorflow就会在如下路径中搜索所需的模块:

  1. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\bin

  2. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\libnvvp