Working with the CUDA Environment

Not all machines have CUDA capable video cards in them (the FastX jump hosts specifically do NOT have GPU's installed and you mush ssh to a lab computer after logging in via FastX).  Use the "lspci" command to determine if there is an NVIDIA GPU installed:

[user@l-lnx103 cuda_samples]$  lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GM107GL [Quadro K620] (rev a2)

Where is the documentation?

CUDA (/usr/local/cuda/doc/html/index.html)
CUDA Downloads
CUDA Toolkit Documentation
Nvidia's CUDA Forum
Release Notes (/usr/local/cuda/doc/html/cuda-toolkit-release-notes/index.html)
SDK Sample programs: /usr/local/cuda/samples

How do I build the examples that are included with the SDK?

You'll need to make a copy of the /usr/local/cuda/samples and setup your shell environment for an older, side loaded version of the gcc 4.9.2 compiler compatable with the CUDA libraries:

[user@l-lnx103:~]$ cp -a /usr/local/cuda/samples ~/
[user@l-lnx103:~]$ cd samples
# setup your environment with version 4.9.2 of gcc to correctly build the samples
[user@l-lnx103:~]$ scl enable devtoolset-3 bash
[user@l-lnx103:~/samples]$ make
[user@l-lnx103:~]$ export PATH=$PATH:~/samples/bin/x86_64/linux/release # for bash users
[user@l-lnx103:~]$ setenv PATH ${PATH}:~/samples/bin/x86_64/linux/release # for tcsh users
[user@l-lnx103:~]$ deviceQuery

If you followed the above example, everything is built and the executables are in ~/samples/bin/x86_64/linux/release.

What version of the toolkit is installed?

Follow the link for /usr/local/cuda or use the rpm command.  On this workstation, version 8 of the cuda SDK is installed:

[user@l-lnx103 cuda_samples]$ ls -l /usr/local/cuda
lrwxrwxrwx. 1 root root 8 Jun 25 12:58 /usr/local/cuda -> cuda-8.0

[user@l-lnx103 cuda_samples]$ rpm -qav|grep  cuda-toolkit

How can I tell if the card supports double-precision floating point numbers?

If you built the examples, there's an executable named deviceQuery that you can run that will tell you all about the card. The card must have compute capability 1.3 or higher. A description of each compute capability level is available in Appendix G of the Programming Guide

Here's sample output of deviceQuery with the Major/Minor revision number highlighted:

[user@l-lnx103:samples]$ bin/x86_64/linux/release/deviceQuery

bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro K620"
  CUDA Driver Version / Runtime Version                8.0 / 8.0
  CUDA Capability Major/Minor version number:     5.0
  Total amount of global memory:                           2000 MBytes (2097414144 bytes)
  ( 3) Multiprocessors, (128) CUDA Cores/MP:         384 CUDA Cores
  GPU Max Clock rate:                                              1124 MHz (1.12 GHz)
  Memory Clock rate:                                                900 Mhz
  Memory Bus Width:                                                128-bit
  L2 Cache Size:                                                        2097152 bytes
  Maximum Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:                         65536 bytes
  Total amount of shared memory per block:            49152 bytes
  Total number of registers available per block:        65536
  Warp size:                                                                 32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:                  1024
  Max dimension size of a thread block (x,y,z):         (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z):             (2147483647, 65535, 65535)
  Maximum memory pitch:                                          2147483647 bytes
  Texture alignment:                                                    512 bytes
  Concurrent copy and kernel execution:                    Yes with 1 copy engine(s)
  Run time limit on kernels:                                         Yes
  Integrated GPU sharing Host Memory:                      No
  Support host page-locked memory mapping:           Yes
  Alignment requirement for Surfaces:                        Yes
  Device has ECC support:                                            Disabled
  Device supports Unified Addressing (UVA):               Yes
  Device PCI Domain ID / Bus ID / location ID:              0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Quadro K620
Result = PASS

How do I compile my code to use double precision?

By default, CUDA will convert doubles into floats. In order to override this behaviour, add "--gpu-name sm_13" to the command line options passed to nvcc. Please see for more information.

deviceQuery gave me an error about API mismatch, what's wrong?

[user@l-lnx103] bin/linux/release/deviceQuery
bin/linux/release/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Error: API mismatch: the NVIDIA kernel module has version 260.19.36,
but this NVIDIA driver component has version 260.19.29. Please make
sure that the kernel module and all NVIDIA driver components
have the same version.
cudaGetDeviceCount FAILED CUDA Driver and Runtime version may be mismatched.


Press <Enter> to Quit...

From time to time we update the Nvidia drivers on the system to fix various issues. When that happens, you'll need to rebuild deviceQuery and any code that you built against a previous version of (which is provided by the driver and not by the sdk)

What other issues do I need to be aware of?

If the computer is using the video card as a display (there's a graphical login screen), you will be limited to 5 seconds of time per kernel execution (Look for the line "Run time limit on kernels" in the output of deviceQuery to see if you'll be limited).

If you are using the computer remotely via ssh and someone sits down at the computer and logs in, you'll no longer be able to use the video card to run CUDA executables. The program deviceQuery will return "There is no device supporting CUDA".

If multiple users are trying to run CUDA programs at the same time, there may be contention problems. It appears to be a first-come, first-served situation. If the first user allocates all of the memory on the video card, no one else will be able to run programs until the first user finishes. If the card can accomodate the needs of multiple programs (memory and processing), then it will run all programs simultaneously. Otherwise, you'll have to wait. If you want to be sure that you can run your code, you'll have to go to the lab and sit at the computer. You will then get exclusive access to the video card, but you will be limited to 5 seconds per kernel execution.