Using the nvidia-smi -command does not give result:
$ nvidia-smi
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:
You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Linked to libnvidia-ml library at wrong path : /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:
You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
$ locate libnvidia-ml.so
/usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so.1
I don’t have CUDA on EL8. CUDA 11.4 install on EL7 (from NVidia’s yum repository) does add /usr/local/cuda/targets/x86_64-linux/lib to ld’s search path, but not the */stubs/
ldconfig -p does not show anything from “stubs” and the libnvidia-ml it sees in /lib64/ (which is a symlink to /usr/lib64/).
How did you install CUDA?
Do you have the /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/ in LD_LIBRARY_PATH in your current shell session?
It didn’t work with the 4.18-series kernel either:
$ cat /proc/version
Linux version 4.18.0-348.12.2.el8_5.x86_64 (mockbuild@dal1-prod-builder001.bld.equ.rockylinux.org) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-3) (GCC)) #1 SMP Wed Jan 19 17:53:40 UTC 2022
The NVidia’s repository has both NVidia drivers and CUDA toolkit. CUDA requires the drivers. At some point is was so nitpicky that one definitely did need the drivers from the same repository. Or so I was told.
I’ve never installed drivers from NVidia’s repo. I’ve installed drivers from ELRepo repository before even defining the NVidia’s repo (if I do it at all).
State: Machine has no third-party repositories / content.
State: Machine knows ‘elrepo’ and has NVidia’s driver in use.
After that it is possible to define the ‘cuda’ repo and install a CUDA toolkit. It should be ok with the NVidia drivers packaged by ELRepo. Note though that “install whole toolkit” does no work; one has to limit to the CUDA subpackages that are actually necessary.
$ sudo dnf install $(nvidia-detect)
An Intel display controller was also detected
Last metadata expiration check: 0:00:50 ago on Fri 11 Feb 2022 09:00:52 PM PST.
Error:
Problem: package kmod-nvidia-470.103.01-1.el8_5.elrepo.x86_64 requires nvidia-x11-drv = 470.103.01, but none of the providers can be installed
- cannot install the best candidate for the job
- package nvidia-x11-drv-470.103.01-1.el8_5.elrepo.x86_64 is filtered out by modular filtering
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)
And nvidia-detect results:
$ nvidia-detect
kmod-nvidia
An Intel display controller was also detected
$ nvidia-smi
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:
You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Linked to libnvidia-ml library at wrong path : /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:
You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!