Nvidia-smi fails

Using the nvidia-smi -command does not give result:

$ nvidia-smi

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Linked to libnvidia-ml library at wrong path : /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
$ locate libnvidia-ml.so
/usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so.1

I don’t have CUDA on EL8. CUDA 11.4 install on EL7 (from NVidia’s yum repository) does add
/usr/local/cuda/targets/x86_64-linux/lib to ld’s search path, but not the */stubs/

ldconfig -p does not show anything from “stubs” and the libnvidia-ml it sees in /lib64/ (which is a symlink to /usr/lib64/).

How did you install CUDA?
Do you have the /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/ in LD_LIBRARY_PATH in your current shell session?

I think I was using instruction as seen in some articles as:
To install kernel5.13 on rocky linux8:

sudo dnf install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm
sudo rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
sudo dnf --enablerepo=elrepo-kernel install kernel-ml kernel-ml-devel kernel-ml-headers

To install NVIDIA drivers on rocky linux8:
https://www.linuxcapable.com/how-to-install-or-upgrade-nvidia-drivers-on-rocky-linux-8/

Does NVidia’s yum repository actually have NVidia drivers that install for / work with the kernel-ml kernel?

It didn’t work with the 4.18-series kernel either:

$ cat /proc/version 
Linux version 4.18.0-348.12.2.el8_5.x86_64 (mockbuild@dal1-prod-builder001.bld.equ.rockylinux.org) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-3) (GCC)) #1 SMP Wed Jan 19 17:53:40 UTC 2022

The NVidia’s repository has both NVidia drivers and CUDA toolkit. CUDA requires the drivers. At some point is was so nitpicky that one definitely did need the drivers from the same repository. Or so I was told.

I’ve never installed drivers from NVidia’s repo. I’ve installed drivers from ELRepo repository before even defining the NVidia’s repo (if I do it at all).

State: Machine has no third-party repositories / content.

sudo dnf install elrepo-release
sudo dnf install nvidia-detect
sudo dnf install $(nvidia-detect)
reboot

State: Machine knows ‘elrepo’ and has NVidia’s driver in use.

After that it is possible to define the ‘cuda’ repo and install a CUDA toolkit. It should be ok with the NVidia drivers packaged by ELRepo. Note though that “install whole toolkit” does no work; one has to limit to the CUDA subpackages that are actually necessary.

Result:

$ sudo dnf install $(nvidia-detect)
An Intel display controller was also detected
Last metadata expiration check: 0:00:50 ago on Fri 11 Feb 2022 09:00:52 PM PST.
Error: 
 Problem: package kmod-nvidia-470.103.01-1.el8_5.elrepo.x86_64 requires nvidia-x11-drv = 470.103.01, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package nvidia-x11-drv-470.103.01-1.el8_5.elrepo.x86_64 is filtered out by modular filtering
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)

And nvidia-detect results:

$ nvidia-detect 
kmod-nvidia
An Intel display controller was also detected
$ nvidia-smi 

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Linked to libnvidia-ml library at wrong path : /usr/local/cuda-10.1/targets/x86_64-linux/lib/stubs/libnvidia-ml.so

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

The prerequisite for my approach was to have no third-party content at start. The /usr/local/cuda-10.1 is not from Rocky.