I am managing a Rocky Linux 8.7 computing cluster and was forced to install updates due to complications which caused the nvidia drivers to break. This has now been somewhat fixed and ‘nvidia-smi’ is producing the desired outcome again.
However, running ‘nvcc --version’ is not working
[admin@cluster~]$ nvcc --version
-bash: nvcc: command not found
I can make it work by running
export PATH=/usr/local/cuda/bin:$PATH
source ~/.bashrc
But this only lasts until I log off again and when I try to use it the next time after logging in, it does’t work, and there are also multiple users on this cluster, meaning I want to find a solution that makes it work for all users
When they are trying to schedule jobs (via slurm) there is the following error
/$PATH/pmemd.cuda: error while loading shared libraries: libcufft.so.10: cannot open shared object file: No such file or directory
Additional info:
[admin@cluster~]$ uname -r
4.18.0-425.13.1.el8_7.x86_64
[admin@cluster~]$ nvidia-smi
NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1
Edit: I suspect it’s mainly a problem with the PATH variables, but I’m not sure how to fix this other than editing the .bashrc file for every user. What leads me to this suspicion is this:
[admin@cluster~]$ echo $LD_LIBRARY_PATH
/usr/local/xtb-6.4.0/lib:/usr/local/openmpi/lib:/usr/local/cuda-11.2/lib64
as there is no /usr/local/cuda-11.2 anymore because I upgraded to 12.1
Assistance with solving this is highly appreciated!