CUDA Toolkit installed but nvcc not found

I am managing a Rocky Linux 8.7 computing cluster and was forced to install updates due to complications which caused the nvidia drivers to break. This has now been somewhat fixed and ‘nvidia-smi’ is producing the desired outcome again.

However, running ‘nvcc --version’ is not working

[admin@cluster~]$ nvcc --version
-bash: nvcc: command not found

I can make it work by running

export PATH=/usr/local/cuda/bin:$PATH
source ~/.bashrc

But this only lasts until I log off again and when I try to use it the next time after logging in, it does’t work, and there are also multiple users on this cluster, meaning I want to find a solution that makes it work for all users

When they are trying to schedule jobs (via slurm) there is the following error

/$PATH/pmemd.cuda: error while loading shared libraries: cannot open shared object file: No such file or directory

Additional info:

[admin@cluster~]$ uname -r
[admin@cluster~]$ nvidia-smi
NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1

Edit: I suspect it’s mainly a problem with the PATH variables, but I’m not sure how to fix this other than editing the .bashrc file for every user. What leads me to this suspicion is this:

[admin@cluster~]$ echo $LD_LIBRARY_PATH

as there is no /usr/local/cuda-11.2 anymore because I upgraded to 12.1

Assistance with solving this is highly appreciated!

The ‘bash’ sources ‘/etc/profile’, which includes all the /etc/profile.d/*.sh
Adding a file there adds it “for everyone”.
Note how the /etc/profile defines pathmunge function for those files to use. Convenient, although only for the PATH.

I did recently see the installation of some CUDA version that it did actually use the alternatives to put symlinks to some commands on the path.

Note also that /usr/local/cuda is a symlink; each CUDA version is in distinct /usr/local/cuda-*

The different versions can be simultaneously installed, because they are in distinct paths. Only the /usr/local/cuda and alternatives symlinks would be “to one version”.
If you have one binary compiled with 11.2 and another with 12.1, are you sure that you want to – and can – run both with the 12.1 libs?

A SLURM job does inherit the submitter’s environment, but the job can adjust its environment too.

We have used environment modules, now the Lmod Lua-based version from EPEL.
Yes, the job script has to have one extra line, like:

module load cuda/11.2

to set the environment for the CUDA-application, but it is also explicitly recorded within the job script, which CUDA was/is used and the user has control.

Thank you very much! Was able to fix it by creating a in /etc/profiled.d and specifying the PATH there. Also downloading cuda 11.2 again (Slurm errors were caused by a library being updated in cuda 12.1 so the system wasn’t able to find the necessary files)