CUDA Toolkit installed but nvcc not found

FloAtWork · April 5, 2023, 3:21pm

I am managing a Rocky Linux 8.7 computing cluster and was forced to install updates due to complications which caused the nvidia drivers to break. This has now been somewhat fixed and ‘nvidia-smi’ is producing the desired outcome again.

However, running ‘nvcc --version’ is not working

[admin@cluster~]$ nvcc --version
-bash: nvcc: command not found

I can make it work by running

export PATH=/usr/local/cuda/bin:$PATH
source ~/.bashrc

But this only lasts until I log off again and when I try to use it the next time after logging in, it does’t work, and there are also multiple users on this cluster, meaning I want to find a solution that makes it work for all users

When they are trying to schedule jobs (via slurm) there is the following error

/$PATH/pmemd.cuda: error while loading shared libraries: libcufft.so.10: cannot open shared object file: No such file or directory

Additional info:

[admin@cluster~]$ uname -r
4.18.0-425.13.1.el8_7.x86_64
[admin@cluster~]$ nvidia-smi
NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1

Edit: I suspect it’s mainly a problem with the PATH variables, but I’m not sure how to fix this other than editing the .bashrc file for every user. What leads me to this suspicion is this:

[admin@cluster~]$ echo $LD_LIBRARY_PATH
/usr/local/xtb-6.4.0/lib:/usr/local/openmpi/lib:/usr/local/cuda-11.2/lib64

as there is no /usr/local/cuda-11.2 anymore because I upgraded to 12.1

Assistance with solving this is highly appreciated!

jlehtone · April 5, 2023, 5:40pm

The ‘bash’ sources ‘/etc/profile’, which includes all the /etc/profile.d/*.sh
Adding a file there adds it “for everyone”.
Note how the /etc/profile defines pathmunge function for those files to use. Convenient, although only for the PATH.

I did recently see the installation of some CUDA version that it did actually use the alternatives to put symlinks to some commands on the path.

Note also that /usr/local/cuda is a symlink; each CUDA version is in distinct /usr/local/cuda-*

The different versions can be simultaneously installed, because they are in distinct paths. Only the /usr/local/cuda and alternatives symlinks would be “to one version”.
If you have one binary compiled with 11.2 and another with 12.1, are you sure that you want to – and can – run both with the 12.1 libs?

A SLURM job does inherit the submitter’s environment, but the job can adjust its environment too.

We have used environment modules, now the Lmod Lua-based version from EPEL.
Yes, the job script has to have one extra line, like:

module load cuda/11.2

to set the environment for the CUDA-application, but it is also explicitly recorded within the job script, which CUDA was/is used and the user has control.

FloAtWork · April 6, 2023, 1:30pm

Thank you very much! Was able to fix it by creating a cuda.sh in /etc/profiled.d and specifying the PATH there. Also downloading cuda 11.2 again (Slurm errors were caused by a library being updated in cuda 12.1 so the system wasn’t able to find the necessary files)

Topic		Replies	Views
NVIDIA driver inside docker broke after update rocky linux 8.6 Rocky Linux Help & Support	3	746	August 25, 2023
Nvidia Drivers install fails on Rocky Linux 9, "NVIDIA-SMI has failed..." Rocky Linux Help & Support	10	4968	August 25, 2023
Nvidia-smi fails Rocky Linux Help & Support	8	2566	August 25, 2023
Rocky 8 - nvdia-smi has failed because it couldn’t communicate with the NVIDIA driver Rocky Linux Help & Support	0	1121	August 22, 2023
Nvidia driver install Rocky 8 - how to? Rocky Linux Help & Support	12	11865	August 25, 2023

CUDA Toolkit installed but nvcc not found

Related topics