Installing additional CUDA versions

I want to install multiple software that is reliant on pytorch, dgl-cuda and cuda - more specifically versions 11.6-11.8 - but the only cuda versions I have currently installed are 11.2, 12.1 and 12.2

I read that usually having multiple cuda versions installed shouldn’t be a problem, but I want to ensure that 12.1 remains the “main” one in use and installing a new one doesn’t brick the system. When I tried to install it with the runfile it showed

Existing package manager installation of the driver found. It is strongly recommended that you remove this before continuing.

And when I tried the rpm instead I got this:

[admin@cluster newcuda]$ sudo dnf -y module install nvidia-driver:latest-dkms
Warning: failed loading '/etc/yum.repos.d/oneAPI.repo', skipping.
Rocky Linux 8 - AppStream                        17 MB/s |  11 MB     00:00    
Rocky Linux 8 - BaseOS                           14 MB/s | 7.1 MB     00:00    
Rocky Linux 8 - PowerTools - Source             1.8 MB/s | 655 kB     00:00    
Rocky Linux 8 - Extras                           53 kB/s |  14 kB     00:00    
Rocky Linux 8 - PowerTools                      5.7 MB/s | 2.8 MB     00:00    
Rocky Linux 8 - PowerTools - Source             557 kB/s | 197 kB     00:00    
cuda-rhel8-x86_64                                16 MB/s | 2.7 MB     00:00    
cuda-rhel8-11-1-local                            26 MB/s |  70 kB     00:00    
cuda-rhel8-11-2-local                            30 MB/s |  72 kB     00:00    
cuda-rhel8-11-7-local                            44 MB/s |  87 kB     00:00    
cuda-rhel8-12-1-local                            36 MB/s |  94 kB     00:00    
ELRepo.org Community Enterprise Linux Repositor 399 kB/s | 243 kB     00:00    
Extra Packages for Enterprise Linux 8 - x86_64   12 MB/s |  16 MB     00:01    
Extra Packages for Enterprise Linux 8 - Next -  1.4 MB/s | 368 kB     00:00    
NVIDIA HPC SDK                                   19 MB/s | 3.1 MB     00:00    
NOTE: Skipping kernel installation since no kernel module package kmod-nvidia-530.30.02-4.18.0-477.27.1 for kernel version 4.18.0-477.27.1.el8_8 and NVIDIA driver 535.86.10 could be found
Error: 
 Problem: problem with installed package kmod-nvidia-535.86.10-4.18.0-477.21.1-3:535.86.10-3.el8_8.x86_64
  - package kmod-nvidia-535.86.10-4.18.0-477.21.1-3:535.86.10-3.el8_8.x86_64 conflicts with kmod-nvidia-latest-dkms provided by kmod-nvidia-latest-dkms-3:535.104.12-1.el8.x86_64
  - cannot install the best candidate for the job
(try to add '--allowerasing' to command line to replace conflicting packages or '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)

Which i fear might cause problems because it wants to erase kernel drivers for 12.1 (?)

Do you have a suggestion on how I can proceed? Otherwise I’d have to ask on the nvidia forum
The software in particular I was trying to install is called RFdiffusion and the error I got said

File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/cuda/nvtx.py", line 9, in _fail
    raise RuntimeError("NVTX functions not installed. Are you sure you have a CUDA build?")
RuntimeError: NVTX functions not installed. Are you sure you have a CUDA build?

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

First, avoid the -y. It is easy to press y, if the proposed transaction seems ok. It is harder to undo things you auto-accepted.

You say “CUDA”, but your command tries to install nvidia-driver. Every CUDA requires the GPU driver, so you must have it already. You should install specific CUDA, for example the package cuda-toolkit-11-8.

The NVidia’s CUDA packages install under /usr/local (which IMHO is a bit odd place for files from RPM-packages).

$ ls -ld /usr/local/cuda*
lrwxrwxrwx.  1 root root   22 17. 7. 11:40 /usr/local/cuda -> /etc/alternatives/cuda
lrwxrwxrwx.  1 root root   25 17. 7. 11:40 /usr/local/cuda-11 -> /etc/alternatives/cuda-11
drwxr-xr-x. 16 root root 4096 17. 7. 11:40 /usr/local/cuda-11.8

Note how the two “defaults” (cuda and cuda-11) are symlinks to alternatives.

$ ls -l /etc/alternatives/cuda*
lrwxrwxrwx. 1 root root 20 17. 7. 11:40 /etc/alternatives/cuda -> /usr/local/cuda-11.8
lrwxrwxrwx. 1 root root 20 17. 7. 11:40 /etc/alternatives/cuda-11 -> /usr/local/cuda-11.8

There is a command to maintain links in alternatives. See man alternatives.

The main point is that if your current applications do look at /usr/local/cuda and installation of another CUDA version updates those alternatives, then there is a way to restore the previous (links).

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.