I build a docker image with the driver and call it in warewulf for the compute nodes. Direct nvidia drivers work when nvidia-smi, but on further testing it doesn’t seem to be right as:
>>> print(torch.cuda.is_available())
False
Two months ago I tried to install the drivers directly from the repository via dnf and it didn’t work. So once I tried again and it worked perfectly and I was using it until yesterday when I did an image update with the repo driver and it didn’t work again. I neglected the versions and ended up losing the image that worked well.
dockerfile I use for official nvidia: https://raw.githubusercontent.com/luvres/hpc/master/dockerfiles/Dockerfile.r8ww-nvidia-slurm
dockerfile for nvidia from repo: https://raw.githubusercontent.com/luvres/hpc/master/dockerfiles/Dockerfile.r8ww-nvrepo-slurm
The image worked until kernel 4.18.0-372.26.1.el8_6.x86_64
and now with 4.18.0-372.32.1.el8_6.x86_64
the nvidia driver doesn’t work. Has anyone ever experienced this? Every help is welcome. I ended up losing the container that worked.
NVIDIA repo dnf
NVIDIA download official site