NVIDIA driver inside docker broke after update rocky linux 8.6

I build a docker image with the driver and call it in warewulf for the compute nodes. Direct nvidia drivers work when nvidia-smi, but on further testing it doesn’t seem to be right as:

>>> print(torch.cuda.is_available())
False

Two months ago I tried to install the drivers directly from the repository via dnf and it didn’t work. So once I tried again and it worked perfectly and I was using it until yesterday when I did an image update with the repo driver and it didn’t work again. I neglected the versions and ended up losing the image that worked well.
dockerfile I use for official nvidia: https://raw.githubusercontent.com/luvres/hpc/master/dockerfiles/Dockerfile.r8ww-nvidia-slurm
dockerfile for nvidia from repo: https://raw.githubusercontent.com/luvres/hpc/master/dockerfiles/Dockerfile.r8ww-nvrepo-slurm

The image worked until kernel 4.18.0-372.26.1.el8_6.x86_64 and now with 4.18.0-372.32.1.el8_6.x86_64 the nvidia driver doesn’t work. Has anyone ever experienced this? Every help is welcome. I ended up losing the container that worked.

NVIDIA repo dnf

NVIDIA download official site

Looks like you need updated kernel modules for NVIDIA.

I use the drivers from RMPFusion on Rocky 8.6, kernel modules are automatically rebuilt if the kernel is updated.

  1. Add RPMFusion “free” and “nonfree”, Configuration - RPM Fusion.
  2. Install drivers, Howto/NVIDIA - RPM Fusion.
1 Like

Oh that’s what I need, thanks for the tip! I Ended up brute solving it by emergency and set the host kernel with grubby --set-default "/boot/vmlinuz-4.18.0-372.26.1.el8_6.x86_64" and build the containers with kernel-core-$(uname -r) and kernel-modules-$(uname -r). I’ll look at how to do it with RPM Fusion.

1 Like