Rocky 9.6 + Nvidia Drivers issue

Folks,

Maybe I’m a special case, but I have two RL9 systems and neither is able to use the nvidia/cuda drivers with the new 9.6 kernel. This may well be expected behavior, as I see the nvidia driver that I’m using is from a build in mid-May, and I’m trying to get it to work with the 9.6 kernels that were build June 2(?).

I’ve given up for the moment, and went back to the previous kernel with both:
kernel-5.14.0-503.40.1.el9_5.x86_64

but I’m wondering if there is a trick that will make the nvidia graphics drivers and cuda work with the kernel:
kernel-5.14.0-570.18.1.el9_6.x86_64

If we should just wait for NVIDIA to catch up, I won’t spend more of my time trying to get these to work.

Cheers,
Mike

NVidia has three sets of RPMs in their repo:

  • Proprietary driver that depends on dkms to build module for new kernel
  • Open source driver that depends on dkms to build module for new kernel
  • Prebuilt kernel modules for some kernel versions

Both package sets that use dkms should install fine on el9_6.

Where are you finding your nvidia drivers?

I don’t know if this helps, but I am able to use cuda with the 9.6 kernel.

5.14.0-570.18.1

nvidia-driver-cuda.x86_64           3:575.57.08-1.el9   @cuda-rhel9-x86_64
nvidia-driver-cuda-libs.x86_64      3:575.57.08-1.el9   @cuda-rhel9-x86_64
dnf-plugin-nvidia.noarch            2.2-2.el9           @cuda-rhel9-x86_64
kmod-nvidia-latest-dkms.x86_64      3:575.57.08-1.el9   @cuda-rhel9-x86_64
libnvidia-cfg.x86_64                3:575.57.08-1.el9   @cuda-rhel9-x86_64
libnvidia-fbc.x86_64                3:575.57.08-1.el9   @cuda-rhel9-x86_64
libnvidia-gpucomp.x86_64            3:575.57.08-1.el9   @cuda-rhel9-x86_64
libnvidia-ml.x86_64                 3:575.57.08-1.el9   @cuda-rhel9-x86_64
nvidia-driver.x86_64                3:575.57.08-1.el9   @cuda-rhel9-x86_64
nvidia-driver-libs.x86_64           3:575.57.08-1.el9   @cuda-rhel9-x86_64
nvidia-kmod-common.noarch           3:575.57.08-1.el9   @cuda-rhel9-x86_64
nvidia-libXNVCtrl.x86_64            3:575.57.08-1.el9   @cuda-rhel9-x86_64
nvidia-libXNVCtrl-devel.x86_64      3:575.57.08-1.el9   @cuda-rhel9-x86_64
nvidia-modprobe.x86_64              3:575.57.08-1.el9   @cuda-rhel9-x86_64
nvidia-persistenced.x86_64          3:575.57.08-1.el9   @cuda-rhel9-x86_64
nvidia-settings.x86_64              3:575.57.08-1.el9   @cuda-rhel9-x86_64
nvidia-xconfig.x86_64               3:575.57.08-1.el9   @cuda-rhel9-x86_64
xorg-x11-nvidia.x86_64              3:575.57.08-1.el9   @cuda-rhel9-x86_64

Hey jlehtone and Sheldon,

I’m using the open source version of the driver, which I’ve been using for well over a year now, installed (from guidance given by nvidia-driver-assistant) with the command:

dnf -y module install nvidia-driver:open-dkms

These continue to work fine with the previous kernel, but X will not start with the 9.6 kernel. It’s not an installation problem, per se. The new kernel boots, but the nvidia loadable kernel modules do not load with the 9.6 kernel. If I reboot with the 9.6 kernel, then (logging in from a different machine) check to see if the proprietary modules loaded, they did not. On the 9.5 kernels, if I issue the command

lsmod | grep nvidia

I see a host of nvidia modules loaded:

nvidia_uvm 4087808 0
nvidia_drm 151552 17
nvidia_modeset 1732608 63 nvidia_drm
nvidia 11579392 729 nvidia_uvm,nvidia_modeset
drm_kms_helper 274432 2 nvidia_drm
drm 782336 21 drm_kms_helper,nvidia,nvidia_drm
video 73728 2 asus_wmi,nvidia_modeset

On the 9.6 kernel, none of these are loaded.

I’ve rebooted into the 9.6 kernel, and removed and reinstalled the nvidia drivers, and still no joy. X will not start, and I suspect there will be no cuda occuring either.

Sheldon, your listing did help, at least to confirm I am using the same rpm packages except for one. This might be what the issue is. That package difference is the kmod-nvidia* package. You’re using “latest”, and I’m using “open”.

Mine:

kmod-nvidia-open-dkms.noarch 3:575.57.08-1.el9 @cuda-rhel9-x86_64

Yours:

kmod-nvidia-latest-dkms.x86_64 3:575.57.08-1.el9 @cuda-rhel9-x86_64

I have found the page in the nvidia site that shows how to switch between the two, and I got the switch completed. I’m about to reboot with the “latest-dkms” as the choice, and hopefully that will do the trick. If it does I’ll let you know…

Thanks much!
Mike

EDIT: rebooted, and still no joy, but I could boot with the latest-dkms nvidia driver installed on that system with the 9.5 kernels. I may have muffed something in my thrashes, but I now believe that others are not finding a problem getting the system to work with the nviida drivers under the 9.6 kernels.

Thanks much for the help. I’ll try to run this to ground and share with the community. Cheers all…

Hi,

not sure if this is the reason why in Rocky / RHEL 10 the “dnf module” was removed, but almost after every upgrade I have to reinstall the nvidia drivers so I know the commands by heart. I had some problems with epel-multimedia, hence I disable it for the update. Ignore it if you haven’t installed epel.multimedia. If you get a “transaction error”, just remove all the packages listed and reinstall them later, if needed.

In my experience 97% of the problems on a workstation come from nvidia. Don’t fight it! Embrace it!

sudo dnf config-manager --set-enabled crb
sudo dnf -y config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
sudo dnf config-manager --set-disabled epel-multimedia
sudo dnf -y remove cuda-toolkit
sudo dnf -y remove cuda\*
sudo dnf -y remove nvidia-driver
sudo dnf -y remove nvidia\*
sudo dnf -y module reset nvidia-driver
sudo dnf -y upgrade
# reboot after the upgrade so open-dkms get  build with the new kernel and kernel-headers.
sudo reboot
sudo dnf -y module install nvidia-driver:open-dkms
# reboot to get your screen back
sudo reboot
sudo dnf config-manager --set-enabled epel-multimedia
sudo dnf -y upgrade

The “latest” is proprietary, the “open” is open. The proprietary does not work with Blackwell (e.g. RTX 5xxx cards.) The open does not work with cards older than Turing. If open driver did work previously, then you have recent enough card for it.


What kernel packages do you have?

dnf rq --installonly

What NVidia kernel modules you have?

find /lib/modules -type f -name "nvidia*.ko.xz"

You can tell the system to go to text mode in boot with:

systemctl set-default multi-user.target

(and revert to "start GUI on boot with systemctl set-default graphical.target)

If you have nvidia modules in the el9_6 kernel’s dirs, then I would boot to text mode and ask:

modinfo nvidia
modprobe nvidia

If there are no nvidia modules for the new kernel, then the question is why dkms did fail to build them?


I have the “open” 575.57.08 on one AlmaLinux 9.6 and 570 on multiple. The Rocky’s kernel should be at least as good as AlmaLinux kernel (but probably even more compatible with RHEL kernel).


The DNF modules are a nice idea, but implementation was far from great. Red Hat went totally overboard with modules in RHEL 8, managed to tone done for RHEL 9, and by development of RHEL 10 concluded that modules is more trouble than benefit.

1 Like

kernel-0:5.14.0-503.38.1.el9_5.x86_64
kernel-0:5.14.0-503.40.1.el9_5.x86_64
kernel-0:5.14.0-570.18.1.el9_6.x86_64
kernel-core-0:5.14.0-503.38.1.el9_5.x86_64
kernel-core-0:5.14.0-503.40.1.el9_5.x86_64
kernel-core-0:5.14.0-570.18.1.el9_6.x86_64
kernel-devel-0:5.14.0-503.38.1.el9_5.x86_64
kernel-devel-0:5.14.0-503.40.1.el9_5.x86_64
kernel-devel-0:5.14.0-570.18.1.el9_6.x86_64
kernel-modules-0:5.14.0-503.38.1.el9_5.x86_64
kernel-modules-0:5.14.0-503.40.1.el9_5.x86_64
kernel-modules-0:5.14.0-570.18.1.el9_6.x86_64
kernel-modules-core-0:5.14.0-503.38.1.el9_5.x86_64
kernel-modules-core-0:5.14.0-503.40.1.el9_5.x86_64
kernel-modules-core-0:5.14.0-570.18.1.el9_6.x86_64
kernel-modules-extra-0:5.14.0-503.38.1.el9_5.x86_64
kernel-modules-extra-0:5.14.0-503.40.1.el9_5.x86_64
kernel-modules-extra-0:5.14.0-570.18.1.el9_6.x86_64

/lib/modules/5.14.0-427.16.1.el9_4.x86_64/extra/nvidia.ko.xz
/lib/modules/5.14.0-427.16.1.el9_4.x86_64/extra/nvidia-modeset.ko.xz
/lib/modules/5.14.0-427.16.1.el9_4.x86_64/extra/nvidia-drm.ko.xz
/lib/modules/5.14.0-427.16.1.el9_4.x86_64/extra/nvidia-uvm.ko.xz
/lib/modules/5.14.0-427.16.1.el9_4.x86_64/extra/nvidia-peermem.ko.xz
/lib/modules/5.14.0-427.18.1.el9_4.x86_64/extra/nvidia.ko.xz
/lib/modules/5.14.0-427.18.1.el9_4.x86_64/extra/nvidia-modeset.ko.xz
/lib/modules/5.14.0-427.18.1.el9_4.x86_64/extra/nvidia-drm.ko.xz
/lib/modules/5.14.0-427.18.1.el9_4.x86_64/extra/nvidia-uvm.ko.xz
/lib/modules/5.14.0-427.18.1.el9_4.x86_64/extra/nvidia-peermem.ko.xz
/lib/modules/5.14.0-427.22.1.el9_4.x86_64/extra/nvidia.ko.xz
/lib/modules/5.14.0-427.22.1.el9_4.x86_64/extra/nvidia-modeset.ko.xz
/lib/modules/5.14.0-427.22.1.el9_4.x86_64/extra/nvidia-drm.ko.xz
/lib/modules/5.14.0-427.22.1.el9_4.x86_64/extra/nvidia-uvm.ko.xz
/lib/modules/5.14.0-427.22.1.el9_4.x86_64/extra/nvidia-peermem.ko.xz
/lib/modules/5.14.0-427.24.1.el9_4.x86_64/extra/nvidia.ko.xz
/lib/modules/5.14.0-427.24.1.el9_4.x86_64/extra/nvidia-modeset.ko.xz
/lib/modules/5.14.0-427.24.1.el9_4.x86_64/extra/nvidia-drm.ko.xz
/lib/modules/5.14.0-427.24.1.el9_4.x86_64/extra/nvidia-uvm.ko.xz
/lib/modules/5.14.0-427.24.1.el9_4.x86_64/extra/nvidia-peermem.ko.xz
/lib/modules/5.14.0-427.28.1.el9_4.x86_64/extra/nvidia-fs.ko.xz
/lib/modules/5.14.0-427.31.1.el9_4.x86_64/extra/nvidia.ko.xz
/lib/modules/5.14.0-427.31.1.el9_4.x86_64/extra/nvidia-modeset.ko.xz
/lib/modules/5.14.0-427.31.1.el9_4.x86_64/extra/nvidia-drm.ko.xz
/lib/modules/5.14.0-427.31.1.el9_4.x86_64/extra/nvidia-uvm.ko.xz
/lib/modules/5.14.0-427.31.1.el9_4.x86_64/extra/nvidia-peermem.ko.xz
/lib/modules/5.14.0-503.38.1.el9_5.x86_64/kernel/drivers/platform/x86/nvidia-wmi-ec-backlight.ko.xz
/lib/modules/5.14.0-503.40.1.el9_5.x86_64/kernel/drivers/platform/x86/nvidia-wmi-ec-backlight.ko.xz
/lib/modules/5.14.0-503.40.1.el9_5.x86_64/extra/nvidia.ko.xz
/lib/modules/5.14.0-503.40.1.el9_5.x86_64/extra/nvidia-modeset.ko.xz
/lib/modules/5.14.0-503.40.1.el9_5.x86_64/extra/nvidia-drm.ko.xz
/lib/modules/5.14.0-503.40.1.el9_5.x86_64/extra/nvidia-uvm.ko.xz
/lib/modules/5.14.0-503.40.1.el9_5.x86_64/extra/nvidia-peermem.ko.xz
/lib/modules/5.14.0-570.18.1.el9_6.x86_64/kernel/drivers/platform/x86/nvidia-wmi-ec-backlight.ko.xz

And booted to the 9.6 kernel, no nvidia modules were loaded. I then did the delete them all/reinstall them all step I use:

dnf module remove --all nvidia-driver
dnf module install nvidia-driver:latest-dkms

… and voila, the drivers appeared.

I’d tried this (in my version of this, where I rebooted into the running, but essentially text mode kernel from another machine). I would then just reboot w/o doing any checks like the “modinfo nvidia” or my “lsmod | grep nvidia”, and it never rebooted with a running graphical mode.

This was with the open version of the driver, but this success was with the proprietary (“latest”) version of the drivers. I am about to try this with the system that is still the open version, but I will then lose this window. I am going to do this more clean boot into “multi-user.target”, and see if it works. If it does, then it has nothing to do with “latest” vs “open”…

More on the flipside. But got one system up!

Much, much thanks!

Mike

1 Like

And did the same twostep, but with the “open” variant. And it is running just fine. I have not yet rebooted, but I suspect it will work w/o issue given it switched runlevels w/o issue.

I am wondering if I did not do this flush and re-install a few times unsuccessfully in the booted 9.6 kernel, but I have to admit that’s a possibility.

Confused, but happy to have the systems back and running under the newest kernel, with the nvidia driver running. These are two different peices of hardware, and neither worked until did the flush/reinstall step while running in textmode (runlevel 3/multi-user.target). I thought I did the equivalent, but something changed between the two. Also don’t understand why neither system didn’t just work with the new kernel.

Regardless: Much,much thanks.
Systems working great now.

Mike

Robbott, you have a much more involved flush and re-install script. Mine is just:

dnf module remove --all nvidia-driver
dnf module install nvidia-driver:latest-dkms

or, for the open driver

dnf module remove --all nvidia-driver
dnf module install nvidia-driver:open-dkms

Really good to have these in a script, just for repeatability. I did have to add the reset when I switched from “open” to “latest”, so I suspect your version is more complete than mine.

Much thanks for your help in this. I’m still baffled by what happened for me to require all my gyrations. My money now would me doing something something wrong as I was booting both of these after having them non-running last week…

Cheers,
Mike

And now, I am adding additional details, as they re-appeared. On one of the machines I had an issue when the next kernel level arrived. After much fooling around, on this newer machine, neither open or “latest” nvidia drivers will work.

On the earlier kernels, the drivers did work just fine (both types). So I now have two machines, and one had no problem upgrading through the entire chain of the kernels I’ve seen:

kernel-5.14.0-570.18.1.el9_6.x86_64
kernel-5.14.0-570.19.1.el9_6.x86_64
kernel-5.14.0-570.21.1.el9_6.x86_64

and the second (more recent motherboard) I could only get to work through

kernel-5.14.0-570.18.1.el9_6.x86_64
kernel-5.14.0-570.19.1.el9_6.x86_64

That machine is now happily running on kernel-5.14.0-570.21.1.el9_6.x86_64, using the nouveau driver, with the cuda rpms still installed. Haven’t tried to play with the cuda bit. Likely will just wait for the next kernel level.

I can share more details about differences between the machine that’s currently fine, and the one that’s not, but not sure if anyone cares. But I do suspect others might run into the same issue right now.

Much thanks for the help.
Cheers!
Mike