Problem making nvidia driver work on 9.3 on kernel 5.14.0-362.18.1.el9_3.0.1.x86_64

After an upgrade I got problems making the nvidia drivers work for a Geforce RTX 3050. After a frustrating day of fielding around, I decided to install Rocky 9.3 from scratch, but with a similar outcome.

I tried (as root):

dnf install epel-release
curver="rhel$(rpm -E %rhel)"
wget -O /etc/yum.repos.d/cuda-$curver.repo   http://developer.download.nvidia.com/compute/cuda/repos/$curver/$(uname -i)/cuda-$curver.repo
crb enable
dnf update -y
dnf module install nvidia-driver:latest

and it semi works. On a single (non 4K) monitor I get an output, which looks fine, but on a 4K monitor the resolution is set to 1024x768 and I can not change it. In any case I can only make one monitor work. Interestingly if I install

dnf module install nvidia-driver:latest

instead I only get a black screen. I have the impression that the nvidia driver is not working at all

nvidia-smi

gives me

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

and

lshw -C display

tells me

  *-display UNCLAIMED       
       description: VGA compatible controller
       product: GA106 [Geforce RTX 3050]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller cap_list
       configuration: latency=0
       resources: iomemory:400-3ff iomemory:400-3ff memory:84000000-84ffffff memory:4000000000-400fffffff memory:4010000000-4011ffffff ioport:6000(size=128) memory:c0000-dffff
  *-graphics
       product: EFI VGA
       physical id: 2
       logical name: /dev/fb0
       capabilities: fb
       configuration: depth=32 resolution=1920,1080

which indicates to me that the card is not used at all ( *-display UNCLAIMED). Can someone give me further assistance? The problem is that we are a small workgroup running Rocky 9.3 on our workstations. This was now the first machine I updated and I’m in great fear that there will be a similar issue for the other machines.

Unfortunately this is normal. Nvidia hard locks to specific kernel versions and doesn’t take advantage of weak modules. This means when we have to rebuild a kernel, the nvidia driver will stop working.

Consider the following options:

  • Use rpmfusion for your driver
  • If you wish to continue using nvidia as the source of your drivers, use dkms instead.

See the following post for more information: Nvidia Drivers on Rocky Linux

1 Like

This is very unfortunate. Using dkms instead, does not help. Using

dnf module install nvidia-driver:latest-dkms

instead gives me just a black screen. When I follow the instruction to install the driver from rmpfusion, I have an output, but the resolution is stuck at 1024x768 and the output of lshw -C display also remains t

  *-display UNCLAIMED       
       description: VGA compatible controller
       product: GA106 [Geforce RTX 3050]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller cap_list
       configuration: latency=0
       resources: iomemory:400-3ff iomemory:400-3ff memory:84000000-84ffffff memory:4000000000-400fffffff memory:4010000000-4011ffffff ioport:6000(size=128) memory:c0000-dffff
  *-graphics
       product: EFI VGA
       physical id: 2
       logical name: /dev/fb0
       capabilities: fb
       configuration: depth=32 resolution=1024,768

We ran into this yesterday patching our edu labs and what I found was we needed to reinstall kernel

dnf reinstall kernel kernel-core kernel-modules kernel-headers

I’m not sure specifically what in the different kernel packages fixes it or why but it does.

So the process that works for us:

the box got upgraded to kernel 5.14.0-362.18.1.el9_3.0.1 and nvidia-smi not working

  1. dnf -y remove nvidia-driver nvidia-settings cuda-driver
  2. dnf -y install nvidia-driver nvidia-settings cuda-driver
  3. dnf reinstall kernel kernel-core kernel-modules kernel-headers

This pulls nvidia driver 5.50 and i’ve tried this with DKMS as well and the same process works.

I’ve honestly am ditching nvidia repo drivers and i’ve tested RPMfusion using akmod in our kickstart/post.sh setup imaging new systems and GUI comes up fine and nvidia-smi reports happy.

A little late for the party but I’d like to offer yet another option.

ELRepo is now offering kmod-nvidia for el9. It is currently in the elrepo-testing repository [1]. After setting up the elrepo repository [2], you can install it by running:

$ sudo dnf --enablerepo=elrepo-testing install kmod-nvidia

[1] 0001245: kmod-nvidia - ELRepo Bugs
[2] https://elrepo.org/

Thank you all for your help. I have a working system again, but it is really odd how… I tried all drivers, but with mixed results, either a black screen or only low resolution or not working with a 4K display. Thus I gave up for the moment and wanted to use the internal GPU of the mainboard to check if it is maybe sufficient in our setup. After activating the internal GPU the nvidia driver started to work again. I have no clue how this is possible. I then tried again the other drivers (rpmfusion, elrepo), but they did not like to work with a 4K display. Only the nvidia driver does the job, but I fear also not so super stable.

For the moment I have a working system and I will do further testing. It feels very unpleasant since I don’t understand what is going on.

NVidia, RPM Fusion, and ELRepo supposedly all three do take the very same binary blob and package it as RPM.
Although, at this moment ELRepo has version 550.54.14, RPM Fusion has version 545.29.06, and NVidia versions 550.54.14, 545.23.08, 535.161.07, 530.30.02, … 515.105.01; you can’t get exact same version from all three.
In principle though they are “same” and should handle 4K the same – unless the packaging / config differs.

Packaging does differ – NVidia’s own and RPM Fusion’s kernel module is (re)built for each kernel version. ELRepo’s kernel module ought to work in every kernel of one point update (e.g. el9_3 has 5.14.0-362 kernels and el9_2 had 5.14.0-284 kernels).


Rocky does have also the Nouveau driver that is definitely different from the NVidia blob. Furthermore, NVidia’s repo has the open source version of their driver. There are thus three different drivers (Nouveau, NVidia proprietary, and NVidia open) for (recent) NVidia GPU’s.

Darn, I’m stuck. I have to proprietary nvidia driver installed.

I got following error, when trying to update:

sudo dnf update              
Last metadata expiration check: 0:06:11 ago on Mon 04 Mar 2024 04:32:01 PM CET.
NOTE: Skipping kernel installation since no kernel module package kmod-nvidia-550.54.14-5.14.0-362.18.1.el9_3.0 for kernel version 5.14.0-362.18.1.el9_3.0.1 and NVIDIA driver 550.54.14 could be found
Dependencies resolved.
Nothing to do.
Complete!

Then I tried the proposed:

sudo dnf reinstall kernel kernel-core kernel-modules kernel-headers

but I get following error:

[...]
Running: dracut -f --kver 5.14.0-362.18.1.el9_3.0.1.x86_64
dracut: Can't write to /boot/efi/05f8068101554226b449ca2d674307a6/5.14.0-362.18.1.el9_3.0.1.x86_64: Directory /boot/efi/05f8068101554226b449ca2d674307a6/5.14.0-362.18.1.el9_3.0.1.x86_64 does not exist or is not accessible.
warning: %posttrans(kernel-modules-5.14.0-362.18.1.el9_3.0.1.x86_64) scriptlet failed, exit status 1

Error in POSTTRANS scriptlet in rpm package kernel-modules

When I try to uninstall nvidia-driver I get:

$ sudo dnf remove nvidia-driver
Error: 
 Problem: The operation would result in removing the following protected packages: kernel-debug-core

also resetting the nvidia-driver does not work.

sudo dnf module reset nvidia-driver

Any other hints what I could try?

Sorry robbott, I think I can’t help you, but I want to share some information. I now updated the second machine in our lab with the same problem, but when not installing the latest Nvidia driver, but 545.23.08 everything seems to work. The “latest” driver, which is 550.54.14 at this point seems to be broken. At least for the graphic card we are using (Zotac RTX3050 8GB Twin Edge)

I’m pretty sure that I’ve tried every permutation of every suggestion on this page to no avail. I’m running this kernel version on Dell G16 with a RTX 4070. Calls to journalclt -xb consistently shows that the driver fails when it attempts to load the device driver:

Mar 14 16:00:51 localhost systemd-udevd[1014]: nvidia: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Mar 14 16:00:51 localhost systemd-udevd[1011]: nvidia: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Mar 14 16:00:51 localhost systemd-udevd[1014]: nvidia: Process '/usr/bin/bash -c 'for i in $(cat /proc/driver/nvidia/gpus/*/information | grep Minor | cut -d \  -f 4); do /usr/bin/mknod -Z -m 666 /dev/nvidia${i} c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) ${i}; done'' failed with exit code 1.
Mar 14 16:00:51 localhost systemd-udevd[1011]: nvidia: Process '/usr/bin/bash -c 'for i in $(cat /proc/driver/nvidia/gpus/*/information | grep Minor | cut -d \  -f 4); do /usr/bin/mknod -Z -m 666 /dev/nvidia${i} c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) ${i}; done'' failed with exit code 1.
Mar 14 16:00:51 localhost kernel: nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.

It’s been 2 days and I’m giving up. Posting this so no one else endures this frustration and simply waits for the fix.

Just hit this problem as well, I believe it’s to do with the version miss-match of the kernel-devel & kernal headers vs the kernel, after the update these are at 5.14.0-362.24.1.el9_3 for the devel & headers while the kernel is 5.14.0-362.18.1.el9_3.0.1.
I suspect that the nvidia install is expecting these to match…

Unless the mirrors you have connected to were not fully in sync at the time, the package versions should be in sync. (They’ve had 4+ days to sync the updates we pushed)

[root@xmpp01 ~]# dnf repoquery -q kernel-headers kernel-devel
kernel-devel-0:5.14.0-362.24.1.el9_3.x86_64
kernel-headers-0:5.14.0-362.24.1.el9_3.x86_64

Please run dnf clean all and check again.

Does this mean there is a sync issue? I’m assuming (and could be completely wrong here) that all the kernel pieces should be the same version?

If I do a dnf repoquery -q kernel* is that:
kernel-debug-devel
kernel-debug-devel-matched
kernel-devel
kernel-devel-matched
kernel-headers
are at a different ver level (5.14.0-362.24.1.el9_3.x86_64 vs 5.14.0-362.18.1.el9_3.0.1.x86_64 for everything else)

Just rechecked and all Kernel pieces are now at 5.14.0-362.24.1.el9_3, updated, re-ran nvidia 550.54.14 install… SUCCESS!