After an upgrade I got problems making the nvidia drivers work for a Geforce RTX 3050. After a frustrating day of fielding around, I decided to install Rocky 9.3 from scratch, but with a similar outcome.
and it semi works. On a single (non 4K) monitor I get an output, which looks fine, but on a 4K monitor the resolution is set to 1024x768 and I can not change it. In any case I can only make one monitor work. Interestingly if I install
dnf module install nvidia-driver:latest
instead I only get a black screen. I have the impression that the nvidia driver is not working at all
nvidia-smi
gives me
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
which indicates to me that the card is not used at all ( *-display UNCLAIMED). Can someone give me further assistance? The problem is that we are a small workgroup running Rocky 9.3 on our workstations. This was now the first machine I updated and I’m in great fear that there will be a similar issue for the other machines.
Unfortunately this is normal. Nvidia hard locks to specific kernel versions and doesn’t take advantage of weak modules. This means when we have to rebuild a kernel, the nvidia driver will stop working.
Consider the following options:
Use rpmfusion for your driver
If you wish to continue using nvidia as the source of your drivers, use dkms instead.
This is very unfortunate. Using dkms instead, does not help. Using
dnf module install nvidia-driver:latest-dkms
instead gives me just a black screen. When I follow the instruction to install the driver from rmpfusion, I have an output, but the resolution is stuck at 1024x768 and the output of lshw -C display also remains t
This pulls nvidia driver 5.50 and i’ve tried this with DKMS as well and the same process works.
I’ve honestly am ditching nvidia repo drivers and i’ve tested RPMfusion using akmod in our kickstart/post.sh setup imaging new systems and GUI comes up fine and nvidia-smi reports happy.
A little late for the party but I’d like to offer yet another option.
ELRepo is now offering kmod-nvidia for el9. It is currently in the elrepo-testing repository [1]. After setting up the elrepo repository [2], you can install it by running:
Thank you all for your help. I have a working system again, but it is really odd how… I tried all drivers, but with mixed results, either a black screen or only low resolution or not working with a 4K display. Thus I gave up for the moment and wanted to use the internal GPU of the mainboard to check if it is maybe sufficient in our setup. After activating the internal GPU the nvidia driver started to work again. I have no clue how this is possible. I then tried again the other drivers (rpmfusion, elrepo), but they did not like to work with a 4K display. Only the nvidia driver does the job, but I fear also not so super stable.
For the moment I have a working system and I will do further testing. It feels very unpleasant since I don’t understand what is going on.
NVidia, RPM Fusion, and ELRepo supposedly all three do take the very same binary blob and package it as RPM.
Although, at this moment ELRepo has version 550.54.14, RPM Fusion has version 545.29.06, and NVidia versions 550.54.14, 545.23.08, 535.161.07, 530.30.02, … 515.105.01; you can’t get exact same version from all three.
In principle though they are “same” and should handle 4K the same – unless the packaging / config differs.
Packaging does differ – NVidia’s own and RPM Fusion’s kernel module is (re)built for each kernel version. ELRepo’s kernel module ought to work in every kernel of one point update (e.g. el9_3 has 5.14.0-362 kernels and el9_2 had 5.14.0-284 kernels).
Rocky does have also the Nouveau driver that is definitely different from the NVidia blob. Furthermore, NVidia’s repo has the open source version of their driver. There are thus three different drivers (Nouveau, NVidia proprietary, and NVidia open) for (recent) NVidia GPU’s.
Darn, I’m stuck. I have to proprietary nvidia driver installed.
I got following error, when trying to update:
sudo dnf update
Last metadata expiration check: 0:06:11 ago on Mon 04 Mar 2024 04:32:01 PM CET.
NOTE: Skipping kernel installation since no kernel module package kmod-nvidia-550.54.14-5.14.0-362.18.1.el9_3.0 for kernel version 5.14.0-362.18.1.el9_3.0.1 and NVIDIA driver 550.54.14 could be found
Dependencies resolved.
Nothing to do.
Complete!
[...]
Running: dracut -f --kver 5.14.0-362.18.1.el9_3.0.1.x86_64
dracut: Can't write to /boot/efi/05f8068101554226b449ca2d674307a6/5.14.0-362.18.1.el9_3.0.1.x86_64: Directory /boot/efi/05f8068101554226b449ca2d674307a6/5.14.0-362.18.1.el9_3.0.1.x86_64 does not exist or is not accessible.
warning: %posttrans(kernel-modules-5.14.0-362.18.1.el9_3.0.1.x86_64) scriptlet failed, exit status 1
Error in POSTTRANS scriptlet in rpm package kernel-modules
When I try to uninstall nvidia-driver I get:
$ sudo dnf remove nvidia-driver
Error:
Problem: The operation would result in removing the following protected packages: kernel-debug-core
Sorry robbott, I think I can’t help you, but I want to share some information. I now updated the second machine in our lab with the same problem, but when not installing the latest Nvidia driver, but 545.23.08 everything seems to work. The “latest” driver, which is 550.54.14 at this point seems to be broken. At least for the graphic card we are using (Zotac RTX3050 8GB Twin Edge)
I’m pretty sure that I’ve tried every permutation of every suggestion on this page to no avail. I’m running this kernel version on Dell G16 with a RTX 4070. Calls to journalclt -xb consistently shows that the driver fails when it attempts to load the device driver:
Mar 14 16:00:51 localhost systemd-udevd[1014]: nvidia: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 255'' failed with exit code 1.
Mar 14 16:00:51 localhost systemd-udevd[1011]: nvidia: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 255'' failed with exit code 1.
Mar 14 16:00:51 localhost systemd-udevd[1014]: nvidia: Process '/usr/bin/bash -c 'for i in $(cat /proc/driver/nvidia/gpus/*/information | grep Minor | cut -d \ -f 4); do /usr/bin/mknod -Z -m 666 /dev/nvidia${i} c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) ${i}; done'' failed with exit code 1.
Mar 14 16:00:51 localhost systemd-udevd[1011]: nvidia: Process '/usr/bin/bash -c 'for i in $(cat /proc/driver/nvidia/gpus/*/information | grep Minor | cut -d \ -f 4); do /usr/bin/mknod -Z -m 666 /dev/nvidia${i} c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) ${i}; done'' failed with exit code 1.
Mar 14 16:00:51 localhost kernel: nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
It’s been 2 days and I’m giving up. Posting this so no one else endures this frustration and simply waits for the fix.
Just hit this problem as well, I believe it’s to do with the version miss-match of the kernel-devel & kernal headers vs the kernel, after the update these are at 5.14.0-362.24.1.el9_3 for the devel & headers while the kernel is 5.14.0-362.18.1.el9_3.0.1.
I suspect that the nvidia install is expecting these to match…
Unless the mirrors you have connected to were not fully in sync at the time, the package versions should be in sync. (They’ve had 4+ days to sync the updates we pushed)
Does this mean there is a sync issue? I’m assuming (and could be completely wrong here) that all the kernel pieces should be the same version?
If I do a dnf repoquery -q kernel* is that:
kernel-debug-devel
kernel-debug-devel-matched
kernel-devel
kernel-devel-matched
kernel-headers
are at a different ver level (5.14.0-362.24.1.el9_3.x86_64 vs 5.14.0-362.18.1.el9_3.0.1.x86_64 for everything else)