Nvidia-driver and Rocky 9, RmInitAdapter failed! can’t assign; no space

Hello, we have encountered the following problem.
We had the following server configuration: Rocky 8 and elrepo kernel 6.8
One of the servers has 6
Tesla T4
nvidia-smi video cardsinstalled

  • everything was working without problems, everything was showing.

We had to urgently upgrade to Rocky 9+ and kernel 6.8+ (in our case yum updated it to kernel-ml 6.12.9, now 6.13.2)
I don’t remember what the smi and cuda versions were :frowning:

After that the problems started.

nvidia-smi does not show all video cards as before
nvidia-smi can cause server restart
After installing nvidia-driver and restarting the server, the server can crash and restart again.
The key errors are:
[ 10.591541] caller _nv046819rm+0x3a/0xb0 [nvidia] mapping multiple BARs
[ 10.600495] [drm:nv_drm_load [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00003f00] Failed to allocate NvKmsKapiDevice
[ 10.600679] [drm:nv_drm_register_drm_device [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00003f00] Failed to register device
[ 10.600836] [drm] [nvidia-drm] [GPU ID 0x00004000] Loading driver
[ 11.819075] resource: resource sanity check: requesting [mem 0x00000000b5700000-0x00000000b66fffff], which spans more than PCI Bus 0000:40 [mem 0xb5000000-0xb63fffff]

we currently have driver
version 570.86.15.

here are the pci error logs (they are the same on the two video cards)
nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-7e72779e-00d4-6c68-ba84-7726909da764)
GPU 1: Tesla T4 (UUID: GPU-79405ede-bba6-5c34-d48d-1ab4d1d48a8e)
GPU 2: Tesla T4 (UUID: GPU-e6c0ca41-b425-75d8-68c3-6751367eb5b7)
GPU 3: Tesla T4 (UUID: GPU-a1165eb6-a4e4-3a89-0e91-d7e8e20d717f)
[root@scanh2-4 ~]# lspci | grep -i nvidia
1b:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
1c:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
1e:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
3f:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
40:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
5e:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

dmesg | grep -i 40:00.0
[ 0.691219] pci 0000:40:00.0: [10de:1eb8] type 00 class 0x030200 PCIe Endpoint
[ 0.691238] pci 0000:40:00.0: BAR 0 [mem 0xb5000000-0xb5ffffff]
[ 0.691255] pci 0000:40:00.0: BAR 1 [mem 0x3afe80000000-0x3afe8fffffff 64bit pref]
[ 0.691271] pci 0000:40:00.0: BAR 3 [mem 0x3afeb0000000-0x3afeb1ffffff 64bit pref]
[ 0.691294] pci 0000:40:00.0: enabling Extended Tags
[ 0.691321] pci 0000:40:00.0: Enabling HDA controller
[ 0.691379] pci 0000:40:00.0: PME# supported from D0 D3hot D3cold
[ 0.691418] pci 0000:40:00.0: VF BAR 0 [mem 0xb6000000-0xb603ffff]
[ 0.691420] pci 0000:40:00.0: VF BAR 0 [mem 0xb6000000-0xb63fffff]: contains BAR 0 for 16 VFs
[ 0.691429] pci 0000:40:00.0: VF BAR 1 [mem 0x3afd80000000-0x3afd8fffffff 64bit pref]
[ 0.691430] pci 0000:40:00.0: VF BAR 1 [mem 0x3afd80000000-0x3afe7fffffff 64bit pref]: contains BAR 1 for 16 VFs
[ 0.691439] pci 0000:40:00.0: VF BAR 3 [mem 0x3afe90000000-0x3afe91ffffff 64bit pref]
[ 0.691440] pci 0000:40:00.0: VF BAR 3 [mem 0x3afe90000000-0x3afeafffffff 64bit pref]: contains BAR 3 for 16 VFs
[ 0.717385] pci 0000:40:00.0: BAR 1 [mem 0x3afe80000000-0x3afe8fffffff 64bit pref]: can’t claim; no compatible bridge window
[ 0.717387] pci 0000:40:00.0: BAR 3 [mem 0x3afeb0000000-0x3afeb1ffffff 64bit pref]: can’t claim; no compatible bridge window
[ 0.717388] pci 0000:40:00.0: VF BAR 1 [mem 0x3afd80000000-0x3afe7fffffff 64bit pref]: can’t claim; no compatible bridge window
[ 0.717390] pci 0000:40:00.0: VF BAR 3 [mem 0x3afe90000000-0x3afeafffffff 64bit pref]: can’t claim; no compatible bridge window
[ 0.752175] pci 0000:40:00.0: BAR 1 [mem size 0x10000000 64bit pref]: can’t assign; no space
[ 0.752176] pci 0000:40:00.0: BAR 1 [mem 0x3afe80000000-0x3afe8fffffff 64bit pref]: failed to assign
[ 0.752177] pci 0000:40:00.0: VF BAR 1 [mem size 0x100000000 64bit pref]: can’t assign; no space
[ 0.752179] pci 0000:40:00.0: VF BAR 1 [mem 0x3afd80000000-0x3afe7fffffff 64bit pref]: failed to assign
[ 0.752180] pci 0000:40:00.0: BAR 3 [mem size 0x02000000 64bit pref]: can’t assign; no space
[ 0.752182] pci 0000:40:00.0: BAR 3 [mem 0x3afeb0000000-0x3afeb1ffffff 64bit pref]: failed to assign
[ 0.752183] pci 0000:40:00.0: VF BAR 3 [mem size 0x20000000 64bit pref]: can’t assign; no space
[ 0.752185] pci 0000:40:00.0: VF BAR 3 [mem 0x3afe90000000-0x3afeafffffff 64bit pref]: failed to assign
[ 0.752186] pci 0000:40:00.0: BAR 1 [mem size 0x10000000 64bit pref]: can’t assign; no space
[ 0.752188] pci 0000:40:00.0: BAR 1 [mem 0x3afe80000000-0x3afe8fffffff 64bit pref]: failed to assign
[ 0.752189] pci 0000:40:00.0: BAR 3 [mem size 0x02000000 64bit pref]: can’t assign; no space
[ 0.752190] pci 0000:40:00.0: BAR 3 [mem 0x3afeb0000000-0x3afeb1ffffff 64bit pref]: failed to assign
[ 0.752192] pci 0000:40:00.0: VF BAR 3 [mem size 0x20000000 64bit pref]: can’t assign; no space
[ 0.752193] pci 0000:40:00.0: VF BAR 3 [mem 0x3afe90000000-0x3afeafffffff 64bit pref]: failed to assign
[ 0.752195] pci 0000:40:00.0: VF BAR 1 [mem size 0x100000000 64bit pref]: can’t assign; no space
[ 0.752196] pci 0000:40:00.0: VF BAR 1 [mem 0x3afd80000000-0x3afe7fffffff 64bit pref]: failed to assign
[ 2.534998] nvidia 0000:40:00.0: enabling device (0100 → 0102)
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:40:00.0)
NVRM: BAR2 is 0M @ 0x0 (PCI:0000:40:00.0)
NVRM: BAR3 is 0M @ 0x0 (PCI:0000:40:00.0)
NVRM: BAR4 is 0M @ 0x0 (PCI:0000:40:00.0)
NVRM: BAR5 is 0M @ 0x0 (PCI:0000:40:00.0)
[ 4.780729] [drm] Initialized nvidia-drm 0.0.0 for 0000:40:00.0 on minor 5
[ 135.112806] NVRM: GPU 0000:40:00.0: RmInitAdapter failed! (0x24:0x72:1568)
[ 135.112962] NVRM: GPU 0000:40:00.0: rm_init_adapter failed, device minor number 4

We have been trying to solve this problem for over a month now :frowning:
would love some help