Can you borrow a graphics card or use onboard graphics to see if it makes any difference. I notice you have CSM enabled, and a number of legacy settings, which I think will become unsustainable as time goes on. I don’t know if you’ve applied UEFI firmware to the GOP, but it doesn’t say so. Be careful if you disable CSM, because it could cause a black screen (as in not able to see the BIOS at all).
In general, from RHEL 8.0 onwards, you are expected to use UEFI, GPT partitions and Secure Boot.
After a fresh 9.0 install and upgrade to 9.1, the the panic stop is still there. So it it seems like the video card which was supported in 9.0 is no longer supported in 9.1.
When people purchase a RHEL subscription, they expect the major release to run from start to end – a decade. If Red Hat would drop hardware support in a point update, then they would break that expectation. The “no longer supported” is thus unthinkable in Enterprise Linux.
An error in (Rocky) build or a regression (introduced by Red Hat) are much more likely explanations.
The challenge is in how to diagnose the root cause.
This is interesting. People were saying just install 9.0 and then upgrade to 9.1 and everything will be fine, but it didn’t make sense to me; it would imply that the boot process of 9.1 (after an upgrade) is different to the boot process from a boot device such as USB.
We don’t know it’s the video card for sure, but we need to rule it out.
I’m surprised this card works in 9.0 with “Compatibility Service Module” disabled, it seems impossible.
Did RH deliberately drop support between 9.0 and 9.1, maybe not, but did they do it accidently, maybe. The release notes are not as concise as they should be.
From the panic crash screenshot, I was able to narrow-down the location of the unexpected exception that causes the crash… It happens in the amdgpu driver in a C function named “amdgpu_device_fini_sw”.
Googling the name of the routine turned up some interesting recent changes to the driver.
The code follows. Note that the routine seems to take a pointer to the device data structure as an argument and it appears to be “resetting” the device.
I have not worked with gpu drivers so I don’t know how to debug this issue especially when it occurs at boot time
void amdgpu_device_fini_sw(struct amdgpu_device *adev)
{
int idx;
amdgpu_fence_driver_sw_fini(adev);
amdgpu_device_ip_fini(adev);
release_firmware(adev->firmware.gpu_info_fw);
adev->firmware.gpu_info_fw = NULL;
adev->accel_working = false;
dma_fence_put(rcu_dereference_protected(adev->gang_submit, true));
amdgpu_reset_fini(adev);
/* free i2c buses */
if (!amdgpu_device_has_dc_support(adev))
amdgpu_i2c_fini(adev);
if (amdgpu_emu_mode != 1)
amdgpu_atombios_fini(adev);
kfree(adev->bios);
adev->bios = NULL;
if (amdgpu_device_supports_px(adev_to_drm(adev))) {
vga_switcheroo_unregister_client(adev->pdev);
vga_switcheroo_fini_domain_pm_ops(adev->dev);
}
if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
vga_client_unregister(adev->pdev);
if (drm_dev_enter(adev_to_drm(adev), &idx)) {
iounmap(adev->rmmio);
adev->rmmio = NULL;
amdgpu_device_doorbell_fini(adev);
drm_dev_exit(idx);
}
if (IS_ENABLED(CONFIG_PERF_EVENTS))
amdgpu_pmu_fini(adev);
if (adev->mman.discovery_bin)
amdgpu_discovery_fini(adev);
amdgpu_reset_put_reset_domain(adev->reset_domain);
adev->reset_domain = NULL;
kfree(adev->pci_state);
}
There have been some sporadic issues with AMD GPU’s needing kernel parameters added such as iommu=soft or iommu=pt. There are other iommu settings that might be tried but those are the two that come up most recently.
There have been some big changes in the AMD graphics stack starting with 9.1, which most people will want, but it’s possible they have not taken older cards into account. Windows 11 has similar changes and the card (Cedar GL) is unlikely to work.
Yes, at first I thought it might be related, as I’m using AMD graphics, but I don’t actually know what caused the upgrade of the kernel to fail, nor why it was fixed by re-installing the failed kernel.
Realizing that, since I only need the graphics card for text mode console output, I don’t really need the amdgpu driver and I could blacklist it so it is never loaded at boot time.
I would like to thank all who participated in this discussion.
Checking that the solution works
Booted the problematic 9.1 option adding the option “modprobe.blacklist=amdgpu” which worked.
It is not necessary to go the “RL9.0 upgrade to 9.1” route. Just add the "modprobe.blacklist=amdgpu” option when booting the RL9.1 USB iso and simply blacklist the amdgpu driver as shown previously without worrying about recreating the initramfs.
Rocky includes two AMD drivers in the distribution: “amdgpu” and “radeon”. When amdgpu is blacklinstd, the radeon driver is used.
PS #2
To verify which video drivers are actually used in my machine (which also includes an Nvidia card) I used the lshw command: