I am looking for help to resolve a persistent “not syncing” panic stop while trying to install Rocky Linux 9.1 but not when I install 9.0 under the exact same circumstances.
Context
The machine is currently running 9.1 (upgraded from 9.0) without any obvious issues but I’m now planning a fresh re-install of 9.1 in order to put it into service as a KVM/QEMU host.
Hardware
M/B: ASUS Prime X570-P
CPU: Ryzen 5 5600X 4th Gen 6-core
RAM: Crucial Ballistix 3600 MHz DDR4 DRAM (2x16GB)
SSD: SAMSUNG 980 PRO 1TB PCIe NVMe Gen4
VGA: ATI AMD FirePro 2270 512MB PCI-E
UEFI BIOS Settings
“Factory” settings
Secure boot disabled
Never overclocked; RAM running at 2666MHz
Things I have tried
Using different USB drives.
Removing the two 16GB RAM sticks and using only one at a time.
Removing all cards except for the video card used for the console.
I have no experience with debugging kernel panic stops and have had no luck searching this forum nor googling for answers.
Kernel Panic
Based on the panic error below, I suspect a video driver issue even though the video card works fine for the console when I boot the existing Rocky 9.1 installation.
Can you clarify BIOS settings in both cases 9.0 and 9.1, disable CSM, secure boot, UEFI etc.
I’m interested in the AMD Firepro 2270, is that the passive cooling one?
I really liked that card, but ran into a problem with it where it didn’t support UEFI (without an unsupported firmware hack), and after looking more into the way it connected to the bus, I decided I’d have to replace it with a newer one.
Alternatively, since installing 9.0 works why not stop beating your head against the wall and just install that? Run a dnf upgrade afterward and you should be all set.
Can you borrow a graphics card or use onboard graphics to see if it makes any difference. I notice you have CSM enabled, and a number of legacy settings, which I think will become unsustainable as time goes on. I don’t know if you’ve applied UEFI firmware to the GOP, but it doesn’t say so. Be careful if you disable CSM, because it could cause a black screen (as in not able to see the BIOS at all).
In general, from RHEL 8.0 onwards, you are expected to use UEFI, GPT partitions and Secure Boot.
After a fresh 9.0 install and upgrade to 9.1, the the panic stop is still there. So it it seems like the video card which was supported in 9.0 is no longer supported in 9.1.
When people purchase a RHEL subscription, they expect the major release to run from start to end – a decade. If Red Hat would drop hardware support in a point update, then they would break that expectation. The “no longer supported” is thus unthinkable in Enterprise Linux.
An error in (Rocky) build or a regression (introduced by Red Hat) are much more likely explanations.
The challenge is in how to diagnose the root cause.
This is interesting. People were saying just install 9.0 and then upgrade to 9.1 and everything will be fine, but it didn’t make sense to me; it would imply that the boot process of 9.1 (after an upgrade) is different to the boot process from a boot device such as USB.
We don’t know it’s the video card for sure, but we need to rule it out.
I’m surprised this card works in 9.0 with “Compatibility Service Module” disabled, it seems impossible.
Did RH deliberately drop support between 9.0 and 9.1, maybe not, but did they do it accidently, maybe. The release notes are not as concise as they should be.
From the panic crash screenshot, I was able to narrow-down the location of the unexpected exception that causes the crash… It happens in the amdgpu driver in a C function named “amdgpu_device_fini_sw”.
Googling the name of the routine turned up some interesting recent changes to the driver.
The code follows. Note that the routine seems to take a pointer to the device data structure as an argument and it appears to be “resetting” the device.
I have not worked with gpu drivers so I don’t know how to debug this issue especially when it occurs at boot time
void amdgpu_device_fini_sw(struct amdgpu_device *adev)
{
int idx;
amdgpu_fence_driver_sw_fini(adev);
amdgpu_device_ip_fini(adev);
release_firmware(adev->firmware.gpu_info_fw);
adev->firmware.gpu_info_fw = NULL;
adev->accel_working = false;
dma_fence_put(rcu_dereference_protected(adev->gang_submit, true));
amdgpu_reset_fini(adev);
/* free i2c buses */
if (!amdgpu_device_has_dc_support(adev))
amdgpu_i2c_fini(adev);
if (amdgpu_emu_mode != 1)
amdgpu_atombios_fini(adev);
kfree(adev->bios);
adev->bios = NULL;
if (amdgpu_device_supports_px(adev_to_drm(adev))) {
vga_switcheroo_unregister_client(adev->pdev);
vga_switcheroo_fini_domain_pm_ops(adev->dev);
}
if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
vga_client_unregister(adev->pdev);
if (drm_dev_enter(adev_to_drm(adev), &idx)) {
iounmap(adev->rmmio);
adev->rmmio = NULL;
amdgpu_device_doorbell_fini(adev);
drm_dev_exit(idx);
}
if (IS_ENABLED(CONFIG_PERF_EVENTS))
amdgpu_pmu_fini(adev);
if (adev->mman.discovery_bin)
amdgpu_discovery_fini(adev);
amdgpu_reset_put_reset_domain(adev->reset_domain);
adev->reset_domain = NULL;
kfree(adev->pci_state);
}