Can install 9.0 but not 9.1 from USB drive

Can you borrow a graphics card or use onboard graphics to see if it makes any difference. I notice you have CSM enabled, and a number of legacy settings, which I think will become unsustainable as time goes on. I don’t know if you’ve applied UEFI firmware to the GOP, but it doesn’t say so. Be careful if you disable CSM, because it could cause a black screen (as in not able to see the BIOS at all).

In general, from RHEL 8.0 onwards, you are expected to use UEFI, GPT partitions and Secure Boot.

Thanks for the suggestion. inst.test and test did not solve the problem.

I noted earlier that I already have a running 9.1 installation upgraded from 9.0 on the box.

Since I’m planning to use this machine as a VM host for all my work, I want to make sure the hardware is good.

That said, I may resort to the 9.0 to 9.1 upgrade route if all else fails.

Ahem.

text != test

You might want to try that again.

My Ryzen processor does not have built-in graphics.

I do have a Quadro 620 but I have not been able to make it work as the console device at bootup.

I tried using UEFI setting for everything earlier without success.

I’m not sure what “applied UEFI firmware to the GOP” means.

Thanks!

Sorry I did try text not test :slight_smile:

Post mortem:

After a fresh 9.0 install and upgrade to 9.1, the the panic stop is still there. So it it seems like the video card which was supported in 9.0 is no longer supported in 9.1.

When people purchase a RHEL subscription, they expect the major release to run from start to end – a decade. If Red Hat would drop hardware support in a point update, then they would break that expectation. The “no longer supported” is thus unthinkable in Enterprise Linux.

An error in (Rocky) build or a regression (introduced by Red Hat) are much more likely explanations.

The challenge is in how to diagnose the root cause.

This is interesting. People were saying just install 9.0 and then upgrade to 9.1 and everything will be fine, but it didn’t make sense to me; it would imply that the boot process of 9.1 (after an upgrade) is different to the boot process from a boot device such as USB.

We don’t know it’s the video card for sure, but we need to rule it out.

I’m surprised this card works in 9.0 with “Compatibility Service Module” disabled, it seems impossible.

Did RH deliberately drop support between 9.0 and 9.1, maybe not, but did they do it accidently, maybe. The release notes are not as concise as they should be.

From the panic crash screenshot, I was able to narrow-down the location of the unexpected exception that causes the crash… It happens in the amdgpu driver in a C function named “amdgpu_device_fini_sw”.

Googling the name of the routine turned up some interesting recent changes to the driver.

The code follows. Note that the routine seems to take a pointer to the device data structure as an argument and it appears to be “resetting” the device.

I have not worked with gpu drivers so I don’t know how to debug this issue especially when it occurs at boot time :frowning:

void amdgpu_device_fini_sw(struct amdgpu_device *adev)
{
	int idx;

	amdgpu_fence_driver_sw_fini(adev);
	amdgpu_device_ip_fini(adev);
	release_firmware(adev->firmware.gpu_info_fw);
	adev->firmware.gpu_info_fw = NULL;
	adev->accel_working = false;
	dma_fence_put(rcu_dereference_protected(adev->gang_submit, true));

	amdgpu_reset_fini(adev);

	/* free i2c buses */
	if (!amdgpu_device_has_dc_support(adev))
		amdgpu_i2c_fini(adev);

	if (amdgpu_emu_mode != 1)
		amdgpu_atombios_fini(adev);

	kfree(adev->bios);
	adev->bios = NULL;
	if (amdgpu_device_supports_px(adev_to_drm(adev))) {
		vga_switcheroo_unregister_client(adev->pdev);
		vga_switcheroo_fini_domain_pm_ops(adev->dev);
	}
	if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
		vga_client_unregister(adev->pdev);

	if (drm_dev_enter(adev_to_drm(adev), &idx)) {

		iounmap(adev->rmmio);
		adev->rmmio = NULL;
		amdgpu_device_doorbell_fini(adev);
		drm_dev_exit(idx);
	}

	if (IS_ENABLED(CONFIG_PERF_EVENTS))
		amdgpu_pmu_fini(adev);
	if (adev->mman.discovery_bin)
		amdgpu_discovery_fini(adev);

	amdgpu_reset_put_reset_domain(adev->reset_domain);
	adev->reset_domain = NULL;

	kfree(adev->pci_state);

}

There have been some sporadic issues with AMD GPU’s needing kernel parameters added such as iommu=soft or iommu=pt. There are other iommu settings that might be tried but those are the two that come up most recently.

Thanks! I will give those options a try.

There have been some big changes in the AMD graphics stack starting with 9.1, which most people will want, but it’s possible they have not taken older cards into account. Windows 11 has similar changes and the card (Cedar GL) is unlikely to work.

1 Like

Tried the two options. No luck.

I/m following your post “Dnf upgrade from 9.0 to 9.1 failed” under General. Interesting.

Yes, at first I thought it might be related, as I’m using AMD graphics, but I don’t actually know what caused the upgrade of the kernel to fail, nor why it was fixed by re-installing the failed kernel.

Are you using a discrete video card? It so, which one?

I’m still getting the same kernel panic after I followed the 9.1 installation method discussed in Dnf upgrade from 9.0 to 9.1 failed

Solved

Realizing that, since I only need the graphics card for text mode console output, I don’t really need the amdgpu driver and I could blacklist it so it is never loaded at boot time.

I would like to thank all who participated in this discussion.

Checking that the solution works

Booted the problematic 9.1 option adding the option “modprobe.blacklist=amdgpu” which worked.

Making the change permanent

Jailed amdgpu:

# echo "blacklist amdgpu" >> /etc/modprobe.d/blacklist.conf

This change must be propagated to the correct initramfs .img file in /boot which, in this case, is the RL 9.1 image

/boot/initramfs-5.14.0-162.6.1.el9_1.0.1.x86_64.img

For safety I backed it up

# cp -p initramfs-5.14.0-162.6.1.el9_1.0.1.x86_64.img \
        initramfs-5.14.0-162.6.1.el9_1.0.1.x86_64.img.BAK

At this point I rebuilt the image

# dracut -f /boot/initramfs-5.14.0-162.6.1.el9_1.0.1.x86_64.img \
                            5.14.0-162.6.1.el9_1.0.1.x86_64

RL9.1 now boots fine!

1 Like

PS

It is not necessary to go the “RL9.0 upgrade to 9.1” route. Just add the "modprobe.blacklist=amdgpu” option when booting the RL9.1 USB iso and simply blacklist the amdgpu driver as shown previously without worrying about recreating the initramfs.

Rocky includes two AMD drivers in the distribution: “amdgpu” and “radeon”. When amdgpu is blacklinstd, the radeon driver is used.

PS #2

To verify which video drivers are actually used in my machine (which also includes an Nvidia card) I used the lshw command:

# lshw -c video
  *-display
       description: VGA compatible controller
       product: GP107GL [Quadro P620]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:04:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nouveau latency=0
       resources: irq:103 memory:fb000000-fbffffff memory:d0000000-dfffffff memory:e0000000-e1ffffff ioport:e000(size=128) memory:fc000000-fc07ffff
  *-display:0
       description: VGA compatible controller
       product: RV370 [Radeon X300]
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:09:00.0
       logical name: /dev/fb0
       version: 00
       width: 32 bits
       clock: 33MHz
       capabilities: pm pciexpress msi vga_controller bus_master cap_list rom fb
       configuration: depth=32 driver=radeon latency=0 resolution=1920,1080
       resources: irq:102 memory:e8000000-efffffff ioport:f000(size=256) memory:fcc30000-fcc3ffff memory:c0000-dffff
  *-display:1 UNCLAIMED
       description: Display controller
       product: RV370 [Radeon X300 SE]
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0.1
       bus info: pci@0000:09:00.1
       version: 00
       width: 32 bits
       clock: 33MHz
       capabilities: pm pciexpress cap_list
       configuration: latency=0
       resources: memory:fcc20000-fcc2ffff

I don’t remember you saying anything about an NVidia card in the previous posts?

I mentioned it back in #15. I unplugged for most tests.