Random Boot Issues with Rocky Linux 8.9 (Timing related?)

Hello everyone!

I’m running into an issue that’s been driving me mad. I upgraded a fleet of similar workstation machines from Rocky Linux 8.7 to Rocky Linux 8.9 using the standard Yum process. After the upgrade, the machines will now randomly fail to boot. The boot failure shows ‘dracut-initqueue[pid]: Warning: dracut-initqueue timeout - starting timeout scripts’.

Every machine I’m having this issue on is the same hardware and software configuration (scripted via Kickstart/Ansible). The systems in-question are Dell Precision Workstation 3650 Towers with an NVMe boot SSD and a spinning data HDD. The NVMe is configured via LVM and has multiple partitions.

A reboot or two/three will eventually get the machine to boot properly. Frustratingly, as I’ve attempted to collect more information about what is failing, the addition of rd.shell rd.debug and removal of rhgb and quiet from boot parameters seem to completely fix the issue. This (and the random nature of the failure/success) makes me think this is some form of timing issue, but that’s a complete guess. I’ve also tried combinations of the rd.shell/rd.debug and removal of rhgb and quiet parameters. I have not been able to draw a correlation of exactly what combination fixes the issue, but I’ve also been thus far unable to collect additional details.

I’m also unable to collect serial debug information from Dracut as the machine lacks a serial port (and I’ve not yet found a work-around to this, since we are low-level, I assume a USB converter won’t work, do PCI Serial cards work?)

It seems to be ultimately failing on mounting /dev/mapper/vg0-root. Here’s a screenshot from a failed boot attempt:

I also tried modifying journald.conf to have Storage=persistent which allowed me to capture a failed boot and view it with journalctl -b, but the information collected was very similar to the last screenshot posted below, and looking in the logs previous to the reported failures did not yield any clues. It seems (unfortunately) that the cleanup tasks deleted this log since that time. I’ll work to capture another and post a follow-up.

Since I’m new here, I can’t post multiple pictures, but I’ll attempt to follow-up with a few more screenshots.

Any thoughts on anything else I can do to continue troubleshooting, or how do I collect more data?

Thanks!!

And a zoomed-in version of the bottom part of the screen:

While dracut is showing the initqueue-timeout, the following section(s) loop in the console:

After you got it to boot, were you able to scroll backward in /var/log/messages and see the entries from the previous boot, where it failed?