Random Boot Issues with Rocky Linux 8.9 (Continued) - Failure to mount LVM Root

Greetings everyone, I started a thread here a while back regarding a random boot issue I’m seeing with Rocky Linux 8.9 across a number of Dell Precision Workstation 3650 machines. The original thread with pictures and video of the failure: Random Boot Issues with Rocky Linux 8.9 (Timing related?)

I found the time to revisit this issue, and I’ve finally been successful in collecting a log during a failed boot. For those finding this on Google, here’s what I needed to do to successfully obtain the logs:

  • Modify /etc/systemd/journald.conf to contain the line ‘Storage=persistent’ rather than ‘Storage=auto’
  • Modify /etc/default/grub to contain ‘rd.debug rd.shell’ as part of the GRUB_CMDLINE_LINUX
  • Run ‘grub2-mkconfig -o /boot/efi/EFI/rocky/grub.cfg’

With these settings in place, I was finally able to collect boot logs during a failure. The failure to boot only occurs around 20% of the time in my testing. Works perfectly other times. This leads me to believe there is a timing issue of some sort at-play. The logs clearly show dracut-initqueue and dracut-lib failing to find the LVM root volume but I can’t seem to locate anything in the logs that shows an underlying root cause.

Relevant section of the logs (this repeats a lot through the log):

Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /lib/dracut-lib.sh@444(check_finished): local f
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /lib/dracut-lib.sh@445(check_finished): for f in $hookdir/initqueue/finished/*.sh
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /lib/dracut-lib.sh@446(check_finished): '[' '/lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fvg0-root.sh' = '/lib/dracut/hooks/initqueue/finished/*.sh' ']'
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /lib/dracut-lib.sh@447(check_finished): '[' -e '/lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fvg0-root.sh' ']'
Jun 25 17:43:18 labhost01 dracut-initqueue[684]: /lib/dracut-lib.sh@447(check_finished): . '/lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fvg0-root.sh'
Jun 25 17:43:18 labhost01 dracut-initqueue[684]: //lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fvg0-root.sh@1(source): grep -q After=remote-fs-pre.target '/run/systemd/generator/systemd-cryptsetup@*.service'
Jun 25 17:43:18 labhost01 dracut-initqueue[684]: //lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fvg0-root.sh@2(source): '[' -e /dev/mapper/vg0-root ']'
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /lib/dracut-lib.sh@447(check_finished): return 1
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /bin/dracut-initqueue@48(main): udevadm settle --timeout=0
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /bin/dracut-initqueue@51(main): sleep 0.5
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /bin/dracut-initqueue@53(main): for i in /run/systemd/ask-password/ask.*
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /bin/dracut-initqueue@54(main): '[' -e '/run/systemd/ask-password/ask.*' ']'
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /bin/dracut-initqueue@57(main): '[' 3 -gt 240 ']'
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /bin/dracut-initqueue@68(main): main_loop=4
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /bin/dracut-initqueue@69(main): '[' 4 -gt 360 ']'
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /bin/dracut-initqueue@22(main): :
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /bin/dracut-initqueue@24(main): check_finished
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /lib/dracut-lib.sh@444(check_finished): local f
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /lib/dracut-lib.sh@445(check_finished): for f in $hookdir/initqueue/finished/*.sh
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /lib/dracut-lib.sh@446(check_finished): '[' '/lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fvg0-root.sh' = '/lib/dracut/hooks/initqueue/finished/*.sh' ']'
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /lib/dracut-lib.sh@447(check_finished): '[' -e '/lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fvg0-root.sh' ']'
Jun 25 17:43:18 labhost01 dracut-initqueue[688]: /lib/dracut-lib.sh@447(check_finished): . '/lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fvg0-root.sh'
Jun 25 17:43:18 labhost01 dracut-initqueue[688]: //lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fvg0-root.sh@1(source): grep -q After=remote-fs-pre.target '/run/systemd/generator/systemd-cryptsetup@*.service'
Jun 25 17:43:18 labhost01 dracut-initqueue[688]: //lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fvg0-root.sh@2(source): '[' -e /dev/mapper/vg0-root ']'
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /lib/dracut-lib.sh@447(check_finished): return 1
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /bin/dracut-initqueue@26(main): udevadm settle --exit-if-exists=/lib/dracut/hooks/initqueue/work
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /bin/dracut-initqueue@28(main): check_finished
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /lib/dracut-lib.sh@444(check_finished): local f
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /lib/dracut-lib.sh@445(check_finished): for f in $hookdir/initqueue/finished/*.sh
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /lib/dracut-lib.sh@446(check_finished): '[' '/lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fvg0-root.sh' = '/lib/dracut/hooks/initqueue/finished/*.sh' ']'
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /lib/dracut-lib.sh@447(check_finished): '[' -e '/lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fvg0-root.sh' ']'
Jun 25 17:43:18 labhost01 dracut-initqueue[691]: /lib/dracut-lib.sh@447(check_finished): . '/lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fvg0-root.sh'
Jun 25 17:43:18 labhost01 dracut-initqueue[691]: //lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fvg0-root.sh@1(source): grep -q After=remote-fs-pre.target '/run/systemd/generator/systemd-cryptsetup@*.service'
Jun 25 17:43:18 labhost01 dracut-initqueue[691]: //lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fmapper\x2fvg0-root.sh@2(source): '[' -e /dev/mapper/vg0-root ']'
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /lib/dracut-lib.sh@447(check_finished): return 1
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /bin/dracut-initqueue@30(main): '[' -f /lib/dracut/hooks/initqueue/work ']'
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /bin/dracut-initqueue@34(main): for job in $hookdir/initqueue/*.sh
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /bin/dracut-initqueue@35(main): '[' -e '/lib/dracut/hooks/initqueue/*.sh' ']'
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /bin/dracut-initqueue@35(main): break
Jun 25 17:43:18 labhost01 dracut-initqueue[491]: /bin/dracut-initqueue@40(main): udevadm settle --timeout=0

The device storing the root partition is an NVMe SSD, it initializes:

Jun 25 17:43:15 labhost01 kernel: nvme 0000:02:00.0: platform quirk: setting simple suspend
Jun 25 17:43:15 labhost01 kernel: nvme nvme0: pci function 0000:02:00.0
[...]
Jun 25 17:43:15 labhost01 kernel: nvme nvme0: IO queues not created
Jun 25 17:43:15 labhost01 systemd-udevd[507]: created empty file '/run/udev/data/c243:0' for '/devices/pci0000:00/0000:00:06.0/0000:02:00.0/nvme/nvme0'

I’m not seeing further NVMe hits in the logs. On a successful boot, there are plenty more entries for the PCI device and NVMe SSD. The “IO queues not created” error is a clue here, and does not show up in the logs from a successful boot.

In a successful boot (same machine), I see the following additional information in the logs:

Jul 01 11:34:49 labhost01 systemd-udevd[525]: created empty file '/run/udev/data/c242:0' for '/devices/pci0000:00/0000:00:06.0/0000:02:00.0/nvme/nvme0/ng0n1'
Jul 01 11:34:49 labhost01 systemd-udevd[532]: created db file '/run/udev/data/b259:0' for '/devices/pci0000:00/0000:00:06.0/0000:02:00.0/nvme/nvme0/nvme0n1'
Jul 01 11:34:49 labhost01 systemd-udevd[532]: created db file '/run/udev/data/b259:0' for '/devices/pci0000:00/0000:00:06.0/0000:02:00.0/nvme/nvme0/nvme0n1'
Jul 01 11:34:49 labhost01 systemd-udevd[540]: created db file '/run/udev/data/b259:3' for '/devices/pci0000:00/0000:00:06.0/0000:02:00.0/nvme/nvme0/nvme0n1/nvme0n1p3'
Jul 01 11:34:49 labhost01 systemd[1]: dev-disk-by\x2dpath-pci\x2d0000:02:00.0\x2dnvme\x2d1.device: Changed dead -> plugged
Jul 01 11:34:49 labhost01 systemd[1]: dev-disk-by\x2did-nvme\x2deui.00000000000000008ce38e0500b6c117.device: Changed dead -> plugged
Jul 01 11:34:49 labhost01 systemd[1]: dev-disk-by\x2did-nvme\x2dKXG70ZNV1T02_NVMe_KIOXIA_1024GB_42MFC96QFTC5.device: Changed dead -> plugged
Jul 01 11:34:49 labhost01 systemd[1]: dev-nvme0n1.device: Changed dead -> plugged
Jul 01 11:34:49 labhost01 systemd[1]: sys-devices-pci0000:00-0000:00:06.0-0000:02:00.0-nvme-nvme0-nvme0n1.device: Changed dead -> plugged

(Removed some ‘successfully forked off’ items from the above log snippet.)

In Googling around for the ‘IO Queues not created’ error, I found some references to issues with specific models of SSD. The ones in these machines are Kioxia KXG70ZNV1T02, supplied by Dell. I don’t see anything relating to these SSDs and Linux boot issues on a brief Google journey. I did just notice in typing this post that there is a more recent firmware version for the SSD, version 1080.4106 vs 1080.4105 these are running. I don’t see any release notes, I’ll try applying the update and see if it fixes anything. Will update this post.

In the interim, does anyone have any information or suggested next steps that will help troubleshoot this issue? It’s been driving me mad. I don’t see any reason for the failure between the last entries in the logs relating to the NVMe device.

I will also try to upload full logs of the failure and link them here, PasteBin doesn’t like them (too large).

Thanks!

You can use https://rpa.st/ instead of pastebin.

Removed a ton of spam similar to the following from the logs:

Jun 25 17:43:15 labhost01 systemd-udevd[468]: passed 187 byte device to netlink monitor 0x5595b4e14660
Jun 25 17:43:15 labhost01 systemd-udevd[523]: passed 212 byte device to netlink monitor 0x5595b4e524d0
Jun 25 17:43:15 labhost01 systemd-udevd[519]: passed 212 byte device to netlink monitor 0x5595b4e4ce10
Jun 25 17:43:15 labhost01 systemd-udevd[520]: passed 189 byte device to netlink monitor 0x5595b4e4db30
Jun 25 17:43:15 labhost01 systemd-udevd[521]: passed 156 byte device to netlink monitor 0x5595b4e4e850
Jun 25 17:43:15 labhost01 systemd-udevd[524]: passed 156 byte device to netlink monitor 0x5595b4e55400

This allowed me to upload the full logs here → -- Logs begin at Tue 2024-06-25 17:43:15 UTC, end at Tue 2024-06-25 18:04:17 UTC - Pastebin.com

Oh Ok, thanks. Here’s that link: View paste 36GA

Any ideas on what to try next? Does anyone see any clues in the logs provided?

Thanks in advance!