Boot failed after upgrade from Rocky 9.4 to 9.5

Last weekend I ran dnf update on four VMs and their host, All of which were on Rocky 9.4

I first verified that two of the VMs successfully booted after running the upgrade, and both worked fine. So I upgraded the two other VMs, and the host.

As usual I caused all VMs to shutdown before rebooting the host.

The host never came backup. Did I mention that the hardware is colocated 40 miles away?

I was not able to ssh into or ping the host, nor of course, any of the VMs.

I tried power cycling the host, just in case that might fix the issue. It didn’t.

On site I found the host failed boot and dumped into the failure shell. So I caused it to reboot and noted that two LVM volumes were not detected. Two were detected and mounted, as was the software raid hard drive. LVM is used on a 1TB SSD (nvme0n1).

The /var and /home volumes were not recognized by the start up process, nor could I find them when logging into the maint shell.

I assumed SSD failed, though I couldn’t see how that was possible, as some of the voumes seemed to work fine.

So I then tried booting from the previous kernel, and lo and behold, the host and all VMs booted. That of course means there is nothing wrong with the SSD.

I’ve removed the “bad” kernel, 5.14.0-503, so now the host boots correctly to 5.14.0-427, without intervention. When I removed the kernel and its associated files, dnf also uninstalled kmod-kvdo and vdo as dependencies. I don’t know if that is related, but I’d never seen those packages before and thought I’d mention them.

Where do I look to help determine what happened? Did this, or something like it, happen to anyone else.

I don’t know enough to create a bug report and will definitely do that once I know more.

What is the output of this command?

sudo grubby --info=DEFAULT

Did prior kernel updates go without issue or is 9.4 the base install and never updated till now?

Also if you have any added kernel parameters you have to run this command before booting into the new kernel:

grub2-mkconfig --update-bls-cmdline -o /boot/grub2/grub.cfg

This post here on the forum may be of interest:
https://forums.rockylinux.org/t/rocky-linux-9-5-sometimes-boot-into-emergency-mode-for-no-reason/16655

No output from

grubby --info=DEFAULT

from shell as root. I don’t normally use sudo.

The host started with a fresh install of Rocky 9.1 and was updated every couple of weeks until it got to 9.4. For each update the host booted as expected after running reboot in a root shell.

I waited three or four weeks after the last update at Rocky 9.4 to see if there were any issues with the upgrade to 9.5. Then I first updated and rebooted two of the VMs to verify all was well. But when it took the host server to Rocky 9.5, reboot failed as described in the original post…

As far as I know there are no added kernel parameter. For sure none that I added.

The post you mentioned was “close”, but not enough. This host is hardware colocated at a daacanter. Its main job is to host four VMs, and its other job is as our primary DNS server. It also hosts our backup image archives.

Because the data center where the host resides is 40+ miles away, and because it hosts virtually all of our internet facing infrastructure, it will be difficult to debug simply by adjusting boot parameters and rebooting to see if it successfully reboots. But if that’s what it takes, I’ll do that, but only after I have some clues on what to look for when it fails to boot.

Will there be any logs that remain after I removed the “bad” kernel that could shed some light on what exatcly happened with the attempt to boot the “bad” kernel?

Can you suggest how I can determine why the “bad” kernel didn’t see two of the LVM volumes that the previous kernel has no porblem seeing, even after multiple reboots. Did it have anything to do with vdo?

I need to find a way I can find out if a new kernel will reboot, before I actually reboot after a kernel update. Any sugesstion on that?

Is this something specific to Rocky 9.5, or to the kernel. I suspect it’s not about Rocky 9.5, as simply booting from the previous kernel while the system is fully updated to Rocky 9.5 works as expected.

Did you mena that I should run

grub2-mkconfig --update-bls-cmdline -o /boot/grub2/grub.cfg

after update and before rebooting?

well thread starter from the linked thread here. i guess its closer than you think because what i did not mention in that post was that it started as a rocky 9.4 vm and sucessfully upgraded to 9.5 without any issues. these just came after a 9.5 kernel upgrade (on 9.5) was installed.

you should be able to list old boot logs via:

journalctl --list-boots

Yes, but if you have no added kernel parameters then there’s no need to run that command especially since this all worked fine w/o doing so on all the previous kernel updates.

Try grubby --info=ALL

The other suggestion I have is to read the release notes for the Rocky Version 9.5 update that can be found in the documentation drop down list.

All four of the VMs I upgraded from 9.4 to 9.5 booted up fine immediately after the upgrade to that VM. It was only the host (hardware) that failed to reboot.

From what you wrote, had I upgraded before the 503 kernel was released, I might not have experienced the boot failure then, but would have eventually, when the 503 kernel was released and deployed on my hardware.

Which tells me the issue has something to do with the 503 kernel.

When I do your suggested joutnalctl --list-boots command on the currently running server I get:

IDX BOOT ID FIRST ENTRY LAST ENTRY
0 02bb85b853d34f7da3ff5ffd67d81fa9 Sun 2024-12-22 15:56:01 PST Fri 2024-12-27 13:14:19 PST

Unfortunately the failed boot happened on the 21st, so the earliest listed is when I successfuly booted the 427 kernel the next day. Indeed, doing journalctl --boot=0 shows me the boot log from the 427 kernal boot.

I guess that means I cannot see any boot logs from when the 503 kernel failed to boot. At least not via journalctl.

I’ll consider reinstalling the 503 kernel while on site, then look to capture the boot log when/if it fails to reboot. If it does fail, and I expect it will, I’ll remove that kernel again and keep looking for way to discover why.

Running grubby --info=ALL lists only the two 427 and rescue kernel files. Which I would expect, as I removed the 503 kernel.

I will reinstall the 503 kernel and run the grubby info command again. Then remove it again until I figure out why it fails. I’ll share tthe before and after grubby results if it looks that info may shed some light.

I didn’t see anything that looks related to my issue in the release notes, and thanks for the suggestion.

I see. Have the exact problem. My vm only lists the current boot.

ive applied that soulution here: kernel - Why does `journalctl --list-boots` only show the current boot? - Ask Ubuntu

yes im well aware thats an 7yr old post and a solution for ubuntu but it works

Here, there are nine posts and we still don’t know anything more about your system setup than the first post which is about nil. How is your system partitioned, how do you do your updates, via ssh and command line or via cockpit web console, and what is on your kernel command line?
Posting output from forensic commands in code blocks, the button above </>, is how it helps us understand. So post the output of these commands:

grubby --info=ALL

lsblk -o name,fstype,uuid,mountpoint

less /etc/fstab

What vm tool are you using?

Another thought that came to me is maybe the problem kernel did not complete its install, I’ve had that happen using the cockpit interface, but it does take a minute or two for kernel scripts to run.

Ffter removing 503 (bad) kernel:
grubby --info=ALL

index=0
kernel=“/boot/vmlinuz-5.14.0-427.37.1.el9_4.x86_64”
args=“ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet $tuned_params”
root=“/dev/mapper/cl-root”
initrd=“/boot/initramfs-5.14.0-427.37.1.el9_4.x86_64.img $tuned_initrd”
title=“Rocky Linux (5.14.0-427.37.1.el9_4.x86_64) 9.4 (Blue Onyx)”
id=“4a3876baf4f14bcebefd80a1cda9197f-5.14.0-427.37.1.el9_4.x86_64”
index=1
kernel=“/boot/vmlinuz-5.14.0-427.20.1.el9_4.x86_64”
args=“ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet $tuned_params”
root=“/dev/mapper/cl-root”
initrd=“/boot/initramfs-5.14.0-427.20.1.el9_4.x86_64.img $tuned_initrd”
title=“Rocky Linux (5.14.0-427.20.1.el9_4.x86_64) 9.4 (Blue Onyx)”
id=“4a3876baf4f14bcebefd80a1cda9197f-5.14.0-427.20.1.el9_4.x86_64”
index=2
kernel=“/boot/vmlinuz-0-rescue-4a3876baf4f14bcebefd80a1cda9197f”
args=“ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet”
root=“/dev/mapper/cl-root”
initrd=“/boot/initramfs-0-rescue-4a3876baf4f14bcebefd80a1cda9197f.img”
title=“Rocky Linux (0-rescue-4a3876baf4f14bcebefd80a1cda9197f) 9.3 (Blue Onyx)”
id=“4a3876baf4f14bcebefd80a1cda9197f-0-rescue”

lsblk -o name,fstype,uuid,mountpoint:

NAME FSTYPE UUID MOUNTPOINT
sda linux_raid_member a4a8a4d4-9a6f-bea3-d723-26db50f64dc1
└─md127 LVM2_member HsyG0X-cvYP-ztJS-gSbm-mTUe-4qId-bKchcq
├─store-backup xfs 739429c4-7409-49fb-bef9-9ea0afdf4b32 /backup
└─store-virt_images xfs b14df60f-69eb-4a24-8a1b-2428254dc443 /var/lib/libvirt/images
sdb linux_raid_member a4a8a4d4-9a6f-bea3-d723-26db50f64dc1
└─md127 LVM2_member HsyG0X-cvYP-ztJS-gSbm-mTUe-4qId-bKchcq
├─store-backup xfs 739429c4-7409-49fb-bef9-9ea0afdf4b32 /backup
└─store-virt_images xfs b14df60f-69eb-4a24-8a1b-2428254dc443 /var/lib/libvirt/images
sdc linux_raid_member a4a8a4d4-9a6f-bea3-d723-26db50f64dc1
└─md127 LVM2_member HsyG0X-cvYP-ztJS-gSbm-mTUe-4qId-bKchcq
├─store-backup xfs 739429c4-7409-49fb-bef9-9ea0afdf4b32 /backup
└─store-virt_images xfs b14df60f-69eb-4a24-8a1b-2428254dc443 /var/lib/libvirt/images
sdd linux_raid_member a4a8a4d4-9a6f-bea3-d723-26db50f64dc1
└─md127 LVM2_member HsyG0X-cvYP-ztJS-gSbm-mTUe-4qId-bKchcq
├─store-backup xfs 739429c4-7409-49fb-bef9-9ea0afdf4b32 /backup
└─store-virt_images xfs b14df60f-69eb-4a24-8a1b-2428254dc443 /var/lib/libvirt/images
nvme0n1
├─nvme0n1p1 vfat 31EB-D37E /boot/efi
├─nvme0n1p2 ext4 82c2eef3-5fdd-453d-ab95-dce74c9ea2f4 /boot
└─nvme0n1p3 LVM2_member 5jCBd9-U14w-mg2R-opiF-nP8T-Iozw-dKpt6b
├─cl-root xfs 98f87ea1-ae61-45b9-acb3-4550aac94473 /
├─cl-swap swap f9acb95f-40dd-4ab7-990d-33bd9795da7f [SWAP]
├─cl-home xfs 201060b7-de95-4903-9d2b-25077f2ac9a6 /home
└─cl-var xfs 139f6e5c-5dd2-4804-a908-5ebc1e1d7264 /var

less /etc/fstab

/dev/mapper/cl-root / xfs defaults 0 0
/dev/mapper/store-backup /backup xfs defaults 0 0
UUID=82c2eef3-5fdd-453d-ab95-dce74c9ea2f4 /boot ext4 defaults 1 2
UUID=31EB-D37E /boot/efi vfat umask=0077,shortname=winnt 0 2
/dev/mapper/cl-home /home xfs defaults 0 0
/dev/mapper/cl-var /var xfs defaults 0 0
/dev/mapper/store-virt_images /var/lib/libvirt/images xfs defaults 0 0
/dev/mapper/cl-swap none swap defaults 0 0

kinfo from vnc connection

Operating System: Rocky Linux 9.5
KDE Plasma Version: 5.27.11
KDE Frameworks Version: 5.115.0
Qt Version: 5.15.9
Kernel Version: 5.14.0-427.37.1.el9_4.x86_64 (64-bit)
Graphics Platform: offscreen
Processors: 32 × AMD EPYC 7302P 16-Core Processor
Memory: 125.2 GiB of RAM
Graphics Processor: llvmpipe

pvscan
PV /dev/md127 VG store lvm2 [<1.82 TiB / 0 free]
PV /dev/nvme0n1p3 VG cl lvm2 [929.92 GiB / 651.29 GiB free]
Total: 2 [<2.73 TiB] / in use: 2 [<2.73 TiB] / in no VG: 0 [0 ]

lvscan
ACTIVE ‘/dev/store/backup’ [1.33 TiB] inherit
ACTIVE ‘/dev/store/virt_images’ [500.00 GiB] inherit
ACTIVE ‘/dev/cl/root’ [40.00 GiB] inherit
ACTIVE ‘/dev/cl/home’ [20.00 GiB] inherit
ACTIVE ‘/dev/cl/var’ [200.00 GiB] inherit
ACTIVE ‘/dev/cl/swap’ [<18.63 GiB] inherit

--------------------
I seem to remember when I did the pvscan from the maint shell when the boot faild on kernel 503, that the second line didn’t exist. However the lvscan listed both the swap and root volumes, but not the var and home volumes.

I used the command “dnf update” from an SSH remote shell to update to Rocky 9.5. The update completed without errors, just as it had on the four VMs that machine hosts.

cat /proc/cmdline

BOOT_IMAGE=(hd4,gpt2)/vmlinuz-5.14.0-427.37.1.el9_4.x86_64 root=/dev/mapper/cl-root ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet

Unfortunately I didn’t know to record the kernel command before removing the 503 kernel and rebooting.

From your original post:

On site I found the host failed boot and dumped into the failure shell. So I caused it to reboot and noted that two LVM volumes were not detected. Two were detected and mounted, as was the software raid hard drive. LVM is used on a 1TB SSD (nvme0n1).

I can only guess that for some un explained reason the root= of your commandline did not get written to the new kernel entry.

One way to insure that this is written correctly is to create the /etc/kernel/cmdline file via the method I suggested up thread using grub2-mkconfig.
Only way to confirm all is working on the new kernel install is to be there at the server. I would not install the new kernel remotely until sure it booted correctly.

The root volumn was available in the maint shell. That and swap were both found and mounted. Ugh.

I did test the upgrade in each VM, alll without a problem. Except for the upgrade to 9.5 on the host (hardware), all upgrades to my Rocky instances have gone flawlessly. So I was quite surprised when this one failed.

I plan to do what you suggest and do the next update on that machine while on-site. Then before booting do all the checks suggested here, and of course run the grub make config command you suggested.

My plan is to collect all the info I can from the miant shell if it fails again, before removing the “bad” kernel and rebooting.

I think it likely that this was an anomoly and all will be well the next time. For sure, from now on I will do updates to the host only when I can get access to the data center. Probaly make an appointment with the tech team there to have them be ready to help.

Thanks for all the comments and suggestions.

Good luck Emmett. Sounds like some bad luck. I’ve upgraded Centos and Rocky Linux dozens of time just by updating the yum repos and running yum/dnf update without a hitch. I hope you figure it out.