Rocky Linux 9.7 Won't Boot Post Kernel Upgrade - lvm /dev/rl_nv/root does not exist - boots fine on older kernel

Like the title says, the latest kernel does not boot on my Dell c1100 CS24-TY 1U server. Here is the dracut emergency log:

Kernel version 5.14.0-570.33.2.el9_6.x86_64 - 9.6 will boot, detect my lvm partitions, and works fine.

Kernel version 5.14.0-570.58.1.el9_6.x86_64 - 9.6 will not boot claiming it can’t find my lvm partitions.

Kernel version 5.14.0-611.1.el9_7.x86_64 - 9.7 will not boot claiming it can’t find my lvm partitions.

Any idea? I also changed the way the lvm partitions are detected by changing

use_devicesfile = 0 in /etc/lvm/lmv.conf and regenerated dracut -f –all

Ran this through Gemini, and I still wasn’t able to get the newer kernel images to work…

This conversation log documents the systematic troubleshooting of a boot failure in Rocky Linux 9.7 (Kernel 5.14.0-611) on legacy Intel ICH10R (IMSM) hardware. While Kernel 9.6 boots successfully, 9.7 fails to assemble the RAID 10 array, preventing the LVM root volume from mounting.


Troubleshooting Log: Linux Kernel 9.7 & MDRAID/IMSM Failure

1. Initial Assessment: Software vs. Metadata

  • Observation: The system drops to a Dracut emergency shell on Kernel 9.7.

  • Verification: lsinitrd confirmed that mdraid, lvm, ahci.ko, and mdadm.conf are present in the 9.7 initramfs.

  • Metadata: mdadm.conf contains correct UUIDs and the AUTO +imsm directive.

  • Finding: The failure is not due to missing software, but a failure in the initialization handshake between the kernel and the RAID metadata.

2. Driver Conflict & Platform Checks

  • Hypothesis: Driver mismatch (VMD vs. AHCI) or strict BIOS platform checks.

  • Test: Added imsm_no_platform=1, vmd.max_devices=0, and rd.driver.pre=ahci to GRUB.

  • Result: Failed. The 9.6 kernel uses the ahci driver successfully, but forcing it in 9.7 does not trigger assembly.

3. Dracut Shell Diagnostics

  • Manual Assembly: Running mdadm --assemble --scan --run --force in the shell resulted in an active /dev/md127 (Container) but failed to start the member volume /dev/md126.

  • Error: Metadata reports /dev/md126 “has been assembled with 1 device but cannot be started.”

  • Analysis: RAID 10 requires at least 2 disks to start. The kernel is successfully reading metadata from sda but failing to “claim” sdb, sdc, and sdd.

4. Identification of the “Partition Lock” Conflict

  • Critical Observation: cat /proc/partitions shows raw partitions (sda1, sda2, sdb1, etc.) on the physical disks.

  • Diagnosis: The 9.7 kernel is performing a partition scan on the raw disks before the MDRAID assembly. Once the kernel “touches” a partition like sda1, it places a lock on the physical disk. When mdadm tries to claim the disk for the RAID container, it returns a “Device or resource busy” state.

  • Verification: mdadm --examine /dev/sda shows all disks as “Active/Online,” proving the metadata is intact but the disks are locked by the kernel partition manager.

5. Attempted Overrides

  • Unlock Attempt: Used partx -d /dev/sda to manually remove partition mappings and release kernel locks.

  • Result: Even after clearing partitions and stopping zombie arrays, the member volume /dev/md126 fails to spawn from the container /dev/md127.

  • Conclusion: There is a fundamental change in the block device discovery order or udev race condition in Kernel 9.7 that prevents Intel IMSM volumes from initializing on ICH10R chipsets when legacy partition tables are present.


Hardware & Environment Summary

Component Specification
Server Dell CS24-TY (Quanta 897a)
Chipset Intel ICH10R SATA Controller [RAID mode]
RAID Level RAID 10 (Intel IMSM / Matrix Storage)
Working Kernel 5.14.0-570 (Rocky 9.6)
Failing Kernel 5.14.0-611 (Rocky 9.7)
Storage Stack Physical Disks → IMSM Container → MD-Volume → LVM

Longer conversation log (not formatted as nicely as above): Full Conversation Log: Linux Kernel & MDRAID Boot DiagnosticsUser: Older ker - Pastebin.com — Any ideas?

Ok, I fixed this issue after 7 hours of painful discussion with Gemini.

It’s actually not kernel specific. It has something to do with recent changes made to dracut and how it identifies IMSM Intel BIOS FakeRAID custom RAID setups. Please FIX this. This is RIDICULOUS. Someone needs to figure out when and how this got broken and fix it. I’m not the only one having these kinds of issues!

Full link to Gemini discussion:

https://gemini.google.com/share/ce8ee81276b0

Gemini Summary:

The Problem: After upgrading to the latest kernel and system packages in Rocky Linux 9, complex Intel Matrix RAID (IMSM/DDF) setups often fail to assemble during early boot, dropping users into the dracut emergency shell.

The Root Cause: Modern EL9 dracut modules are “minimalist” by design. They skip loading DDF/IMSM metadata handlers and mdmon binaries unless they detect an active, standard array at build-time. Furthermore, EL9 enforces strict “fail-fast” boot timeouts. If the RAID assembly takes more than a few seconds, the kernel gives up and drops to the emergency shell, regardless of whether your mdadm.conf is correct.

The Permanent Solution: Stop relying on dracut’s “smart” auto-detection. Instead, use a UDEV rule to force an event-driven assembly the moment the disks are detected, bypassing the boot-time timeout issue entirely.

Steps to Implement:

  1. Create an explicit RAID configuration: Ensure your /etc/mdadm.conf is accurate and maps your specific DDF container.

  2. Create a persistent UDEV trigger: Create /etc/udev/rules.d/99-mdadm.rules and add the following line:

ACTION==“add”, SUBSYSTEM==“block”, ENV{ID_FS_TYPE}==“ddf_member|isw_raid_member”, RUN+=“/usr/sbin/mdadm --assemble --scan --config=/etc/mdadm.conf”, RUN+=“/usr/sbin/lvm pvscan”, RUN+=“/usr/sbin/lvm vgchange -ay“

  1. Force dracut to package these dependencies: Edit or create /etc/dracut.conf.d/raid.conf:

add_drivers+=" dm_mod dm_raid raid0 raid1 raid10 raid456 md_mod "
install_items+=" /etc/mdadm.conf /etc/udev/rules.d/99-mdadm.rules "
mdadm_conf=“/etc/mdadm.conf”

  1. Regenerate the Initramfs: Run sudo dracut -f -v --regenerate-all.

Why this works:

  • Event-Driven: The UDEV rule fires immediately upon disk detection, which is faster and more reliable than the sequential dracut boot hooks.

  • Timeout Bypass: By handling assembly through UDEV, you ensure the block devices exist and the Volume Groups are active before the kernel hits its hard-coded boot timeout.

  • Version Immunity: By hard-coding these files into dracut’s install_items, the RAID assembly logic will persist through every future kernel update.


Closing Thought

You’ve essentially “hard-wired” your hardware configuration into the boot process. You aren’t just relying on software to guess correctly anymore; you’ve given the kernel a explicit set of instructions.

No hard wiring has been done. The defaults are relying on auto-detect, and in your case it doesn’t auto-detect it.

You do realise that Rocky is based on RHEL, so if you want to complain, go complain to them about the choices they made. Because what the Rocky team packages is exactly the same as what they have. You can also raise a bug for it on the Rocky bug tracker if you think it needs fixing outside of what you did above - you can find all that info on the Rocky website. Raise the bug, and the Rocky team will look at it, and most likely direct it upstream to RHEL.

Just writing about it on a forum won’t get it fixed, a bug needs to be raised, and since you are the one experiencing this problem then it should be you who opens the bug.

I did open a bug with them that they closed saying they don’t support Rocky:

Added a new comment telling them to fix their stuff, but we’ll see… evidently, they don’t like Rocky.

The bug should have been opened on Rocky’s bug tracker which you could have found on the Rocky website.

If you open a bug with Red Hat, it is for RHEL. They aren’t going to fix other distros, nor can that be expected. So it’s perfectly fair they closed it when it was opened in the wrong place for the wrong distro. Had you reported the bug with Red Hat from a RHEL distro, then they would have accepted it. Which they did actually say in one of their replies - if you can reproduce on RHEL or CentOS.

Also, you aren’t in a position to make demands. Earlier on in this forum post with your solution you also demanded that someone fix this. And you did the same in the RH bug as well. That is not how it works. So you need to reset your expectations/attitude because nobody has to do what you say so.

Also, going back to my comment earlier, Rocky has the same features and functionality that RHEL has. So if something doesn’t work here, then it most likely doesn’t in RHEL either with it being the upstream distro. So if hardware support is removed, we have no control over it. Sure, third-party kernels can be used that reintroduce hardware support like the elrepo kernels for example.

I never said I was, but the fact that they’re probably breaking several legacy systems that are supposed to be supported in 9 is criminal if you ask me. And most system administrators will probably just give up because it’s almost impossible to figure out.

I thought Linux wasn’t supposed to have the same issues as Windows. I guess no one cares about backwards compatibility, older hardware, and not breaking people’s stuff in general these days…

Again, this wasn’t an issue until recently (past 6 months). I know most people won’t care, but it would be nice if developers would actually not break stuff… that’s what I TRY to do as a developer.

I’m not asking for anything, but I bet I’m not the only person affected by this… But yeah, it should be fixed by whomever broke it, and they really should be called out for their bad work.

I will file a bug with Rocky as well…

I’m curious what year is your hardware?
I’m seeing the launch-to-phase-out as being 2008-2012 for the ICH10R?

2010, but what does that matter? They’re still work horses and are still plenty powerful / capable and work great (except when updates break things).

Well it does matter a lot. I’ve got hardware from 2008, but I cannot run EL9 or EL10 on it. And I don’t expect to either due to hardware being deprecated. And you will have to expect the same and accept it whether you like it or not.