Kernel is tainted - memory management and kernel crashes

Hello, would someone be able to help.

Issue: during idle or normal work load PC goes into safe shutdown by itself. Then turns back on.

  • Clean install Rocky Linux 9.6.
  • Issue happens on Kernel-5.14.0-570.18.1.el9_6.x86_64
  • 9950x3D, 4090 rtx, 186 GB RAM, no XPO.
  • Latest Bios update
  • Nvidia drivers 550, RPM Fusion
  • CPU stress test fine. GPU stress test fine. Memtest all good.
  • Temps under 100% load: CPU - 90°, GPU - 63°

Three crash dumps - Vmcore-dmesg.txt and kexec-dmesg.log interpreted by Claude.ai:

The crash occurred in the LRU (Least Recently Used) page management code during filesystem unmounting. Specifically:

  • Location: lru_gen_del_folio.constprop.0+0x12f/0x1a0
  • Trigger: __list_del_entry_valid+0x2d/0x50
  • Address: Invalid memory address 0xefffe53885663608

What Was Happening

  1. The system was shutting down normally
  2. During filesystem unmount, the kernel was trying to clean up memory pages
  3. When attempting to remove a page from the LRU list, it encountered corrupted list pointers
  4. The list validation detected an invalid memory address and crashed

Likely Causes

Most Probable:

  • Memory corruption from one of the tainted kernel modules:
    • nvidia driver (proprietary)
    • vmmon (VMware module)
    • Other out-of-tree modules

Other Possibilities:

  • Hardware memory issues (RAM corruption)
  • Race condition in the kernel’s multi-generational LRU code
  • Filesystem corruption during shutdown

Evidence Supporting This

  • The kernel is tainted (Tainted: P OE) due to proprietary/out-of-tree modules
  • The crash happens in memory management, which proprietary drivers often interfere with
  • The invalid pointer value suggests memory corruption rather than a logic error

After I have switched to kernel 5.14.0-570.17.1.el9_6.x86_64 PC works without crash. Why is the issue happening with newer kernel?

I feel there is something wrong with that kernel. On my HP Proliant DL360e Gen8 that kernel doesn’t work either, or at least for a few hours and then crashes on me. Currently using the 9.5 kernel (5.14.0-503.40.1.el9_5.x86_64) as this one works fine. Didn’t try the previous 570.17 kernel like you have, but I guess that might also work on mine.

I don’t know if I have the same errors as you, but I expect as the behaviour is similar I don’t think it will be problems with your hardware, but rather something in this kernel is borked.

The 570-19 kernel is available in Rocky 9 now, perhaps try this one, I’m going to do the same now by running dnf update.

I’m also contemplating enabling elrepo and installing kernel-lt or kernel-ml since they are 6.x kernels of which I know work on this server since I was using either Debian 12 or Ubuntu with a 6.x kernel before switching to Rocky 9.

Actually server died on the 570.19 kernel as well when performing a restic backup to one of my VM’s. So looks like either an elrepo kernel or the 9.5 one.

EDIT: so far kernel-lt from elrepo is working fine. No crashes so far. Worst case I’ll end up returning to 5.14.0-503.40.1.

thank for looking into it. I have installed kernel-lt and test to see how things are going. I must say, I have spent 2 days debugging this issue. I hope all will be fine for now. thx!

@iwalker one question, out of curiosity. Is this standard procedure to install different kernels for Rocky based on the hardware configuration to see what works and what not?

Since I am using latest 9950x3D CPU I ended up with elrepo 6.14 kernels because I kept having crashes because of the 3D cache on kernel-lt.

Also I could not use the newest elrepo 6.15 kernel because it was too new for Nvidia drivers. So 6.14 seems to be a good middle ground for me.

I am just surprised that I have to be so selective about kernels and nvidia drivers in order not to break things and have the hardware work.

You have to remember, EL is Enterprise Linux, and there is a certified hardware list that Red Hat makes the distro for. Since Rocky is based on RHEL, that means the hardware support for the stock kernels is the same. If you have new hardware, especially if something exotic with the absolutely latest CPU, or whatever, then sure there can be issues since it’s most likely not on the hardware compatibility list that Red Hat built the distro for. In which case you need to use the kernel-lt or kernel-ml from elrepo. For example, AMD Ryzen CPU’s would have the CPU fan going at full whack on older kernels than for say with a newer one when support for the CPU was provided.

EL or Enterprise Linux is built for stability, and that doesn’t necessarily mean that the latest and greatest hardware is supported. In this instance, you are probably better off with Fedora or as already mentioned a newer kernel but still use EL.

I run Fedora 42 on my Lenovo Thinkpad T15p.

Incidently, Rocky 10 is due out soon with a 6.12 kernel, so may work better for you.

Super thank you. That clears out the confusion I had. I appreciate the explanation :+1:

1 Like