Rocky 8.7 with kernel: 4.18.0-553.44.1.el8_10.x86_64 OS auto reboot when in high load

Here our HPC slurm based compute node with 128 cores/780GB. OS Rocky 8.7 with kernel: 4.18.0-553.44.1.el8_10.x86_64. HPE HW confirms no problem from vendor. kdump is well configurated to expect vmcore create when OS crashed. Below is the issue we met:

  1. When give a high cpu/memory load, OS auto reboot without any vmore created in /var/crash.
  2. no any footprint left in journald log/server console by ilo/dmesg -T
  3. Even try manual create vmcore doesn’t work except a OS auto reboot[echo 1 > /proc/sys/kernel/sysrq, echo c > /proc/sysrq-trigger]
  4. OS keeps well when it’s idle.
  5. HW firmware is flushed to latest in this year and HW vendor confirms no HW error.
  6. We confirm kdump related settings is well configured for create vmcore.

Our question is why a high load[But still within OS HW resource specification] will trigger OS auto reboot without any vmcore. Is it a server OS panic, cause nothing output to OS console and vmcore? How to troubleshooting such kind of issue?

Rocky 8.7 is no longer supported. The current supported release is 8.10. I would upgrade your entire system first and make sure it’s up to date. It should not be cherry-pick updated by just applying kernel updates from 8.10.

Rocky 8.7 is not the latest, but it doesn’t really explain why it would suddenly reboot. Assuming you have persistent journal, what was the last entry in the log, you should see a small gap in the timestamp at the exact time of the reboot.

Could it be the reboot is being caused by the hardware / firmware?

Mix of 8.7 and 8.10 packages most likely, which is not a good scenario to be in, and is not supported by the Rocky team.

That said, on Rocky 9.6 I had problems with these kernels due to hardware most likely being deprecated and had to stay on a 9.5 kernel, or use kernel-lt from elrepo which is what I did in the end.

So could be a combination of both - mix of packages, and newer kernel drivers/modules causing problems for the hardware.

HPE hardware support has confirmed no problem from server platform after analysis the server’s ahs report. And I also tend to issue from OS level. Because I even can’t manually trigger a kerner panic vmcore dump, but just a OS quick reboot. Yes there is journald gap after reboot, I think that’s caused by cached in memory or journald~ in a sudden OS reboot. And I also see below KBs and from description that our HW happens to be based on AMD processors.

https://bugs.rockylinux.org/view.php?id=9109

https://access.redhat.com/solutions/7103432

So those articles imply that (what ever they are talking about) was fixed in a more recent kernel than the one in the title of this post, so the question goes back to why you are not running the latest version?

It’s kind of similar to what I was saying about “hardware”, e.g. a bit like pressing the reset button, and therefore there’s nothing in the logs.

I’m seeking a positive feedback it’s triggered by this. Or other ways to bypass the issue based on current kernel version we are managing.

I don’t think anyone is going to be able to definitively tell you that is “the” problem. But as a basic troubleshooting step the first thing to do will be to update your system to the latest version and then see if the problem you’re having goes away.

If it does, it’s solved. If it doesn’t, then it’s time to look into it further.

1 Like

Exactly what @FrankCox says and what I also mentioned in my first reply. The only supported version is 8.10 so you should make sure your system is up-to-date first with ALL packages. Only then will someone will assist you. Updating the system entirely is what you should do first.

Even the Rocky team will tell you to do that before even looking at any bug reports.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.