Rocky 8.7 with kernel: 4.18.0-553.44.1.el8_10.x86_64 OS auto reboot when in high load

HuangHaiqing · September 17, 2025, 2:36am

Here our HPC slurm based compute node with 128 cores/780GB. OS Rocky 8.7 with kernel: 4.18.0-553.44.1.el8_10.x86_64. HPE HW confirms no problem from vendor. kdump is well configurated to expect vmcore create when OS crashed. Below is the issue we met:

When give a high cpu/memory load, OS auto reboot without any vmore created in /var/crash.
no any footprint left in journald log/server console by ilo/dmesg -T
Even try manual create vmcore doesn’t work except a OS auto reboot[echo 1 > /proc/sys/kernel/sysrq, echo c > /proc/sysrq-trigger]
OS keeps well when it’s idle.
HW firmware is flushed to latest in this year and HW vendor confirms no HW error.
We confirm kdump related settings is well configured for create vmcore.

Our question is why a high load[But still within OS HW resource specification] will trigger OS auto reboot without any vmcore. Is it a server OS panic, cause nothing output to OS console and vmcore? How to troubleshooting such kind of issue?

iwalker · September 17, 2025, 7:27am

Rocky 8.7 is no longer supported. The current supported release is 8.10. I would upgrade your entire system first and make sure it’s up to date. It should not be cherry-pick updated by just applying kernel updates from 8.10.

gerry666uk · September 17, 2025, 7:39pm

Rocky 8.7 is not the latest, but it doesn’t really explain why it would suddenly reboot. Assuming you have persistent journal, what was the last entry in the log, you should see a small gap in the timestamp at the exact time of the reboot.

Could it be the reboot is being caused by the hardware / firmware?

iwalker · September 17, 2025, 8:47pm

Mix of 8.7 and 8.10 packages most likely, which is not a good scenario to be in, and is not supported by the Rocky team.

That said, on Rocky 9.6 I had problems with these kernels due to hardware most likely being deprecated and had to stay on a 9.5 kernel, or use kernel-lt from elrepo which is what I did in the end.

So could be a combination of both - mix of packages, and newer kernel drivers/modules causing problems for the hardware.

HuangHaiqing · September 18, 2025, 4:47am

HPE hardware support has confirmed no problem from server platform after analysis the server’s ahs report. And I also tend to issue from OS level. Because I even can’t manually trigger a kerner panic vmcore dump, but just a OS quick reboot. Yes there is journald gap after reboot, I think that’s caused by cached in memory or journald~ in a sudden OS reboot. And I also see below KBs and from description that our HW happens to be based on AMD processors.

https://bugs.rockylinux.org/view.php?id=9109

https://access.redhat.com/solutions/7103432

gerry666uk · September 18, 2025, 6:44pm

So those articles imply that (what ever they are talking about) was fixed in a more recent kernel than the one in the title of this post, so the question goes back to why you are not running the latest version?

It’s kind of similar to what I was saying about “hardware”, e.g. a bit like pressing the reset button, and therefore there’s nothing in the logs.

HuangHaiqing · September 19, 2025, 1:23am

I’m seeking a positive feedback it’s triggered by this. Or other ways to bypass the issue based on current kernel version we are managing.

FrankCox · September 19, 2025, 2:00am

I don’t think anyone is going to be able to definitively tell you that is “the” problem. But as a basic troubleshooting step the first thing to do will be to update your system to the latest version and then see if the problem you’re having goes away.

If it does, it’s solved. If it doesn’t, then it’s time to look into it further.

iwalker · September 19, 2025, 7:18am

Exactly what @FrankCox says and what I also mentioned in my first reply. The only supported version is 8.10 so you should make sure your system is up-to-date first with ALL packages. Only then will someone will assist you. Updating the system entirely is what you should do first.

Even the Rocky team will tell you to do that before even looking at any bug reports.

Topic		Replies	Views
RockyLinux 8.7 updated kernel 4.18.0-425 crash Rocky Linux Help & Support	6	1591	August 25, 2023
Intermittent server reboot issue Rocky Linux Help & Support rocky-linux-8 , dell	3	464	April 21, 2024
Cpu disabled by guest operating system, Unwanted reboot issue on rocky linux 9.2 Rocky Linux Help & Support rocky-linux-9	9	2217	May 12, 2024
RockyLinux 8.10 Dont Load last kernel Rocky Linux Help & Support rocky-linux-8 , unsupported	2	154	September 24, 2024
Rocky v8.4 on Dell XPS 8940 is unstable! Rocky Linux Help & Support	13	2508	August 25, 2023

Rocky 8.7 with kernel: 4.18.0-553.44.1.el8_10.x86_64 OS auto reboot when in high load

Related topics