On two servers running Rocky Linux 9.1, I upgraded to 9.2 with new kernel. Since then it has been a disaster. After a random period ranging from a few minutes to a couple of hours, two CPUs completely freeze making the system unusable. The only solution is to restart the server, it works fine again for a while and then same story. This is the message:
kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 3151s kworker/2:2:16322
kernel:watchdog: BUG: soft lockup - CPU#2 stuck for 3308s kworker/1:0:10255
I was earlier fighting with a “similar” problem:
I think that CPU stuck error is just a symptom, and the root cause can be various different things. I felt in my case it was related to desktop usage (e.g. connected USB devices or the power management or something), which possibly are not an issue in server use.
Are they real server hardware, or e.g. virtual machines, even in the public cloud?
In my case (running Rocky Linux in a laptop), things I felt may have at least affected it (this is the timeline too):
It seemed at some point some connected USB devices triggered this. Things seemed better if I disconnected all USB devices… but it may be the reason was the next point, as I usually also disconnected AC power along with the USB devices.
At some point I noticed that if and when this issue took place, disconnecting my laptop from AC power somehow helped the situation and the laptop would become “unstuck”? This lead me to believe this was somehow related to power management.
As i mention in the other thread, I disabled hibernate and sleep mode altogether. I felt it at least helped with the problem.
The problem did occur even after that but I got somewhat different error messages (something about some device having to wait for n seconds or something), and googling for it, someone had fixed it by disabling/removing Pipewire and replacing it with PulseAudio, or something like that… I don’t think I ever managed to do that switch.
However, now I haven’t seen this issue for awhile, not sure what has changed or what fixed it. From the abovementioned things, I felt disabling (masking) hibernate and sleep mode is most probably the thing that helped it the most, or fixed it. I haven’t checked lately if I get any such errors in logs, but at least I a not experiencing similar unresponsiveness anymore.
However, since your issues happen in server use, somehow I feel maybe your root cause is not related to USB devices or power management or audio libraries.
In my case I am dealing with a cloud system with dedicated resources. Before upgrading to RockyLinux 9.2 everything worked fine.
The one time I was able to log into one of the two machines affected during the problem (usually I can’t even log in via ssh and have to do a hard reset), I got that screen shown above. Looking for some more info, the processes that went to block were of this type:
[kworker/2:1-cifsiod]
giving me the idea that Cifs may have something to do with it. But I have no certainty.
Two days after upgrading to RL 9.2 from 9.1 I saw one of these CPU lockup messages. Only cure was to power down and restart - simple restart didn’t work.
USB devicea - an APC UPS, keyboard, mouse.
Please devs take a look at this.
console message is:
Message from syslogd@localhost at May 18 20:24:43 …
kernel:watchdog: BUG: soft lockup - CPU#4 stuck for 3397s! [kworker/4:2:519]
After rebooting the machine, the issue recurred a couple hours later with a different CPU number.
Hello. I have a simple advice. I also had a lot of problems with Rocky Linux 9.2. I spit on everything and switched to Oracle Linux (UEK kernel). All my problems are gone.
I edited the /etc/fstab file removing the cifs network path initializations. I turned the server off and on again and the problem disappeared.
Clearly now I no longer have the cifs network paths as directories and have to reach them in another way (e.g. FTP). It’s not optimal but at least the server doesn’t go down anymore.
The BZ reports state this issue is fixed in kernel-5.14.0-301.el9 or later - but these are not available with RHEL/Rocky9 - but they are available with CentOS 9 Stream - so if you can’t wait until RHEL/Rocky have a fix, you could download and update to the latest CentOS 9 Stream kernel RPMS …
It relies on upstream (Red Hat), since the linked posts here suggest it’s fixed in 5.14.0-301.el9 we have to wait until Red Hat release it. As Rocky is 1:1 of RHEL, then Rocky has exactly what RHEL has.
FWIW in case anyone else is dealing with this, I have been working to migrate to Rocky Linux 9.1 and at the time I synced the repos locally kernel-5.14.0-162.12.1.el9_1.0.2.x86_64 was available. I ran into the same issues above running kernel-5.14.0-162.12.1.el9_1.0.2.x86_64 and installed the suggested kernel-5.14.0-162.23.1.el9_1.x86_64 and it seems to have stabilized the host. Will let it run for longer and report back but it seems like a bug that was introduced in kernel-5.14.0-162.12.1.el9_1.0.2.x86_64 fixed in kernel-5.14.0-162.23.1.el9_1.x86_64 and reintroduced in 9.2