Blocking BUG with the new RockyLinux 9.2 kernel

On two servers running Rocky Linux 9.1, I upgraded to 9.2 with new kernel. Since then it has been a disaster. After a random period ranging from a few minutes to a couple of hours, two CPUs completely freeze making the system unusable. The only solution is to restart the server, it works fine again for a while and then same story. This is the message:

kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 3151s kworker/2:2:16322

kernel:watchdog: BUG: soft lockup - CPU#2 stuck for 3308s kworker/1:0:10255

Do you have any idea what it is and how to fix it?

I was earlier fighting with a “similar” problem:

I think that CPU stuck error is just a symptom, and the root cause can be various different things. I felt in my case it was related to desktop usage (e.g. connected USB devices or the power management or something), which possibly are not an issue in server use.

Are they real server hardware, or e.g. virtual machines, even in the public cloud?

In my case (running Rocky Linux in a laptop), things I felt may have at least affected it (this is the timeline too):

  1. It seemed at some point some connected USB devices triggered this. Things seemed better if I disconnected all USB devices… but it may be the reason was the next point, as I usually also disconnected AC power along with the USB devices.

  2. At some point I noticed that if and when this issue took place, disconnecting my laptop from AC power somehow helped the situation and the laptop would become “unstuck”? This lead me to believe this was somehow related to power management.

  3. As i mention in the other thread, I disabled hibernate and sleep mode altogether. I felt it at least helped with the problem.

  4. The problem did occur even after that but I got somewhat different error messages (something about some device having to wait for n seconds or something), and googling for it, someone had fixed it by disabling/removing Pipewire and replacing it with PulseAudio, or something like that… I don’t think I ever managed to do that switch.

However, now I haven’t seen this issue for awhile, not sure what has changed or what fixed it. From the abovementioned things, I felt disabling (masking) hibernate and sleep mode is most probably the thing that helped it the most, or fixed it. I haven’t checked lately if I get any such errors in logs, but at least I a not experiencing similar unresponsiveness anymore.

However, since your issues happen in server use, somehow I feel maybe your root cause is not related to USB devices or power management or audio libraries.

In my case I am dealing with a cloud system with dedicated resources. Before upgrading to RockyLinux 9.2 everything worked fine.

The one time I was able to log into one of the two machines affected during the problem (usually I can’t even log in via ssh and have to do a hard reset), I got that screen shown above. Looking for some more info, the processes that went to block were of this type:

[kworker/2:1-cifsiod]

giving me the idea that Cifs may have something to do with it. But I have no certainty.

This is the screen of top command before kernel crash:

Is there any way to disable this cifsiod process? It didn’t exist in the previous kernel. Being a kernel process it won’t let me terminate it.

Is there any way to report this bug to the developers?

Two days after upgrading to RL 9.2 from 9.1 I saw one of these CPU lockup messages. Only cure was to power down and restart - simple restart didn’t work.

USB devicea - an APC UPS, keyboard, mouse.

Please devs take a look at this.

console message is:

Message from syslogd@localhost at May 18 20:24:43 …
kernel:watchdog: BUG: soft lockup - CPU#4 stuck for 3397s! [kworker/4:2:519]

After rebooting the machine, the issue recurred a couple hours later with a different CPU number.

Hi

I can confirm this issue after upgrading yesterday. After ~3 hours, the system (VM on a cloud hoster) becomes unresponsive.

Hello. I have a simple advice. I also had a lot of problems with Rocky Linux 9.2. I spit on everything and switched to Oracle Linux (UEK kernel). All my problems are gone.

Looks like this is known issue:

https://bugzilla.redhat.com/show_bug.cgi?id=2180423
https://bugzilla.redhat.com/show_bug.cgi?id=2189320

I found a temporary fix.

I edited the /etc/fstab file removing the cifs network path initializations. I turned the server off and on again and the problem disappeared.

Clearly now I no longer have the cifs network paths as directories and have to reach them in another way (e.g. FTP). It’s not optimal but at least the server doesn’t go down anymore.

However on Rocky Linux 9.1 this did not happen.

Fabio

The BZ reports state this issue is fixed in kernel-5.14.0-301.el9 or later - but these are not available with RHEL/Rocky9 - but they are available with CentOS 9 Stream - so if you can’t wait until RHEL/Rocky have a fix, you could download and update to the latest CentOS 9 Stream kernel RPMS …

I used the grub boot choices to boot from the previous kernel (RL 9.1) 5.14.0-162.23.1.el9_1

So far the server has stayed up. Until this is fixed, this is my workaround.

Any ETA on when the fix will get propagated to RL9.2?

It relies on upstream (Red Hat), since the linked posts here suggest it’s fixed in 5.14.0-301.el9 we have to wait until Red Hat release it. As Rocky is 1:1 of RHEL, then Rocky has exactly what RHEL has.

1 Like

FWIW in case anyone else is dealing with this, I have been working to migrate to Rocky Linux 9.1 and at the time I synced the repos locally kernel-5.14.0-162.12.1.el9_1.0.2.x86_64 was available. I ran into the same issues above running kernel-5.14.0-162.12.1.el9_1.0.2.x86_64 and installed the suggested kernel-5.14.0-162.23.1.el9_1.x86_64 and it seems to have stabilized the host. Will let it run for longer and report back but it seems like a bug that was introduced in kernel-5.14.0-162.12.1.el9_1.0.2.x86_64 fixed in kernel-5.14.0-162.23.1.el9_1.x86_64 and reintroduced in 9.2

Have you tried using:
ELRepo | HomePage

Index of /linux/kernel (elrepo.org)

just to make sure that a kernel upgrade resolves the issue on one machine?