Blocking BUG with the new RockyLinux 9.2 kernel

On two servers running Rocky Linux 9.1, I upgraded to 9.2 with new kernel. Since then it has been a disaster. After a random period ranging from a few minutes to a couple of hours, two CPUs completely freeze making the system unusable. The only solution is to restart the server, it works fine again for a while and then same story. This is the message:

kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 3151s kworker/2:2:16322

kernel:watchdog: BUG: soft lockup - CPU#2 stuck for 3308s kworker/1:0:10255

Do you have any idea what it is and how to fix it?

I was earlier fighting with a “similar” problem:

I think that CPU stuck error is just a symptom, and the root cause can be various different things. I felt in my case it was related to desktop usage (e.g. connected USB devices or the power management or something), which possibly are not an issue in server use.

Are they real server hardware, or e.g. virtual machines, even in the public cloud?

In my case (running Rocky Linux in a laptop), things I felt may have at least affected it (this is the timeline too):

  1. It seemed at some point some connected USB devices triggered this. Things seemed better if I disconnected all USB devices… but it may be the reason was the next point, as I usually also disconnected AC power along with the USB devices.

  2. At some point I noticed that if and when this issue took place, disconnecting my laptop from AC power somehow helped the situation and the laptop would become “unstuck”? This lead me to believe this was somehow related to power management.

  3. As i mention in the other thread, I disabled hibernate and sleep mode altogether. I felt it at least helped with the problem.

  4. The problem did occur even after that but I got somewhat different error messages (something about some device having to wait for n seconds or something), and googling for it, someone had fixed it by disabling/removing Pipewire and replacing it with PulseAudio, or something like that… I don’t think I ever managed to do that switch.

However, now I haven’t seen this issue for awhile, not sure what has changed or what fixed it. From the abovementioned things, I felt disabling (masking) hibernate and sleep mode is most probably the thing that helped it the most, or fixed it. I haven’t checked lately if I get any such errors in logs, but at least I a not experiencing similar unresponsiveness anymore.

However, since your issues happen in server use, somehow I feel maybe your root cause is not related to USB devices or power management or audio libraries.

In my case I am dealing with a cloud system with dedicated resources. Before upgrading to RockyLinux 9.2 everything worked fine.

The one time I was able to log into one of the two machines affected during the problem (usually I can’t even log in via ssh and have to do a hard reset), I got that screen shown above. Looking for some more info, the processes that went to block were of this type:

[kworker/2:1-cifsiod]

giving me the idea that Cifs may have something to do with it. But I have no certainty.

This is the screen of top command before kernel crash:

Is there any way to disable this cifsiod process? It didn’t exist in the previous kernel. Being a kernel process it won’t let me terminate it.

Is there any way to report this bug to the developers?

Two days after upgrading to RL 9.2 from 9.1 I saw one of these CPU lockup messages. Only cure was to power down and restart - simple restart didn’t work.

USB devicea - an APC UPS, keyboard, mouse.

Please devs take a look at this.

console message is:

Message from syslogd@localhost at May 18 20:24:43 …
kernel:watchdog: BUG: soft lockup - CPU#4 stuck for 3397s! [kworker/4:2:519]

After rebooting the machine, the issue recurred a couple hours later with a different CPU number.

Hi

I can confirm this issue after upgrading yesterday. After ~3 hours, the system (VM on a cloud hoster) becomes unresponsive.

Hello. I have a simple advice. I also had a lot of problems with Rocky Linux 9.2. I spit on everything and switched to Oracle Linux (UEK kernel). All my problems are gone.

Looks like this is known issue:

https://bugzilla.redhat.com/show_bug.cgi?id=2180423
https://bugzilla.redhat.com/show_bug.cgi?id=2189320

I found a temporary fix.

I edited the /etc/fstab file removing the cifs network path initializations. I turned the server off and on again and the problem disappeared.

Clearly now I no longer have the cifs network paths as directories and have to reach them in another way (e.g. FTP). It’s not optimal but at least the server doesn’t go down anymore.

However on Rocky Linux 9.1 this did not happen.

Fabio

The BZ reports state this issue is fixed in kernel-5.14.0-301.el9 or later - but these are not available with RHEL/Rocky9 - but they are available with CentOS 9 Stream - so if you can’t wait until RHEL/Rocky have a fix, you could download and update to the latest CentOS 9 Stream kernel RPMS …

I used the grub boot choices to boot from the previous kernel (RL 9.1) 5.14.0-162.23.1.el9_1

So far the server has stayed up. Until this is fixed, this is my workaround.

Any ETA on when the fix will get propagated to RL9.2?

It relies on upstream (Red Hat), since the linked posts here suggest it’s fixed in 5.14.0-301.el9 we have to wait until Red Hat release it. As Rocky is 1:1 of RHEL, then Rocky has exactly what RHEL has.

1 Like

FWIW in case anyone else is dealing with this, I have been working to migrate to Rocky Linux 9.1 and at the time I synced the repos locally kernel-5.14.0-162.12.1.el9_1.0.2.x86_64 was available. I ran into the same issues above running kernel-5.14.0-162.12.1.el9_1.0.2.x86_64 and installed the suggested kernel-5.14.0-162.23.1.el9_1.x86_64 and it seems to have stabilized the host. Will let it run for longer and report back but it seems like a bug that was introduced in kernel-5.14.0-162.12.1.el9_1.0.2.x86_64 fixed in kernel-5.14.0-162.23.1.el9_1.x86_64 and reintroduced in 9.2

Have you tried using:
ELRepo | HomePage

Index of /linux/kernel (elrepo.org)

just to make sure that a kernel upgrade resolves the issue on one machine?

I can confirm kernel-5.14.0-162.23.1.el9_1.x86_64 has resolved the issues with the system locking up. The system has been stable for 5 days. kernel-5.14.0-162.12.1.el9_1.0.2.x86_64 seems to have the same issues as reported with the 9.2 kernel.

Here is the /var/log/messages output while running on kernel-5.14.0-162.12.1.el9_1.0.2.x86_64

[root@hostname123 ~]# grep -iE “soft lockup|Workqueue:” /var/log/messages
May 22 09:59:39 hostname123 kernel: [184937.055476] Workqueue: cgroup_destroy css_free_rwork_fn
May 23 12:18:52 hostname123 kernel: [279690.777541] watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [kworker/7:0:2092347]
May 23 12:18:52 hostname123 kernel: [279690.782265] Workqueue: deferredclose smb2_deferred_work_close [cifs]
May 23 12:18:52 hostname123 kernel: [279690.823543] watchdog: BUG: soft lockup - CPU#11 stuck for 22s! [.NET ThreadPool:2161345]
May 23 12:18:52 hostname123 kernel: [279690.866542] watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [migration/15:94]
May 23 12:18:52 hostname123 kernel: [279691.085543] watchdog: BUG: soft lockup - CPU#35 stuck for 22s! [zabbix_agent2:5710]
May 23 12:18:52 hostname123 kernel: [279691.095544] watchdog: BUG: soft lockup - CPU#36 stuck for 22s! [zabbix_agent2:4964]
May 23 12:18:53 hostname123 kernel: [279691.206545] watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [.NET ThreadPool:2143310]
May 23 12:18:53 hostname123 kernel: [279691.428546] watchdog: BUG: soft lockup - CPU#86 stuck for 26s! [cifsd:5197]
May 23 12:18:53 hostname123 kernel: [279691.459548] watchdog: BUG: soft lockup - CPU#94 stuck for 26s! [zabbix_agent2:85411]
May 23 12:18:53 hostname123 kernel: [279691.467547] watchdog: BUG: soft lockup - CPU#96 stuck for 22s! [.NET ThreadPool:2161681]
May 23 12:18:53 hostname123 kernel: [279691.471547] watchdog: BUG: soft lockup - CPU#97 stuck for 26s! [FileWriterThrea:2162018]
May 23 12:18:53 hostname123 kernel: [279691.503547] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [.NET ThreadPool:2161682]
May 23 12:18:53 hostname123 kernel: [279691.511547] watchdog: BUG: soft lockup - CPU#107 stuck for 23s! [kworker/u226:2:2091130]
May 23 12:18:53 hostname123 kernel: [279691.519547] watchdog: BUG: soft lockup - CPU#109 stuck for 27s! [.NET ThreadPool:2161586]
May 23 12:18:53 hostname123 kernel: [279691.577019] Workqueue: writeback wb_workfn
May 23 12:18:56 hostname123 kernel: [279694.485572] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [zabbix_agent2:530182]
May 23 12:18:56 hostname123 kernel: [279694.738574] watchdog: BUG: soft lockup - CPU#6 stuck for 27s! [zabbix_agent2:6984]
May 23 12:18:57 hostname123 kernel: [279695.273578] watchdog: BUG: soft lockup - CPU#52 stuck for 22s! [zabbix_agent2:15933]
May 23 12:18:57 hostname123 kernel: [279695.389579] watchdog: BUG: soft lockup - CPU#76 stuck for 22s! [zabbix_agent2:295377]
May 23 12:18:57 hostname123 kernel: [279695.475580] watchdog: BUG: soft lockup - CPU#98 stuck for 22s! [zabbix_agent2:523551]

May 23 12:18:52 hostname123 kernel: [279690.790388] Call Trace:
May 23 12:18:52 hostname123 kernel: [279690.790898] _raw_spin_lock+0x25/0x30
May 23 12:18:52 hostname123 kernel: [279690.791372] _get_xid+0x13/0xa0 [cifs]
May 23 12:18:52 hostname123 kernel: [279690.791921] _cifsFileInfo_put+0x2b6/0x430 [cifs]
May 23 12:18:52 hostname123 kernel: [279690.792434] ? smb2_deferred_work_close+0x30/0x60 [cifs]
May 23 12:18:52 hostname123 kernel: [279690.792954] process_one_work+0x1e5/0x3c0
May 23 12:18:52 hostname123 kernel: [279690.793429] worker_thread+0x50/0x3b0
May 23 12:18:52 hostname123 kernel: [279690.793945] ? rescuer_thread+0x380/0x380
May 23 12:18:52 hostname123 kernel: [279690.794429] kthread+0x146/0x170
May 23 12:18:52 hostname123 kernel: [279690.794967] ? set_kthread_struct+0x50/0x50
May 23 12:18:52 hostname123 kernel: [279690.795507] ret_from_fork+0x1f/0x30
May 23 12:18:52 hostname123 kernel: [279690.823543] watchdog: BUG: soft lockup - CPU#11 stuck for 22s! [.NET ThreadPool:2161345]

kernel 5.14.0-162.23.1.el9_1.x64_64 is fine. Been running with that for months.

It’s the upgrade to 5.14.0.284.11.1.el9_2.x64_64 that broke things. It installed with RL 9.2 . Currently running on the older kernel waiting for a fix.

I get this error in my rock 9(in looping)

Kernel Version:
Linux rock-kvm-server01 5.14.0-284.11.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Tue May 9 17:09:15 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Release:
NAME=“Rocky Linux”
VERSION=“9.2 (Blue Onyx)”
ID=“rocky”

Error(in looping and system is very slow…):

Message from syslogd@rock-kvm-server01 at Jun 6 17:01:15 …
kernel:watchdog: BUG: soft lockup - CPU#14 stuck for 1151s! [kworker/14:1:172]

Message from syslogd@rock-kvm-server01 at Jun 6 17:02:03 …
kernel:watchdog: BUG: soft lockup - CPU#14 stuck for 1196s! [kworker/14:1:172]

Message from syslogd@rock-kvm-server01 at Jun 6 17:03:27 …
kernel:watchdog: BUG: soft lockup - CPU#14 stuck for 1274s! [kworker/14:1:172]

top:

We are also dealing with this bug. All machines that went to 9.2 lock up now. The ones on 9.1 are fine.

Posting here in case my fix helps someone else.

I got tired of waiting for RHEL to release a fixed kernel. Having to revert to the RL9.1 kernel caused other problems (example: VirtualBox stopped working).

I found CentOS kernel-5.14.0-302.el9 here:

https://kojihub.stream.centos.org/koji/buildinfo?buildID=31868

This is newer than the 284 kernel RL9.2 is supplying. After installing this newer kernel (plus some other dependencies from the same web page) all seems well. Server has been running over 24 hours with no CPU lockups.

I see that even newer builds than 302 are available, for example 326 here:

https://kojihub.stream.centos.org/koji/buildinfo?buildID=33654

1 Like