I was earlier fighting with a “similar” problem:
I think that CPU stuck error is just a symptom, and the root cause can be various different things. I felt in my case it was related to desktop usage (e.g. connected USB devices or the power management or something), which possibly are not an issue in server use.
Are they real server hardware, or e.g. virtual machines, even in the public cloud?
In my case (running Rocky Linux in a laptop), things I felt may have at least affected it (this is the timeline too):
-
It seemed at some point some connected USB devices triggered this. Things seemed better if I disconnected all USB devices… but it may be the reason was the next point, as I usually also disconnected AC power along with the USB devices.
-
At some point I noticed that if and when this issue took place, disconnecting my laptop from AC power somehow helped the situation and the laptop would become “unstuck”? This lead me to believe this was somehow related to power management.
-
As i mention in the other thread, I disabled hibernate and sleep mode altogether. I felt it at least helped with the problem.
-
The problem did occur even after that but I got somewhat different error messages (something about some device having to wait for n seconds or something), and googling for it, someone had fixed it by disabling/removing Pipewire and replacing it with PulseAudio, or something like that… I don’t think I ever managed to do that switch.
However, now I haven’t seen this issue for awhile, not sure what has changed or what fixed it. From the abovementioned things, I felt disabling (masking) hibernate and sleep mode is most probably the thing that helped it the most, or fixed it. I haven’t checked lately if I get any such errors in logs, but at least I a not experiencing similar unresponsiveness anymore.
However, since your issues happen in server use, somehow I feel maybe your root cause is not related to USB devices or power management or audio libraries.