Kernel loses one processor (smpboot: do_boot_cpu failed(-1) to wakeup CPU#128)

Hello,

after upgrading to Rocky Linux 9.3, the kernel loses one of the cores in my systems.

I have 25 identical systems with Gigabyte R183-Z90-AAD1 and 2 AMD EPYC 9554 64-core processors. In total there should be 2642=256 hardware threads / logical cores. With the kernel in Rocky Linux 9.2 this degree of parallelism was detected, reported and could be used.

After upgrading to Rocky Linux 9.3, I see the following on all 25 systems:

prompt> journalctl -k

localhost kernel: smpboot: Allowing 256 CPUs, 0 hotplug CPUs

localhost kernel: smpboot: do_boot_cpu failed(-1) to wakeup CPU#128

The systems tells me that 1 (logical) processor is offline (sorry, the following output is partially in german):
prompt> lscpu
Architektur: x86_64
CPU Operationsmodus: 32-bit, 64-bit
Adressgrößen: 52 bits physical, 57 bits virtual
Byte-Reihenfolge: Little Endian
CPU(s): 256
Liste der Online-CPU(s): 0-127,129-255
Liste der Offline-CPU(s): 128 <------------
Anbieterkennung: AuthenticAMD
BIOS-Anbieterkennung: Advanced Micro Devices, Inc.
Modellname: AMD EPYC 9554 64-Core Processor
BIOS-Modellname: AMD EPYC 9554 64-Core Processor

It is not possible to turn this CPU#128 to online:
prompt> echo 1 > /sys/devices/system/cpu/cpu128/online
-bash: echo: write error: I/O error.

Any Hints?

Best regards,
Rudi

Hi @Rudi

I’m hitting the same issue with some new Gigabyte systems, mine are R283-Z92-AAE1-000. The problem goes away if I disable hyperthreading. A few search results pointed to setting the kernel parameter cpu_init_udelay= to various values, but that has not changed the behavior on any of my systems.

griznog

Hi @griznog,

thank you for your feedback.

Yes, in the meantime other admins have seen the same problem. Disabling Hyperthreading solves the problem. But we have applications that benefit from Hyperthreading. Our (intermediate) solution is that we don’t use this non-recognized hardware thread (1 of 256), use the rest 255 cores/hw-threads and hope that this error in the kernel will soon be fixed.

Rudi

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.