[Rocky Linux 8] Bug soft lockup in kernel 4.18.0-513.9.1.el8_9.x86_64

Hello,

I encountered an issue recently on my server. The server would randomly freeze, sometimes after 1-2 weeks of use (with suspend in between). I have no clue what to look at next, any help is greatly appreciated. Here’s the logs from a recent crash that happened today:

Feb 24 08:00:28 cdn2-new kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [nginx:435762]
Feb 24 08:00:30 cdn2-new kernel: Modules linked in: cls_bpf sch_ingress mptcp_diag udp_diag raw_diag unix_diag tcp_diag inet_diag binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables libcrc32c nfnetlink intel_rapl_msr intel_rapl_common amd_energy crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel joydev pcspkr lpc_ich i2c_i801 ext4 mbcache jbd2 bochs drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt drm_ttm_helper ttm drm ahci libahci sd_mod t10_pi sg libata crc32c_intel virtio_net serio_raw net_failover failover virtio_scsi
Feb 24 08:00:30 cdn2-new kernel: Red Hat flags: eBPF/cls eBPF/event
Feb 24 08:00:30 cdn2-new kernel: CPU: 0 PID: 435762 Comm: nginx Kdump: loaded Tainted: G             L   --------- -  - 4.18.0-513.9.1.el8_9.x86_64 #1
Feb 24 08:00:30 cdn2-new kernel: Hardware name: Linode Compute Instance/Standard PC (Q35 + ICH9, 2009), BIOS Not Specified 
Feb 24 08:00:30 cdn2-new kernel: RIP: 0010:new_slab+0x242/0x520
Feb 24 08:00:30 cdn2-new kernel: Code: 20 4c 01 ee 4d 39 fd 0f 84 98 00 00 00 48 8b 83 90 01 00 00 49 89 f1 4c 89 ff 4d 89 fd 49 0f c9 48 83 c2 01 4c 31 f8 4c 31 c8 <48> 89 06 0f b7 75 2a 81 e6 ff 7f 00 00 48 39 f2 0f 83 e4 01 00 00
Feb 24 08:00:30 cdn2-new kernel: RSP: 0018:ffffa69c80003a40 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
Feb 24 08:00:30 cdn2-new kernel: RAX: f1fae3a2d3b8e228 RBX: ffff91d481002380 RCX: 0000000000001800
Feb 24 08:00:30 cdn2-new kernel: RDX: 0000000000000002 RSI: ffff91d488a9ca00 RDI: ffff91d488a9d800
Feb 24 08:00:30 cdn2-new kernel: RBP: ffffc6a8c022a700 R08: 00000000000396d7 R09: 00caa988d491ffff
Feb 24 08:00:30 cdn2-new kernel: R10: 0000000000000006 R11: 0000000000000006 R12: ffff91d488a9c000
Feb 24 08:00:30 cdn2-new kernel: R13: ffff91d488a9d800 R14: 0000000000000003 R15: ffff91d488a9d800
Feb 24 08:00:30 cdn2-new kernel: FS:  00007ff1cfc25b80(0000) GS:ffff91d4ffc00000(0000) knlGS:0000000000000000
Feb 24 08:00:30 cdn2-new kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 24 08:00:30 cdn2-new kernel: CR2: 00000000ef90c000 CR3: 000000004a5e6000 CR4: 0000000000350ef0
Feb 24 08:00:30 cdn2-new kernel: Call Trace:
Feb 24 08:00:30 cdn2-new kernel: <IRQ>
Feb 24 08:00:30 cdn2-new kernel: ? watchdog_timer_fn.cold.10+0x46/0x9e
Feb 24 08:00:30 cdn2-new kernel: ? watchdog+0x30/0x30
Feb 24 08:00:30 cdn2-new kernel: ? __hrtimer_run_queues+0x101/0x280
Feb 24 08:00:30 cdn2-new kernel: ? hrtimer_interrupt+0x100/0x220
Feb 24 08:00:30 cdn2-new kernel: ? smp_apic_timer_interrupt+0x6a/0x130
Feb 24 08:00:30 cdn2-new kernel: ? apic_timer_interrupt+0xf/0x20
Feb 24 08:00:30 cdn2-new kernel: ? apic_timer_interrupt+0xa/0x20
Feb 24 08:00:30 cdn2-new kernel: ? new_slab+0x242/0x520
Feb 24 08:00:30 cdn2-new kernel: ___slab_alloc.part.91+0x441/0x740
Feb 24 08:00:30 cdn2-new kernel: ? __alloc_skb+0x8c/0x1c0
Feb 24 08:00:30 cdn2-new kernel: ? fib6_table_lookup+0x114/0x330
Feb 24 08:00:30 cdn2-new kernel: __kmalloc_node_track_caller+0xc7/0x2a0
Feb 24 08:00:30 cdn2-new kernel: ? __alloc_skb+0x8c/0x1c0
Feb 24 08:00:30 cdn2-new kernel: kmalloc_reserve+0x2e/0x80
Feb 24 08:00:30 cdn2-new kernel: __alloc_skb+0x8c/0x1c0
Feb 24 08:00:30 cdn2-new kernel: tcp_make_synack+0x54/0x4e0
Feb 24 08:00:30 cdn2-new kernel: ? ip_route_output_key_hash_rcu+0x4dc/0xa10
Feb 24 08:00:30 cdn2-new kernel: ? xfrm_lookup_route+0x1d/0x90
Feb 24 08:00:30 cdn2-new kernel: tcp_v4_send_synack+0x44/0xe0
Feb 24 08:00:30 cdn2-new kernel: tcp_rtx_synack+0x5d/0xc0
Feb 24 08:00:30 cdn2-new kernel: ? internal_add_timer+0x42/0x70
Feb 24 08:00:30 cdn2-new kernel: ? inet_csk_reqsk_queue_drop_and_put+0x90/0x90
Feb 24 08:00:30 cdn2-new kernel: reqsk_timer_handler+0x1f9/0x2a0
Feb 24 08:00:30 cdn2-new kernel: ? inet_csk_reqsk_queue_drop_and_put+0x90/0x90
Feb 24 08:00:30 cdn2-new kernel: call_timer_fn+0x2e/0x130
Feb 24 08:00:30 cdn2-new kernel: run_timer_softirq+0x1e5/0x440
Feb 24 08:00:30 cdn2-new kernel: ? kvm_sched_clock_read+0xd/0x20
Feb 24 08:00:30 cdn2-new kernel: ? sched_clock+0x5/0x10
Feb 24 08:00:30 cdn2-new kernel: __do_softirq+0xdc/0x2cf
Feb 24 08:00:30 cdn2-new kernel: irq_exit_rcu+0xc6/0xd0
Feb 24 08:00:30 cdn2-new kernel: irq_exit+0xa/0x10
Feb 24 08:00:30 cdn2-new kernel: smp_apic_timer_interrupt+0x74/0x130
Feb 24 08:00:30 cdn2-new kernel: apic_timer_interrupt+0xf/0x20
Feb 24 08:00:30 cdn2-new kernel: </IRQ>
Feb 24 08:00:30 cdn2-new kernel: RIP: 0010:new_slab+0x242/0x520
Feb 24 08:00:30 cdn2-new kernel: Code: 20 4c 01 ee 4d 39 fd 0f 84 98 00 00 00 48 8b 83 90 01 00 00 49 89 f1 4c 89 ff 4d 89 fd 49 0f c9 48 83 c2 01 4c 31 f8 4c 31 c8 <48> 89 06 0f b7 75 2a 81 e6 ff 7f 00 00 48 39 f2 0f 83 e4 01 00 00
Feb 24 08:00:30 cdn2-new kernel: RSP: 0018:ffffa69c80bbfae0 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
Feb 24 08:00:30 cdn2-new kernel: RAX: f1b2b9acdde2ae28 RBX: ffff91d481002380 RCX: 0000000000001400
Feb 24 08:00:30 cdn2-new kernel: RDX: 0000000000000002 RSI: ffff91d486f38200 RDI: ffff91d486f39400
Feb 24 08:00:30 cdn2-new kernel: RBP: ffffc6a8c01bce00 R08: 00000000000396d7 R09: 0082f386d491ffff
Feb 24 08:00:30 cdn2-new kernel: R10: 0000000000000006 R11: 0000000000000001 R12: ffff91d486f38000
Feb 24 08:00:30 cdn2-new kernel: R13: ffff91d486f39400 R14: 0000000000000000 R15: ffff91d486f39400
Feb 24 08:00:30 cdn2-new kernel: ___slab_alloc.part.91+0x441/0x740
Feb 24 08:00:30 cdn2-new kernel: ? __alloc_skb+0x8c/0x1c0
Feb 24 08:00:30 cdn2-new kernel: ? __alloc_skb+0x182/0x1c0
Feb 24 08:00:30 cdn2-new kernel: __kmalloc_node_track_caller+0xc7/0x2a0
Feb 24 08:00:30 cdn2-new kernel: ? __alloc_skb+0x8c/0x1c0
Feb 24 08:00:30 cdn2-new kernel: kmalloc_reserve+0x2e/0x80
Feb 24 08:00:30 cdn2-new kernel: __alloc_skb+0x8c/0x1c0
Feb 24 08:00:30 cdn2-new kernel: sk_stream_alloc_skb+0xe5/0x2c0
Feb 24 08:00:30 cdn2-new kernel: tcp_sendmsg_locked+0x33a/0xda0
Feb 24 08:00:30 cdn2-new kernel: ? sock_has_perm+0x80/0xa0
Feb 24 08:00:30 cdn2-new kernel: tcp_sendmsg+0x27/0x40
Feb 24 08:00:30 cdn2-new kernel: sock_sendmsg+0x50/0x60
Feb 24 08:00:30 cdn2-new kernel: sock_write_iter+0x97/0x100
Feb 24 08:00:30 cdn2-new kernel: new_sync_write+0x112/0x160
Feb 24 08:00:30 cdn2-new kernel: vfs_write+0xa5/0x1b0
Feb 24 08:00:30 cdn2-new kernel: ksys_write+0x4f/0xb0
Feb 24 08:00:30 cdn2-new kernel: do_syscall_64+0x5b/0x1b0
Feb 24 08:00:30 cdn2-new kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
Feb 24 08:00:30 cdn2-new kernel: RIP: 0033:0x7ff1cf5eea17
Feb 24 08:00:30 cdn2-new kernel: Code: c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 1b fd ff ff 4c 89 e2 48 89 ee 89 df 41 89 c0 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 54 fd ff ff 48
Feb 24 08:00:30 cdn2-new kernel: RSP: 002b:00007ffc4af6cfd0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
Feb 24 08:00:30 cdn2-new kernel: RAX: ffffffffffffffda RBX: 00000000000006eb RCX: 00007ff1cf5eea17
Feb 24 08:00:30 cdn2-new kernel: RDX: 000000000000010a RSI: 000055c1851b71e3 RDI: 00000000000006eb
Feb 24 08:00:30 cdn2-new kernel: RBP: 000055c1851b71e3 R08: 0000000000000000 R09: 000055c182a02012
Feb 24 08:00:30 cdn2-new kernel: R10: 000055c1851b71e5 R11: 0000000000000293 R12: 000000000000010a
Feb 24 08:00:30 cdn2-new kernel: R13: 00007ffc4af6d060 R14: 000055c183741ce0 R15: 0000000000000000

Thank you for your attention to this matter.

I searched that term found this explanation on Suse site:
https://www.suse.com/support/kb/doc/?id=000018705

Which suggest a period of heavy load on the processor is causing the lockup. The article also suggests that you can extend the watchdog timer period to see if that alleviates the lockup.

1 Like

I will try and update. Thank you for your solution.

I see that you are running a linode compute instant. Is it using a linode kernel?

Hi @ganphx , no, I am not running on a linode kernel

Why do you ask about this?

He most likely asked because of this in your first post:

clearly mentions Linode, and some providers use their own kernel within their VPS. The other reason for asking is, if it was a Linode kernel, then the problem at this point would need to be reported to Linode. But since you say it’s not a Linode kernel, that means it’s most likely using the official one that Rocky Linux provides.

What is the spec of the VPS? If it’s causing lock ups, maybe you have a lack of CPU/ram.

1 Like

Hi @iwalker, thank you for your inquiry! Here are the specifications of the VPS:

Processor: 1 core CPU AMD EPYC 7542 32-Core Processor
RAM: 2 GB RAM
Storage: 50 GB Storage
Operating System: Rocky Linux 8.9 (Green Obsidian)
Kernel: Linux 4.18.0-513.9.1.el8_9.x86_64

Regarding another VPS with similar specifications, I’d like to note that there are no issues with soft lockups. If you need any further assistance or have any questions, feel free to let me know!

What are you running on it? I know we see the soft lockup with nginx, but what other things are installed/running on the server? Eg: PHP or MariaDB, etc?

Then we’d need to take a look at what/how it was configured, to see if it was configured for more resources than is actually available, or is it for example default configuration for whatever was installed? Usually I do apply changes to my PHP configuration, especially when using FPM, and other stuff for nginx as well, but what gets configured depends on the amount of resources the VPS has, otherwise I could cause a similar situation that you experience.

Currently, my server is primarily running nginx to serve static files. There is no PHP, FPM, or any database installed or running on it.

And the nginx config is default? No real modifications done that could cause permissions issues?

You can remove all sssd packages to free up some memory from these daemons running. Or alternatively stop/disable sssd just in case this is causing some issues. If I’m not using sssd for anything I just remove them from my installs. You can check with:

rpm -qa | grep -i sssd

to check what is installed, and also:

systemctl disable sssd
systemctl stop sssd

if you do not want to remove the packages. Then see if you get any improvements.

Yes, Nginx configuration is indeed default with no significant modifications made that could potentially cause permission issues.

The sssd service is already stopped:

[root@cdn2-new ~]# systemctl status sssd
● sssd.service - System Security Services Daemon
Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: enabled)
Active: inactive (dead)
Condition: start condition failed at Tue 2023-12-26 10:51:20 UTC; 2 months 1 days ago

So I just disabled it. Thank you so much.

Well said @iwalker -Getting down to brass tacks. Surprise to see rocky 8 could be installed and performs in a 2GB/1 core vps.

“Please note a soft lockup can occur for reasons other than hypervisor overcommittment, such as possible kernel bugs or bugs in 3rd party kernel modules. If the environment of concern exhibits the symptoms noted, please attempt the steps in the resolution section and lowering the workload on the hypervisor.”