Xfs soft lockup cpu

Hi, All,

I am wondering whether this patch has been merged/backup ported to the Rocky 8.9.

I am having the issue reported in this thread

The developer said they are going to backport it in the end of thread, it was 2 years ago.

But I am in Rocky 8.9 kernel and seeing the exact error.

Linux cms 4.18.0-513.11.1.el8_9.x86_64 #1 SMP Wed Jan 10 22:58:54 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/*release

Rocky Linux release 8.9 (Green Obsidian)

NAME="Rocky Linux"

VERSION="8.9 (Green Obsidian)"

ID="rocky"

ID_LIKE="rhel centos fedora"

VERSION_ID="8.9"

PLATFORM_ID="platform:el8"

PRETTY_NAME="Rocky Linux 8.9 (Green Obsidian)"

ANSI_COLOR="0;32"

LOGO="fedora-logo-icon"

CPE_NAME="cpe:/o:rocky:rocky:8:GA"

SUPPORT_END="2029-05-31"

ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"

ROCKY_SUPPORT_PRODUCT_VERSION="8.9"

REDHAT_SUPPORT_PRODUCT="Rocky Linux"

REDHAT_SUPPORT_PRODUCT_VERSION="8.9"

Rocky Linux release 8.9 (Green Obsidian)

Rocky Linux release 8.9 (Green Obsidian)

Rocky Linux release 8.9 (Green Obsidian)

ERROR BELOW.


[Thu Apr 11 13:02:01 2024] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:80773]

[Thu Apr 11 13:02:01 2024] Modules linked in: rpcsec_gss_krb5 nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) nf_tables nfnetlink ib_ipoib(OE) ib_cm(OE) ib_umad(OE) dm_crypt intel_rapl_msr intel_rapl_common amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl pcspkr ipmi_ssif ses enclosure scsi_transport_sas sp5100_tco ipmi_si ccp ipmi_devintf ipmi_msghandler k10temp i2c_piix4 i2c_designware_platform i2c_designware_core acpi_cpufreq vfat fat nfsd auth_rpcgss nfs_acl lockd grace sunrpc knem(OE) xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod t10_pi sg crc32c_intel ast drm_shmem_helper drm_kms_helper syscopyarea sysfillrect ahci sysimgblt libahci drm libata mlx5_core(OE) igb megaraid_sas mlxfw(OE) pci_hyperv_intf tls dca i2c_algo_bit psample mlxdevm(OE)

[Thu Apr 11 13:02:01 2024] mlx_compat(OE) dm_mirror dm_region_hash dm_log dm_mod

[Thu Apr 11 13:02:01 2024] CPU: 1 PID: 80773 Comm: kworker/1:0 Kdump: loaded Tainted: G OEL --------- - - 4.18.0-513.11.1.el8_9.x86_64 #1

[Thu Apr 11 13:02:01 2024] Hardware name: Lenovo ThinkSystem SR665 V3/SB27B34573, BIOS KAE112Q-2.12 07/21/2023

[Thu Apr 11 13:02:01 2024] Workqueue: xfs-conv/dm-5 xfs_end_io [xfs]

[Thu Apr 11 13:02:01 2024] RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20

[Thu Apr 11 13:02:01 2024] Code: c0 e9 be f7 2e 00 b8 01 00 00 00 e9 b4 f7 2e 00 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 e9 90 f7 2e 00 0f 1f 44 00 00 0f 1f 44 00 00 8b 07

[Thu Apr 11 13:02:01 2024] RSP: 0018:ff6dbd42c1973d00 EFLAGS: 00000212 ORIG_RAX: ffffffffffffff13

[Thu Apr 11 13:02:01 2024] RAX: 0000000000000001 RBX: ffffffffa1e08670 RCX: dead000000000200

[Thu Apr 11 13:02:01 2024] RDX: ff6dbd42c4ed7c38 RSI: 0000000000000212 RDI: 0000000000000212

[Thu Apr 11 13:02:01 2024] RBP: 0000000000000212 R08: ff6dbd42c4ed7c38 R09: 0000000000032940

[Thu Apr 11 13:02:01 2024] R10: 0013c67dfc693540 R11: 0000000000000008 R12: 0000000000000ab0

[Thu Apr 11 13:02:01 2024] R13: fff46568dc312780 R14: 0000000000000004 R15: ff1bd57a4bafe040

[Thu Apr 11 13:02:01 2024] FS: 0000000000000000(0000) GS:ff1bd55fef040000(0000) knlGS:0000000000000000

[Thu Apr 11 13:02:01 2024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[Thu Apr 11 13:02:01 2024] CR2: 000055ca5555ff78 CR3: 0000003c61610005 CR4: 0000000000771ee0

[Thu Apr 11 13:02:01 2024] PKRU: 55555554

[Thu Apr 11 13:02:01 2024] Call Trace:

[Thu Apr 11 13:02:01 2024] <IRQ>

[Thu Apr 11 13:02:01 2024] ? watchdog_timer_fn.cold.10+0x46/0x9e

[Thu Apr 11 13:02:01 2024] ? watchdog+0x30/0x30

[Thu Apr 11 13:02:01 2024] ? __hrtimer_run_queues+0x101/0x280

[Thu Apr 11 13:02:01 2024] ? hrtimer_interrupt+0x100/0x220

[Thu Apr 11 13:02:01 2024] ? srso_alias_return_thunk+0x5/0xfcdfd

[Thu Apr 11 13:02:01 2024] ? smp_apic_timer_interrupt+0x6a/0x130

[Thu Apr 11 13:02:01 2024] ? apic_timer_interrupt+0xf/0x20

[Thu Apr 11 13:02:01 2024] </IRQ>

[Thu Apr 11 13:02:01 2024] ? _raw_spin_unlock_irqrestore+0x11/0x20

[Thu Apr 11 13:02:01 2024] wake_up_page_bit+0x8a/0x110

[Thu Apr 11 13:02:01 2024] end_page_writeback+0x72/0x100

[Thu Apr 11 13:02:01 2024] iomap_finish_ioend+0x116/0x1d0

[Thu Apr 11 13:02:01 2024] ? iomap_finish_ioend+0x150/0x1d0

[Thu Apr 11 13:02:01 2024] iomap_finish_ioends+0x7f/0xb0

[Thu Apr 11 13:02:01 2024] xfs_end_ioend+0x58/0x130 [xfs]

[Thu Apr 11 13:02:01 2024] xfs_end_io+0xaf/0xe0 [xfs]

[Thu Apr 11 13:02:01 2024] process_one_work+0x1d3/0x390

[Thu Apr 11 13:02:01 2024] ? process_one_work+0x390/0x390

[Thu Apr 11 13:02:01 2024] worker_thread+0x30/0x390

[Thu Apr 11 13:02:01 2024] ? process_one_work+0x390/0x390

[Thu Apr 11 13:02:01 2024] kthread+0x134/0x150

[Thu Apr 11 13:02:01 2024] ? set_kthread_struct+0x50/0x50

[Thu Apr 11 13:02:01 2024] ret_from_fork+0x1f/0x40

Thanks. WG

2 Likes

Downgraded the kernel to 4.18.0-425.3.1.el8.x86_64
and see error following the previous error. I think it is still the iomap issue and the nfsd is just the after effect.

[Thu Apr 11 21:42:18 2024] INFO: task nfsd:5435 blocked for more than 120 seconds.
[Thu Apr 11 21:42:18 2024]       Tainted: G           OEL   --------- -  - 4.18.0-425.3.1.el8.x86_64 #1
[Thu Apr 11 21:42:18 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Apr 11 21:42:18 2024] task:nfsd            state:D stack:    0 pid: 5435 ppid:     2 flags:0x80004080
[Thu Apr 11 21:42:18 2024] Call Trace:
[Thu Apr 11 21:42:18 2024]  __schedule+0x2d1/0x860
[Thu Apr 11 21:42:18 2024]  schedule+0x35/0xa0
[Thu Apr 11 21:42:18 2024]  rwsem_down_write_slowpath+0x30c/0x5c0
[Thu Apr 11 21:42:18 2024]  ? xfs_file_buffered_write+0x5f/0x2c0 [xfs]
[Thu Apr 11 21:42:18 2024]  xfs_ilock+0xf5/0x100 [xfs]
[Thu Apr 11 21:42:18 2024]  xfs_file_buffered_write+0x5f/0x2c0 [xfs]
[Thu Apr 11 21:42:18 2024]  ? __inode_security_revalidate+0x63/0x80
[Thu Apr 11 21:42:18 2024]  do_iter_readv_writev+0x156/0x1b0
[Thu Apr 11 21:42:18 2024]  do_iter_write+0x88/0x1a0
[Thu Apr 11 21:42:18 2024]  nfsd_vfs_write+0xe6/0x460 [nfsd]
[Thu Apr 11 21:42:18 2024]  ? ima_file_check+0x30/0x90
[Thu Apr 11 21:42:18 2024]  nfsd_write+0x87/0x170 [nfsd]
[Thu Apr 11 21:42:18 2024]  nfsd3_proc_write+0x105/0x180 [nfsd]
[Thu Apr 11 21:42:18 2024]  nfsd_dispatch+0x146/0x270 [nfsd]
[Thu Apr 11 21:42:18 2024]  svc_process_common+0x360/0x5c0 [sunrpc]
[Thu Apr 11 21:42:18 2024]  ? svc_sock_secure_port+0x12/0x40 [sunrpc]
[Thu Apr 11 21:42:18 2024]  ? nfsd_svc+0x2f0/0x2f0 [nfsd]
[Thu Apr 11 21:42:18 2024]  ? nfsd_shutdown_threads+0x80/0x80 [nfsd]
[Thu Apr 11 21:42:18 2024]  svc_process+0xb7/0xf0 [sunrpc]
[Thu Apr 11 21:42:18 2024]  nfsd+0xe3/0x140 [nfsd]
[Thu Apr 11 21:42:18 2024]  kthread+0x10b/0x130
[Thu Apr 11 21:42:18 2024]  ? set_kthread_struct+0x50/0x50
[Thu Apr 11 21:42:18 2024]  ret_from_fork+0x35/0x40
1 Like

Hi, Any developers can take a look? Thanks

1 Like

I changed the lvm io_timeout and have been monitoring the system for more than 2 weeks, the problem does not happen again.

Here is what I have

# cat /sys/dev/block/253\:5/queue/io_timeout 
240000

The 253:5 is the xfs filesystem sitting on.

1 Like

Thank you for coming back to post your solution / confirmation! A rare courtesy these days.

1 Like