Kernel errors releated to infiniband

Hello everybody,
today I updated a server to Rocky Linux 9.5 and in the boot process (which hang for around 1,5 minutes on one task) I saw some errors that worried me. After running dmesg I saw that this error is related to infiniband, but the server doesn’t use infiniband. I have a 10G Ethernet card, and I have read that this card also support infiniband but the default should be Ethernet mode.

The dmesg output is:

[   50.201717] bnxt_en 0000:43:00.0 bnxt_re0: Failed to modify HW QP
[   50.201741] infiniband bnxt_re0: Couldn't change QP1 state to INIT: -110
[   50.201764] infiniband bnxt_re0: Couldn't start port
[   50.202910] bnxt_en 0000:43:00.0 bnxt_re0: Failed to destroy HW QP
[   50.202972] ------------[ cut here ]------------
[   50.202987] WARNING: CPU: 1 PID: 1437 at drivers/infiniband/core/cq.c:322 ib_free_cq+0xf2/0x130 [ib_core]
[   50.203018] Modules linked in: ipmi_ssif amd_atl intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm bnxt_re(+) ast i2c_algo_bit wmi_bmof ib_uverbs acpi_cpufreq pcspkr rapl drm_shmem_helper acpi_ipmi ses ipmi_si enclosure ch drm_kms_helper ib_core ipmi_devintf k10temp i2c_piix4 ptdma ipmi_msghandler joydev drm nfsd nfs_acl lockd auth_rpcgss grace sunrpc xfs libcrc32c sd_mod raid1 crct10dif_pclmul crc32_pclmul crc32c_intel mpt3sas ahci nvme libahci raid_class ghash_clmulni_intel scsi_transport_sas bnxt_en nvme_core libata ccp nvme_auth t10_pi sp5100_tco wmi rndis_host cdc_ether usbnet mii zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) st sg fuse
[   50.203075] CPU: 1 PID: 1437 Comm: systemd-udevd Tainted: P           OE     -------  ---  5.14.0-503.15.1.el9_5.x86_64 #1
[   50.203078] Hardware name: Supermicro AS -1114S-WN10RT/H12SSW-NTR, BIOS 2.7 10/25/2023
[   50.203080] RIP: 0010:ib_free_cq+0xf2/0x130 [ib_core]
[   50.203098] Code: 08 48 89 ee e8 1f 61 02 00 65 ff 0d 70 c9 ae 3e 75 81 0f 1f 44 00 00 e9 77 ff ff ff 48 8d 7f 50 e8 f3 ab 7e de e9 46 ff ff ff <0f> 0b e9 52 e0 5b df 0f 0b 5d e9 4a e0 5b df 80 3d d6 25 03 00 00
[   50.203100] RSP: 0018:ffffb915c931b840 EFLAGS: 00010202
[   50.203102] RAX: 0000000000000002 RBX: ffff9e4cc4a00000 RCX: 0000000000000000
[   50.203104] RDX: 0000000000000000 RSI: ffff9e8a8e8608c0 RDI: ffff9e4c4fdbb000
[   50.203105] RBP: ffff9e4c5bce6000 R08: 0000000000000000 R09: ffffb915c931b570
[   50.203106] R10: ffffb915c931b568 R11: ffffffffa1de93e8 R12: 00000000ffffff92
[   50.203107] R13: 0000000000000246 R14: ffff9e4c5bce68f8 R15: ffff9e4c5bce6870
[   50.203109] FS:  00007fa84118ab40(0000) GS:ffff9e8a8e840000(0000) knlGS:0000000000000000
[   50.203110] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.203112] CR2: 00007f567512e4e0 CR3: 000000014694a001 CR4: 0000000000770ef0
[   50.203113] PKRU: 55555554
[   50.203114] Call Trace:
[   50.203118]  <TASK>
[   50.203119]  ? srso_alias_return_thunk+0x5/0xfbef5
[   50.203124]  ? show_trace_log_lvl+0x26e/0x2df
[   50.203131]  ? show_trace_log_lvl+0x26e/0x2df
[   50.203137]  ? ib_mad_port_open+0x267/0x3f0 [ib_core]
[   50.203160]  ? ib_free_cq+0xf2/0x130 [ib_core]
[   50.203176]  ? __warn+0x7e/0xd0
[   50.203180]  ? ib_free_cq+0xf2/0x130 [ib_core]
[   50.203196]  ? report_bug+0x100/0x140
[   50.203201]  ? handle_bug+0x3c/0x70
[   50.203205]  ? exc_invalid_op+0x14/0x70
[   50.203207]  ? asm_exc_invalid_op+0x16/0x20
[   50.203212]  ? ib_free_cq+0xf2/0x130 [ib_core]
[   50.203228]  ib_mad_port_open+0x267/0x3f0 [ib_core]
[   50.203247]  ib_mad_init_device+0x51/0xc0 [ib_core]
[   50.203265]  add_client_context+0x110/0x1b0 [ib_core]
[   50.203284]  enable_device_and_get+0xd7/0x1e0 [ib_core]
[   50.203301]  ib_register_device+0xe7/0x160 [ib_core]
[   50.203319]  bnxt_re_ib_init+0x143/0x160 [bnxt_re]
[   50.203333]  bnxt_re_probe+0x141/0x1b0 [bnxt_re]
[   50.203342]  ? __pfx_bnxt_re_probe+0x10/0x10 [bnxt_re]
[   50.203349]  auxiliary_bus_probe+0x45/0x80
[   50.203353]  ? driver_sysfs_add+0x59/0xc0
[   50.203357]  really_probe+0xe1/0x390
[   50.203360]  ? pm_runtime_barrier+0x50/0x90
[   50.203363]  __driver_probe_device+0xd6/0x130
[   50.203367]  driver_probe_device+0x1e/0x90
[   50.203370]  __driver_attach+0xd2/0x1c0
[   50.203373]  ? __pfx___driver_attach+0x10/0x10
[   50.203375]  bus_for_each_dev+0x78/0xd0
[   50.203379]  bus_add_driver+0xc2/0x1f0
[   50.203383]  driver_register+0x70/0xd0
[   50.203386]  __auxiliary_driver_register+0x6a/0xd0
[   50.203389]  ? __pfx_init_module+0x10/0x10 [bnxt_re]
[   50.203397]  bnxt_re_mod_init+0x3b/0xff0 [bnxt_re]
[   50.203404]  do_one_initcall+0x44/0x210
[   50.203409]  ? srso_alias_return_thunk+0x5/0xfbef5
[   50.203412]  ? kmalloc_trace+0x25/0xa0
[   50.203417]  do_init_module+0x64/0x230
[   50.203422]  __do_sys_init_module+0x12e/0x1b0
[   50.203428]  do_syscall_64+0x5f/0xf0
[   50.203434]  ? srso_alias_return_thunk+0x5/0xfbef5
[   50.203436]  ? __mod_memcg_lruvec_state+0x76/0xc0
[   50.203441]  ? srso_alias_return_thunk+0x5/0xfbef5
[   50.203443]  ? __mod_lruvec_page_state+0x97/0x160
[   50.203446]  ? srso_alias_return_thunk+0x5/0xfbef5
[   50.203448]  ? folio_add_new_anon_rmap+0x44/0xe0
[   50.203452]  ? srso_alias_return_thunk+0x5/0xfbef5
[   50.203454]  ? do_anonymous_page+0x25a/0x410
[   50.203457]  ? srso_alias_return_thunk+0x5/0xfbef5
[   50.203460]  ? __handle_mm_fault+0x2fb/0x690
[   50.203463]  ? nohz_balancer_kick+0x31/0x240
[   50.203469]  ? srso_alias_return_thunk+0x5/0xfbef5
[   50.203471]  ? __count_memcg_events+0x4f/0xb0
[   50.203472]  ? srso_alias_return_thunk+0x5/0xfbef5
[   50.203474]  ? mm_account_fault+0x6c/0x100
[   50.203478]  ? srso_alias_return_thunk+0x5/0xfbef5
[   50.203480]  ? handle_mm_fault+0x116/0x270
[   50.203482]  ? srso_alias_return_thunk+0x5/0xfbef5
[   50.203484]  ? do_user_addr_fault+0x1d6/0x6a0
[   50.203488]  ? srso_alias_return_thunk+0x5/0xfbef5
[   50.203490]  ? exc_page_fault+0x62/0x150
[   50.203493]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[   50.203495] RIP: 0033:0x7fa841d0f01e
[   50.203517] Code: 48 8b 0d fd 9d 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ca 9d 0e 00 f7 d8 64 89 01 48
[   50.203518] RSP: 002b:00007ffd1917cf08 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[   50.203520] RAX: ffffffffffffffda RBX: 0000561c2715fd40 RCX: 00007fa841d0f01e
[   50.203521] RDX: 00007fa8423b932c RSI: 000000000006db86 RDI: 0000561c27a32aa0
[   50.203522] RBP: 0000561c27a32aa0 R08: 0000561c27168a60 R09: 000000000006c010
[   50.203523] R10: 0000000000000005 R11: 0000000000000246 R12: 00007fa8423b932c
[   50.203524] R13: 0000561c27178a00 R14: 0000000000000007 R15: 0000561c27169b70
[   50.203527]  </TASK>
[   50.203528] ---[ end trace 0000000000000000 ]---
[   50.203531] bnxt_en 0000:43:00.0 bnxt_re0: Free MW failed: 0xffffff92
[   50.230787] infiniband bnxt_re0: Couldn't open port 1
[   50.231303] infiniband bnxt_re0: Device registered with IB successfully
[   91.161675] bnxt_en 0000:43:00.1: QPLIB: bnxt_re_is_fw_stalled: FW STALL Detected. cmdq[0xe]=0x3 waited (40888 > 40000) msec active 1 
[   91.162408] bnxt_en 0000:43:00.1 bnxt_re1: Failed to modify HW QP
[   91.163075] infiniband bnxt_re1: Couldn't change QP1 state to INIT: -110
[   91.163553] infiniband bnxt_re1: Couldn't start port
[   91.164692] bnxt_en 0000:43:00.1 bnxt_re1: Failed to destroy HW QP
[   91.165179] bnxt_en 0000:43:00.1 bnxt_re1: Free MW failed: 0xffffff92
[   91.165498] infiniband bnxt_re1: Couldn't open port 1
[   91.166028] infiniband bnxt_re1: Device registered with IB successfully
[   91.180881] XFS (md126): Mounting V5 Filesystem 2f004d25-f83e-4473-94e2-bf5d544112b5
[   91.182397] XFS (md123): Mounting V5 Filesystem 4c75560e-87bf-499d-84d9-bdf8ea51c740
[   91.193783] XFS (md123): Ending clean mount
[   91.205241]  md124:
[   91.370594] XFS (md126): Ending clean mount
[   96.216133] evm: overlay not supported
[   96.378151] Warning: Unmaintained driver is detected: ip_set
[   96.576894] bnxt_en 0000:43:00.0 eno1np0: NIC Link is Up, 10000 Mbps full duplex, Flow control: none
[   96.577506] bnxt_en 0000:43:00.0 eno1np0: EEE is not active
[   96.577878] bnxt_en 0000:43:00.0 eno1np0: FEC autoneg off encoding: None
[   96.749368] bnxt_en 0000:43:00.1 eno2np1: NIC Link is Up, 10000 Mbps full duplex, Flow control: none
[   96.750123] bnxt_en 0000:43:00.1 eno2np1: EEE is not active
[   96.750600] bnxt_en 0000:43:00.1 eno2np1: FEC autoneg off encoding: None
[   96.792985] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[   96.881166] br0: port 1(bond0) entered blocking state
[   96.881620] br0: port 1(bond0) entered disabled state
[   96.882121] device bond0 entered promiscuous mode
[   96.882529] br0: port 1(bond0) entered blocking state
[   96.882885] br0: port 1(bond0) entered forwarding state
[   96.883385] br0: port 1(bond0) entered disabled state
[   96.971207] bnxt_en 0000:43:00.0 eno1np0: NIC Link is Up, 10000 Mbps full duplex, Flow control: none
[   96.971893] bnxt_en 0000:43:00.0 eno1np0: EEE is not active
[   96.972311] bnxt_en 0000:43:00.0 eno1np0: FEC autoneg off encoding: None
[   96.973704] bnxt_en 0000:43:00.0 bnxt_re0: Failed to add GID: 0xffffff92
[   96.973708] device eno1np0 entered promiscuous mode
[   96.973711] infiniband bnxt_re0: add_roce_gid GID add failed port=1 index=2
[   96.973716] __ib_cache_gid_add: unable to add gid 0000:0000:0000:0000:0000:ffff:0a0a:0013 error=-110
[   96.973720] bnxt_en 0000:43:00.0 bnxt_re0: Failed to add GID: 0xffffff92
[   96.973723] infiniband bnxt_re0: add_roce_gid GID add failed port=1 index=2
[   96.973725] __ib_cache_gid_add: unable to add gid 0000:0000:0000:0000:0000:ffff:0a0a:0013 error=-110
[   96.973783] bond0: (slave eno1np0): Enslaving as a backup interface with an up link
[   97.030469] bnxt_en 0000:43:00.1 bnxt_re1: Failed to add GID: 0xffffff92
[   97.031235] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=0
[   97.031709] __ib_cache_gid_add: unable to add gid fe80:0000:0000:0000:3eec:efff:fe97:7c62 error=-110
[   97.032142] bnxt_en 0000:43:00.1 bnxt_re1: Failed to add GID: 0xffffff92
[   97.032553] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=0
[   97.032970] __ib_cache_gid_add: unable to add gid fe80:0000:0000:0000:3eec:efff:fe97:7c62 error=-110
[   97.072530] bnxt_en 0000:43:00.1 eno2np1: NIC Link is Up, 10000 Mbps full duplex, Flow control: none
[   97.073327] bnxt_en 0000:43:00.1 eno2np1: EEE is not active
[   97.073807] bnxt_en 0000:43:00.1 eno2np1: FEC autoneg off encoding: None
[   97.074823] bnxt_en 0000:43:00.1 bnxt_re1: Failed to add GID: 0xffffff92
[   97.074881] device eno2np1 entered promiscuous mode
[   97.075462] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=0
[   97.076088] bond0: (slave eno2np1): Enslaving as a backup interface with an up link
[   97.076623] __ib_cache_gid_add: unable to add gid fe80:0000:0000:0000:3eec:efff:fe97:7c62 error=-110
[   97.077471] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
[   97.077833] bnxt_en 0000:43:00.1 bnxt_re1: Failed to add GID: 0xffffff92
[   97.077836] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=0
[   97.078427] br0: port 1(bond0) entered blocking state
[   97.078953] __ib_cache_gid_add: unable to add gid fe80:0000:0000:0000:3eec:efff:fe97:7c62 error=-110
[   97.079463] br0: port 1(bond0) entered forwarding state
[   97.080025] bnxt_en 0000:43:00.1 bnxt_re1: Failed to add GID: 0xffffff92
[   97.081499] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=0
[   97.081901] __ib_cache_gid_add: unable to add gid fe80:0000:0000:0000:3eec:efff:fe97:7c62 error=-110
[   97.082298] bnxt_en 0000:43:00.1 bnxt_re1: Failed to add GID: 0xffffff92
[   97.082695] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=0
[   97.083083] __ib_cache_gid_add: unable to add gid fe80:0000:0000:0000:3eec:efff:fe97:7c62 error=-110
[   97.083488] bnxt_en 0000:43:00.0 bnxt_re0: Failed to add GID: 0xffffff92
[   97.083892] infiniband bnxt_re0: add_roce_gid GID add failed port=1 index=2
[   97.084283] __ib_cache_gid_add: unable to add gid 0000:0000:0000:0000:0000:ffff:0a0a:0013 error=-110
[   97.084687] bnxt_en 0000:43:00.0 bnxt_re0: Failed to add GID: 0xffffff92
[   97.085081] infiniband bnxt_re0: add_roce_gid GID add failed port=1 index=2
[   97.085479] __ib_cache_gid_add: unable to add gid 0000:0000:0000:0000:0000:ffff:0a0a:0013 error=-110
[   97.085891] bnxt_en 0000:43:00.1 bnxt_re1: Failed to add GID: 0xffffff92
[   97.086295] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2
[   97.087151] __ib_cache_gid_add: unable to add gid 0000:0000:0000:0000:0000:ffff:0a0a:0013 error=-110
[   97.087659] bnxt_en 0000:43:00.1 bnxt_re1: Failed to add GID: 0xffffff92
[   97.088101] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2
[   97.088534] __ib_cache_gid_add: unable to add gid 0000:0000:0000:0000:0000:ffff:0a0a:0013 error=-110

Card info:

# lspci | grep Ethernet          
43:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)
43:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)

Loaded modules:

# lsmod | grep bnxt
bnxt_re               188416  0
ib_uverbs             208896  1 bnxt_re
ib_core               557056  6 rdma_cm,rpcrdma,iw_cm,bnxt_re,ib_uverbs,ib_cm
bnxt_en               425984  1 bnxt_re

ethtool output

# ethtool -i eno1np0
driver: bnxt_en
version: 5.14.0-503.15.1.el9_5.x86_64
firmware-version: 218.0.153.0/pkg 218.0.169.0
expansion-rom-version: 
bus-info: 0000:43:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

RDMA config:

# rdma link
link bnxt_re0/1 state ACTIVE physical_state LINK_UP netdev eno1np0 
link bnxt_re1/1 state ACTIVE physical_state LINK_UP netdev eno2np1

It looks like that the the network is running normal. I remember that in the server logs I had in the past also errors about infiniband (after a reboot), but because I don’t use this and network was working I never look closer to it.

Is this something I have to worry about, and can I fix this some how?

Have a good day!
Jonathan

A cursory search for infiniband bnxt_re0: Couldn't start port reveals [0], which suggests:

  1. Update the firmware of your NIC.
    2a. Disable the RDMA feature (if you don’t need it) on your NIC itself (enabled by default, you need to install the niccli tool).
    niccli -i 1 nvm -setoption support_rdma -scope 0 -value 0
    niccli -i 1 reset
    

2b. Disable loading the RDMA driver (if you don’t need it).

echo "blacklist bnxt_re" >> /etc/modprobe.d/blacklist-bnxt_re.conf
update-initramfs -u

[0] Network interfaces down on reboot | Proxmox Support Forum

Thank you @anthyve for finding this for me! I was a bit afraid of deactivation module that may end up in not reaching the server any more.

I tried now suggestion 2, by disabling the module and regenerate the initramfs by:

dracut --omit-drivers bnxt_re -f