Mismatched Version of Kernel Symbol

Hello,

We are testing Rocky on a Grace Hopper (GH200) server.
When attempting to load the nvidia-peermem kernel module via modprobe, we encountered an error related to a mismatch of kernel symbol.

  1. Specs:

    • OS: Rocky Linux 9.3 (Blue Onyx)
    • Kernel: 5.14.0-362.8.1.el9_3.aarch64+64k
  2. modprobe

    $ sudo modprobe nvidia-peermem 
    modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument
    
  3. dmesg

    [348143.219153] nvidia_peermem: disagrees about version of symbol ib_register_peer_memory_client
    [348143.219156] nvidia_peermem: Unknown symbol ib_register_peer_memory_client (err -22)
    
  4. kysm

    $ cat /proc/kallsyms | grep ib_register_peer_memory_client
    0000000000000000 r __kstrtab_ib_register_peer_memory_client	[ib_core]
    0000000000000000 r __kstrtabns_ib_register_peer_memory_client	[ib_core]
    0000000000000000 r __ksymtab_ib_register_peer_memory_client	[ib_core]
    0000000000000000 T ib_register_peer_memory_client	[ib_core]
    

So the kernel does provide symbol of ib_register_peer_memory_client.
We are not sure how to pin point the origin of version mismatch.
It also seems that it is not possible obtain the version string for diagnostic purpose.

We much appreciate your insights on this issue.

Thanks.

Kernel modules are often tied to specific version.
You can try as root
modinfo /path/2/yourmodule.ko

Thanks for the comment.

The said kernel module is tied to mellanox_ofed driver.
The solution is to install mellanox_ofed driver first, then NVIDIA driver.
Now we can load nvidia-peermem without issue.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.