Insmod fails - module does not match running kernel but module has been compiled against correct kernel

Goal: Compile NVMe driver on Rocky 9 (currently no changes to source).
Problem: It seems like the Module.symvers distributed in kernel-devel is not correct or I’m doing something wrong and can’t figure out what it is. I don’t know how compilation could succeed with the running kernel’s config and ostensibly, symbels, but then I still have symbol errors when I try to lead the compiled module.

What I’m doing

  1. Update everything (dnf update -y && reboot) and reboot to make sure it’s all where it needs to be
  2. Install headers dnf install -y kernel-headers ncurses-devel
  3. Get kernel source dnf download --source kernel && rpm2cpio kernel-5.14.0-362.18.1.el9_3.src.rpm | cpio -idmv && tar -xf linux-5.14.0-362.18.1.el9_3.tar.xz (this is where I’m getting the NVMe source code - I have not pulled it from elsewhere)
  4. Pull in Module.symvers from the headers cp /usr/src/kernels/5.14.0-362.18.1.el9_3.x86_64/Module.symvers .
  5. Pull in the config from my specific kernel cp /boot/config-$(uname -r) .config
  6. Run the following to build:
make clean
cp -f /boot/config-$(uname -r) .config
cp -f /usr/src/kernels/5.14.0-362.18.1.el9_3.x86_64/Module.symvers .
make -j$(nproc --all) scripts prepare modules_prepare
make ARCH=x86_64 -j$(nproc --all) M=drivers/nvme

The Problem

If I leave the NVMe options to default in make menuconfig:

image

and then build it produces this error:

[23890.056877] nvme_core: disagrees about version of symbol nvme_auth_gen_shared_secret
[23890.056881] nvme_core: Unknown symbol nvme_auth_gen_shared_secret (err -22)
[23890.057140] nvme_core: disagrees about version of symbol nvme_auth_gen_pubkey
[23890.057142] nvme_core: Unknown symbol nvme_auth_gen_pubkey (err -22)
[23890.057471] nvme_core: disagrees about version of symbol nvme_auth_gen_privkey
[23890.057472] nvme_core: Unknown symbol nvme_auth_gen_privkey (err -22)

If I update the options to remove all the stuff I don’t care about I receive a different error (had to do this one in text due to new user limit of one media item per post):

  • [M] NVMe Express block device
  • [*] NVMe multipath support
  • [*] NVMe verbose error reporting
  • NVMe hardware monitoring
  • < > NVMe Express over Fabrics RDMA host driver
  • < > NVMe Express over Fabrics FC host driver
  • < > NVMe Express over Fabrics TCP host driver
  • NVMe Express over Fabrics In-Band Authentication
  • < > NVMe Target support

module: x86/modules: Skipping invalid relocation target, existing value is nonzero for type 1, loc 0000000087bc08cc, val ffffffffc071647a

I’m at a total loss though. The only way I’m aware of this can occur is that you have different kernel symbol versions in the first case. I’m less sure what’s going on in the second case, but either way there seems to be a delta in what I’m pulling as part of kernel-devel and the running kernel. .config and Module.symvers match as far as I can tell in the running kernel and what I’m compiling against so the only thing I can conclude is that what is in Rocky’s kernel-devel is wrong, but that seems unlikely.

Here is what the two modules modified/original look like compared:

[root@nvmetest linux-5.14.0-362.18.1.el9_3]# !mod
modinfo ./drivers/nvme/host/nvme-core.ko
filename:       /root/new_driver/linux-5.14.0-362.18.1.el9_3/./drivers/nvme/host/nvme-core.ko
version:        1.0
license:        GPL
rhelversion:    9.3
srcversion:     A869617A5E58420845515F4
depends:        t10-pi
retpoline:      Y
name:           nvme_core
vermagic:       5.14.0 SMP preempt mod_unload modversions
parm:           multipath:turn on native support for multiple controllers per subsystem (bool)
parm:           iopolicy:Default multipath I/O policy; 'numa' (default) or 'round-robin'
parm:           admin_timeout:timeout in seconds for admin commands (uint)
parm:           io_timeout:timeout in seconds for I/O (uint)
parm:           shutdown_timeout:timeout in seconds for controller shutdown (byte)
parm:           max_retries:max number of retries a command may have (byte)
parm:           default_ps_max_latency_us:max power saving latency for new devices; use PM QOS to change per device (ulong)
parm:           force_apst:allow APST for newly enumerated devices even if quirked off (bool)
parm:           apst_primary_timeout_ms:primary APST timeout in ms (ulong)
parm:           apst_secondary_timeout_ms:secondary APST timeout in ms (ulong)
parm:           apst_primary_latency_tol_us:primary APST latency tolerance in us (ulong)
parm:           apst_secondary_latency_tol_us:secondary APST latency tolerance in us (ulong)
[root@nvmetest linux-5.14.0-362.18.1.el9_3]# modinfo /lib/modules/5.14.0-362.18.1.el9_3.x86_64/kernel/drivers/nvme/host/nvme-core.ko.xz
filename:       /lib/modules/5.14.0-362.18.1.el9_3.x86_64/kernel/drivers/nvme/host/nvme-core.ko.xz
version:        1.0
license:        GPL
rhelversion:    9.3
srcversion:     ADFE53FFFB5D30ECFF130B0
depends:        nvme-common,t10-pi
retpoline:      Y
intree:         Y
name:           nvme_core
vermagic:       5.14.0-362.18.1.el9_3.x86_64 SMP preempt mod_unload modversions
sig_id:         PKCS#7
signer:         Rocky kernel signing key
sig_key:        37:B0:46:2C:D4:62:CB:E7:6C:CA:AE:9F:2A:A2:BE:E1:36:3A:8A:AF
sig_hashalgo:   sha256
signature:      60:89:BA:1D:1C:71:38:82:DF:09:73:B4:23:3E:C8:FE:7B:E4:9F:0D:
                62:6E:28:D7:3E:5A:5E:11:CD:7B:D2:52:E2:C6:ED:5E:B6:A7:19:54:
                8A:FB:BB:E8:2D:A5:77:3F:A1:C1:7E:EB:45:74:30:E9:18:1C:3D:9A:
                53:4A:2B:B0:1E:F0:35:D3:D1:E5:B6:A5:D0:47:6C:2F:B7:C6:6F:00:
                30:0E:82:BA:FD:4F:9D:0E:3B:4A:17:A4:1B:E8:31:FC:FB:BC:C2:93:
                1C:6D:5E:94:FD:DE:65:3B:3E:0B:F4:B4:B0:82:67:87:8C:90:C9:74:
                44:BB:14:D9:F9:43:33:CC:CC:77:29:11:2C:3D:79:30:EA:B3:63:74:
                F3:02:F0:DA:68:40:BA:65:B0:E5:D8:90:FF:B0:CA:8D:D7:31:00:47:
                FE:9C:B9:17:8F:81:1D:7F:45:F6:98:E8:14:1F:73:99:00:51:18:48:
                1F:29:98:F4:37:FA:62:46:FF:1B:64:B5:1F:03:C3:5C:87:2E:13:9E:
                EE:8C:32:DE:D8:B6:3F:1D:C2:69:45:46:E2:8B:E4:BD:C2:7C:00:14:
                3F:7B:76:C8:43:4E:ED:24:BE:C8:9D:85:16:C6:9B:55:1F:BA:7B:39:
                07:57:A7:46:1A:E4:98:D5:29:C9:27:07:0B:3A:FE:6D:49:4B:DD:24:
                E0:4C:99:C1:C4:88:4D:E1:D9:78:EC:46:4F:D6:94:D6:93:B0:D4:24:
                23:08:40:35:F9:41:0D:1E:4A:78:3C:B2:A9:DB:51:C9:D0:96:F5:64:
                43:7E:FF:69:71:09:06:9D:79:B0:56:A0:49:71:69:64:3D:50:B6:0B:
                DE:A0:FA:36:D7:86:AD:B9:2A:8C:11:B5:73:F3:C5:2B:C5:2F:C2:A6:
                AB:7F:01:B1:E6:60:8F:6A:F0:A1:AE:A9:32:E1:30:DA:4D:7A:98:F5:
                89:F7:7B:4D:FF:23:4B:77:29:BA:62:1B:54:09:1F:33:57:8F:44:A3:
                FE:19:D6:43
parm:           multipath:turn on native support for multiple controllers per subsystem (bool)
parm:           iopolicy:Default multipath I/O policy; 'numa' (default) or 'round-robin'
parm:           admin_timeout:timeout in seconds for admin commands (uint)
parm:           io_timeout:timeout in seconds for I/O (uint)
parm:           shutdown_timeout:timeout in seconds for controller shutdown (byte)
parm:           max_retries:max number of retries a command may have (byte)
parm:           default_ps_max_latency_us:max power saving latency for new devices; use PM QOS to change per device (ulong)
parm:           force_apst:allow APST for newly enumerated devices even if quirked off (bool)
parm:           apst_primary_timeout_ms:primary APST timeout in ms (ulong)
parm:           apst_secondary_timeout_ms:secondary APST timeout in ms (ulong)
parm:           apst_primary_latency_tol_us:primary APST latency tolerance in us (ulong)
parm:           apst_secondary_latency_tol_us:secondary APST latency tolerance in us (ulong)
[root@nvmetest linux-5.14.0-362.18.1.el9_3]#

I can force the version magic to be completely the same using CONFIG_LOCALVERSION, but the resulting errors are identical. This post also indicates that the base kernel version is sufficient.

Update - Problem 2

My module, even though compiled with the system’s .config file does not have matching dependencies. It only has depends: t10-pi whereas the live module has depends: nvme-common,t10-pi

Update 3

Got past the symbol error by loading nvme-common first. I realized nvme-core depends on it. I still get the relocation error though. My new setup is:

dnf install -y rpm-build rpmdevtools git python3-devel make gcc flex bison kernel-headers ncurses-devel tmux elfutils-libelf-devel openssl-devel bc kernel-devel-$(uname -r) dwarves
rpmdev-setuptree
dnf download --source kernel
rpm -ivh kernel-5.14.0-362.18.1.el9_3.src.rpm
rpmbuild -bp kernel.spec
cd /root/rpmbuild/BUILD/kernel-5.14.0-362.18.1.el9_3/linux-5.14.0-362.18.1.el9.x86_64/
cp -f /boot/config-$(uname -r) .config
cp -f /usr/src/kernels/$(uname -r)/Module.symvers .
make -j$(nproc --all) scripts prepare modules_prepare
make ARCH=x86_64 -j$(nproc --all) M=drivers/nvme
insmod drivers/nvme/common/nvme-common.ko
insmod drivers/nvme/host/nvme-core.ko

I ultimately rebuilt the box, made sure it was up to date before I did anything at all and then ran:

dnf install -y rpm-build rpmdevtools git python3-devel make gcc flex bison kernel-headers ncurses-devel tmux elfutils-libelf-devel openssl-devel bc kernel-devel-$(uname -r) dwarves
rpmdev-setuptree
dnf download --source kernel
rpm -ivh kernel-5.14.0-362.18.1.el9_3.src.rpm
cd /root/rpmbuild/SPECS
rpmbuild -bp kernel.spec
cd /root/rpmbuild/BUILD/kernel-5.14.0-362.18.1.el9_3/linux-5.14.0-362.18.1.el9.x86_64/
cp -f /boot/config-$(uname -r) .config
cp -f /usr/src/kernels/$(uname -r)/Module.symvers .
make -j$(nproc --all) scripts prepare modules_prepare
make ARCH=x86_64 -j$(nproc --all) M=drivers/nvme
insmod drivers/nvme/common/nvme-common.ko
insmod drivers/nvme/host/nvme-core.ko

Not sure what the problem was, but that worked first try.