NFS mounts hang forever

We have run into issues with NFS mounts on some of our machines.

NFS server

  • Linux server 5.14.0-427.22.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Jun 19 17:35:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Exports a total of 13 file systems (NFS v3 + v4)
* [root@server ~]# exportfs -s
/export  IPv4/23(sync,wdelay,hide,no_subtree_check,fsid=0,sec=sys,ro,secure,root_squash,no_all_squash)
/export  IPv6/48(sync,wdelay,hide,no_subtree_check,fsid=0,sec=sys,ro,secure,root_squash,no_all_squash)
/export/homea  IPv4/23(sync,wdelay,hide,no_subtree_check,fsid=1,sec=sys,rw,secure,root_squash,no_all_squash)
/export/homea  IPv6/48(sync,wdelay,hide,no_subtree_check,fsid=1,sec=sys,rw,secure,root_squash,no_all_squash)

NFS client

  • Linux client 5.14.0-427.40.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Oct 16 14:57:47 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

The server and the client are in the same IPv4/IPv6 subnet
selinux is disabled
firewall is off on both server and client

The command

mount -vvvv server:/homea /mounts/homea -s -o rw

executed on the client hangs forever. The same command run with strace results in

mount.nfs: trying text-based options 'sloppy,vers=4.2,addr=SERVERIPV6,clientaddr=CLIENTIPV6'
0x7ffd28c2c358, 0, NULL)  = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---

and lots of these lines

wait4(161989, 0x7ffd28c2c358, 0, NULL)  = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---

It doesn’t make a difference, if the mount is done via autofs or manually.
On other server/clients the mounts work perfectly well.

What can I do to make this working?

The clients with the issue; did they EVER work, or are they new (first try)?

Do all clients have same
/etc/nfsmount.conf
Commented out

strace isn’t going to tell you much, since actual mount is happening in the kernel. What does dmesg command show when this mount is attempted?

The client hardware did work before with earlier Rocky versions. I can’t recall specific events after which the NFS mounts failed. It might have to do with a move to NFSv4 mounts that I recently introduced. NFSv4 works on ~100 clients (Rocky 9.3 or 9.4) except on ~4 clients. I did a fresh install on 3 of those 4 clients - they are still unable to mount.
All clients (both those that fail to mount and those, which don’t) use default NFS configuration files (/etc/nfs.conf, /etc/nfsmount.conf, /etc/nfsmount.conf.d/10-nfsv4.conf).
/etc/nfsmount.conf is fully commented out, In /etc/nfs the defaults for use-gss-proxy, rdma, rdma-port) are not commented out.
/etc/nfsmount.conf.d/10-nfsv4.conf has a setting “Nfsvers=4”. The mounts don’t work, even if this is commented out.
The NFS related installed packages are libnfsidmap, nfs-utils, sssd-nfs-idmap, libnfs, pcp-pmda-nfsclient, nfs-utils-coreos, nfs4-acl-tools, nfsv4-client-utils.

dmesg results in a few errors that I don’t see on a “working” machine:

...
[    1.452165] Warning: Deprecated Driver is detected: qla4xxx will not be maintained in a future major release and may be disabled
...
[    2.312091] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.SAT0.SPT0._GTF.DSSP], AE_NOT_FOUND (20221020/psargs-330)
[    2.312129] ACPI Error: Aborting method \_SB.PCI0.SAT0.SPT0._GTF due to previous error (AE_NOT_FOUND) (20221020/psparse-529)
[    2.312440] ata1.00: ATA-9: INTEL SSDSC2CT120A3, 300i, max UDMA/133
[    2.312871] ata1.00: 234441648 sectors, multi 16: LBA48 NCQ (depth 32), AA
[    2.313601] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.SAT0.SPT2._GTF.DSSP], AE_NOT_FOUND (20221020/psargs-330)
[    2.313643] ACPI Error: Aborting method \_SB.PCI0.SAT0.SPT2._GTF due to previous error (AE_NOT_FOUND) (20221020/psparse-529)
[    2.313681] ata3.00: ATAPI: TSSTcorp CDDVDW SH-224BB, SB00, max UDMA/100
[    2.314560] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.SAT0.SPT2._GTF.DSSP], AE_NOT_FOUND (20221020/psargs-330)
[    2.314599] ACPI Error: Aborting method \_SB.PCI0.SAT0.SPT2._GTF due to previous error (AE_NOT_FOUND) (20221020/psparse-529)
[    2.314635] ata3.00: configured for UDMA/100
[    2.322052] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.SAT0.SPT0._GTF.DSSP], AE_NOT_FOUND (20221020/psargs-330)
[    2.322089] ACPI Error: Aborting method \_SB.PCI0.SAT0.SPT0._GTF due to previous error (AE_NOT_FOUND) (20221020/psparse-529)
...
[    4.492714] ACPI Warning: SystemIO range 0x0000000000000428-0x000000000000042F conflicts with OpRegion 0x0000000000000400-0x000000000000047F (\PMIO) (20221020/utaddress-204)
[    4.494458] ACPI: OSL: Resource conflict; ACPI support missing from driver?
[    4.496103] ACPI Warning: SystemIO range 0x0000000000000540-0x000000000000054F conflicts with OpRegion 0x0000000000000500-0x0000000000000563 (\GPIO) (20221020/utaddress-204)
[    4.497887] ACPI: OSL: Resource conflict; ACPI support missing from driver?
[    4.499338] ACPI Warning: SystemIO range 0x0000000000000530-0x000000000000053F conflicts with OpRegion 0x0000000000000500-0x0000000000000563 (\GPIO) (20221020/utaddress-204)
[    4.501085] ACPI: OSL: Resource conflict; ACPI support missing from driver?
[    4.503050] ACPI Warning: SystemIO range 0x0000000000000500-0x000000000000052F conflicts with OpRegion 0x0000000000000500-0x0000000000000563 (\GPIO) (20221020/utaddress-204)
[    4.504978] ACPI: OSL: Resource conflict; ACPI support missing from driver?

There are a few NFS related lines in the end of the dmesg output, which might give a hint:

[   11.537057] NFSD: Using nfsdcld client tracking operations.
[   11.537060] NFSD: no clients to reclaim, skipping NFSv4 grace period (net f0000000)
[   11.735194] block dm-0: the capability attribute has been deprecated.
[   14.038379] FS-Cache: Loaded
[   14.084389] Key type dns_resolver registered
[   14.232213] NFS: Registering the id_resolver key type
[   14.232219] Key type id_resolver registered
[   14.232220] Key type id_legacy registered

ACPI warnings aren’t related to anything around NFS. The NFS-related lines are also not too strange.

  1. Can you try forcing vers=3 in mount command?
  2. Post rpcinfo output.
  3. You can increase verbosity of the nfs client using rpcdebug -m nfs -s all, the try the mount and observe journalctl output. You can disable the nfs client logging by using rcpdebug -m nfs -c all.

ad 1. NFSv3

Excellent news: vers=3 mounts do work!

ad 2: rpcinfo output before change to vers=3

[root@fh09 ~]# rpcinfo
   program version netid     address                service    owner
    100000    4    tcp6      ::.0.111               portmapper superuser
    100000    3    tcp6      ::.0.111               portmapper superuser
    100000    4    udp6      ::.0.111               portmapper superuser
    100000    3    udp6      ::.0.111               portmapper superuser
    100000    4    tcp       0.0.0.0.0.111          portmapper superuser
    100000    3    tcp       0.0.0.0.0.111          portmapper superuser
    100000    2    tcp       0.0.0.0.0.111          portmapper superuser
    100000    4    udp       0.0.0.0.0.111          portmapper superuser
    100000    3    udp       0.0.0.0.0.111          portmapper superuser
    100000    2    udp       0.0.0.0.0.111          portmapper superuser
    100000    4    local     /run/rpcbind.sock      portmapper superuser
    100000    3    local     /run/rpcbind.sock      portmapper superuser
    100005    1    udp       0.0.0.0.78.80          mountd     superuser
    100005    1    tcp       0.0.0.0.78.80          mountd     superuser
    100005    1    udp6      ::.78.80               mountd     superuser
    100005    1    tcp6      ::.78.80               mountd     superuser
    100005    2    udp       0.0.0.0.78.80          mountd     superuser
    100005    2    tcp       0.0.0.0.78.80          mountd     superuser
    100005    2    udp6      ::.78.80               mountd     superuser
    100005    2    tcp6      ::.78.80               mountd     superuser
    100005    3    udp       0.0.0.0.78.80          mountd     superuser
    100005    3    tcp       0.0.0.0.78.80          mountd     superuser
    100005    3    udp6      ::.78.80               mountd     superuser
    100005    3    tcp6      ::.78.80               mountd     superuser
    100024    1    udp       0.0.0.0.152.55         status     29
    100024    1    tcp       0.0.0.0.164.149        status     29
    100024    1    udp6      ::.161.227             status     29
    100024    1    tcp6      ::.134.107             status     29
    100003    3    tcp       0.0.0.0.8.1            nfs        superuser
    100003    4    tcp       0.0.0.0.8.1            nfs        superuser
    100227    3    tcp       0.0.0.0.8.1            nfs_acl    superuser
    100003    3    tcp6      ::.8.1                 nfs        superuser
    100003    4    tcp6      ::.8.1                 nfs        superuser
    100227    3    tcp6      ::.8.1                 nfs_acl    superuser
    100021    1    udp       0.0.0.0.153.53         nlockmgr   superuser
    100021    3    udp       0.0.0.0.153.53         nlockmgr   superuser
    100021    4    udp       0.0.0.0.153.53         nlockmgr   superuser
    100021    1    tcp       0.0.0.0.155.241        nlockmgr   superuser
    100021    3    tcp       0.0.0.0.155.241        nlockmgr   superuser
    100021    4    tcp       0.0.0.0.155.241        nlockmgr   superuser
    100021    1    udp6      ::.134.121             nlockmgr   superuser
    100021    3    udp6      ::.134.121             nlockmgr   superuser
    100021    4    udp6      ::.134.121             nlockmgr   superuser
    100021    1    tcp6      ::.150.179             nlockmgr   superuser
    100021    3    tcp6      ::.150.179             nlockmgr   superuser
    100021    4    tcp6      ::.150.179             nlockmgr   superuser
[root@fh09 ~]#

ad 3: system log after " rpcdebug -m nfs -s all"

I get lots and lots of these lines:

Oct 25 17:05:13 client kernel: nfs4_handle_reclaim_lease_error: handled error -10008 for server server
Oct 25 17:05:13 client kernel: --> nfs4_proc_create_session clp=000000005d6b5f82 session=00000000cdc79772
Oct 25 17:05:13 client kernel: nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
Oct 25 17:05:13 client kernel: nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16

nfs3: okay, that narrows it down a bit.

Error -10008 is NFS4ERR_DELAY. Cursory search brings me to :

(Sorry, walled behind a free-for-developers subscription)

Per NFS4 spec (RFC 3530):

  NFS4ERR_DELAY         The server initiated the request, but was not
                         able to complete it in a timely fashion. The
                         client should wait and then try the request
                         with a new RPC transaction ID.  For example,
                         this error should be returned from a server
                         that supports hierarchical storage and receives
                         a request to process a file that has been
                         migrated. In this case, the server should start
                         the immigration process and respond to client
                         with this error.  This error may also occur
                         when a necessary delegation recall makes
                         processing a request in a timely fashion
                         impossible.

Per link:

  • The NFS server may be over congested and not able to handle the incoming calls from the NFS client. The NFS server vendor should be contacted for investigation of this issue.
  • The issue may depends on delegation with any networking issue where the NFS client may not be able to receive callbacks such as CB_RECALL from the NFS server or the NFS server may not be able to receive calls such as DELEGRETURN from the NFS client.

I am about to increase memory on the file server with the most exports to make it responding faster on client requests. And I reverted back most of the mounts to V3. But V4 is still desirable. The question is, what to do? Our data center admins only allow access to ports 111 and 2049 of one of our servers for security reasons so NFS v4 is unavoidable.

We have run into issues with NFS mounts on some of our machines.

Do the NFS clients machines that worked use nfs v4 (before you switched to v3) ? If so, are they using the same kernel version?

The question is, what to do?

Keep troubleshooting. Can you post the full dmesg with rpcdebug -m nfs -s all from start of the mount? The snippet you sent doesn’t show what NFS message the server is responding with NFS4ERR_DELAY to.

Yes. nfs v4 worked well on almost all clients (on ~95 of about 100 machines).

Most clients run on 5.14.0-427.37.1.el9_4.x86_64 or 5.14.0-427.40.1.el9_4.x86_64. The problematic server (with 13 exported file systems) runs on 5.14.0-427.22.1.el9_4.x86_64.

[57635.144682] Key type dns_resolver registered
[57635.343061] NFS: Registering the id_resolver key type
[57635.343074] Key type id_resolver registered
[57635.343076] Key type id_legacy registered
[57804.474928] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57804.474941] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57804.474947] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57804.474950] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57805.498939] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57805.498953] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57805.498960] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57805.498963] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57806.522953] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57806.522963] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57806.522968] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57806.522972] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57807.546964] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57807.546979] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57807.546986] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57807.546990] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57808.570978] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57808.570988] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57808.570993] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57808.570996] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57809.594991] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57809.595005] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57809.595012] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57809.595016] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57810.619002] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57810.619012] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57810.619017] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57810.619021] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57811.643015] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57811.643029] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57811.643036] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57811.643039] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57812.667029] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57812.667039] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57812.667044] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57812.667047] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57813.691039] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57813.691054] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57813.691060] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57813.691064] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57814.715051] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57814.715061] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57814.715066] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57814.715069] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57815.739048] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57815.739058] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57815.739064] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57815.739067] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57816.763076] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57816.763090] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57816.763096] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57816.763100] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57817.787075] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57817.787082] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57817.787086] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57817.787087] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57818.811101] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57818.811115] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57818.811122] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57818.811125] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57819.835112] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57819.835122] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57819.835127] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57819.835131] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57820.859126] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57820.859140] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57820.859146] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57820.859150] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57821.883136] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57821.883147] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57821.883151] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57821.883155] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57822.907149] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57822.907163] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57822.907169] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57822.907173] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57823.931162] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57823.931172] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57823.931177] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57823.931180] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57824.955174] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57824.955189] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57824.955196] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57824.955199] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57825.979185] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57825.979196] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57825.979201] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57825.979204] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57827.003198] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57827.003213] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57827.003219] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57827.003223] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57828.027210] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57828.027220] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57828.027225] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57828.027229] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57829.051222] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57829.051237] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57829.051243] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57829.051247] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57830.075229] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57830.075239] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57830.075244] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57830.075248] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57831.099241] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57831.099255] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57831.099262] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57831.099266] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57832.123258] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57832.123268] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57832.123273] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57832.123277] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[57833.147272] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[57833.147286] --> nfs4_proc_create_session clp=000000006864147e session=000000002f930198
[57833.147293] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[57833.147297] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16

Again, error -10008

I did this again (debug rpc, try mount, dmesg) which resulted in a somewhat different output.
The mount of the file system on oursecondserver is done via nfs v3, the one on ourserver is mounted via nfs v4.

[  765.396319] NFS: sending MNT request for oursecondserver.ourdomain.org:/export/path
[  765.398365] NFS: received 1 auth flavors
[  765.398382] NFS:   auth flavor[0]: 1
[  765.398401] NFS: MNT request succeeded
[  765.398402] NFS: attempting to use auth flavor 1
[  765.399694] NFS call  fsinfo
[  765.399897] NFS reply fsinfo: 0
[  765.399900] NFS call  pathconf
[  765.400096] NFS reply pathconf: 0
[  765.400100] NFS call  getattr
[  765.400315] NFS reply getattr: 0
[  765.400317] Server FSID: 65c3c268e5dab2ca:0
[  765.400583] do_proc_get_root: call  fsinfo
[  765.400793] do_proc_get_root: reply fsinfo: 0
[  765.400982] do_proc_get_root: reply getattr: 0
[  765.400996] NFS: nfs_fhget(0:42/16895408 fh_crc=0x478ae16e ct=1)
[  765.403083] NFS: revalidating (0:42/16895408)
[  765.403089] NFS call  getattr
[  765.403472] NFS reply getattr: 0
[  765.403476] NFS: nfs_update_inode(0:42/16895408 fh_crc=0x478ae16e ct=2 info=0x27e7f)
[  765.403489] NFS: (0:42/16895408) revalidation complete
[  765.403491] NFS: nfs_weak_revalidate: inode 16895408 is valid
[  765.403495] NFS call  access
[  765.403830] NFS: nfs_update_inode(0:42/16895408 fh_crc=0x478ae16e ct=2 info=0x27e7f)
[  765.403835] NFS reply access: 0
[  765.403837] NFS: permission(0:42/16895408), mask=0x24, res=0
[  765.403840] NFS: open dir(/)
[  765.403873] NFS: readdir(/) starting at cookie 0
[  765.403882] NFS call  readdirplus 0
[  765.404188] NFS: nfs_update_inode(0:42/16895408 fh_crc=0x478ae16e ct=2 info=0x27e7f)
[  765.404193] NFS reply readdirplus: 652
[  765.404203] NFS: nfs_fhget(0:42/33644731 fh_crc=0x97693b95 ct=1)
[  765.404206] NFS: dentry_delete(/bin, 20080c)
[  765.404210] NFS: nfs_fhget(0:42/51149943 fh_crc=0x93fa5bdb ct=1)
[  765.404212] NFS: dentry_delete(/src, 20080c)
[  765.404215] NFS: nfs_do_filldir() filling ended @ cookie 512
[  765.404217] NFS: readdir(/) returns 0
[  765.404225] NFS: permission(0:42/16895408), mask=0x81, res=0
[  765.404230] NFS: dentry_delete(/bin, 28080c)
[  765.404235] NFS: permission(0:42/16895408), mask=0x81, res=0
[  765.404240] NFS: dentry_delete(/bin, 28084c)
[  765.404244] NFS: permission(0:42/16895408), mask=0x81, res=0
[  765.404246] NFS call getacl
[  765.404498] NFS reply getacl: 0
[  765.404500] NFS: nfs_update_inode(0:42/33644731 fh_crc=0x97693b95 ct=1 info=0x27e7f)
[  765.404503] NFS: dentry_delete(/bin, 28084c)
[  765.404507] NFS: permission(0:42/16895408), mask=0x81, res=0
[  765.404510] NFS: dentry_delete(/bin, 28084c)
[  765.404521] NFS: permission(0:42/16895408), mask=0x81, res=0
[  765.404523] NFS: dentry_delete(/src, 28080c)
[  765.404527] NFS: permission(0:42/16895408), mask=0x81, res=0
[  765.404529] NFS call getacl
[  765.404758] NFS reply getacl: 0
[  765.404760] NFS: nfs_update_inode(0:42/51149943 fh_crc=0x93fa5bdb ct=1 info=0x27e7f)
[  765.404763] NFS: dentry_delete(/src, 28084c)
[  765.404767] NFS: permission(0:42/16895408), mask=0x81, res=0
[  765.404769] NFS: dentry_delete(/src, 28084c)
[  765.404773] NFS: readdir(/) starting at cookie 512
[  765.404783] NFS: readdir(/) returns 0
[  765.417529] Key type dns_resolver registered
[  765.611426] NFS: Registering the id_resolver key type
[  765.611437] Key type id_resolver registered
[  765.611439] Key type id_legacy registered
[  765.612209] --> nfs4_try_get_tree()
[  765.618213] nfs_callback_create_svc: service created
[  765.618219] NFS: create per-net callback data; net=f0000000
[  765.618285] nfs_callback_up: service started
[  765.618287] NFS: nfs4_discover_server_trunking: testing 'ourserver.ourdomain.org'
[  765.746186] --> nfs4_proc_create_session clp=00000000d4a0c6ba session=00000000e4767ae4
[  765.746194] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[  765.746195] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[  766.750060] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[  766.750073] --> nfs4_proc_create_session clp=00000000d4a0c6ba session=00000000e4767ae4
[  766.750080] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[  766.750083] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[  767.774065] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[  767.774076] --> nfs4_proc_create_session clp=00000000d4a0c6ba session=00000000e4767ae4
[  767.774081] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[  767.774085] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[  768.798084] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[  768.798098] --> nfs4_proc_create_session clp=00000000d4a0c6ba session=00000000e4767ae4
[  768.798105] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[  768.798109] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[  769.822102] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[  769.822109] --> nfs4_proc_create_session clp=00000000d4a0c6ba session=00000000e4767ae4
[  769.822113] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[  769.822115] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[  770.846108] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[  770.846122] --> nfs4_proc_create_session clp=00000000d4a0c6ba session=00000000e4767ae4
[  770.846128] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[  770.846132] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[  771.870119] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[  771.870129] --> nfs4_proc_create_session clp=00000000d4a0c6ba session=00000000e4767ae4
[  771.870134] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[  771.870137] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[  772.894140] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[  772.894149] --> nfs4_proc_create_session clp=00000000d4a0c6ba session=00000000e4767ae4
[  772.894153] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[  772.894155] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[  773.918141] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[  773.918154] --> nfs4_proc_create_session clp=00000000d4a0c6ba session=00000000e4767ae4
[  773.918160] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[  773.918164] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[  774.942143] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[  774.942152] --> nfs4_proc_create_session clp=00000000d4a0c6ba session=00000000e4767ae4
[  774.942157] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[  774.942161] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16
[  775.966164] nfs4_handle_reclaim_lease_error: handled error -10008 for server ourserver.ourdomain.org
[  775.966178] --> nfs4_proc_create_session clp=00000000d4a0c6ba session=00000000e4767ae4
[  775.966184] nfs4_init_channel_attrs: Fore Channel : max_rqst_sz=1049620 max_resp_sz=1049480 max_ops=8 max_reqs=64
[  775.966188] nfs4_init_channel_attrs: Back Channel : max_rqst_sz=4096 max_resp_sz=4096 max_resp_sz_cached=0 max_ops=2 max_reqs=16

Okay, so create_session requests are returning the NFS4ERR_DELAY error. This could indicate the server cannot accept any sessions, which could potentially be due to lack of resources (usually memory) on the server. This is specific to NFS v4, because it is a stateful protocol requiring the server to track some state for each client. In contrast, NFS v3 is a stateless protocol. The difference could explain why it works when NFS v4 does not.

You mentioned:

It might have to do with a move to NFSv4 mounts that I recently introduced.

The increased resource requirements on NFS server when using NFS v4 could explain the situation. What are the NFS server specs, resource-wise?

Any any rate, troubleshooting should now focus on the NFS server to determine why it is sending the delay message. Do you see any nfs or rpc related messages in journal/dmesg on NFS server side when this is happening?

EDIT: Can you also check that contents of /etc/machine-id is unique and different between all NFS clients?

Not only did I revert back to NFS v3, but I also increased RAM on the NFS server (last saturday), which is now 128 GB as compared to initially 32 GB. That made a huge difference. With 32GB I saw a lot of swapping, which I don’t see anymore. It was particularly problematic that those ~60 GB of data are saved in nightly backups, which regularly ran until afternoon. The server has 2 AMD EPYC 7252 8-Core processors.

This is a production system, so I cannot revert back the server to the state where NFS v4 was still used. Unfortunately there is no persistent journal.

[root@ourserver log]# journalctl -k -b -1
Specifying boot ID or boot offset has no effect, no persistent journal was found.

They are all different. I don’t use images for installation.

That made a huge difference. With 32GB I saw a lot of swapping, which I don’t see anymore. It was particularly problematic that those ~60 GB of data are saved in nightly backups, which regularly ran until afternoon.

If you saw a lot of swapping with 32GB, then NFSv4 may have been hitting the concurrent session limits due to insufficient memory.

The server has 2 AMD EPYC 7252 8-Core processors.

Could be relevant as well. Another Rocky user complained of high memory usage on idle workloads, and apparently traced it to use of AMD EPYC. See Rocky Linux 8 6 GB memory in noncache used by kernel - #7 by fkaluza

Yes, I also stumbled over slaptop, but didn’t really understand the output.

What surprised me was that even with 64GB of RAM I saw swapping because most of the RAM was used by some kind of caching (cache/slab_cache). This is what munin reports:

munin evaluates /proc/meminfo, /proc/slabinfo, /sys/kernel/mm/ksm/run and /sys/kernel/mm/ksm/pages_sharing.

With all of that caching the 128GB are now again almost used up.

You could try installing kernel-lt or kernel-ml from elrepo repository and see if that makes a difference. Seen similar issues but for example with Ryzen, and the 6.x kernels that kernel-lt and kernel-ml provides helped out there.

Check sysctl vm.swappiness
The Dynamic System Tuning Daemon (tuned.service), if running, may adjust that (at least with some profiles).

I was unable to mount from a CentOS Stream client which hung. All Rocky clients work. My solution: I have to use mount option vers=4.1 and my CentOS Stream client can connect.

Check sysctl vm.swappiness
The Dynamic System Tuning Daemon (tuned.service), if running, may adjust that (at least with some profiles).

Swappiness on my server is 60 (per default). Luckily, the server does not use swap space anymore after increasing memory, so I’ll leave swappiness unchanged for the time being.