NFS clients unresponsive, nfsiostat hangs

I checked the 15 clients I did set up with Oracle Linux 8 and Rocky Linux 8 in the last 2 years with fresh installations and they have all unique id hashs. This are currently 100% of the affected nfs clients.

Benjamin Coddington claims a very similar problem was fixed on 2024-03-04 with a patch titled “NFSv4: fairly test all delegations on a SEQ4_ revocation”. This patch found it’s way to Oracle Linux 9 kernel-5.14.0-427.13.1.el9_4 via RHEL-7976 and to Oracle Linux 8 kernel-4.18.0-552.3.1.el8_10 via RHEL-34912. So in Rocky Linux it should be in kernel-4.18.0-553.el8_10.x86_64.rpm from 24-May-2024, but I don’t know where to find confirmation.

The problem yesterday occured on an Oracle Linux 8 nfs client with kernel 4.18.0-553.22.1.el8_10.x86_64 and today on an Rocky Linux 8 client with kernel 4.18.0-553.16.1.el8_10.x86_64.

After reading the the bug report more carefully, next time I will look for the test_stateid value of nfsstat -c -4 -n.

Crude attempt to check if commit is in Rocky X.y

https://patchwork.kernel.org/project/linux-nfs/patch/20231019155922.6549-1-bcodding@redhat.com/#25562075

Can’t make sense of it, but one of the changes is

a/include/linux/nfs_fs_sb.h

I look on Rocky 9.4, and I see (line 237)

/usr/src/kernels/5.14.0-427.16.1.el9_4.x86_64/include/linux/nfs_fs_sb.h

struct list_head    ss_copies;
unsigned long       delegation_gen;
unsigned long       mig_gen;

I look in Rocky 8.9 and I see

struct list_head        ss_copies;
unsigned long           mig_gen;

The delegation thing is missing??

I finally found the Rocky Linux kernel sources and the ones for old versions, i.E. 8.9.
To see if the path is applied you have to:

  • download the kernel sources rpm, i.e. kernel-4.18.0-553.22.1.el8_10.src.rpm
  • extract the tar.gz kernel file from the rpm i.e. linux-4.18.0-553.22.1.el8_10.tar.gz
  • extract the 3 patched files from the tar.gz. archive:
    • /fs/nfs/delegation.c
    • /fs/nfs/delegation.h
    • /include/linux/nfs_fs_sb.h
  • compare the lines of the patch with those in the 3 files.

The result is: the patch is not in kernel-4.18.0-553.el8_10, but in all kernels afterwards. So the clients should not be affected at all, but they are. Maybe the patch did not fix all of the problem.

A quick search on the nfs clients which are used servers with a nfsstat -c -4 -n -l | grep test_stateid resulted in 20 clients showing 0-10, 2 with ~70, 2 with ~5000, and then 1 with ~15.000, ~60.000, ~170.000 and 1 with ~300.000.000. So far people only complained about the last one being unresponsive.

So I think you’re saying you have a later version of kernel 8.10 on your clients, but they are still affected? Was there anything in the later 8.10 kernel changlog saying this patch had been included?

I can’t see the server details in the original post, is it also running latest 8.10?