NFS clients unresponsive, nfsiostat hangs

I checked the 15 clients I did set up with Oracle Linux 8 and Rocky Linux 8 in the last 2 years with fresh installations and they have all unique id hashs. This are currently 100% of the affected nfs clients.

Benjamin Coddington claims a very similar problem was fixed on 2024-03-04 with a patch titled “NFSv4: fairly test all delegations on a SEQ4_ revocation”. This patch found it’s way to Oracle Linux 9 kernel-5.14.0-427.13.1.el9_4 via RHEL-7976 and to Oracle Linux 8 kernel-4.18.0-552.3.1.el8_10 via RHEL-34912. So in Rocky Linux it should be in kernel-4.18.0-553.el8_10.x86_64.rpm from 24-May-2024, but I don’t know where to find confirmation.

The problem yesterday occured on an Oracle Linux 8 nfs client with kernel 4.18.0-553.22.1.el8_10.x86_64 and today on an Rocky Linux 8 client with kernel 4.18.0-553.16.1.el8_10.x86_64.

After reading the the bug report more carefully, next time I will look for the test_stateid value of nfsstat -c -4 -n.

Crude attempt to check if commit is in Rocky X.y

https://patchwork.kernel.org/project/linux-nfs/patch/20231019155922.6549-1-bcodding@redhat.com/#25562075

Can’t make sense of it, but one of the changes is

a/include/linux/nfs_fs_sb.h

I look on Rocky 9.4, and I see (line 237)

/usr/src/kernels/5.14.0-427.16.1.el9_4.x86_64/include/linux/nfs_fs_sb.h

struct list_head    ss_copies;
unsigned long       delegation_gen;
unsigned long       mig_gen;

I look in Rocky 8.9 and I see

struct list_head        ss_copies;
unsigned long           mig_gen;

The delegation thing is missing??

I finally found the Rocky Linux kernel sources and the ones for old versions, i.E. 8.9.
To see if the path is applied you have to:

  • download the kernel sources rpm, i.e. kernel-4.18.0-553.22.1.el8_10.src.rpm
  • extract the tar.gz kernel file from the rpm i.e. linux-4.18.0-553.22.1.el8_10.tar.gz
  • extract the 3 patched files from the tar.gz. archive:
    • /fs/nfs/delegation.c
    • /fs/nfs/delegation.h
    • /include/linux/nfs_fs_sb.h
  • compare the lines of the patch with those in the 3 files.

The result is: the patch is not in kernel-4.18.0-553.el8_10, but in all kernels afterwards. So the clients should not be affected at all, but they are. Maybe the patch did not fix all of the problem.

A quick search on the nfs clients which are used servers with a nfsstat -c -4 -n -l | grep test_stateid resulted in 20 clients showing 0-10, 2 with ~70, 2 with ~5000, and then 1 with ~15.000, ~60.000, ~170.000 and 1 with ~300.000.000. So far people only complained about the last one being unresponsive.

So I think you’re saying you have a later version of kernel 8.10 on your clients, but they are still affected? Was there anything in the later 8.10 kernel changlog saying this patch had been included?

I can’t see the server details in the original post, is it also running latest 8.10?

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.