I checked the 15 clients I did set up with Oracle Linux 8 and Rocky Linux 8 in the last 2 years with fresh installations and they have all unique id hashs. This are currently 100% of the affected nfs clients.
Benjamin Coddington claims a very similar problem was fixed on 2024-03-04 with a patch titled “NFSv4: fairly test all delegations on a SEQ4_ revocation”. This patch found it’s way to Oracle Linux 9 kernel-5.14.0-427.13.1.el9_4 via RHEL-7976 and to Oracle Linux 8 kernel-4.18.0-552.3.1.el8_10 via RHEL-34912. So in Rocky Linux it should be in kernel-4.18.0-553.el8_10.x86_64.rpm from 24-May-2024, but I don’t know where to find confirmation.
The problem yesterday occured on an Oracle Linux 8 nfs client with kernel 4.18.0-553.22.1.el8_10.x86_64 and today on an Rocky Linux 8 client with kernel 4.18.0-553.16.1.el8_10.x86_64.
After reading the the bug report more carefully, next time I will look for the test_stateid value of nfsstat -c -4 -n.
Crude attempt to check if commit is in Rocky X.y
Can’t make sense of it, but one of the changes is
a/include/linux/nfs_fs_sb.h
I look on Rocky 9.4, and I see (line 237)
/usr/src/kernels/5.14.0-427.16.1.el9_4.x86_64/include/linux/nfs_fs_sb.h
struct list_head ss_copies;
unsigned long delegation_gen;
unsigned long mig_gen;
I look in Rocky 8.9 and I see
struct list_head ss_copies;
unsigned long mig_gen;
The delegation thing is missing??
I finally found the Rocky Linux kernel sources and the ones for old versions, i.E. 8.9.
To see if the path is applied you have to:
- download the kernel sources rpm, i.e. kernel-4.18.0-553.22.1.el8_10.src.rpm
- extract the tar.gz kernel file from the rpm i.e. linux-4.18.0-553.22.1.el8_10.tar.gz
- extract the 3 patched files from the tar.gz. archive:
- /fs/nfs/delegation.c
- /fs/nfs/delegation.h
- /include/linux/nfs_fs_sb.h
- compare the lines of the patch with those in the 3 files.
The result is: the patch is not in kernel-4.18.0-553.el8_10, but in all kernels afterwards. So the clients should not be affected at all, but they are. Maybe the patch did not fix all of the problem.
A quick search on the nfs clients which are used servers with a nfsstat -c -4 -n -l | grep test_stateid resulted in 20 clients showing 0-10, 2 with ~70, 2 with ~5000, and then 1 with ~15.000, ~60.000, ~170.000 and 1 with ~300.000.000. So far people only complained about the last one being unresponsive.
So I think you’re saying you have a later version of kernel 8.10 on your clients, but they are still affected? Was there anything in the later 8.10 kernel changlog saying this patch had been included?
I can’t see the server details in the original post, is it also running latest 8.10?