NFS clients unresponsive, nfsiostat hangs

Setup: ~50 NFS clients connected to a NFS server, user login and automount is configured via freeipa with -rw,soft,acl,intr,fstype=nfs,vers=4.2 .

The clients did flawlessly work for several months, then 2 weeks ago one client became incredibly slow when accessing NFS shares. Not every time all users on the server were affected. In the end we tried dnf update -y && reboot and the problem seemed to go away. But 2 days later it came back to 3 other clients. Two days after these were updated and rebooted it came back to the first client.
The ~46 other clients don’t have any symptoms. The first (and current) client which had the problem shows no badcalls in nfsstat -c, but nfsiostat hangs (for hours), just like any service which relies on NFS shares. The syslog is full of messages from services with timeouts due to the blocking NFS, but I don’t see any messages about hardware failures. On all other clients nfsstat -c and nfsiostat do work, and there does not seem to be anything suspiciously on the server in the result of nfsstat -s. The affected clients seem to belong to clients used with more CPU load. The usage of NFS 4.2 is only average.
Till spring the setup did work without any problems for some years with CentOS 7, NFS 3 and NIS. Before that for some years with CentOS 6, NFS 3 and NIS.

After 7 1/2 hours finally nfsiostat did return to the command line prompt:

20:41:46 root@client1 # nfsiostat
^CFailed to import the site module
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site.py", line 564, in <module>
    main()
  File "/usr/lib64/python3.6/site.py", line 551, in main
    known_paths = addsitepackages(known_paths)
  File "/usr/lib64/python3.6/site.py", line 339, in addsitepackages
    addsitedir(sitedir, known_paths)
  File "/usr/lib64/python3.6/site.py", line 207, in addsitedir
    addpackage(sitedir, name, known_paths)
  File "/usr/lib64/python3.6/site.py", line 159, in addpackage
    f = open(fullname, "r")
  File "<frozen importlib._bootstrap>", line 966, in _find_and_load
KeyboardInterrupt
04:17:02 root@client1 #

But did it actually “return”, it looks like someone pressed CTRL+C

I pressed CTRL+C approximately after half an hour, but the prompt took 7 hours more to come. It looks like the 7 hours were used to process CTRL+C.

Yesterday logins (to the /home in the NFS) were handled again by the client, but slowly. Calling commands like top still took half an hour.

I tried echo 3 > /proc/sys/vm/drop_caches which took 30 minutes - 1 hour and when it was finished, the client was back to normal, all symptoms gone! I am waiting for the next client to get problems to verify the bugfix and test if clearing the pagecache echo 1 >... or the slab objects echo 2 >... helps alone. My money would be on the second, since this feels like inodes are out. Anyone got an idea about the reason?

Looks like a hang in the kernel’s NFS client. Usually happens if the client can’t reach the server, but the drop caches thing is strange (could be a coincidence, e.g. network got unblocked after 30 minutes).

Please provide the following from the affected NFS client hosts:

cat /etc/rocky-release
uname -a

In case of NFS client hangs, usually there are signs of trouble reported in the kernel ring buffer, so let’s try that first. When you notice problems again, log in to root on the affected client and:

  1. run dmesg. Output should show a bunch of errors from the kernel (“server did not respond” or similar).
  2. check for blocked kernel tasks:
    echo w > /proc/sysrq-trigger
    dmesg
    

Send outputs of both dmesg commands.

I found nothing useful in hours of searching and so did 3 other people.

Thanks, i will do that.

RH Closed “wontfix”.
This is old, but looks important.
No idea if it’s still an issue.
I could see this causing all kinds of problems if set incorrectly.
https://bugzilla.redhat.com/show_bug.cgi?id=1582186

Good find, but:

Steps to Reproduce:

two nfs clients, with the same IP address, each behind a system/device providing NAT
on both clients, mount with nfs v4 from the same nfs server

Our clients are directly connected to the server and have different names an IP addresses.
Although some of them might have been cloned for faster installation, they all have been rebooted after renaming them and giving them different IP addresses, so the client string for EXCHANGE_ID should be different.

I would guess this might have been patched in RHEL 8, but the Red Hat bug for RHEL 8 is not accessible for RH customers, so I can’t say.

RHEL8 bug Bug Access Denied

Even if it’s not this, it’s worth checking the uniqueness of all clients. The original post sounds like what you would expect if the server thinks it needs to “lock” a client.

Does it make a difference if the clients are accessing different files, and then they try to access the same file at the same time?

Hi @DM-Admin,

in the context of NFS4+, Can you test dropping intr option and changing soft to hard – if this makes any difference?

in the context of NFS4+, Can you test dropping intr option and changing soft to hard – if this makes any difference?

The intr option was already removed earlier this week. Changing to hard would be possible, but as long as we cannot reproduce the error I would not like the ~50 clients run with it for several days or months, maybe causing more severe problems.

Even if it’s not this, it’s worth checking the uniqueness of all clients. The original post sounds like what you would expect if the server thinks it needs to “lock” a client.

What else except the hostname and ip address has to be checked for uniqueness?

Meanwhile we got a bit closer and can divide the NFS symptoms in 3 different problems:

  1. Very slow access to the nfs mounts, which is discussed in this thread. Probably due some issue with the slabobjects cache.
  2. autofs suddenly stopping to mount the auto mounts from freeipa. This has happened 3 times in the last week for unknown reasons, mostly in the mid of the night and always on all Linux 8 clients.
    The CentOS 7 and OracleLinux 7 clients (9 clients using NIS for autofs) were not affected, as well as the Oracle- and Rocky Linux 9 clients (10 clients using freeipa for autofs).
    All the mounts during the autofs “crash” prevailed being mounted and the problem could be solved (till the next “crash”) with: systemctl restart sssd && systemctl restart autofs.
  3. On two newly installed clients with Rocky Linux 8 the login with a freeipa user via ssh would not succeed. At debug3: send packet: type 50 it would simply disconnect instead of: debug3: receive packet: type 52 Authenticated to client1 ([192.168.1.1]:22) using "publickey"..
    Setting UsePAM no in /etc/ssh/sshd_config “patched” the issue. There must be some PAM setting which had another default setting in former installations, which I have to find when I have more time.

Back to the original “Very slow access to the nfs mounts” problem of the thread.
On affected clients you can see in perf top ~50% utilization of nfs_server_return_marked_delegations when accessing something on the nfs. When you enter echo 3 > /proc/sys/vm/drop_caches the symptom of slow to inaccessible nfs and nfs_server_return_marked_delegations vanishes. echo 1 >... is not sufficient. When trying echo 2 >... I made a mistake, so I can not 100% confirm if this is enough, but now I am 99% sure.
There is a Fedora 39 bug report that looks a bit similar, claiming a problem with the kernel has been fixed upstream. I would guess it also should be in the used 4.18.0-553.16.1.el8_10.x86_64 kernel, but I am not sure.

Side observation
When the problem occured:

17:30:00 root@client1 # free -hm
              total        used        free      shared  buff/cache   available
Mem:          503Gi        19Gi        45Gi        17Mi       437Gi       480Gi
Swap:         127Gi       1.0Mi       127Gi

Some hours after echo 3 > /proc/sys/vm/drop_caches

22:43:26 root@client1 # free -hm
              total        used        free      shared  buff/cache   available
Mem:          503Gi        10Gi       466Gi        16Mi        25Gi       488Gi
Swap:         127Gi       1.0Mi       127G

I already tried setting /proc/sys/vm/vfs_cache_pressure to values up to 10000, but this did not see, to have any effect.

this would help, narrowing down for SSH authentication failure.

ssh -vvv user@hostname

for inspecting the NFS access slowdown:

  • slabtop
  • cat /proc/slabinfo
  • df -i (if the inode cache is getting full)

although you mentioned “3” helped - it would be worth noting which of these helps a temporary slowdown fix for all occurrences:

  • echo 1 > /proc/sys/vm/drop_caches (pagecache)
  • echo 2 > /proc/sys/vm/drop_caches (dentries and inodes)
  • echo 3 > /proc/sys/vm/drop_caches (pagecache, dentries, and inodes)

mount options to try not ideal but if these helps narrow down the behaviour:

  • nfs4_disable_delegation=1
  • noac (disable attribute caching)
  • actimeo=0 (disable attribute cache timeout)
  • acregmin=0, acregmax=0, acdirmin=0, acdirmax=0 (disable file/directory attribute caching)
  • sync (synchronous writes)
  • lookupcache=none (disable directory lookup caching - dentries)

I do use autofs or SystemD’s automount. Both can umount unused filesystems, so there is less “active clients” at any given time.

ssh -vvvv is how I got debug3: send packet: type 50 in the first place. At the moment there are no clients with the problem, so i can not get the very verbose log of it.

The output of slabtop and cat /proc/slabinfo is totally cryptic to me. I don’t know what to look after. That’s why I was happy about the output of perf top being a good indicator.
I thought high nfs_inode_cache in slabtop would be a sign of the problem, but currently the client is working flawless, no sign of nfs_server_return_marked_delegations in perf top and slabtop starts with:

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
5642280 5642151  99%    0.19K 134340       42   1074720K dentry
5540800 5540558  99%    0.06K  86575       64    346300K lsm_inode_cache
5158607 5158607 100%    1.07K 177883       29   5692256K nfs_inode_cache

So high lsm_inode_cache and nfs_inode_cache don’t seem necessarily an indicator of the problem.

I waiting for an opportunity to test this for several days now and as written: botched the echo 2 > ... test. So now I have to wait for the next occurrence of the problem. And as mentioned echo 1 > ... did not help, echo 3 > ... did in the test.

For that I would have to 6 additional autofs setups, assign them to different groups and then wait some days or weeks or months to see what happens. And if nothing happens, I never would be sure if it`s due to the option, or just luck.

We use automount to reduce the filesystem tree size on the clients. Several years ago this was an advantage compared the whole filesystem of several TB constantly being mounted. I don’t know if this is still the case nowadays with NFS 4.2 and some GB more of RAM compared to a decade ago. But why did you quote the “hard” mount option? What has the advantage of mount / unmount to do with the NFS block of the “hard” mount option?

Ok, interesting development. We hat another slow client with nfs problems. I first wanted to try /proc/sys/vm/zone_reclaim_mode from the AMD HPC tuning guide (page 25, section 5.2.1). /proc/sys/vm/drop_cache is the next bullet point on the next page.
2 > drop_cache seemed to be the only thing that helped. But on client3 despite 22 CPUs and not much workload the computer feels slow. But not as unusable slow as before.

root@client3 # cat /proc/sys/vm/zone_reclaim_mode
0
free -hm
              total        used        free      shared  buff/cache   available
Mem:          377Gi        22Gi       133Gi        17Mi       222Gi       352Gi
Swap:         127Gi          0B       127Gi

root@client3 # echo 3 > /proc/sys/vm/zone_reclaim_mode
root@client3 # free -hm
              total        used        free      shared  buff/cache   available
Mem:          377Gi        22Gi       132Gi        17Mi       222Gi       352Gi
Swap:         127Gi          0B       127Gi

root@client3 # perf top
Samples: 233K of event 'cycles', 4000 Hz, Event count (approx.): 68469211948 lost: 0/0 drop: 0/1024737
Overhead  Shared Object                          Symbol
  64.18%  [kernel]                               [k] nfs_server_reap_expired_delegations
  29.91%  [kernel]                               [k] nfs_mark_test_expired_all_delegations
   1.07%  [kernel]                               [k] nfs_server_return_marked_delegations
...

2 min

perf top
Samples: 6M of event 'cycles', 4000 Hz, Event count (approx.): 884299419691 lost: 0/0 drop: 0/1024737
Overhead  Shared Object                          Symbol
  93.64%  [kernel]                               [k] nfs_mark_test_expired_all_delegations
   5.02%  [kernel]                               [k] nfs_server_reap_expired_delegations
...
root@client3 # echo 1 > /proc/sys/vm/zone_reclaim_mode

No change.

root@client3 # echo 2 > /proc/sys/vm/drop_caches

< 30 sec

root@client3 # free -hm
              total        used        free      shared  buff/cache   available
Mem:          377Gi        18Gi       329Gi        17Mi        28Gi       356Gi
Swap:         127Gi          0B       127Gi

root@client3 # perf top
Samples: 12M of event 'cycles', 4000 Hz, Event count (approx.): 84375363950 lost: 0/0 drop: 0/1135939
Overhead  Shared Object                            Symbol
  58.94%  [kernel]                                 [k] nfs_server_reap_expired_delegations
  26.88%  [kernel]                                 [k] nfs_mark_test_expired_all_delegations
   0.92%  [kernel]                                 [k] nfs_mark_test_expired_delegation.isra.16.part.17
...

Accessible again, but not very fast.

root@client3 # echo 1 > /proc/sys/vm/drop_caches

< 10 sec

root@client3 # free -hm
              total        used        free      shared  buff/cache   available
Mem:          377Gi        18Gi       356Gi        14Mi       1.8Gi       356Gi
Swap:         127Gi          0B       127Gi

root@client3 # perf top
Samples: 15M of event 'cycles', 4000 Hz, Event count (approx.): 59105133403 lost: 0/0 drop: 0/1135939
Overhead  Shared Object                          Symbol
  83.75%  [kernel]                               [k] nfs_server_reap_expired_delegations
   8.15%  [kernel]                               [k] nfs_mark_test_expired_all_delegations
...

Accessible again, but not very fast.

root@client3 # echo 3 > /proc/sys/vm/drop_caches
[< 5 sec]
root@client3 # free -hm
              total        used        free      shared  buff/cache   available
Mem:          377Gi        18Gi       357Gi        14Mi       1.2Gi       356Gi
Swap:         127Gi          0B       127Gi

after ~2 minutes

root@client3 # perf top
Samples: 17M of event 'cycles', 4000 Hz, Event count (approx.): 31067710225 lost: 0/0 drop: 0/1135939
Overhead  Shared Object                             Symbol
  18.84%  [kernel]                                  [k] nfs_server_reap_expired_delegations
...

Accessible again, but not very fast.

root@client3 # numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
node 0 size: 192841 MB
node 0 free: 178847 MB
node 1 cpus: 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
node 1 size: 193485 MB
node 1 free: 179523 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

Forgot to execute numactl after every step.

Reboot. Still not blazing fast.

root@client3 # free -hm
              total        used        free      shared  buff/cache   available
Mem:          377Gi       4.1Gi       358Gi        16Mi        14Gi       370Gi
Swap:         127Gi          0B       127Gi

root@client3 # numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
node 0 size: 192841 MB
node 0 free: 178360 MB
node 1 cpus: 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
node 1 size: 193485 MB
node 1 free: 188788 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

root@client3 # perf top
Samples: 3K of event 'cycles', 4000 Hz, Event count (approx.): 841703479 lost: 0/0 drop: 0/0
Overhead  Shared Object                          Symbol
   4.15%  [kernel]                               [k] entry_SYSCALL_64
...
root@client3 # cat /proc/sys/vm/zone_reclaim_mode
0
root@client3 # cat /proc/sys/vm/drop_caches
0                                   

It looks like the error can have different intensities. But when it starts, it won’t go away till the next reboot. We are still searching for a way to reliably reproduce the error. Some clients seem to have a greater tendency to get the problem ~days instead of ~weeks. And it looks like compiling code can trigger the problem since one of the most affected NFS clients is a compile only server, which does not execute the code it compiles.
The pattern is: the client runs fine till perf top shows the 4 symbols:

   x.xx%  [k] nfs_mark_test_expired_all_delegations
   x.xx%  [k] nfs_server_reap_expired_delegations
   x.xx%  [k] nfs_server_return_marked_delegations
   x.xx%  [k] nfs_expire_unreferenced_delegations

And NFS access gets slower. When the percentage of nfs_mark_test_expired_all_delegations is around ~10% the system is a little sluggish. It can get better. echo 3 > /proc/sys/vm/drop_caches always helps, echo 2 > /proc/sys/vm/drop_caches helps in most cases, echo 1 > /proc/sys/vm/drop_caches helps in some cases. But nfs_mark_test_expired_all_delegations does not go completely away, once it was there. Only a reboot makes this, for some days.

If we could reproduce the problem, we could test if it started with a certain kernel if a newer one would fix the issue and what is the cause behind it. Like: 2176575 – intermittent severe NFS client performance drop via nfs_server_reap_expired_delegations looping?

Some details about the kernel symbols.

nfs_mark_test_expired_all_delegations:

Percent│      nop
       │      push  %r13
       │      push  %r12
       │      push  %rbp
       │      push  %rbx
       │      mov   0xc8(%rdi),%rax
       │      lea   0xc8(%rdi),%r13
       │      cmp   %rax,%r13
       │    ↓ je    5c
       │      lea   -0x8(%rax),%rbp
       │22:   mov   0x300(%rbp),%rbx
  0.04 │      lea   0x300(%rbp),%r12
       │      cmp   %r12,%rbx
       │    ↓ je    4f
       │35:   mov   0x30(%rbx),%eax
 98.61 │      test  %eax,%eax
       │    ↓ je    47
       │      mov   %rbx,%rsi
       │      mov   %rbp,%rdi
  0.01 │    → callq nfs_mark_test_expired_delegation.isra.16.part.17
       │47:   mov   (%rbx),%rbx
  1.29 │      cmp   %r12,%rbx
  0.02 │    ↑ jne   35
       │4f:   mov   0x8(%rbp),%rax
  0.04 │      lea   -0x8(%rax),%rbp
       │      cmp   %rax,%r13
       │    ↑ jne   22
       │5c:   pop   %rbx
       │      pop   %rbp
       │      pop   %r12

nfs_server_reap_expired_delegations:

Percent│       lea    0x300(%rdi),%r13
       │       push   %r12
       │       push   %rbx
       │       and    $0xfffffffffffffff0,%rsp
       │       sub    $0x30,%rsp
       │       mov    %gs:0x28,%rax
       │       mov    %rax,0x28(%rsp)
       │       xor    %eax,%eax
       │       mov    0x320(%rdi),%rax
  0.01 │       lea    0x1(%rax),%r12
       │       mov    %r12,0x320(%rdi)
       │ 46:   mov    0x300(%r14),%rbx
  0.03 │       cmp    %r13,%rbx
       │     ↓ je     168
       │ 56:   mov    0x50(%rbx),%rax
 80.48 │       test   $0x80,%al
  0.03 │     ↓ jne    15c
  0.39 │       mov    0x50(%rbx),%rax
  2.03 │       test   $0x10,%al
  0.32 │     ↓ jne    15c
       │       mov    0x50(%rbx),%rax
  1.07 │       test   $0x40,%al
  0.17 │     ↓ je     15c
  0.03 │       cmp    0x48(%rbx),%r12
  0.28 │     ↓ je     15c
  0.04 │       mov    %rbx,%rdi
       │     → callq  nfs_delegation_grab_inode
       │       mov    %rax,%r15
       │       test   %rax,%rax
       │     ↑ je     46

nfs_server_return_marked_delegations:

Percent│       pop    %r14
       │       pop    %r15
       │     → jmpq   0xffffffffb5626840
       │ b3:   lea    0x5c(%rbx),%rax
       │       mov    %rdx,0x10(%rsp)
       │       mov    %rax,%rdi
       │       mov    %rax,0x8(%rsp)
       │     → callq  _raw_spin_lock
       │       mov    0x18(%rbx),%rsi
  0.53 │       xor    %r8d,%r8d
       │       mov    0x8(%rsp),%rax
       │       mov    0x10(%rsp),%rdx
       │       test   %rsi,%rsi
       │     ↓ je     ee
       │       mov    -0x78(%rsi),%rdi
 41.26 │       sub    $0x78,%rsi
       │       cmp    %rdi,%rsi
       │       sete   %r8b
       │ ee:   mov    %rax,%rdi
       │       movb   $0x0,(%rdi)
       │       nop
       │       test   %r8b,%r8b
       │     ↑ je     70
       │101:   lock   andb   $0xfb,(%rdx)
  5.37 │       mov    0x50(%rbx),%rax
  1.81 │       mov    $0x1,%r15d
       │       test   $0x10,%al
       │     ↑ jne    7c
       │117:   mov    0x50(%rbx),%rax
       │       test   $0x1,%ah

nfs_expire_unreferenced_delegations:

Percent│
       │    Disassembly of section load0:
       │
       │    ffffffffc0c3b200 <load0>:
       │      nop
       │      mov  0xc8(%rdi),%rax
       │      lea  0xc8(%rdi),%r8
       │      lea  -0x8(%rax),%rdx
       │      cmp  %rax,%r8
       │    ↓ je   5d
       │1c:   mov  0x300(%rdx),%rax
       │      lea  0x300(%rdx),%rsi
       │      cmp  %rsi,%rax
       │    ↓ je   50
       │2f:   lock btrq $0x3,0x50(%rax)
 78.58 │   ┌──jb   48
       │   │  lock orb  $0x4,0x50(%rax)
  9.85 │   │  mov  (%rdx),%rcx
       │   │  lock orb  $0x20,0x130(%rcx)
 10.14 │48:└─→mov  (%rax),%rax
  1.43 │      cmp  %rsi,%rax
       │    ↑ jne  2f
       │50:   mov  0x8(%rdx),%rax
       │      lea  -0x8(%rax),%rdx
       │      cmp  %rax,%r8
       │    ↑ jne  1c
       │5d:   mov  0x130(%rdi),%rax
       │      test $0x20,%al
       │    ↓ jne  6d
       │    → jmpq 0xffffffffb5626840

In Red Hat Bugzilla – Bug 2176575, in comment 16 and comment 17, do either of those sound like what you’re seeing, and where is this “upstream” commit anyway? Is it in the linux kernel, Fedora, server or client? Has it been backported into your Rocky versions?

Definitely. And so far it seems to be the only post in the whole web, narrowed down to a similar problem.

The bug report in Bugzilla is for Fedora 39. The upstream fix mentioned is a commit to kernel.org by someone from Red Hat. In the Fedora Bug the topic is closed, because version 40 should have a now kernel and not have this bug any more.
The backport to RH 8 does not seem to have taken place, or the bug was not the source of my problem. One would have to investigate the source code a little further, to see if the path is in it, or where it is missing.
And the problem occurs on Rocky Linux 8, as well as on Oracle Linux 8, so I would guess it is a Red Hat problem.

Further thinking on the start of symptoms makes us believe it started much earlier last year with the switch from CentOS 7 / NFS 3. Something just made it worse in the past weeks. It just rose from nagging and a reboot once a month to completely blocking users every week.
The affected clients seem to have been running IDEs with code analysis more often, than the less affected clients. So the next guess would be that the code analysis causes high I/O on the nfs which causes the unusable state.
Today cient1 had the symptoms again and writing 2, 1 and finally 3 into drop_caches for the first time did not only did not help, but made the client unusable even for the local user. After the reboot I switched it’s autofs configuration from nfs 4.2 to 3. Let’s wait if the problem occurs also on nfs 3.

I know you already ruled it out, but can you check
cat /etc/machine-id
on every client, and confirm they are all unique?