DNS Reponses Being Blocked?

I think I have a DNS problem.

I just setup a new Rocky Linux 9.5 system with an autofs NFS mounted home directory. This machine is referred to below as nfsclient.white.lan.

It sporadically hangs for 20 seconds trying to open a file or write or do NFS stuff.

It looks like it occurs when autofs re-mounts after the mount is “Deactivated successfully”.

I have another client that mounts the server just fine.

I think the problem has to do something with the fact that the DNS server is not authoritative for the NFS server or client.

The machines in question are:

nfsclient.white.lan: 10.10.10.201
nfsserver.white.lan: 10.10.10.17
dnsserver.black.lan: 10.10.10.72
dnsserver.white.lan: 10.10.10.1

So the DNS server being used is only authoritative for black.lan and has to forward to 10.10.10.1.

I don’t see why this should be a problem.

If I do dig nfsserver.white.lan, it is instantly successful.

But .. I got a capture which shows DNS:

DNS: 10.10.10.201 > 10.10.10.72: A nfsserver.white.lan
DNS: 10.10.10.201 > 10.10.10.72: AAAA nfsserver.white.lan
DNS: 10.10.10.72 > 10.10.10.201: A nfsserver.white.lan success: 10.10.10.17

5 seconds later …

ARP: Who has 10.10.10.201 tell 10.10.10.72
ARP: 10.10.10.201 is at ..a9
DNS: 10.10.10.201 > 10.10.10.72: A nfsserver.white.lan
DNS: 10.10.10.201 > 10.10.10.72: AAAA nfsserver.white.lan
DNS: 10.10.10.72 > 10.10.10.201: A nfsserver.white.lan success: 10.10.10.17
ARP: Who has 10.10.10.72 tell 10.10.10.201
ARP: 10.10.10.72 is at ..6d

4 seconds later …

DNS: 10.10.10.72 > 10.10.10.201: "Server failure": AAAA nfsserver.white.lan
DNS: 10.10.10.201 > 10.10.10.72: A nfsserver.white.lan
DNS: 10.10.10.201 > 10.10.10.72: AAAA nfsserver.white.lan
DNS: 10.10.10.72 > 10.10.10.201: A nfsserver.white.lan success: 10.10.10.17

and two more retries …
after 20 seconds in total …

ICMP: 10.10.10.201 > 10.10.10.72: Destination unreachable
NFS: SYN / ACK ... success

So it looks like there are a few odd things going on here:

ICMP is blocked (presumably firewalled but I don’t see why this should be an issue)

AAAA lookups are failing (normal - no IPv6 here yet)

ARP and DNS retrans seems to indicate that DNS responses are not being received by 201?

Could this be an SELinux thing?

Note: I can’t just change DNS to dnserver.white.lan because dnsserver.black.lan is Windows DNS which has the SRV records and such needed for doing Kerberos with Windows KDCs.

UPDATE 1:

After disabling SELinux and turning off firewalld, the problem was not resolved.

However, after changing /etc/auto.nfs from:

user1  -fstype=nfs,rw  nfsserver.white.lan:/d0/user1

to reference the server by IP address:

user1  -fstype=nfs,rw  10.10.10.17:/d0/user1

Now it works. Not ideal. But techncially it’s working.

Something is blocking NFS from getting DNS responses. Presumably it’s some kind of obscure security feature of NFS or some RPC service or maybe autofs is somehow influencing things or …

shrug
I’ve always just put that sort of thing into /etc/hosts

No, you shouldn’t need to change it, it won’t be down to the fact that the white.lan is being managed elsewhere. All domains on the internet are at various nameservers so your setup is no different from this really.

I’m guessing about how the DNS is configured on the host where you have the auto.nfs mount? Can you check what the contents of /etc/resolv.conf is? Is it using something like systemd-resolved or dnsmasq? Usually if you see a 127.0.0.1 type address in resolv.conf it would suggest it’s going to something else first like systemd-resolved, dnsmasq, before it hit’s the servers that you configured.

You can edit /etc/resolv.conf manually, and put the nameserver entries to go direct to the DNS servers in question if it does have a 127.0.0.1 type address in there. That will be an easy way to check/test if that is the problem.

Another potential solution is to install something on the machine like unbound and configure it to cache locally the DNS responses (that way once it’s resolved the timeout should no longer appear in the future), but I would only tend to do that at a last resort and if you have to do for multiple machines you can install and configure unbound with ansible so it would be real quick to deploy on a bigger scale. You could also add to /etc/hosts, but if you do that, then what’s the point of DNS? :slight_smile:

If you are having problems with a dns server being able to resolve the FQDN ‘nfsserver.white.lan’, but using the IP address ‘10.10.10.17’ works, then this seems to suggest a dns problem. Using the FQDN requires your dns server being able to resolve that FQDN to an IP address, but using the IP will go directly to the IP.

It looks like your dns server ‘dnsserver.black.lan’ isn’t forwarding correctly, it should go: 'I do not know where ‘nfsserver.white.lan’ is, but ‘dnsserver.white.lan’ will, so I will forward it to that forwarder.

I think I know what the problem is and ultimately I have to suspect that it is actually a bug in the DNS client retry logic (and in practice an “imperfection” in DNS itself).

BEHAVIOR:

If the NFS client’s DNS is set to the “bad” DNS server with /etc/resolve.conf like:

nameserver 10.10.10.72

the DNS queries retry and timeout after about 13 seconds ending with AAAA queries failing with Reply code: Server failure (2).

If I change DNS to the “good” DNS server with /etc/resolve.conf:

nameserver 10.10.10.1

the DNS queries do not retry, the AAAA query fails immediately with Reply code: Not implemented (4) and the NFS client immediately uses the A record IP.

ANALYSIS:

The Not implemented (4) is correct in that the upstream DNS sever does not have AAAA records for the hostname.

However, the “bad” DNS server apparently retries, times out and returns Server failure (2).

This looks very much like the NFS client has logic that interprets Server failure (2) as a retry condition.

Technically I would say that this is a flaw in how the downstream DNS server (which is Windows DNS in this case) is reporting upstream errors. The downstream DNS server should probably just relay the upstream Not implemented error to the client immediately. Instead, it apparently retries the query (otherwise it would have returned quickly) and returns Server failure.

With this behavior, it is not clear how clients could distinguish between a failing individual query and a server that is actually having a more general failure.

Furthermore, I suspect that if there were multiple DNS servers listed in the NFS client conf, the Server failure might actually trigger the client to fail over to the next DNS server. Meaning one failed upstream query returning Server failure instead of Not implemented could cause the client to fail over to the next DNS server, which will also fail and then probably take even longer to timeout.

Alternatively, the downstream DNS server could return a different reply code of Upstream failure (?). Aside from being hard to get that train to change tracks, DNS reply codes are only 4 bits and they’re all accounted for.

I need one of those tee shirts that reads “It was DNS”.

UPDATE:

From RFC 4074:

4.3. Return Other Erroneous Codes

Other authoritative servers return a response with erroneous response
codes other than RCODE 3 (“Name Error”). One such RCODE is 4 (“Not
Implemented”), indicating that the servers do not support the
requested type of query.

These cases are less harmful than the previous one; if the stub
resolver falls back to querying for an A RR, the caching server will
process the query correctly and return an appropriate response.

However, these can still cause a serious effect. There was an
authoritative server implementation that returned RCODE 2 (“Server
failure”) to queries for AAAA RRs. One widely deployed mail server
implementation with a certain type of resolver library interpreted
this result as an indication of retry and did not fall back to
queries for A RRs, causing message delivery failure.

If the caching server receives a response with these response codes,
it does not cache the fact that the queried name has no AAAA RR,
resulting in redundant queries for AAAA RRs in the future. The
behavior will waste network bandwidth and increase the load of the
authoritative server.

Using RCODE 1 (“Format error”) would cause a similar effect, though
the authors have not seen such implementations yet.

At this point I have to guess that the real problem is ultimately that the upstream DNS (which is a conventional Internet router) should not return Not implemented (4) for AAAA queries but instead it should return No such name (3).

Meaning my router is garbage.