Connectivity lost after sssd_nss failure

I frequently lose all SSH connectivity to my AWS EC2 instance running Rocky Linux v8.6:

# cat /etc/os-release
NAME="Rocky Linux"
VERSION="8.6 (Green Obsidian)"
ID_LIKE="rhel centos fedora"
PRETTY_NAME="Rocky Linux 8.6 (Green Obsidian)"

When this happens, it sometimes heals itself after an extended period (at least 30 minutes). I can generally get things going again by doing a shutdown and restart, but that’s very inconvenient.

This may be a symptom of some other issue – it always happens while I’m doing remote debugging using VSCode and Javascript/nodejs. I know that VSCode currently leaves at least dozens and sometimes hundreds of ports in TIME_WAIT – I’m in communication with the VSCode team about addressing that.

Nevertheless, I don’t think this should kill ALL ssh access to the affected system.

I’ve found entries in /var/log and var/log/sssd that apparently correspond to the value. In /var/log/messages, I see the following complaint with a timestamp that manages when I observed the failure:

Jan 24 17:10:53 byron sssd[886]: Child [902] ('nss':'nss') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.

In /var/log/sssd/sssd.log, again at or about the same timestamp, I see the following:

(2023-01-24 17:10:50): [sssd] [svc_child_info] (0x0020): Child [902] ('nss':'nss') was terminated by own WATCHDOG
   *  (2023-01-24 17:10:50): [sssd] [mt_svc_exit_handler] (0x1000): SIGCHLD handler of service nss called
   *  (2023-01-24 17:10:50): [sssd] [svc_child_info] (0x0020): Child [902] ('nss':'nss') was terminated by own WATCHDOG
********************** BACKTRACE DUMP ENDS HERE *********************************

I see an apparent reference to this issue in a redhat bugzilla, but the thread doesn’t seem helpful.

I found an alleged fix but can’t read it because I don’t have an enterprise account at redhat.

So far as I can tell, there is no /etc/sssd.conf on my system.

I appreciate any guidance the community can offer about a fix or workaround. I haven’t found any source that describes the origin or originator of this “own WATCHDOG” termination.

In short, it suggests increasing the timeout value in sssd.conf under the domain section.

1 Like

Cool. This may be a symptom rather than a problem.

According to find, there is just one sssd.conf. I presume it needs to be copied to /etc/sssd in order for any changes to take effect.

Also, in the sssd man pages I find the following:

Options usable in SERVICE and DOMAIN sections
timeout (integer)

Timeout in seconds between heartbeats for this service. This is used to ensure that the process is alive and capable of answering requests. Note that after three missed heartbeats the process will terminate itself.
Default: 10

The man pages make me wonder if this sssd issue is a problem or a symptom.

In the thread-starter, I mentioned an ongoing issue with VSCode – it leaves a large number of ports in TIME_WAIT. I notice the following entry in the relevant VSCode log:

[20:14:12] No ptyHost heartbeat after 6 seconds
[20:14:26] No ptyHost heartbeat after 6 seconds
[20:14:46] No ptyHost heartbeat after 6 seconds

I think I’ll wait until the VSCode team releases an upgrade for VSCode before diving into changes to sssd.conf.

I’m reluctant to change a configuration setting this deeply wired into sssd without stronger evidence that the default is the source of the issues I’m seeing. I don’t understand why anything should cause a delay of even 18 seconds (as I see in the VSCode log entry above), never mind 30 secods (three times the sssd default).

I appreciate your attention – I hope this issue is addressed by next week’s update to VSCode.