Connectivity lost after sssd_nss failure

SomervilleTom · January 24, 2023, 7:10pm

I frequently lose all SSH connectivity to my AWS EC2 instance running Rocky Linux v8.6:

# cat /etc/os-release
NAME="Rocky Linux"
VERSION="8.6 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.6"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.6 (Green Obsidian)"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky Linux"
ROCKY_SUPPORT_PRODUCT_VERSION="8"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8"

When this happens, it sometimes heals itself after an extended period (at least 30 minutes). I can generally get things going again by doing a shutdown and restart, but that’s very inconvenient.

This may be a symptom of some other issue – it always happens while I’m doing remote debugging using VSCode and Javascript/nodejs. I know that VSCode currently leaves at least dozens and sometimes hundreds of ports in TIME_WAIT – I’m in communication with the VSCode team about addressing that.

Nevertheless, I don’t think this should kill ALL ssh access to the affected system.

I’ve found entries in /var/log and var/log/sssd that apparently correspond to the value. In /var/log/messages, I see the following complaint with a timestamp that manages when I observed the failure:

Jan 24 17:10:53 byron sssd[886]: Child [902] ('nss':'nss') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.

In /var/log/sssd/sssd.log, again at or about the same timestamp, I see the following:

(2023-01-24 17:10:50): [sssd] [svc_child_info] (0x0020): Child [902] ('nss':'nss') was terminated by own WATCHDOG
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
<elided>
   *  (2023-01-24 17:10:50): [sssd] [mt_svc_exit_handler] (0x1000): SIGCHLD handler of service nss called
   *  (2023-01-24 17:10:50): [sssd] [svc_child_info] (0x0020): Child [902] ('nss':'nss') was terminated by own WATCHDOG
********************** BACKTRACE DUMP ENDS HERE *********************************

I see an apparent reference to this issue in a redhat bugzilla, but the thread doesn’t seem helpful.

I found an alleged fix but can’t read it because I don’t have an enterprise account at redhat.

So far as I can tell, there is no /etc/sssd.conf on my system.

I appreciate any guidance the community can offer about a fix or workaround. I haven’t found any source that describes the origin or originator of this “own WATCHDOG” termination.

iwalker · January 24, 2023, 8:35pm

In short, it suggests increasing the timeout value in sssd.conf under the domain section.

SomervilleTom · January 24, 2023, 10:44pm

Cool. This may be a symptom rather than a problem.

According to find, there is just one sssd.conf. I presume it needs to be copied to /etc/sssd in order for any changes to take effect.

Also, in the sssd man pages I find the following:

Options usable in SERVICE and DOMAIN sections
timeout (integer)

Timeout in seconds between heartbeats for this service. This is used to ensure that the process is alive and capable of answering requests. Note that after three missed heartbeats the process will terminate itself.
Default: 10

The man pages make me wonder if this sssd issue is a problem or a symptom.

In the thread-starter, I mentioned an ongoing issue with VSCode – it leaves a large number of ports in TIME_WAIT. I notice the following entry in the relevant VSCode log:

[20:14:12] No ptyHost heartbeat after 6 seconds
[20:14:26] No ptyHost heartbeat after 6 seconds
[20:14:46] No ptyHost heartbeat after 6 seconds

I think I’ll wait until the VSCode team releases an upgrade for VSCode before diving into changes to sssd.conf.

I’m reluctant to change a configuration setting this deeply wired into sssd without stronger evidence that the default is the source of the issues I’m seeing. I don’t understand why anything should cause a delay of even 18 seconds (as I see in the VSCode log entry above), never mind 30 secods (three times the sssd default).

I appreciate your attention – I hope this issue is addressed by next week’s update to VSCode.

SomervilleTom · February 3, 2023, 11:39pm

I’ve now upgraded RockyLinux to v8.7 and installed the latest VSCode update (v1.75.0, just released today) and continue to see the same issue.

When this issue occurs, ALL connectivity is lost for a period ranging from a few minutes to much longer. The system sometimes restores itself and sometimes needs a manual restart.

Here is the RockyLinux status of the affected system:

# cat /etc/os-release
NAME="Rocky Linux"
VERSION="8.7 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.7"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.7 (Green Obsidian)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
ROCKY_SUPPORT_PRODUCT_VERSION="8.7"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.7"

Here is the version info for VSCode:

Version: 1.75.0 (system setup)
Commit: e2816fe719a4026ffa1ee0189dc89bdfdbafb164
Date: 2023-02-01T15:23:45.584Z
Electron: 19.1.9
Chromium: 102.0.5005.194
Node.js: 16.14.2
V8: 10.2.154.23-electron.0
OS: Windows_NT x64 10.0.18363
Sandboxed: No

I use the “remote SSH” and “Javascript debugger” extensions of VSCode, so the actual issue is happening on the remote system.

I found RHBA-2020:4569 - Bug Fix Advisory that appears to be relevant. I don’t have access to the redhat portal, and I don’t know enough RL to know if or how this relates to RL v8.7.

I just now had a hang from around 23:00 UTC to 23:04 UTC.

Here is the relevant excerpt from /var/log/messages:

Feb  3 23:00:01 byron systemd[1]: Started system activity accounting tool.
Feb  3 23:02:46 byron sssd[902]: Child [923] ('nss':'nss') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
Feb  3 23:02:47 byron systemd[1]: session-55.scope: Succeeded.
Feb  3 23:02:47 byron systemd-logind[924]: Session 55 logged out. Waiting for processes to exit.
Feb  3 23:02:47 byron systemd-logind[924]: Removed session 55.
Feb  3 23:02:47 byron sssd_nss[22624]: Starting up
Feb  3 23:02:48 byron systemd[1]: Started Process Core Dump (PID 22647/UID 0).
Feb  3 23:02:48 byron systemd-coredump[22648]: Resource limits disable core dumping for process 22068 (node).
Feb  3 23:02:48 byron systemd-coredump[22648]: Process 22068 (node) of user 1001 dumped core.
Feb  3 23:02:48 byron systemd[1]: systemd-coredump@1-22647-0.service: Succeeded.
Feb  3 23:04:35 byron systemd[1]: Started Session 61 of user root.
Feb  3 23:04:35 byron systemd-logind[924]: New session 61 of user root.

I lost connectivity at 23:00. It appears to me that the “WATCHDOG” shut down sssd_nss after several failed heartbeats – I think that’s what the second entry (Feb 3 23:02:46) is telling us.

The successful signin (“Session 61”) of user “root” exactly corresponds to when my SSH client was able to reconnect.

Can someone confirm that the bug fixes and updates mentioned in the above redhat advisory are in RL v8.7?

I made one attempt to copy sssd.conf from /usr/lib64/sssd/conf/sssd.conf to /etc/default/sysconfig/sssd.con – the place where an earlier log said that sssd was looking for the file. I made the change suggested upthread. With that change in place, systemctl was unable to successfully launch sssd.

I did see a complaint in a VSCode log earlier this week – just once – with the following:

2023-02-02 17:40:22.918 [error] Error: spawn ENOMEM
        at ChildProcess.spawn (node:internal/child_process:413:11)
        at Object.spawn (node:child_process:700:9)
        at t.Git.spawn (/home/tms/.vscode-server/bin/97dec172d3256f8ca4bfb2143f3f76b503ca0534/extensions/git/dist/main.js:2:2017628)
        at t.Git.stream (/home/tms/.vscode-server/bin/97dec172d3256f8ca4bfb2143f3f76b503ca0534/extensions/git/dist/main.js:2:2016385)
        at P.stream (/home/tms/.vscode-server/bin/97dec172d3256f8ca4bfb2143f3f76b503ca0534/extensions/git/dist/main.js:2:2020758)
        at P.getStatus (/home/tms/.vscode-server/bin/97dec172d3256f8ca4bfb2143f3f76b503ca0534/extensions/git/dist/main.js:2:2034051)
        at j.db (/home/tms/.vscode-server/bin/97dec172d3256f8ca4bfb2143f3f76b503ca0534/extensions/git/dist/main.js:2:2101461)
        at j.bb (/home/tms/.vscode-server/bin/97dec172d3256f8ca4bfb2143f3f76b503ca0534/extensions/git/dist/main.js:2:2100726)
        at async j.ab (/home/tms/.vscode-server/bin/97dec172d3256f8ca4bfb2143f3f76b503ca0534/extensions/git/dist/main.js:2:2100238)
        at async j.W (/home/tms/.vscode-server/bin/97dec172d3256f8ca4bfb2143f3f76b503ca0534/extensions/git/dist/main.js:2:2099522)
        at async j.status (/home/tms/.vscode-server/bin/97dec172d3256f8ca4bfb2143f3f76b503ca0534/extensions/git/dist/main.js:2:208574
4)

This complaint suggests that vscode-server (the nodejs process that runs VSCode on the remote machine) used all the memory on the server.

I’ve had multiple hangs per day (at least a dozen), and this is the only log entry contains “ENOMEM”.

That timestamp (2023-02-02 17:40:22) corresponded to a hang I experienced yesterday.

I’m seeing this multiple times a day, it’s a real hassle.

I invite the guidance of this community about how best to proceed.

nazunalika · February 4, 2023, 12:10am

How much memory does this instance have? It is very likely the system is running out of memory and processes are being killed as a result.