I frequently lose all SSH connectivity to my AWS EC2 instance running Rocky Linux v8.6:
# cat /etc/os-release
NAME="Rocky Linux"
VERSION="8.6 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.6"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.6 (Green Obsidian)"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky Linux"
ROCKY_SUPPORT_PRODUCT_VERSION="8"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8"
When this happens, it sometimes heals itself after an extended period (at least 30 minutes). I can generally get things going again by doing a shutdown and restart, but that’s very inconvenient.
This may be a symptom of some other issue – it always happens while I’m doing remote debugging using VSCode and Javascript/nodejs. I know that VSCode currently leaves at least dozens and sometimes hundreds of ports in TIME_WAIT – I’m in communication with the VSCode team about addressing that.
Nevertheless, I don’t think this should kill ALL ssh access to the affected system.
I’ve found entries in /var/log
and var/log/sssd
that apparently correspond to the value. In /var/log/messages
, I see the following complaint with a timestamp that manages when I observed the failure:
Jan 24 17:10:53 byron sssd[886]: Child [902] ('nss':'nss') was terminated by own WATCHDOG. Consult corresponding logs to figure out the reason.
In /var/log/sssd/sssd.log
, again at or about the same timestamp, I see the following:
(2023-01-24 17:10:50): [sssd] [svc_child_info] (0x0020): Child [902] ('nss':'nss') was terminated by own WATCHDOG
********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE:
<elided>
* (2023-01-24 17:10:50): [sssd] [mt_svc_exit_handler] (0x1000): SIGCHLD handler of service nss called
* (2023-01-24 17:10:50): [sssd] [svc_child_info] (0x0020): Child [902] ('nss':'nss') was terminated by own WATCHDOG
********************** BACKTRACE DUMP ENDS HERE *********************************
I see an apparent reference to this issue in a redhat bugzilla, but the thread doesn’t seem helpful.
I found an alleged fix but can’t read it because I don’t have an enterprise account at redhat.
So far as I can tell, there is no /etc/sssd.conf
on my system.
I appreciate any guidance the community can offer about a fix or workaround. I haven’t found any source that describes the origin or originator of this “own WATCHDOG” termination.