Sshd core dumps at login attempts after upgrade to Rocky 8.7

GregTourte · November 24, 2022, 5:00pm

ever since the upgrade to 8.7, sshd core dumps after a connection attempt from any other computer on the network.
From the user pov on a computerB all you get is

kex_exchange_identification: read: Connection reset by peer

followed by a

ssh: connect to host computerA port 22: Connection refused

if further connection are attempted before systemctl restart the deamon

I though I first it was due to corruption during the update as the machine was being hammered by user and ran out of memory. so I reinstalled openssh-server and all it dependencies, but it didn’t change anything.
I haven’t been able to find the coredump file (or enable its creation, but I am not even sure I would be able to interpret it anyway.
Any pointers would be appreciated.

gerry666uk · November 24, 2022, 7:13pm

I’m assuming there are two compuuters, but who is running what, and who got upgraded?

Can you rewrite this with terms like “Computer A” and “Computer B”, and which one is dumping core, and if you can’t find the core dump, how did you know it happened?

GregTourte · November 24, 2022, 8:19pm

The server (computer A) is the one that got upgraded from 8.6 to 8.7, and with the sshd listener that crashes every time a connection is attempted from any other machines on the network.

The coredump is mentioned by systemctl status sshd on computer A. I looked in /var/lib/systemd/coredump/ but that is empty.

jlehtone · November 24, 2022, 9:06pm

Is there anything else before the coredump message?
sudo journalctl -u sshd.service shows more lines from log than the systemctl.

gerry666uk · November 24, 2022, 10:28pm

Ok, the post is clear now.

This sounds serious, assuming you are running only official Rocky packages.

In addition to journalctl, It’s also worth checking /var/log/messages, as there may be a warning about core dump not working because of limits x,y,z.

In my case, I’ve upgraded Rocky on real hardware (and virtual machine) from 8.6 to 8.7, and have used ssh at least 20 times, and have not seen this…

GregTourte · November 25, 2022, 1:05am

journalctl doesn’t really offer much info to be honest:

Nov 24 16:56:11 computerA systemd[1]: Starting OpenSSH server daemon...
Nov 24 16:56:11 computerA systemd[1]: Started OpenSSH server daemon.
Nov 24 19:18:02 computerA systemd-coredump[2358]: Process 2261 (sshd) of user 0 dumped core.
Nov 24 19:18:02 computerA systemd[1]: sshd.service: Main process exited, code=dumped, status=11/SEGV
Nov 24 19:18:02 computerA systemd[1]: sshd.service: Failed with result 'core-dump'.
Nov 24 19:18:44 computerA systemd[1]: sshd.service: Service RestartSec=42s expired, scheduling restart.
Nov 24 19:18:44 computerA systemd[1]: sshd.service: Scheduled restart job, restart counter is at 3.
Nov 24 19:18:44 computerA systemd[1]: Stopped OpenSSH server daemon.

/var/log/messges does mention limits and I have added a line in limits.d/coredump.conf to set it to unlimited but it probably wasn’t enough.

felfert · November 26, 2022, 10:31am

Regarding handling of the actual coredump (regardless where it is):

There is a utility named coredumpctl which should be installed by default as it is part of systemd.
Invoked without any parameters, it lists any coredumps on the system. If there are no coredums, then it tells you that also,
You probably want to read its manpage coredumpctl.1

Note For some of its features, you also need gdb installed.

GregTourte · November 28, 2022, 1:51pm

Thanks for the pointer to coredumpctl. after more googling (install abrt, starting abrtd and ading a few lines in sysctl.conf) I finally got a core dump file.

coredumpctl returns

           PID: 8334 (sshd)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Mon 2022-11-28 13:24:39 GMT (23min ago)
  Command Line: /usr/sbin/sshd -D -E /var/log/sshd_error.log
    Executable: /usr/sbin/sshd
 Control Group: /system.slice/sshd.service
          Unit: sshd.service
         Slice: system.slice
       Boot ID: 142a290ffbc342d581f032731029833e
    Machine ID: c4f796fe98e64eaf8ed6caf1a22776ea
      Hostname: computerA
       Storage: /var/lib/systemd/coredump/core.sshd.0.142a290ffbc342d581f032731029833e.8334.1669641879000000.lz4
       Message: Process 8334 (sshd) of user 0 dumped core.
                
                Stack trace of thread 8334:
                #0  0x00007f1461f1caff raise (libc.so.6)
                #1  0x000055c3dee21677 sshbuf_len (sshd)
                #2  0x000055c3dee2a2c6 sshbuf_put_stringb (sshd)
                #3  0x000055c3dedf0c7f send_rexec_state (sshd)
                #4  0x000055c3dedef126 main (sshd)
                #5  0x00007f1461f08d85 __libc_start_main (libc.so.6)
                #6  0x000055c3dedf074e _start (sshd)

running gdb on the file it complains that there aren’t any debug symbols in the binary and to run:
yum debuginfo-install openssh-server
which of course I then did, however, this comes pack with no package found.
Still running bt full doesn’t return anything more than coredumpctl.

felfert · November 28, 2022, 2:10pm

Try enabling the the relevant debug repository like this:
dnf --enablerepo baseos-debug debuginfo-install openssh-server

=> If this still is not enough, you might need to enable even more debug repositories
=> You can list an overview of all repos using dnf repolist --all

GregTourte · November 28, 2022, 2:58pm

Ah indeed, the previous command automatically enabled the epel debuginfo repo, so I (wrongly) assumes it was enough.

After also installing all the other debuginfo packages that it eventually requested. I got the backtrace output. As I suspected, it doesn’t mean much to me though:

GNU gdb (GDB) Red Hat Enterprise Linux 8.2-19.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/sshd...Reading symbols from /usr/lib/debug/usr/sbin/sshd-8.0p1-16.el8.x86_64.debug...done.
done.
[New LWP 8677]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/sshd -D -E /var/log/sshd_error.log'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI_raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/raise.c:50
50	  return ret;
#0  __GI_raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {0, 139663504466158, 6, 139663504466280, 0, 11407368291814832640, 6, 94225421477760, 1, 9, 0, 7, 0, 0, 
            0, 0}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x000055b2909d9677 in sshbuf_check_sanity (buf=0x0) at sshbuf.c:46
No locals.
#2  sshbuf_check_sanity (buf=0x0) at sshbuf.c:32
No locals.
#3  sshbuf_len (buf=buf@entry=0x0) at sshbuf.c:258
No locals.
#4  0x000055b2909e22c6 in sshbuf_put_stringb (buf=0x55b290d052b0, v=0x0) at sshbuf-getput-basic.c:374
No locals.
#5  0x000055b2909a8c7f in send_rexec_state (fd=9, conf=0x55b290cfa260) at sshd.c:945
        m = 0x55b290cfc0b0
        inc = 0x55b290d052b0
        item = 0x55b290cfb770
        r = <optimized out>
        __func__ = "send_rexec_state"
#6  0x000055b2909a7126 in server_accept_loop (ssh=0x0, config_s=0x7ffdfd591820, newsock=<synthetic pointer>, 
    sock_out=<synthetic pointer>, sock_in=<synthetic pointer>) at sshd.c:1381
        fdset = <optimized out>
        j = <optimized out>
        maxfd = 7
        listening = <optimized out>
        from = {ss_family = 2, 
          __ss_padding = "\204&\254\023 \313\000\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\311A", '\000' <repeats 22 times>, "\006\000\000\000\000\000\000\000\000\020", '\000' <repeats 14 times>, ")\217\177c\000\000\000\000\223*~\r\000\000\000\000@\354\064c", '\000' <repeats 12 times>, "\210\205\177c\000\000\000\000@\020\316!\000\000\000", __ss_align = 0}
        pid = 8679
        rnd = "::\000.0.0\000\210\037\000\362\005\177\000\000\360\032Y\375\375\177\000\000\a\000\000\000\000\000\000\000\360\024\000\362\005\177\000\000\a\000\000\000\b\000\000\000\060\372\"\362\005\177\000\000\314\065\001\362\005\177\000\000\000\000\000\000\000\000\000\000xG\001\362\005\177\000\000\377\377\377\377\000\000\000\000P\245!\362\005\177\000\000 q\236\357\005\177\000\000\000\000\000\000\000\000\000\000\300\032Y\375\375\177\000\000\001\000\000\000\000\000\000\000(\037\000\362\005\177\000\000\000\000\000\000\000\000\000\000\030\227!\362\005\177\000\000(\037\000\362\005\177\000\000\300\032Y\375\375\177\000\000`\027\000\362\005\177\000\000x\005#\362\005\177\000\000\000\020\000\362\005\177\000\000\274\031\000\362\005\177\000\000"...
        ret = <optimized out>
        startups = 1
        fromlen = 16
        i = 0
        lameduck = 0
        startup_p = {7, 8}
        c = 0 '\000'
        fdset = <optimized out>
        i = <optimized out>
        j = <optimized out>
        ret = <optimized out>
        maxfd = <optimized out>
        startups = <optimized out>
        listening = <optimized out>
        lameduck = <optimized out>
        startup_p = <optimized out>
        c = <optimized out>
        from = <optimized out>
        fromlen = <optimized out>
        pid = <optimized out>
        rnd = <optimized out>
        __func__ = "server_accept_loop"
        laddr = <optimized out>
        raddr = <optimized out>
#7  main (ac=<optimized out>, av=<optimized out>) at sshd.c:2082
        ssh = 0x0
        r = <optimized out>
        opt = <optimized out>
        on = 1
        already_daemon = <optimized out>
        remote_port = <optimized out>
        sock_in = -1
        sock_out = -1
        newsock = 6
        remote_ip = <optimized out>
        rdomain = <optimized out>
        fp = <optimized out>
        line = <optimized out>
        laddr = <optimized out>
        logfile = <optimized out>
        config_s = {9, 10}
        i = <optimized out>
        j = <optimized out>
        ibytes = 139663513882112
        obytes = 16
        new_umask = <optimized out>
        key = 0x55b290d05300
        pubkey = 0x55b290d054c0
        keytype = <optimized out>
        authctxt = <optimized out>
        connection_info = <optimized out>
        __func__ = "main"

gerry666uk · November 28, 2022, 8:19pm

Can you confirm that this was originally a clean install of Rocky 8.x (not upgraded or migrated), and that you don’t have any packages installed, other than those from the official Rocky repos? It’s almost as if one of the libraries is the wrong version (because otherwise lots of people would be seeing this same issue on 8.7)

Can you also check as root

rpm -q openssh-server
rpm --verify openssh-server
rpm -q glibc
rpm --verify glibc

gerry666uk · November 28, 2022, 8:43pm

Can you also check using a new key pair (just for now), and also clear out any known_hosts files. Clean up the keys on both client and server.
When you then try to re-connect, it should throw up a message asking if you trust the host. Check to see if it crashes before or after that message, then it will move on to verifying the new keys (and might crash at that point instead).

GregTourte · November 29, 2022, 1:20am

I did regenerate all the host keys before I opened this topic. It crashes before the message about trusting the new host.
Just to check again, I removed the entries in known_hosts on my client, and it does the same. After attempting connection, there are no new keys added in known_hosts (not that I was expecting any but I thought I’d check anyway).
As for other repos, only epel and mongodb repos are enabled outside of the official Rocky repos.
dnf also confirms that both openssh-server and glibc are installed from the baseos repo.

FrankCox · November 29, 2022, 3:21am

I don’t see where you have posted any relevant lines from /var/log/secure or /var/log/messages, or the output from ssh -vvv when you try to login.

felfert · November 29, 2022, 11:15am

Looking at the source, your backtrace suggests, that some of the sshd config might be corrupt:

The code from frame #5 looks like this:

        /* pack includes into a string */
        TAILQ_FOREACH(item, &includes, entry) {
                if ((r = sshbuf_put_cstring(inc, item->selector)) != 0 ||
                    (r = sshbuf_put_cstring(inc, item->filename)) != 0 ||
                    (r = sshbuf_put_stringb(inc, item->contents)) != 0)
                        fatal("%s: buffer error: %s", __func__, ssh_err(r));

Then in frame #4, sshbuf_put_stringb is invoked with a NULL pointer (v=0x0)

The &includes in the code above refers to any config file that might be included from sshd_config
So I would check /etc/ssh/sshd_config for corruption, or perhaps let it be recreated with the following steps:

mv /etc/ssh/sshd_config /etc/ssh/sshd_config.broken
dnf reinstall openssh-server

Note: Since it is a config-file, it would not be overwritten by dnf reinstall if it already exists, hence the mv

felfert · November 29, 2022, 11:45am

Perhaps, some more explanation is in order:

Whenever the main sshd process gets an incoming connection, it forks a child which will
handle that connection. Then, the main process communicates several config options, states etc
down to the child via a local pipe. The function send_rexec_state takes care of this communication.
Also, since this happens in a verry early state of the connection, using ssh -vvv on the client side
would not reveal much except that the connection was closed on the server side.

The actual coredump is triggered programmatically in sshbuf_check_sanity (frame #0 in your backtrace):

static inline int
sshbuf_check_sanity(const struct sshbuf *buf)
{
        SSHBUF_TELL("sanity");
        if (__predict_false(buf == NULL ||
            (!buf->readonly && buf->d != buf->cd) ||
            buf->refcount < 1 || buf->refcount > SSHBUF_REFS_MAX ||
            buf->cd == NULL ||
            buf->max_size > SSHBUF_SIZE_MAX ||
            buf->alloc > buf->max_size ||
            buf->size > buf->alloc ||
            buf->off > buf->size)) {
                /* Do not try to recover from corrupted buffer internals */
                SSHBUF_DBG(("SSH_ERR_INTERNAL_ERROR"));
                signal(SIGSEGV, SIG_DFL);
                raise(SIGSEGV);
                return SSH_ERR_INTERNAL_ERROR;
        }
        return 0;
}

GregTourte · November 29, 2022, 3:17pm

ok I fixed it last night after looking up the some of the error messages I got. Some forum was pointing to a faulty match statement which was not applicable in my case but…

It turns out I had an include /etc/ssh/sshd_config.d/*.conf at the end of the config file, with that folder being empty. This same sshd_config (with the empty folder) was running fine before the update so it looks to me like a regression.

I have tested that same configuration on a vanilla openssh server version 8.8p1 on slackware64 and an empty include doesn’t trigger a segfault in the listener.
Additionally, checking the configuration with sshd -t also doesn’t trigger an error on rocky (or the vanilla slackware).

Anyway, for now I have created an empty.conf file in that folder and updated my ansible role to do so as well, which completely fixes the problem for now. (somehow having an file that matches that include glob works, even if that file is empty).

Is this something that should be reported as a bug somewhere?

felfert · November 29, 2022, 3:48pm

Reliably reproduceable here.

I would say yes.
Where to report depends on testing this on a real RHEL. If it segfaults there too,
then report it at bugzilla.redhat.com otherwise it would be some glitch in rocky and
then it should be reported at bugs.rockylinux.org.

olista · November 29, 2022, 4:06pm

The bug can be reproduced in RHEL 8.6 and 8.7.

jlehtone · November 29, 2022, 4:20pm

If one does not have RHEL 8, then reproducing in CentOS Stream 8 allows bug reports to RH bugzilla.

I was under the impression that openssh in el8 does not use /etc/ssh/sshd_config.d/
(I only noticed that configuration approach when el9 appeared, and definitely at that time
the el8 had only the /etc/ssh/ssh_config.d/ for ssh client.)

I’d guess that the addition of /etc/ssh/sshd_config.d/ to openssh did add “dir can be empty” handler and that the previous include implementation expects “real files”.

In el9 the default installation does place a 50-*.conf there and the include is the first setting in /etc/ssh/sshd_config
Everything in /etc/ssh/ssh_config.d/ therefore supercedes content of /etc/ssh/sshd_config.

The 50-*.conf further includes the system wide security policy.

If you opt in installer to allow root ssh login with password, then installer adds another file to /etc/ssh/sshd_config.d/ that contains:

PermitRootLogin yes

Topic		Replies	Views
SSSD core dumps on Rocky 9.2 intermittently Rocky Linux Help & Support	3	583	January 21, 2024
Sshd service does not start Rocky Linux Help & Support rocky-linux-8	1	1447	January 21, 2024
Problem with ssh server on rocky 8.5 Rocky Linux Help & Support	13	11545	August 25, 2023
SFTP failed after open ssh ugrade Rocky Linux Help & Support rocky-linux-8	6	238	July 29, 2024
Strange behavior of ssh ( Rocky 9.3 vs. Cent OS 7.9) Rocky Linux Help & Support	10	667	June 11, 2024

Sshd core dumps at login attempts after upgrade to Rocky 8.7

Related topics