Unable to ssh (client_loop: send disconnect: Broken pipe) or login to Rocky Linux via console

Hi guys, have a very strange situation. I have about a dozen VMs running Rocky 8.5 in my lab and only one of them has this issue. The VMs run an Elasticsearch cluster with Fleet server and the problem VM runs Kibana. These are all running via VirtualBox.

I have been able to log into this VM as with all of the VMs for many months normally. Recently I discovered that I cannot ssh from any of 3 different clients into the server. I have tried key auth and also have tried it via user/pass. ssh -vvv shows it connects, verifies auth and then disconnects with this error.

debug1: Authentication succeeded (publickey).
Authenticated to 192.168.1.190 ([192.168.1.190]:22).
debug1: channel 0: new [client-session]
debug3: ssh_session2_open: channel_new: 0
debug2: channel 0: send open
debug3: send packet: type 90
debug1: Requesting no-more-sessions@openssh.com
debug3: send packet: type 80
debug1: Entering interactive session.
debug1: pledge: network
debug3: send packet: type 1
client_loop: send disconnect: Broken pipe

I have found some similar posts via google search that resolved this issue by either updating the OS or tweaking the ssh config. However, I am also unable to log into the VM on the console. I have tried the root user and a non root user, using correct passwords verified on my base OS image, however none of the users can login. Therefore, I am unable to updated the OS or modifiy the ssh config. I also can’t view any logs on the system.

I do happen to have the Elastic agent on this VM and it is collecting auditd logs. Can anyone help me in figuring how to recover this VMs login capability?

Note: I have already followed a process for resetting the root password. That didn’t resolve it. And I have tried connecting with with this ssh modification, but it also didn’t work.
///
robert@Ubuntu:~$ ssh -o ServerAliveInterval=600 robert@192.168.2.190
client_loop: send disconnect: Broken pipe

Last info to add, it’s possible this started after a dnf update. I updated all my VMs a week or two ago including this one, and this problem seems to fit within that timeframe.

Thanks,
Robert

Edit: I guess it’s also possible this could be related to SELinux. I was toggling SELinux from enforcing to disabled on a couple VMs trying to trigger an Elastic Agent rule. This VM could have been one of the ones I was using for that test.

Also, I did boot into the last 2 older kernel and the rescue option but wasn’t able to log in.

I think I might have found the issue. I used a LiveCD to boot in the VirtualBox VM and was able to look at the /var/log/secure file. I noticed something

The screenshot shows login attempts for the 3 users that should work and do work on other VMs using the same base OS image for their VMs. Says, No such file or directory. So I checked in /home and there are in fact no user directories as the other VMs have.

If I try to create the dir and minimum files needed to log into with password, can anyone help me out with which minimum files are needed to allow the login?

Thanks for any help.

Final update. I realized while using the LiveCD I could copy my kibana configuration and related files (certs) off the VM. The quickest resolution for me was to just spin up a new VM and install kibana and move the config/certs into place and start it up.

I have kept the original VM and might try to do some more digging as time permits. I’m really curious how all the users home directories got removed. If anyone has any suggestions or ideas on how to approach trying to figure out how to try logging into that VM, please let me know.

Thanks and Regards

Going out on a limb here as I haven’t set up a VM in a long time but I’m thinking that within the VM there is a separate “/home” volume that for some reason is now not getting mounted.

Thanks for the suggestion. I don’t seem to be able to locate a /home dir anywhere on/off the system. I did copy the user folder from a working VM (remember they came from the same base OS image), but that hasn’t worked either. I am seeing an error in /var/log/secure that says the tmpfs for user(s) doesn’t exist. Maybe that a related cause?? In the screenshot on the left side is the problem VM’s secure log showing it can’t find /run/user/1000 or /run/user/0 (root). On the right is a working VM that lists tmpfs locations for those 2 users that are logged in.

Any ideas if I could try anything related to that?

My original assessment might have been wrong since usually if a device in “/etc/fstab” is not found the boot will hang and not complete. Since this is not the case, it does boot, there is something else wrong.

This doesn’t work because “/home” in the root volume is just a mount point for the “home volume” as seen in the right hand image. So when the system starts anything you copied to “/home” is lost. You need to figure out how to find and mount the “/home” volume while using the install disk rescue utilities so you can see what is wrong with the users folders.
The other thing to point out is that the output from the “secure” log is a symptom of the primary problem so really isn’t going to help in the solution. Be good if you had a Rocky install disk instead of the ubuntu one just so there is no compatibility problem. To get at more helpful diagnostic information you need to follow the instructions to “chroot mount” the root partition. Then use the tool journalctl -b to read the boot messages. If selinux is running by default then you need to look at it’s log output also.
Another approach when starting this VM is to disable selinux at boot by appending “selinux=0” to the kernel command line. Then see if you can ssh in to your user.

Thanks for the additional information jbkt23. I will try to get a rocky rescue disk going and look into mounting the /home as you mention.

Did you try my last suggestion which was to append “selinux=0” to the kernel command line from the grub menu? You do this by typing the letter ‘e’ on the selected kernel line then go to the end of the line that begins with “linux” type a space and then “selinux=0”. After you have done that enter Ctrl+x to start. Then try to login.

I made some progress. Modifying the boot loader line as you indicated allowed me to log into the system. All user directories were in /home/

I tested selinux a couple times and if set to disabled I can log in, but if set to enforcing (as it is on all the other systems in the cluster) I cannot log in. Is it correct to say this is a selinux related issue? Can it or does it need to be reset somehow?

==========================
Edit, now I’m thinking disabling selinux just gets me into the system. I haven’t done this low level type of rescue before so am discovering along the way… I’m downloading the rocky boot iso to see if that’s a liveCD/rescue type option.

With selinux disabled, the disk into looks like this:

[robert@lab2-ki02 ~]$ lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda           8:0    0   60G  0 disk 
├─sda1        8:1    0    1G  0 part /boot
└─sda2        8:2    0   59G  0 part 
  ├─rl-root 253:0    0   37G  0 lvm  /
  ├─rl-swap 253:1    0    4G  0 lvm  [SWAP]
  └─rl-home 253:2    0 18.1G  0 lvm  /home
sr0          11:0    1 1024M  0 rom  
[robert@lab2-ki02 ~]$ df -h
Filesystem           Size  Used Avail Use% Mounted on
devtmpfs             1.9G     0  1.9G   0% /dev
tmpfs                1.9G     0  1.9G   0% /dev/shm
tmpfs                1.9G  8.5M  1.9G   1% /run
tmpfs                1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/mapper/rl-root   37G  5.6G   32G  16% /
/dev/sda1           1014M  333M  682M  33% /boot
/dev/mapper/rl-home   19G  1.7G   17G  10% /home
tmpfs                374M     0  374M   0% /run/user/1000

I just did a search for “selinux error log” and came up with a number of hits two being RH tutorials. Here’s a link to one:
Selinux Troubleshooting

I don’t use selinux so I can’t be of further help. But I believe it has its place in multi-user serving environments. Most of the professionals who provide there assistance here on this forum advocate it’s use and knowledge thereof.

Sorry for the delay. Thanks for the additional information. I’ll look into that.
Really appreciate the information/advice.