Rocky Server Crashed This morning

My personal server decided to crash this morning at about 5:30 AM. I was doing a dnf update and it crashed part way through saying it couldn’t write a file. The whole OS drive was offline and dead. It was a 120GB SSD that was several years old. This server has been upgraded since CentOS 7, to 8.2, 8.3, CentOS Stream, then to Rocky. So it was time for a good nuke and pave anyway. Luckily I keep copies of my smb.conf, iptables, main.cf, and crontab, on my backup drive, so restoring it back to operation was a breeze. The main raid survived unharmed and mounted right back up, so I didn’t have to rsync 8TB from my offsite backup server, which was nice!

2 Likes

It is the damage of your SSD driver.

It’s good of having some backups.

By the way, if you have raid your SSD, such as mdadm raid1 of both your ‘/boot’, ‘/boot/efi’, and ‘/’ in two SSD, it will keep your system still running after damage of one SSD driver.

Yes, but it’s worth making sure you keep the logs from just before the “crash”, and note the exact error messages. If it’s to do with failed SSD there should be a huge number of block io messages. (unless the logs were on the same physical drive).

Yeah /var/log was on the boot drive as well. I knew it was bad by hooking it up to a USB adapter to my Windows Laptop and running some diagnostics on it. I do really good backups of my data. I have a nightly script that runs and copies all of my configuration files to /data on raid drives which has a luks encrypted volume, and I have another script that rsyncs that to external drive with luks encryption. I rotate the external drive out once per week to a safety deposit box.

The external drive came in handy about 5 years ago when the feds kicked down my door and took my family out in handcuffs at gun point, and took all my computer equipment. Back then I didn’t do encryption, but I do today. I was never charged (because I never did anything illegal), but I never got any of my equipment back. Luckily I had my backup drive in my safety deposit box so I could rebuild everything.

I have found software RAID1 on the boot partition, to be problematic at best, and it never works quite right. I would just rather do a single drive or if I had a hardware raid controller that will work, but my MB was built for windows and there there was never a Linux driver created for it’s built in raid. So… I just have a document with lots of notes on what I need to install and I copy all of my config files over, so I can be backup and running in a few hours, which is acceptable for a home server.

That is great. Config management systems (e.g. Ansible, Chef, Puppet, Salt) are a machine actionable version of notes. Worth a look.