Rocky 9.5: - I tried to make a change to order entry function to accomodate the tax rate changes here and even after the link was changed to point to the new rate table and http was restarted the old rates were being served. I assumed this was a cache problem so I tried to reboot and the system is now unbootable. It gets as far as “a start job is running for /dev/disk/by-uuid/ [long string]” looking back on the boot messages, all of the hard disks do show up in the listing.
I have seen something similar before and it was a bad disk, so I booted from the install media and ran smartctl -a on each disk and all of them show “passed”. I tried to run mdadm --detail on the md arrays and md124 to md127 all show as degraded with one disk removed but the array active and clean. md123 was on sdb and sdd before and it shows as not existing. There was a hot spare for all of the RAID arrays except md127 (boot) but that doesn’t seem to have been used.
The configuration is 4 - 1 Tb SSD’s forming 2 groups of RAID one arrays, one array is for the http software and scripts (on /usr), the other two disks hold one md array for each system partition (/boot /uefi /home etc.) There is a 3 Tb HDD that holds the spare RAID partitions.
Succinctly, how can I interupt the “start job is running…” so boot can finish and I can figure out which disk is/might be defective so I can fix it. I looked at lsblk and it lists all of the partitions on all of the disks but only the high number md’s show up. The rest show as “partition”. When I did the December 1 backup I checked all disks and all md devices and all were showing as ‘passed’ in smartctl and ‘active clean’ in mdadm. In the boot messages there was something about changing the md device sizes from 0 to some large number on several of the RAID clusters but I did nothing that I would have expected to change sizes. The message scrolled too fast to read.
RHCSA v8: Boot Targets, Systemd Targets and root Password Reset · Victor's Blog lists some ways to boot without starting all the normal services.
Thanks. I’m going to have to read this in detail; it’s all quite new to me.
I cannot get this to work. I set to a target and boot says: booting a script but it always terminates at the same place with A start job is running…". I booted from the install media again and the network will come up and I can connect to the test server but I can’t copy some important info (like the SSL key) to the test server so it can act as a backup for the unbootable one. Seems the mounted filesystem is not the real filesystem on this machine (/mnt/sysroot apparently). I am still showing all of the disks are OK and active but the machine still won’t boot. Any ideas where to go next?
Your problem sounds a bit similar to the issues deacined in this topic:
If the described solutions should not work for you, try booting into emergency.target
and commenting out every entry from /etc/fstab
besides /
.
Than reboot. If it works, the problem lies somewhere in fstab
.
I did a journalctl -xb and asuming that red is fatal, orange is error, and green is OK, there are several thousand lines of orange and hundreds of red and I stopped writing after several pages. None of the lines make any sense to me BUT at one point there are a number of errors of the form "DeviceDisappeared event detected on md device md12x. After several pages ALL the md devices disappeared. But the hot spare is still present and active so why does it not rebuild here? (the existing md devices are ssd’s, the spare is hdd). All of the active md devices are shown as degraded but there are also a lot of errors abount missing files like “/sys/devices/virtual/block/md127/md/degraded” being missing.
Since the data is apparently there, if degraded, what should I do to recover data that had not yet been backed up when all this started? I need to get direct access to one of the mirrors and they don’t seem to be mounted when the system is booted from the recovery media. I have been unable to make it boot to the emergency or rescue shell. Also, ssh logins are disabled (“connection rejected”) even though the machine seems to be on the network at #101 (it’s normal fixed address is #5)
If you want to access the data you can mount the disks manually when you booted from the install media. You have the access to all the tools and disks.
If you cannot boot into emergency.target
check if you even can boot into initramfd
(How to recover a root password in Red Hat-based Linux systems).
If even this fails, that means /boot
is somehow corrupt or inaccessible.
I was completely unable to bring the disks online and I have run out of time; just too complicated for a 1 person shop. I replaced all of the disks with new WD-Red HDD’s, I would not recommend using SSD’s in a server.
If someone on this thread can spare some time to provide advice; I had set things up before on RAID 1 (mirror) with hot spare so why would I have such a complete failure without any system recovery on any of the drives? Also, I bought a USB remote SATA reader but even using that I was unable to bring any of the disks online on my workstation. Apparently the underlying format is not one of the standards so the data is inaccessible. Would that be expected on a mirror?
Also, thanks to jlehtone and hs303 for attempting to help; it was appreciated.