Reboot leaves machine in emergency state - many start-limit-hit

As I have set the BIOS of my Rocky Linux 8.5 (fresh install) to auto-reboot upon a power-loss-power-on sequence, I pulled the power cord to test the set-up. And there is no normal startup. I end up in the emergency mode.
After logging in, I ran # journalctl -xb and checked all red lines in the (first part of the) logfile. The are a whole bunch of errors related to “start-limit-hit”, see below.
What is the real problem for the emergency state after the reboot?

mokvar: EFI MOKvar config table is not in EFI runtime memory >>>> I know this is not a serious issue and is being worked on by Rocky, but I show it just to don’t omit anything.

– The start-up result is done.
apr 03 17:49:02 server.hartings.se systemd[1]: mdadm-last-resort@md122.timer: Succeeded.
– Subject: Unit succeeded
– Defined-By: systemd

– The unit mdadm-last-resort@md122.timer has successfully entered the ‘dead’ state.
apr 03 17:49:02 server.hartings.se systemd[1]: Stopped Timer to wait for more drives before activating degraded array md122…
– Subject: Unit mdadm-last-resort@md122.timer has finished shutting down
– Defined-By: systemd

– Unit mdadm-last-resort@md122.timer has finished shutting down.
apr 03 17:49:02 server.hartings.se systemd[1]: mdadm-last-resort@md121.timer: Succeeded.
– Subject: Unit succeeded
– Defined-By: systemd

– The unit mdadm-last-resort@md121.timer has successfully entered the ‘dead’ state.
apr 03 17:49:02 server.hartings.se systemd[1]: Stopped Timer to wait for more drives before activating degraded array md121…
– Subject: Unit mdadm-last-resort@md121.timer has finished shutting down
– Defined-By: systemd

– Unit mdadm-last-resort@md121.timer has finished shutting down.
apr 03 17:49:02 server.hartings.se systemd[1]: mdadm-last-resort@md120.timer: Succeeded.
– Subject: Unit succeeded
– Defined-By: systemd

– The unit mdadm-last-resort@md120.timer has successfully entered the ‘dead’ state.
apr 03 17:49:02 server.hartings.se systemd[1]: Stopped Timer to wait for more drives before activating degraded array md120…
– Subject: Unit mdadm-last-resort@md120.timer has finished shutting down
– Defined-By: systemd

– Unit mdadm-last-resort@md120.timer has finished shutting down.
apr 03 17:49:02 server.hartings.se systemd[1]: systemd-binfmt.service: Start request repeated too quickly.
apr 03 17:49:02 server.hartings.se systemd[1]: systemd-binfmt.service: Failed with result ‘start-limit-hit’.
– Subject: Unit failed
– Defined-By: systemd

– The unit systemd-binfmt.service has entered the ‘failed’ state with result ‘start-limit-hit’.
apr 03 17:49:02 server.hartings.se systemd[1]: Failed to start Set Up Additional Binary Formats.
– Subject: Unit systemd-binfmt.service has failed
– Defined-By: systemd

– Unit systemd-binfmt.service has failed.

– The result is failed.
apr 03 17:49:02 server.hartings.se systemd[1]: ostree-remount.service: Start request repeated too quickly.
apr 03 17:49:02 server.hartings.se systemd[1]: ostree-remount.service: Failed with result ‘start-limit-hit’.
– Subject: Unit failed
– Defined-By: systemd

– The unit ostree-remount.service has entered the ‘failed’ state with result ‘start-limit-hit’.
apr 03 17:49:02 server.hartings.se systemd[1]: Failed to start OSTree Remount OS/ Bind Mounts.
– Subject: Unit ostree-remount.service has failed
– Defined-By: systemd

– Unit ostree-remount.service has failed.

– The result is failed.

– Unit rpc-statd.service has begun starting up.
apr 03 17:49:02 server.hartings.se systemd[1]: Reached target Sockets.
– Subject: Unit sockets.target has finished start-up
– Defined-By: systemd

– Unit sockets.target has finished starting up.

– The start-up result is done.
apr 03 17:49:02 server.hartings.se systemd[1]: systemd-firstboot.service: Start request repeated too quickly.
apr 03 17:49:02 server.hartings.se systemd[1]: systemd-firstboot.service: Failed with result ‘start-limit-hit’.
– Subject: Unit failed
– Defined-By: systemd

– The unit systemd-firstboot.service has entered the ‘failed’ state with result ‘start-limit-hit’.
apr 03 17:49:02 server.hartings.se systemd[1]: Failed to start First Boot Wizard.
– Subject: Unit systemd-firstboot.service has failed
– Defined-By: systemd

– Unit systemd-firstboot.service has failed.

– The result is failed.
apr 03 17:49:02 server.hartings.se systemd[1]: iscsi-onboot.service: Start request repeated too quickly.
apr 03 17:49:02 server.hartings.se systemd[1]: iscsi-onboot.service: Failed with result ‘start-limit-hit’.
– Subject: Unit failed
– Defined-By: systemd

– The unit iscsi-onboot.service has entered the ‘failed’ state with result ‘start-limit-hit’.
apr 03 17:49:02 server.hartings.se systemd[1]: Failed to start Special handling of early boot iSCSI sessions.
– Subject: Unit iscsi-onboot.service has failed
– Defined-By: systemd

– Unit iscsi-onboot.service has failed.

– The result is failed.
apr 03 17:49:02 server.hartings.se systemd[1]: systemd-hwdb-update.service: Start request repeated too quickly.
apr 03 17:49:02 server.hartings.se systemd[1]: systemd-hwdb-update.service: Failed with result ‘start-limit-hit’.
– Subject: Unit failed
– Defined-By: systemd

– The unit systemd-hwdb-update.service has entered the ‘failed’ state with result ‘start-limit-hit’.
apr 03 17:49:02 server.hartings.se systemd[1]: Failed to start Rebuild Hardware Database.
– Subject: Unit systemd-hwdb-update.service has failed
– Defined-By: systemd

– Unit systemd-hwdb-update.service has failed.

– The result is failed.
apr 03 17:49:02 server.hartings.se systemd[1]: loadmodules.service: Start request repeated too quickly.
apr 03 17:49:02 server.hartings.se systemd[1]: loadmodules.service: Failed with result ‘start-limit-hit’.
– Subject: Unit failed
– Defined-By: systemd

– The unit loadmodules.service has entered the ‘failed’ state with result ‘start-limit-hit’.
apr 03 17:49:02 server.hartings.se systemd[1]: Failed to start Load legacy module configuration.
– Subject: Unit loadmodules.service has failed
– Defined-By: systemd

– Unit loadmodules.service has failed.

– The result is failed.
apr 03 17:49:02 server.hartings.se systemd[1]: systemd-sysusers.service: Start request repeated too quickly.
apr 03 17:49:02 server.hartings.se systemd[1]: systemd-sysusers.service: Failed with result ‘start-limit-hit’.
– Subject: Unit failed
– Defined-By: systemd

– The unit systemd-sysusers.service has entered the ‘failed’ state with result ‘start-limit-hit’.
apr 03 17:49:02 server.hartings.se systemd[1]: Failed to start Create System Users.
– Subject: Unit systemd-sysusers.service has failed
– Defined-By: systemd

– Unit systemd-sysusers.service has failed.

– The result is failed.
apr 03 17:49:02 server.hartings.se systemd[1]: systemd-ask-password-console.path: Start request repeated too quickly.
apr 03 17:49:02 server.hartings.se systemd[1]: systemd-ask-password-console.path: Failed with result ‘start-limit-hit’.
– Subject: Unit failed
– Defined-By: systemd

– The unit systemd-ask-password-console.path has entered the ‘failed’ state with result ‘start-limit-hit’.
apr 03 17:49:02 server.hartings.se systemd[1]: Failed to start Dispatch Password Requests to Console Directory Watch.
– Subject: Unit systemd-ask-password-console.path has failed
– Defined-By: systemd

– Unit systemd-ask-password-console.path has failed.

– The result is failed.

apr 03 17:49:03 server.hartings.se kernel: XFS (md127): Ending clean mount
apr 03 17:49:03 server.hartings.se kernel: XFS (md124): Starting recovery (logdev: internal)
apr 03 17:49:03 server.hartings.se systemd[1]: Mounted /boot.

end of lines obtained from “journalct -xb”<<<<<<

Any clues to what the initial problem is?
Many thanks for your input!!

The initial problem is that you pulled the plug. Sudden power loss can leave many things in an inconsistent state that may or may not be fixable afterward; that’s why there’s a shutdown command.

You can’t really tell in advance what might be affected when you do a crash stop like that; it’s probably different every time and you might even get away with it occasionally with no issues at all. Or you could lose the whole works. Or any degree in between.

To avoid issues like this in the future, buy yourself a UPS and don’t be pulling the plug without running a shutdown first.

Thanks FrankCox for your reply.
I have run CentOS since 2007 and have had several outages since that time, perhaps once every second year or so. None of them lead to these problems. I tested all my previous servers and CentOS installations for thid. The disks nicely resynchronized afterwards and the server auto-restarted without issues.
I am sure something is wrong here. If Rocky Linux cannot handle power interruptions, there is a serious issue, I think. Stability and rock solid operation has, and hopefully still is, the key for OS like CentOS was, and Rocky Linux hopefully also is today.

If you have power interruptions, you should be investing in a UPS (uninterruptable power supply). Just because you were lucky with one operating system doesn’t mean that problems wouldn’t exist with that as well. That has nothing to do with it. You were purely lucky. Pulling the power when data is being written to the disk is going to be a problem with any operating system. Not only that, but some filesystems are better in such scenarios than others. For example, XFS does not like power outages and is more unstable in such a scenario for example than compared to ext3/ext4. But even they can also have issues and require manual intervention to recover and get the filesystem mounting again.

Rocky, or CentOS, or Red Hat (or any other operating system) are not responsible for your power supply or lack thereof. If you have such problems, then that is what a UPS was made for.

I would never ever run a serious system in such a scenario where power outages are possible. I would always run a UPS for such scenarios. And I do this, I have a number of devices connected to a 2000VA UPS at home, as well as a 600VA UPS powering my internet router, firewall, Raspberry PI device and a 8-port switch as well. I even have 3 x 400VA UPS’s to power my WIFI mesh network (3 devices). A UPS also protects against power spikes and surges that could potentially destroy your equipment otherwise.

Thanks iwalker for your frank reply. Appreciated!
My previous servers all had ext4 file systems which may have helped when a power interruption occured in the past.
I do have a voltage protection device connected to broadband fiber connection, router, server, switch and backup server, but no UPS (yet). Perhaps this is what I should invest in as you suggest. :smiley:
I did a controlled reboot of the system after the power interruption, and the server booted as it should.
Thanks for all comments!

I’m currently using Green Cell UPS’s, they are cheap, so you get what you pay for, but they do the job for me. I’ve also used Belkin UPS in the past as well. There’s also APC UPS’s which are more expensive than the ones I’ve mentioned, but better quality (theoretically I guess).

Since where I live in a village, they sometimes like to cut the power without informing anyone that there will be a power outage, and I don’t really want my Synology NAS, Fortigate firewall and WIFI mesh network destroyed. The worst type is when the power flicks off and back on within the space of a second or less. Whilst the UPS’s I use are cheap(ish), at least they protect against that kind of issue and most certainly worth it especially considering the price of the hardware connected to it (and also the data stored on it).

With the exception of laptops, I have all of my expensive electronic equipment on UPS’s.

Including my piano. :slight_smile:

I figure it’s cheap insurance. In addition to power failure protection the UPS will also act as a kind of a giant fuse – if there’s something bad coming down the power line the UPS will give its life first to protect what you have plugged into it.

Depends on type of UPS. Cheaper “offline” models run on wall power by default and switch to battery when external power drops. “Online” models supply the payload from batteries all the time and merely recharge the batteries. Finest models completely insulate batteries from wall power (even though they do recharge) and ensure “clean, steady power” – no ripples through.

All of mine are line interactive UPS, so supposedly better than offline, but not quite as good as online. So far, good enough for me anyway and has kept everything running for me 24x7 and the equipment connected doesn’t draw huge amount of power so gives me about 2 - 3 hours running time. So far haven’t had a power outage longer than that.

It’s clear that using a UPS is a very good practice. For PC’s applications it’s obvious. But Imagine that you have multiple servers located in a Datacenter with Energy Backup Systems, alternate Generators,e tc. And for some reason all energy supply to servers is shutdown. There are more than one scenario where this can happen, And if you have an OS that can not recover properly for those scenarios, then we have a huge potential problem there.

Yup, and that’s common to all operating systems going back to when smartdrv was allowed to use delayed writes under msdos (and probably before that too but that’s when I personally remember becoming aware of this as a potential issue).

Basically, anything that uses delayed writes (and that’s pretty much everything these days) is susceptible to this.

Databases like postgres and oracle have specific code for things like rollbacks and transaction completion to help avoid these issues, but like anything else it’s not 100% bulletproof in all possible scenarios.

Regular, recent and verified backups are the only real complete defence against this, along with things like multiple servers in geographically separated datacenters and the like.

There are a great many papers that have been written about data recovery and procedural hardening and anyone interested in this subject could spend a great deal of time researching it.

A year ago an OVH datacenter in France did burn badly. The UPS were burning too. Electric fires are nasty. UPS are batteries. They supply power to the fire too, unlike wall outlet that you might be able to disconnect from outside.

The actual thing you wanted to test, whether the system reboots automatically after a power outage, worked! Your system rebooted.

It just wouldn’t boot properly into your normal OS setup. But as has already been mentioned above, that can be expected. Although often nothing will happen to a system that was forced off, that isn’t guaranteed. The most common issues that happen are File-System corruptions (those can usually be fixed when in emergency mode with fsck, which writes back the filesystem transaction logs). The other common problem is your RAID array. It may be necessary to get the Disks in the Array to resync. Depending on the RAID you have used, that may not be possible, it depends on whether you have used something like RAID 0 which doesn’t have any redundancy, or if more than the safe number of disks have failed during the power outage. If you use a good hardware RAID controller with a battery & cache, you are safer from such a problem. but if you don’t have that, or if you are using Linux Software RAID (as it seems you are using), you don’t have that recovery option.

Hi rindi,

Many thanks for your reply! Yes, you are correct. The server did auto-reboot upon power restore. I got that answer from the test, which is good as you say!

Regarding the filestystem, yes I used Raid 1 so there is a chance that such a reboot leads to a corrupt file system. I think this is what I need to accept with my setup.

I tried, in a separate thread, to find out if I can access the server from outside when it is in the emergency state after a power restore, to for instance run fsck and/or trigger a default boot. For this I need ssh to work in the emergency state, but this doesn’t seems to be possible?

/Ralf
Den 9 apr. 2022, kI 09:50, frei via Rocky Linux Forum <notifications@rockylinux.discoursemail.com> skrev:

I must say I’m a little disappointed by this thread. A UPS is not a solution, it’s a band aid. Who here runs a UPS monitor daemon that gracefully shuts down the system when the UPS runs down?

Databases with atomic properties record writes into a separate space and then update a relatively small value (maybe just one integer in an index) to commit the change. This GREATLY minimizes the chances of a failure. I would expect any filesystem to have this behavior. It sounds like XFS may not be as good at this as EXT4.

I live in an area where the transformers are always blowing up. I can hear them pop sometimes. Usually the power flickers but keeps going. But sometimes it does not and it can easily take longer to come back up than my UPS will survive. I have always just let it go down. I suppose I should hit the power button to signal a shutdown. I can’t do a regular shutdown because the KVM and monitor are not on the UPS because I want to divert all battery to just the two servers for as long as possible.

It so happens that when I setup my new server last week, I chose to make all filesystems EXT4. Not sure why I did but apparently my instinct was correct. I will not be using XFS at all unless its in a datacenter with high availability power.

Some 15 years ago XFS (with less official Linux support at that time) did not tolerate less graceful outages. It was after all made for SGI IRIX, for “real servers”. A year or two ago we by mistake unplugged an XFS volume and it was fine, when replugged. I’d bet that Red Hat did put some hours on XFS when they chose to use it by default.

FAQ (or problem about XFS) on fora is the fact that you can’t shrink it. If you need smaller volume, then it is the dump-remove-create-restore drill.

Thanks for this valuable input!

My previous servers all ran ext4 filesystems. I wished this thread was created a couple of weeks ago, before I configured my new server. I would certainly have chosen ext4 instead of the standard offered xfs during the installation process.

I guess there is no way to change filesystems on an existing installation, right?

/Ralf
Den 9 apr. 2022, kI 16:20, ioplex via Rocky Linux Forum <notifications@rockylinux.discoursemail.com> skrev:

I want to restart this thread, as the problem is no longer related to a power interruption…
I get the same problem and respons when I properly reboot or shutdown and restart the server.
The server ends up in the emergency state and when I just press ctrl-D and start a normal boot after that, the server boots up properly, everything seems fine. I checked with mdadm and all raid arrays are clean. But why do I run in these issues and an emergency boot after a normal reboot??? Can anyone point me to the real cause of ending up in the emergency state?
Just to add more background info, a shutdown or taking down the system in a reboot, takes ages… about 3-4 minutes!!
One example of the failures (there are many…) is on failure with the “hardware database” and related to “start-limit-hit” (most problems seem to have this in common) :

[root@server ~]# systemctl status systemd-hwdb-update
● systemd-hwdb-update.service - Rebuild Hardware Database
   Loaded: loaded (/usr/lib/systemd/system/systemd-hwdb-update.service; static; vendor preset: disabled)
   Active: failed (Result: start-limit-hit)
Condition: start condition failed at Thu 2022-04-14 20:06:07 CEST; 3h 54min ago
           ├─ ConditionNeedsUpdate=/etc was not met
           └─ ConditionDirectoryNotEmpty=|/etc/udev/hwdb.d was not met
     Docs: man:hwdb(7)
           man:systemd-hwdb(8)

apr 14 20:05:44 server.hartings.se systemd[1]: systemd-hwdb-update.service: Start request repeated too quickly.
apr 14 20:05:44 server.hartings.se systemd[1]: systemd-hwdb-update.service: Failed with result 'start-limit-hit'.
apr 14 20:05:44 server.hartings.se systemd[1]: Failed to start Rebuild Hardware Database.

Without waiting for days for us to ask the right questions and you providing the backup info one approach in this case would be to reinstall all the software via dnf

dnf reinstall *

I’ve done it in the past to clean up my own mistakes and it has worked for me. It will take a couple hours of your time. Another approach might be to use rpm to verify the installed software. I think the command is rpm -qv * , but I would have to read the man page to be sure. You would want to redirect the output to a file for review when you have a few hours to browse through it and interpret the coding for the rpm output, which is also described in the man page. Maybe others here have a more structured way to help you solve this. The journalctl tool has a number of options for determining blame for certain failures but I’m not as familiar with that tool.

I used to use CentOS 8. Experienced three accidental power outages. That is, the server can start normally. Go to rocky. The only unexpected power failure occurred after the accident. Then the system completely hung up. Only reinstall the system… I don’t know why