Kernel booting failures

Kernel has some issues during booting:


And I think it may have corrupted the file system somehow also (especially with earlier 4.x kernel version).

I would:

  1. Boot installer in rescue mode (or a “live” distro) and check with smartctl whether the drive(s) reports errors
  2. Wipe everything and reinstall, (if hardware looks ok).

If I check this in the dualboot system LinuxMint, it gives:
Processing: report.txt…

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.11.0-38-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda Pro Compute
Device Model:     ST1000LM049-2GH172
Serial Number:    WGS5BY78
LU WWN Device Id: 5 000c50 0c05ad3a0
Firmware Version: RXM3
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Nov 10 20:10:48 2021 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x51) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 118) minutes.
SCT capabilities: 	       (0x303d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   067   056   006    Pre-fail  Always       -       131905737
  3 Spin_Up_Time            0x0027   099   099   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2508
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002f   086   060   045    Pre-fail  Always       -       372995992
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       17515 (88 193 0)
 10 Spin_Retry_Count        0x0033   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       540
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x003b   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       4
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   053   049   040    Old_age   Always       -       47 (Min/Max 41/47)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       1
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       153
193 Load_Cycle_Count        0x0032   096   096   000    Old_age   Always       -       9779
194 Temperature_Celsius     0x0022   047   051   000    Old_age   Always       -       47 (0 20 0 0 0)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Aborted by host               80%     15388         -
# 2  Short offline       Aborted by host               70%     14997         -
# 3  Extended offline    Completed without error       00%     13324         -
# 4  Short offline       Completed without error       00%     13321         -
# 5  Extended offline    Aborted by host               90%     13321         -
# 6  Short offline       Completed without error       00%     11988         -
# 7  Short offline       Aborted by host               60%     11956         -
# 8  Short offline       Completed without error       00%     10522         -
# 9  Short offline       Completed without error       00%      9304         -
#10  Short offline       Aborted by host               40%      9303         -
#11  Short offline       Completed without error       00%      8595         -
#12  Short offline       Completed without error       00%      8375         -
#13  Short offline       Completed without error       00%      8375         -
#14  Short offline       Completed without error       00%      5609         -
#15  Short offline       Completed without error       00%      5609         -
#16  Short offline       Completed without error       00%      5505         -
#17  Short offline       Completed without error       00%      5505         -
#18  Short offline       Completed without error       00%      5505         -
#19  Short offline       Completed without error       00%      3524         -
#20  Short offline       Completed without error       00%      3355         -
#21  Short offline       Completed without error       00%      3350         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

It looks fully fixed with the e2fsck, but I had to reinstall some softwares that were corrupted.

Drive looks ok (but temperature – 47 – Celsius degrees? feels quite warm).

[EDIT]
Run rpm -Va to see which files do not match what packages have.
Then, if needed, dnf reinstall packages.

[jari@cosmo ~]$ hddtemp /dev/sda
/dev/sda: ST1000LM049-2GH172: 46°C

It could be due to hot weather here, as room temperature is 32C.

It was fixed with the e2fsck during the booting.

The installation files are on different drive/partition (on the SSD).
This HDD-partition is for /home.
Some softwares like libreoffice, etc are installed on the /home.

It can be possible also that the booting problem is caused by shutdown problem.
So that it looks shutdown takes 5-10minutes, and should not be reseted by hardware reset
during the shutdown or the discs will remain mounted and will corrupt.

Shutdown tends to be quick (unless there are stuck NFS mounts, etc).

Are there anything revealing in the logs (in /var/log/ ) written during the shutdown?

I’m not sure what to look for here, I can see some errors in the /var/log/messages as:

Nov  8 17:59:44 unassigned journal[7326]: Corrupted message received
Nov  8 17:59:44 unassigned journal[7326]: Ignoring device due to initialization error: unsupported firmware version
ov  8 18:00:26 unassigned kernel: EXT4-fs (sda2): warning: mounting fs with errors, running e2fsck is recommended
Nov  8 18:00:26 unassigned kernel: EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
Nov  8 18:01:46 unassigned kernel: EXT4-fs error (device sda2): ext4_validate_block_bitmap:384: comm kworker/u16:6: bg 1104: bad block bitmap checksum
Nov  8 18:01:46 unassigned kernel: EXT4-fs error (device sda2): ext4_validate_block_bitmap:384: comm kworker/u16:6: bg 1105: bad block bitmap checksum
Nov  9 02:07:19 cosmo kernel: pcieport 0000:00:1d.4: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 4, DL_ActiveErr+
Nov  8 18:07:23 cosmo kernel: hp_accel: probe of HPQ6007:00 failed with error -22
Nov  8 18:07:24 cosmo kernel: hp_wmi: query 0xd returned error 0x5
Nov  8 18:11:03 cosmo kernel: EXT4-fs error (device sda2): ext4_lookup:1582: inode #9045816: comm updatedb: iget: bad extra_isize 28338 (inode size 256)
Nov  8 18:11:04 cosmo kernel: EXT4-fs error (device sda2): ext4_lookup:1582: inode #9045864: comm updatedb: iget: bad extra_isize 28338 (inode size 256)
Nov  8 18:13:55 cosmo kernel: EXT4-fs (sda2): error count since last fsck: 16808
Nov  8 18:13:55 cosmo kernel: EXT4-fs (sda2): initial error at time 1636312711: ext4_validate_block_bitmap:390
Nov  8 18:13:55 cosmo kernel: EXT4-fs (sda2): last error at time 1636366289: ext4_lookup:1582: inode 9045851
Nov  9 03:43:12 cosmo journal[2118]: gsd-clipboard: Fatal IO error 11 (Resource temporarily unavailable) on X server :1024.
Nov  9 03:43:12 cosmo journal[1662]: Connection to xwayland lost
Nov  9 03:43:12 cosmo cupsd[1111]: REQUEST localhost - - "POST / HTTP/1.1" 200 151 Cancel-Subscription client-error-not-found
Nov  9 03:43:12 cosmo journal[2125]: gsd-media-keys: Fatal IO error 11 (Resource temporarily unavailable) on X server :1024.
Nov  9 03:43:12 cosmo journal[2119]: gsd-color: Fatal IO error 11 (Resource temporarily unavailable) on X server :1024.
Nov  9 03:43:12 cosmo journal[2124]: gsd-keyboard: Fatal IO error 11 (Resource temporarily unavailable) on X server :1024.
Nov  9 03:43:12 cosmo journal[2138]: gsd-power: Fatal IO error 11 (Resource temporarily unavailable) on X server :1024.
Nov  9 03:43:12 cosmo journal[2114]: gsd-xsettings: Fatal IO error 11 (Resource temporarily unavailable) on X server :1024.

But I can not separate what errors happened during shutdown.
I try search more.

Also during the startup with 5.13-kernel it can not find the /dev/sda2 at all:

[ ***  ] A start job is running for dev-s
[***   ] A start job is running for dev-sda2.device (1
[     *] A start job is running for dev-sda2.device (1min 21s / 1min
[***   ] A start job is running for dev-sda2.device (1min 23s 
e ***  ] A start job is running for dev-sda2.device (1min 26s / 1min 30s)
[   **e
[ TIME ] Timed out waiting for device dev-sda2.device.
[DEPEND] Dependency failed for File System Check on /dev/sda2.
[DEPEND] Dependency failed for /home.
[DEPEND] Dependency failed for Local File Systems.
[DEPEND] Dependency failed for Mark the need to relabel after reboot.

[FAILED] Failed to start Crash recovery kernel arming.

I think it is needed ‘shutdown.log’ as well to show that problem occur before reboot already.

If I shutdown, I can see only 1 line message on the screen:

In /var/log/messages it says:

messages:Nov  9 01:48:12 localhost kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov  8 17:48:23 localhost dracut[1783]: *** Including module: watchdog-modules ***
messages:Nov  9 01:58:45 unassigned kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov  8 17:58:55 unassigned dracut[1723]: *** Including module: watchdog-modules ***
messages:Nov  9 02:07:19 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov  8 18:07:30 cosmo dracut[1892]: *** Including module: watchdog-modules ***
messages:Nov  8 18:15:22 cosmo dracut[17855]: *** Including module: watchdog-modules ***
messages:Nov  9 11:42:42 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov  9 11:54:25 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov  9 12:34:14 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov 10 19:07:14 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov 10 19:16:21 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov 10 20:43:36 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov 10 20:50:45 cosmo journal[4460]: watchdog: enabled [pulse: 90s]
messages:Nov 12 13:36:53 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov 12 14:01:48 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.

I can not see the on-screen message “watchdog: watchdog0: watchdog did not stop!” in any of the log-files. So that would appear on the screen only.
But it looks during shutdown it is stopped due to waiting this ‘watchdog’ to shutdown. So I should wait it to shutdown the watchdog to safely unmount the discs. Not 100% sure what is going on there.
Also the on-screen message appears only if I press CTRL+ALT+DEL after waiting the shutdown to occur.

Also other messages that appear during shutdown:

The shutdown is always slow and it should not interrupted or the HDD might not get unmounted.

  1. When logged into GUI session one probably has options for “Logout”, “Restart”, and “Shutdown”.
    System might behave differently, if you log out first and then on the “Login screen” choose restart or poweroff.
  2. Kernel command line has by default options “rhgb” and “quiet”. I tend to remove them (from /etc/default/grub and then update grub.cfg&co with grub2-mkconfig)
    That lets more messages to be seen, at least during boot, perhaps during shutdown too?

I usually restart from the user GUI session.
This would work in some linux types.

Yes. The point is that if you can restart when there is no GUI sessions but get in trouble when restarting from GUI sessions, then we have narrowed down the list of suspects a bit.

I updated the grub2-mkconfig, but should I also reinstall the grub now?
This looks more complicated, but let me see if this will reveal something.
Certainly the unmount should be last to write all the log events on the disk,
but this case risking if system was jammed before the unmount occurs.
If the shutdown log was written the /boot partition and all other partitions were
unmounted at the first place during the shutdown, that could help.
Looks also risky, like the log files should have their own partition to avoid issues with data partition or possibly send the shutdown logs to network drive at RockyLinux.
It could be good idea to send the logs and especially any errors to the linux
system maker. It might or might not help check the development status.
Here is what is visible after enabling the kernel full logging: