Kernel has some issues during booting:
And I think it may have corrupted the file system somehow also (especially with earlier 4.x kernel version).
Kernel has some issues during booting:
I would:
smartctl
whether the drive(s) reports errorsIf I check this in the dualboot system LinuxMint, it gives:
Processing: report.txt…
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.11.0-38-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda Pro Compute
Device Model: ST1000LM049-2GH172
Serial Number: WGS5BY78
LU WWN Device Id: 5 000c50 0c05ad3a0
Firmware Version: RXM3
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Nov 10 20:10:48 2021 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x51) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 118) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 067 056 006 Pre-fail Always - 131905737
3 Spin_Up_Time 0x0027 099 099 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 098 098 000 Old_age Always - 2508
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x002f 086 060 045 Pre-fail Always - 372995992
9 Power_On_Hours 0x0032 081 081 000 Old_age Always - 17515 (88 193 0)
10 Spin_Retry_Count 0x0033 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 540
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x003b 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 4
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 053 049 040 Old_age Always - 47 (Min/Max 41/47)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 1
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 153
193 Load_Cycle_Count 0x0032 096 096 000 Old_age Always - 9779
194 Temperature_Celsius 0x0022 047 051 000 Old_age Always - 47 (0 20 0 0 0)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
254 Free_Fall_Sensor 0x0032 100 100 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Aborted by host 80% 15388 -
# 2 Short offline Aborted by host 70% 14997 -
# 3 Extended offline Completed without error 00% 13324 -
# 4 Short offline Completed without error 00% 13321 -
# 5 Extended offline Aborted by host 90% 13321 -
# 6 Short offline Completed without error 00% 11988 -
# 7 Short offline Aborted by host 60% 11956 -
# 8 Short offline Completed without error 00% 10522 -
# 9 Short offline Completed without error 00% 9304 -
#10 Short offline Aborted by host 40% 9303 -
#11 Short offline Completed without error 00% 8595 -
#12 Short offline Completed without error 00% 8375 -
#13 Short offline Completed without error 00% 8375 -
#14 Short offline Completed without error 00% 5609 -
#15 Short offline Completed without error 00% 5609 -
#16 Short offline Completed without error 00% 5505 -
#17 Short offline Completed without error 00% 5505 -
#18 Short offline Completed without error 00% 5505 -
#19 Short offline Completed without error 00% 3524 -
#20 Short offline Completed without error 00% 3355 -
#21 Short offline Completed without error 00% 3350 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
It looks fully fixed with the e2fsck, but I had to reinstall some softwares that were corrupted.
Drive looks ok (but temperature – 47 – Celsius degrees? feels quite warm).
[EDIT]
Run rpm -Va
to see which files do not match what packages have.
Then, if needed, dnf reinstall
packages.
[jari@cosmo ~]$ hddtemp /dev/sda
/dev/sda: ST1000LM049-2GH172: 46°C
It could be due to hot weather here, as room temperature is 32C.
It was fixed with the e2fsck during the booting.
The installation files are on different drive/partition (on the SSD).
This HDD-partition is for /home.
Some softwares like libreoffice, etc are installed on the /home.
It can be possible also that the booting problem is caused by shutdown problem.
So that it looks shutdown takes 5-10minutes, and should not be reseted by hardware reset
during the shutdown or the discs will remain mounted and will corrupt.
Shutdown tends to be quick (unless there are stuck NFS mounts, etc).
Are there anything revealing in the logs (in /var/log/ ) written during the shutdown?
I’m not sure what to look for here, I can see some errors in the /var/log/messages as:
Nov 8 17:59:44 unassigned journal[7326]: Corrupted message received
Nov 8 17:59:44 unassigned journal[7326]: Ignoring device due to initialization error: unsupported firmware version
ov 8 18:00:26 unassigned kernel: EXT4-fs (sda2): warning: mounting fs with errors, running e2fsck is recommended
Nov 8 18:00:26 unassigned kernel: EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
Nov 8 18:01:46 unassigned kernel: EXT4-fs error (device sda2): ext4_validate_block_bitmap:384: comm kworker/u16:6: bg 1104: bad block bitmap checksum
Nov 8 18:01:46 unassigned kernel: EXT4-fs error (device sda2): ext4_validate_block_bitmap:384: comm kworker/u16:6: bg 1105: bad block bitmap checksum
Nov 9 02:07:19 cosmo kernel: pcieport 0000:00:1d.4: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 4, DL_ActiveErr+
Nov 8 18:07:23 cosmo kernel: hp_accel: probe of HPQ6007:00 failed with error -22
Nov 8 18:07:24 cosmo kernel: hp_wmi: query 0xd returned error 0x5
Nov 8 18:11:03 cosmo kernel: EXT4-fs error (device sda2): ext4_lookup:1582: inode #9045816: comm updatedb: iget: bad extra_isize 28338 (inode size 256)
Nov 8 18:11:04 cosmo kernel: EXT4-fs error (device sda2): ext4_lookup:1582: inode #9045864: comm updatedb: iget: bad extra_isize 28338 (inode size 256)
Nov 8 18:13:55 cosmo kernel: EXT4-fs (sda2): error count since last fsck: 16808
Nov 8 18:13:55 cosmo kernel: EXT4-fs (sda2): initial error at time 1636312711: ext4_validate_block_bitmap:390
Nov 8 18:13:55 cosmo kernel: EXT4-fs (sda2): last error at time 1636366289: ext4_lookup:1582: inode 9045851
Nov 9 03:43:12 cosmo journal[2118]: gsd-clipboard: Fatal IO error 11 (Resource temporarily unavailable) on X server :1024.
Nov 9 03:43:12 cosmo journal[1662]: Connection to xwayland lost
Nov 9 03:43:12 cosmo cupsd[1111]: REQUEST localhost - - "POST / HTTP/1.1" 200 151 Cancel-Subscription client-error-not-found
Nov 9 03:43:12 cosmo journal[2125]: gsd-media-keys: Fatal IO error 11 (Resource temporarily unavailable) on X server :1024.
Nov 9 03:43:12 cosmo journal[2119]: gsd-color: Fatal IO error 11 (Resource temporarily unavailable) on X server :1024.
Nov 9 03:43:12 cosmo journal[2124]: gsd-keyboard: Fatal IO error 11 (Resource temporarily unavailable) on X server :1024.
Nov 9 03:43:12 cosmo journal[2138]: gsd-power: Fatal IO error 11 (Resource temporarily unavailable) on X server :1024.
Nov 9 03:43:12 cosmo journal[2114]: gsd-xsettings: Fatal IO error 11 (Resource temporarily unavailable) on X server :1024.
But I can not separate what errors happened during shutdown.
I try search more.
Also during the startup with 5.13-kernel it can not find the /dev/sda2 at all:
[ *** ] A start job is running for dev-s
[*** ] A start job is running for dev-sda2.device (1
[ *] A start job is running for dev-sda2.device (1min 21s / 1min
[*** ] A start job is running for dev-sda2.device (1min 23s
e *** ] A start job is running for dev-sda2.device (1min 26s / 1min 30s)
[ **e
[ TIME ] Timed out waiting for device dev-sda2.device.
[DEPEND] Dependency failed for File System Check on /dev/sda2.
[DEPEND] Dependency failed for /home.
[DEPEND] Dependency failed for Local File Systems.
[DEPEND] Dependency failed for Mark the need to relabel after reboot.
[FAILED] Failed to start Crash recovery kernel arming.
I think it is needed ‘shutdown.log’ as well to show that problem occur before reboot already.
If I shutdown, I can see only 1 line message on the screen:
In /var/log/messages it says:
messages:Nov 9 01:48:12 localhost kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov 8 17:48:23 localhost dracut[1783]: *** Including module: watchdog-modules ***
messages:Nov 9 01:58:45 unassigned kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov 8 17:58:55 unassigned dracut[1723]: *** Including module: watchdog-modules ***
messages:Nov 9 02:07:19 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov 8 18:07:30 cosmo dracut[1892]: *** Including module: watchdog-modules ***
messages:Nov 8 18:15:22 cosmo dracut[17855]: *** Including module: watchdog-modules ***
messages:Nov 9 11:42:42 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov 9 11:54:25 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov 9 12:34:14 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov 10 19:07:14 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov 10 19:16:21 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov 10 20:43:36 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov 10 20:50:45 cosmo journal[4460]: watchdog: enabled [pulse: 90s]
messages:Nov 12 13:36:53 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
messages:Nov 12 14:01:48 cosmo kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
I can not see the on-screen message “watchdog: watchdog0: watchdog did not stop!” in any of the log-files. So that would appear on the screen only.
But it looks during shutdown it is stopped due to waiting this ‘watchdog’ to shutdown. So I should wait it to shutdown the watchdog to safely unmount the discs. Not 100% sure what is going on there.
Also the on-screen message appears only if I press CTRL+ALT+DEL after waiting the shutdown to occur.
Also other messages that appear during shutdown:
The shutdown is always slow and it should not interrupted or the HDD might not get unmounted.
/etc/default/grub
and then update grub.cfg&co with grub2-mkconfig)I usually restart from the user GUI session.
This would work in some linux types.
Yes. The point is that if you can restart when there is no GUI sessions but get in trouble when restarting from GUI sessions, then we have narrowed down the list of suspects a bit.
I updated the grub2-mkconfig, but should I also reinstall the grub now?
This looks more complicated, but let me see if this will reveal something.
Certainly the unmount should be last to write all the log events on the disk,
but this case risking if system was jammed before the unmount occurs.
If the shutdown log was written the /boot partition and all other partitions were
unmounted at the first place during the shutdown, that could help.
Looks also risky, like the log files should have their own partition to avoid issues with data partition or possibly send the shutdown logs to network drive at RockyLinux.
It could be good idea to send the logs and especially any errors to the linux
system maker. It might or might not help check the development status.
Here is what is visible after enabling the kernel full logging:
Fixing Boot Failure
Use the set command with no arguments to view the environment variables.
The ls command lists the available partitions on the disk.
Set the boot partition as the value of the root variable.
Load the normal boot mode.
Start the normal boot mode.
Load the Linux kernel using the linux command.
Regards,
Rachel Gomez