Very Short disk life expectancy on Rocky 9

I think this is off-topic for a Rocky forum but Western Digital support suggested seeing if anyone has an answer. I built a new server with 4 WD (new) Blue SSD’s in 2 RAID 1 clusters and in less than 16 months 3 of the 4 SSD drives have failed. I notice that they are constantly selected based on the “disk” light on the server. Is there any known reason why I should back off to standard rotary HDD’s? Our old server was built in 2016 and of 6 HDD’s (5 WD and 1 Seagate at the moment) only two have failed in all that time. The problem is that I don’t trust the new machine enough to put it online as our company web server. (handles mail and FTP only at the moment)

Do you have a swap partition on them? Actively swapping on SSDs can wear them out fairly quickly. Also, are you using software RAID or a RAID controller?

Yes, SWAP on one of them, not the entire group but it is rarely active because of excessive memory for current uses. I am using the standard (mdadm) software RAID.

I encountered additional problems here: to get the server back online I just added another disk to the md cluster and removed the bad one logically but left it physically connected. When I tried to remove the disk the server will not boot, never comes online. I put the defective disk back and everything works as expected. The disk should have no active files on it and I used 'swapoff -a" to make sure the system was not looking for the swap partition that used to be there. The disk put itself back as active sometime last week so I am able to get some diagnostics - here is the gdisk partition info:

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048      1048840191   500.1 GiB   FD00  
   2      1048840192      1468532735   200.1 GiB   FD00  
   3      1468532736      1783367679   150.1 GiB   FD00  
   4      1783367680      1783615487   121.0 MiB   FD00  
   5      1783615488      1785714687   1.0 GiB     FD00  
   6      1785714688      1819269119   16.0 GiB    8200

What do I have to do to physically remove the defective disk and have the server boot properly?
I plugged in a screen and there is a message about “A start job is running for /dev/disk-by-uuid…” (more or less I can’t get a hard copy since boot has not completed).

How does your system boot?

  • Legacy BIOS did read sector 0 from one disk – it does not know about software RAID
  • UEFI loads bootloader from ESP partition – it does not know about software RAID. The default boot entry in UEFI has identifier of specific bootloader on specific drive

If either looks for that drive …


GRUB does not know about software RAID either. It is the kernel & initramfs that set those up.


At what stage of boot process does the “action” cease?

@jlehtone: Thanks for the prompt response, very much appreciated.

This machine has UEFI and I think GRUB as the boot loader. Whatever comes standard with Rocky 4. Everything is up to date as of last Sunday.
Both the /boot and /boot/uefi partitions are on RAID 1 clusters, /boot is on /dev/md126 and /boot/uefi is on /dev/md124.
When the old disk is unplugged boot gets stuck at:

starting dracut premount hook...
finished dracut premount hook
A start job is running for /dev/desk/by-uuid/1493f55...ab5c <time>/no limit

When I plug the defective disk back in it boots correctly within a minute.

Ok,

The UEFI loads GRUB from one of the drives that make the “md124”. It can do that, because a member of RAID 1 (at least some version) looks like plain “filesystem on partition”.

Likewise, the GRUB loads kernel and initramfs image from one member of “md126” (unless GRUB has support for mdadm RAID 1 and the wrapper stub /boot/efi/EFI/rockylinux/grub.cfg enables that support – the “real” grub.cfg is in /boot/grub2/ ).

Kernel loads and initializes some device support according to instructions within the initramfs.


Most likely something there refers to “uuid” of the removed drive explicitly.

(One can list files of initramfs with sudo lsinitrd )

One thing to try is to regenerate initramfs. I have not bothered to learn how; I usually (re)install another kernel (version) – and after successful boot to it, reinstall the latest.

I did “lsinitrd | grep uuid” and no explicit references to any uuid. Not sure if they may be referenced indirectly.

Result of:

grep uuid /boot/efi/EFI/rocky/grub.cfg
search --no-floppy --root-dev-only --fs-uuid --set=dev b8df5ed2-e789-49e2-9ed5-23e58a274594

so it looks like the extension may be present to boot off the RAID 1 cluster.

I’m not sure how to move between kernels. Can I do this using dnf? One problem is that I can only take this server offline on weekends so I don’t want to do anything that might prevent it getting back online by tomorrow AM.

lsblk shows the drive as present and partitioned but not allocated to anything. Did I miss a command to make it “disappear” from the available devices? How can I be sure that none of these partitions is active somewhere even though they appear to be not allocated to a directory?

sdb         8:16   0 931.5G  0 disk  
├─sdb1      8:17   0 500.1G  0 part  
├─sdb2      8:18   0 200.1G  0 part  
├─sdb3      8:19   0 150.1G  0 part  
├─sdb4      8:20   0   121M  0 part  
├─sdb5      8:21   0     1G  0 part  
└─sdb6      8:22   0    16G  0 part  

After playing with this for a while I looked at what does /dev/disk/by-uuid/1493f155-fdfa-4a56-a339-9863fd12ab5c actually point to and I get

��?��U��JV�9�c��\SWAPSPACE2

I did a swapoff -a but this link still shows as active. I think my problem is how do I make the system let go of the swap space. Note that “top” only shows enough total swap for the proper partition, that is the one on the replacement disk.

@jlehtone: I found out how to re-initialize initramfs (from a Centos post)

dracut -f /boot/initramfs-5.14.0-427.16.1.el9_4.x86_64.img $(uname -r)

and I ran it but even with swap turned off it still has the defective disk showing in /dev/disk/by-uuid (now /dev/sde6) and I still can’t remove the disk. Same problem as before, looking for the swap partition on this disk. Any further suggestions? I’m concerned that the disk may go offline again and make the server unbootable.

The disk is bad, so clearing its partition table should be ok?
I would look at gdisk if it has an option for that.
With no partitions in the table there should be no hint that there is a “swap partition” within.

@ jlehtone: No joy. I had thought of that but I was afraid of just what happened - the server is now completely unbootable. The original message has re-appeared about a start job is running for uuid… and since there is no longer an entry in the partition table the machine won’t start at all. No way that I can find to get a screen up to re-create the disk. Help!
(I did try to boot off the install media and that might work but I have a lot of time invested in configuring this server and I don’t want to lose it if possible. Is there any way to recreate the missing uuid on another disk or better yet just bypass the start job message?

“Is there any way to recreate the missing uuid on another disk…”

Yes: “tune2fs”. Here’s a link to an article about retrieving and changing uuids:
https://linuxconfig.org/how-to-retrieve-and-change-partitions-universally-unique-identifier-uuid-on-linux
Hope this helps.

(P.S. Based on my research, the primary factor destroying electronics is heat. IMHO, although they don’t have movable parts, SSDs suffer the same Joule heating effect as the processors or memory chips.)

@pilotFerdi: Thanks for the quick response. If you have the in-depth knowledge of disk systems, maybe the following is a better question to ask: boot is hung looking for a defective disk with the given uuid. I know the uuid involved but a better solution would be to delete the requirement from the boot process since the disk partition no longer exists. I can install another unused disk if necessary to put the uuid on but can I safely delete the uuid from somewhere so boot doesn’t look for it? That way boot will not get inconsistent information. Can I do this from the install media since the underlying system will not boot? I DON’T want to accidentally reformat!

As far as I understand how Linux boots: your swap partition is not used by grub (the Linux boot loader). The only disk partition grub accesses during boot is the root partition. After that happens, it reads the /etc/fstab file to establish the inodes for the drives. It is the record (or records) in that file that’s causing your problem.

As far as I know, you do not need to have a swap partition, a swap file will be created on the root partition in its absence. Consequently, I think you could potentially solve the problem by

  1. boot from a live USB stick;
  2. edit the /etc/fstab file on your server’s root partition, and comment (not delete!) the entry for the swap partition;
  3. reboot.

Once you have it working, you could move to the next steps - install a new drive, change its UUID to match the UUID of the damaged drive, un-comment the swap partition in /etc/fstab.

I’ve done this a couple of times, but a long time ago, when Linux was using ‘lilo’ instead of ‘grub’ as boot loader. Hence, my recollection might be a little off. Please let me know if this will help you. Good luck!

Not quite correct, the swap file can be mentioned in grub with resume=.

Erm no, the fstab is used for mounting partitions, not for establishing inodes. Inodes are a part of the filesystem and nothing to do with fstab.

I booted from the install media and found that /etc/fstab did not have a reference to the defective disk. ie, that uuid does not exist in the list. It also does not exist in “ls -l /dev/disk/by-uuid”. This is expected because the defective disk is no longer plugged in.

The ONLY entries in fstab running under the rescue kernel are /, /boot, /boot/efi, /home, and /usr. /none (the swap partition) is commented out because I did that before it became unbootablle. [SWAP] is shown on zram0.

I DID remember to chroot to the real root “chroot /mnt/sysroot”

mdadm shows the /boot and /boot/efi RAID 1 clusters as clean so it is unlikely that there is a backup fstab somewhere created by the rescue system.

Trying to boot normally still hangs on the start job for the deleted disk message as before.

Does GRUB_CMDLINE_LINUX in /etc/default/grub reference the failed drive?

Related to Ianmor’s post above what is listed on the “options” line for the selected kernel in /boot/loader/entries/<machineid> <kernel version> ?
If the old swap is listed there you will need to use the grubby command to remove the unwanted parameter.

1 Like

Thanks to all of the people who replied. The uuid was in BOTH locations as parameter “resume=…”. I had a bit of trouble figuring out why grubby wouldn’t work until I discovered that 'uname -r" was returning the wrong kernel version, seemed quite old. I typed in the correct id and removed the resume option, did power off and restart and the server rebooted three times and then came up properly. Now I know much more about the boot process than I ever wanted to! Thanks again.

2 Likes