NVMe Drives Go Out In The Weeds For No Reason

Well I’m back. I have a SMALL problem and hope you can guide me in fixing it:My NVMe drives – Corsair MP600 Gen4 PCIe – periodically go out in the weeds. What do I mean?? Occasionally for no apparent reason I am greeted by a screen spitting out an endless string of ERROR messages that keep on scrolling by without pause. If I REBOOT the machine one of two things happens:

  1. I can not even get into BIOS without removing the NVMe drives. Eventually I get to the point where it says “Press F1 to get into BIOS”. I do that, them hit F7 and then I am on my way to boot at last BOOTING the machine. Only this time I am greeted by the following message: “No NVMe device/s found.”

  2. IF I am LUCKY the NVMe drive IS found, and then I am on my way… almost: Now a DIFFERENT problem occurs. I Boot Rocky Linux 8.7, The Menu comes up, The Kernel (the most recent) comes up, and then comes the PROBLEM: The Rocky logo comes up, and the SPINNING CIRCLE comes up (Normal prior to a Login Page) but it sits there SPINNING, and I NEVER get to the LOGIN PAGE!! If I wait until it times out, then I get back to a scrolling page of ERROR MESSAGES once again, and I am unlucky I can’t even get into BIOS. (See 1).

Removing the drive/s and reinstalling one or both of them sometimes fixes the problem; sometimes it does not. Usually it does not. Usually what happens is if I REBOOT the machine without waiting for it to timeout I have bo problem with the BIOS and Rocky Linux 8.7 comes up, I get the MENU, and then I get the Rocky Logo, and the Spinning Circle… which keeps on spinning, and I never get to the Login Page.

The problem seems to thus occur BETWEEN the GRUB Menu and the LOGIN PAGE. I have looked everywhere for some config file but can’t find one. /boot yields nothing – at least to my eyes – and combing /etc showing nothing.

WORKING THEORY:

SPEED KILLS!! I do NOT have this problem with another copy Rocky 8.7 that is located on a S-L-O-W SSD, only with 8.7 on the NVMe 4.0 Gen copy. INXI tells me that a NORMAL SSD runs at 6.0 Gb/s; while my NVMe 4.0 Gen drive runs at 63.2 Gb/s – some 10 times faster. Thus some small ERROR gets replicated 10X faster resulting in a cascading set of errors.

POSSIBLE FIX:

Since I know that the problem occurs between the GRUB Menu and the LOGIN Page (Where the Rocky Linux Logo appears with the Spinning Circle to be exact), the SOLUTION would be to FIND and COPY the CONFIG file that would allow you to LOGIN. Now WHERE that file is I do not have a clue. In the event this problem were to raise its ugly head again, the SOLUTION would be to boot the NORMAL SSD where the COPY of the file is located and then COPY OVER the CORRUPT COPY on the NVMe drive, then REBOOT the Machine and select the NVMe drive to BOOT from and IN THEORY the NVMe drive blazing away at 63.2 Gb/s would pop up.

If this is a Config file that periodically gets corrupted I need the NAME and WHERE IT IS LOCATED. Any help would be GREATLY APPRECIATED.

Thanks.

D’ Cat

P.S. Just in case someone is wondering, I am posting this from ocelot where the NVMe drive is located and – miracles of miracles – I was able to LOGIN to Rocky 8.7. I suspect that the next time I reboot the machine – or before – while the machine is sitting overnight – I’m going to get the scrolling ERROR MESSAGES, which will force me to shut down the machine, and take my chances. If I am LUCKY once again I’ll be able to REBOOT the machine, the BIOS will pop up, the NVMe drive will be found, and then it will get to the LOGIN PAGE one more time. Am I counting on that? NO!!

D’Cat

The reason why you don’t get the login page, is not because of a corrupt config file. Rather, you don’t see it, because of all the ERROR messages that are being generated causing high I/O and slow response/timeout of the login screen. Since you waited, and it bombs out showing the error messages, then this would generally hint potentially at:

  1. Perhaps your NVMe drives need firmware updates?
  2. Perhaps there is a hardware problem with your NVMe drives?
  3. Perhaps the Rocky Linux 8.7 kernel doesn’t support that hardware because they are too new. Perhaps you need an up-to-date kernel, like the one provided by installing Rocky Linux 9 or using the ELREPO kernel’s that give a 6.x kernel.

Rocky 8 has a 4.18 kernel.
Rocky 9 has a 5.14 kernel.

You can debug it easily enough by:

  1. Installing Rocky 9 on this machine and see if it behaves better with a 5.14 kernel instead of Rocky 8 with the 4.18 kernel.
  2. Install kernel-ml from ELREPO on the Rocky 8 machine that you have this issue with (probably far quicker than doing a new Rocky 9 install - assuming of course you can actually boot the machine and login to it).

Also check to see if there are firmware updates for the Corsair NVMe drives that you have. If there are firmware updates available, then apply those updates to the drives - it may fix issues that you are experiencing.

THANK YOU iwalker. At least I now have a few things to check out. I have 3 Corsair MP600 Gen4 PCIe NVMe Drives all bought at different times usually ON SALE (!!) One of the 3 is going to dedicated to openSUSE 15.5 when it comes out in about 3 weeks, that leaves 2 drives to play with. Rocky Linux is going on ONE of them. I think my “other drive” (NOT this one) is going to play “guinea pig” and I’ll install Rocky Linux 9.1 (? or is it now 9.2?).

BTW since I already have 8.7 up AND RUNNING now would be a good time to try and install " kernel-ml " from ELREPO is that kernel-ml as in mL or mi without the L being capitalized ; m1? I will ASSUME that this is some kernel which can be found in the ElRepo directory which can be downloaded and somehow installed? Yes?

But again THANK YOU for your help. This has been a problem that has been driving me CRAZY! At least I know where to START trying to solve the problem.

D’ Cat

On Rocky 8 you can do:

dnf install elrepo-release
dnf install kernel-ml --enablerepo=elrepo-kernel