Post hangs occasionally forcing OS reinstall

I experienced 4 occurrences of post hang either at install or weeks after install. Servers used for testing NVMe drives hang occurs on Supermicro & Dell servers single and dual processors, Intel.
Each server hangs during post with the same message and server vendors point to Intel or server vendor.
Post messages:
1- x86/cpu: SGX disabled by BIOS.
2- sgx: There are zero EPC sections.

Server will not advance beyond this point and recover to date has required Rocky re-install.

My first thought would be either bad ram or overheating.

Or maybe even bad ram caused by overheating.

You could try running memtest overnight and see what that tells you.

I will run the memtest

memtest passed no issues all servers and no over temp found
re-installed OS will repost if I find a resolution

Hummmmm. I hope you are right. I get the SAME THING… It also has to deal with my NVMe drive (a 1 TB Corsair MP600 4th Gen ) It may be sitting in idle in the screen Locker mode and I come back to ocelot to work on it and I am greeted by a scrolling screen of error messages. I try to reboot the machine only now I am told that the NVMe drive does not exist!!! I can usually get it back to working by booting Knoppix 9.1, becoming su then first running fdisk-l to see if it sees the drive (50% of the time it does) and then running gparted, if gparted see the NVMe drive it is usually a safe bet to I kill KNOPPIX, remove it from the tray, then go into BIOS to see if the NVMe drive is back, if it is then I simply go over to boot priority select the NVMe drive, and away I go. This “hanging” seems to be ONLY with RL (or CentOS 8.x, or…) it does NOT happen with openSUSE 15.3 Leap, but that is installed on a separate SATA drive and not on my NVMe drive that only holds RL 8.5. This could be a bug in RHEL 8.5 and its offspring, or it may have to do with the handing of NVMe drives. My buddy suggests that I run “Bonnie Plus” on the NVMe drive while I am running openSUSE 15.3 which is located on a SATA drive.

I wonder if that drive somehow goes to sleep after an idle period and never comes back again.

Maybe openSuse does something that gives it a kick once in a while and prevents it from going to sleep that way.

Maybe there’s a firmware setting to tell it not to do that? Or maybe you could set up a cronjob to read or write on it every x period to keep it alive?

Hummmmm. One thing I can try and do is load RL on a spare SATA drive and see if the problem replicates itself. If it does…then the problem is with the NVMe drive. It could be that 4th Gen NVMe drives like the Corsair MP600 are simply “Too NEW”. First I have to FIND a spare SATA Drive. Since the move to my TEMPORARY digs I’m lucky if I can find ANYTHING. It took 2 weeks to find a pair of sunglasses I had been searching for.

Thank you for the suggestions. Right now I’m running on openSUSE, after another few days I’ll try and reboot RL abd see if it will come up or if it has once again disappeared from the BIOS. The other thing is, as you just suggested, it could be a firmware problem that is restricted to the Corsair MP 600. Maybe for my 70th Birthday I can get a cheap @$$ NVMe and stick it into the 2nd NVMe drive space and see if if I still have the same problem. The other thing I could do is simply update the BIOS as I know there have been at least 2-3 updates since I updated it sometime last year.

Again, Thank You for your suggestions. If nothing else it “kick-started” my brain.

D’ Cat

It occurs to me that just doing an occasional read from the drive won’t accomplish your goal. The data will just be cached and it won’t hit the drive again after that.

I think you’ll have to do an occasional write to drive in order to get some action out of it, or maybe you could just read a random sector on the drive. (Do modern drives even have sectors any more?)

I know spinning rust does, but SATA and “memory” drives?!? Not sure though there has to be a way it accesses data.

" Toto, I don’t think we are in Kansas anymore".

D’ Cat