Hi everyone,
We are currently in the process of migrating our big physical database servers to new VMware Infrastructure with multiple new Rocky 9 VMs. Most of the VMs are stable, while we faced unknown crashes of the filesystem on currently 2 different hosts on different ESX Servers. These crashes only ever appeared on the new vm database servers. Not any other servers.
Current setup, though the same issues appeared with Rocky 9.3 before:
EsXi Host: VMware ESXi, 8.0.3, 24022510
OS Version: Rocky Linux 9.4 Blue Onyx
Kernel Version: 5.14.0-427.22.1.el9_4.x86_64
open-vm-tools: 12.3.5.46049 (build-22544099)
CPUs: AMD EPYC 9274F 24-Core Processor
Cores: 48 physical – 48 logical
RAM: 257 GB
We do not see any specific increase in resources in vmware or our monitoring system.
These crashes appear quite frequently on one server, always around in between 6-10 days uptime, no systemlogs are available, as the server seems to loose access to / partition and about 15m later from the /data partition. The server has already been cloned and recreated as a trial and also moved to a different esx.
ESX log shows these lines:
scsi0:0: aborting cmd 0x335
[...]
scsi0:1: aborting cmd 0x3fa
Unfortunately Broadcom Support is not able to give us any answer regarding these problems.
stdout on the screen of the crashed machine:
The servers are connected to a fibre channel storage solution in combination with the ESXI. The storage is used on a lot of Servers, with multiple OSes, (Rocky9, SL6.4 as well as some Alma8) without any issues on other servers.
Anyone ever experienced something like this? These are some of our most important servers, on which a lot of jobs are ran in the offtime, when nobody is available to restart the server. And really we are at the end of our ideas right now.
On the face of things, it’s looks more like a “hardware” error, like someone unplugged the disk(s).
The screenshot might not be that helpful, it’s showing stuff “after the fact”, what you really want is the hardware log, like the log of the storage system showing the disk going offline.
I would agree to that, if it wouldn’t be one storage system for 100 machines and only the new database servers of them have such problems currently.
These 2 servers where it actually happened are the the ones with the biggest Filesystems though.
We have found some bugs noted for vmfs 6 type storage for devices of more than 2.5TB, which would be the case for some of these servers. VMWare/Broadcom doesn’t say anything though, they start again from the beginning after 8 weeks. And don’t seem to understand their own bug reports.
I have like half a mind to use 2 virtual disk of 2TB and just create the LVM over 2 Disks, to just hope.
Another idea.
You say you can’t access the system logs at the point of the crash, because the filesystems stopped working (makes sense).
BUT.
What happens if you reboot the crashed server?
Do the old filesystems come back?
If so:
Do they pass filesystem integrity checks?
Do they have the data you expect to see on them?
If so:
You should be able to obtain the system logs by going back in time to the point of the crash, and see what it says in the log.
If the file systems are gone completely, and the server can’t reboot, then what does vmware say about the missing filesystems? They can’t just be gone?
Really, you need to forget about the o/s, and go into vmware storage logging, and then go into the storage logs of the storage vendor. You should also check the exact mapping from the o/s to vmware, and then from vmware to the storage system, e.g. note every fibre connection, every device id all the way to the real disk. Try also connecting the same server to different storage, and then try to make it crash.