Production Risk: GCP VMs with Rocky Linux 9.5 Shutting Down Unexpectedly

We recently migrated a few GCP instances to Rocky Linux 9.5 (Blue Onyx) and are facing a critical issue: the system shuts down automatically. After restarting, some instances fail to boot due to filesystem check errors (fsck). To recover, I’ve had to attach the boot disk to another VM, repair the filesystem manually, remount it and then restart the instance.

The biggest concern is why the system shuts down in the first place. I don’t see any OOM kills or obvious crash traces in the logs. It just stops unexpectedly. I’m wondering if GCP is triggering a stop due to some setting or underlying issue.

I’ve been running CentOS 7 VMs on GCP for years without problems, but this is making me hesitant and concerned about using Rocky Linux 9 in production. I don’t even know which logs would help to paste it here, because nothing seems out of place.

Please advise—how can I prevent these VMs from shutting down? Is there something I should disable or configure differently in GCP or in Rocky Linux?

I assume you’re not just booting these instances and leaving them idle. You haven’t specified what programs and/or jobs you have them doing.

Obviously, the first place to look for a problem is in the software that you’re running on the base operating system.

Thanks for the response. To clarify, the application we’re running has been stable and unchanged for years on CentOS 7 VMs without any such issues. We’re not doing anything that would explicitly trigger a system shutdown (no shutdown, reboot, or related commands in our code or scripts), and the application runs with standard user privileges — not root.

We’ve also reviewed our system logs (journalctl, /var/log/messages, serial console output on GCP), and there’s no trace of a crash, OOM, kernel panic, or shutdown command. The system just stops, and after reboot some instances don’t come back due to filesystem corruption that requires manual fsck repair.

This feels more like a kernel-level crash, storage layer failure, or some system component behaving differently in Rocky 9.5 — especially since Rocky uses a newer kernel/systemd version compared to CentOS 7.

Is there any known issue with system stability on Rocky Linux 9.5 on GCP (especially around disk I/O, ACPI events, or default kernel/systemd behavior)? Any additional diagnostics you’d recommend?

How much cpu and ram allocated to the instance?

Combination of nothing in the logs, and filesystem corruption, almost sounds like a power off by GCP (the host), so it would be good to get host logs; do we know what kind of host GCP is; vmware, something else? Can you ask them for the logs? Is there a web interface where you can delibetately force power off without shutting down?

1 Like

@vadirajks casually just I saw other post from you, I dont know if is related, but I guess it is. Your application is java based and you used a customized rpm to install JDK from oracle maybe is that where you need to start checking if the app is causing the crash or the java machine is requesting the paths you’ve mentioned on the other post.

Just throwing ideas here :sweat_smile:

Yes, I think that’s a good spot to look at. If these nodes are part of some sort of auto scaling group (or whatever Google calls that) they may be failing health checks and being replaced automatically. The logs in Google should be fairly descriptive about this if that’s what is happening,

My Rocky9 crash is different, but may be relevant: Running Rocky 9.5 native on my Lenovo laptop, it runs days OK, then reboots with no warning. Like in this post, I find no error messages in the logs. For a while it only happened during zoom calls connected via firefox, but latest time, I was creating an email with evolution, nothing else running except ?background stuff in mate GUI. My point is: If this is the same problem the OP has, it suggests the Rocky kernel, not the GCP or OP’s software, is the place to look. I will watch this post for any suggestions how to debug.

Probably better to make a new thread.