"I have a concern about Rocky Linux’s kernel management policy. The distribution automatically removes old kernels when new ones are released, citing security vulnerabilities in older versions. However, this creates a significant challenge in HPC environments.
In our HPC cluster, critical components like parallel filesystems and high-speed network drivers are tightly coupled with specific kernel versions. Rocky Linux’s approach means I would need to reprovision our 400+ compute node cluster every 3-5 months, resulting in downtime and reduced computing capacity.
I’m interested in learning how other HPC administrators who use Rocky Linux address this issue. What strategies do they employ to balance security updates with system stability and uptime requirements?"
The distribution automatically removes old kernels when new ones are released, citing security vulnerabilities in older versions. However, this creates a significant challenge in HPC environments.
Which part is the challenge for you?
that RL removes old kernels from RPM repositories (so you can’t provision new nodes using existing kernels), or
that you want to be on top of security updates for the kernel.
If 1. and the kernel packages are your only restriction, you can rsync kernel packages (binary and source) from RPM repos (basically collecting all kernel versions that were released), and use createrepo to build internal RPM repository on top of that. This lets you keep the kernel versions you need well after they have been removed by RL. (Context: Not a HPC shop, but we do this for all RL {source,x64} packages to maintain reproducible versionlocked VM images way after RPM packages disappear from official RL repos).
If 2, I don’t see how that’s Rocky-related, all distros have the same issue.
I’m interested in learning how other HPC administrators who use Rocky Linux address this issue. What strategies do they employ to balance security updates with system stability and uptime requirements?"
Again, not a HPC shop, but we have related requirements wrt stability, security and uptime:
For stability, we do rigorous testing in separate environments to root out regressions.
For security, we scan our deployments for vulnerable packages and use that as input for deciding when to upgrade.
For uptime, we automated the provisioning to make the updates fast across our entire fleet.
On HPC cluster one can presumably reserve individual node for maintenance, deploy updates, and immediately return the node to production. Automation ought to cut delays from that process. Yes, every node will be down occasionally, but that won’t stop the entire cluster, unless a job needs all nodes.
There might be related packages, so this:
is the trivial way. If one cannot afford downtime, then one must be able to afford storage for local repos.
Yes. When you have shown that you can deploy to test hosts without errors, then you can deploy to production cluster.
My problem is 1 :
RL removes old kernels from RPM repositories. This won’t create any issue on modern clusters because of golden image / container logic, but some centers are still using Foreman / Ansible couple to provision nodes, Whenever the kernel is old and moved from pub to vault, manual intervention is needed.
Yes creating a local repository and pointing Foreman to local repository would fix this. I just wanted to be sure before working on that is the only way for this kind of systems.
Thank you for your answers.
Yes, when a new Rocky version is released, eg: when 9.4 was released, then 9.3 was moved to the vault. This is normal for Rocky Linux, since Rocky doesn’t do what RHEL has by doing version pinning, etc.
You can just adjust the URL’s in any ansible playbook or whatever is being used to deploy the servers and point them to the vault URL’s. Or maintaining your own foreman server, local repos that have all the packages and bootstrap from that. Although that also can suffer the same situation when packages are removed from the Rocjy repositories, since the older ones then could potentially disappear. In which case, for previous versions at this point you’ll prob need to point to the vault URL’s in foreman as well.
Obviously it’s easier when using RHEL and version pinning, but then that requires a paid subscription.
Here is what I did: I created a local repo pulling the official 9.4 iso and syncrepo the latest devel packages, including latest 9.4 kernel updates, then removed all the kernel related updates in the devel repo so that we dock to 9.4 iso kernel , then updated the repodata, then used xcat to create the stateless/lite image with devel updates, filesystem rpms and infiniband drivers in it. The image would then stay for about half a year. Since the kernel updates almost every week for minor patches, we can’t keep up with that as the filesystem and infiniband driver updates are presumably slower than the kernel (the LTS might be even slower). Next time we build an updated image say for 9.5, we follow the above scheme. Userland updates usually work.
Unfortunately, I can’t use xcat in some infras and didn’t want to spare time to maintain a local repository.
What I did is I added following commands to kickstart post section, required kernel and packages are installed.