Yesterday, on September 29th, and again this morning on the 30th, there a total of three separate incidents of instability/outage on the mirrorlist services, affecting mirrors.rockylinux.org.
The first of these was the expiration of the Let’s Encrypt DST Root CA X3. The intermediary that the proxy middleware we are using expired at Sep 29 19:21:40 2021 UTC (In your timezone: 2021-09-29T19:21:40Z). This affected most of our services which are sent via the same middleware at the moment. The outage lasted approximately 35 minutes, and access to mirror services and downloads were restored by 2021-09-29T19:55:00Z. This event was not detected by at-present monitors (more on this in a bit).
The second and third outages are closely related and occurred around 2021-09-30T08:13:00Z and again at 2021-09-30T12:40:00Z.
The first of this set of incidents was caused by an increase in the frequency at which mirrors are crawled by the mirror host. Though this change initially seemed stable when it was deployed last night, overnight the system ran out of memory and swap space, causing it to become unresponsive. This server was the only system serving the mirror list service, making it a single point of failure. While the Rocky Linux mirror list service is behind a content delivery network, caching was disabled for mirrorlist urls because it is difficult to know when it is best to expire cache. This incident was resolved overnight around 2021-09-30T09:38:00Z by the Infrastructure team.
It was initially unclear if it was a widespread problem or not, possibly due to the time of day and lack of visibility across all platforms, however once the issue was determined to be widespread, it was swiftly tracked down and mitigated by the Infrastructure team.
The final outage was due to a necessary host reboot in order increase the resources on the server. The reboot and resize took approximately 5 minutes.
All times in UTC
- 2021-09-29 19:21 - Certificate expires for Mirror and Download frontend services due to Let’s Encrypt’s Root CA expiration process (Still under investigation for root cause)
- 2021-09-29 19:22 - Issue is noticed and Infrastructure team begins working to identify cause.
- 2021-09-29 19:30 - Cause is identified and a workaround is attempted and verified for chat.rockylinux.org. Download and Mirror services are next.
- 2021-09-29 19:55 - Download and Mirror services are fully operational and validated to be online and servicing requests. Incident 1 Complete
- 2021-09-30 08:10 - Mirrorlist server runs out of memory and swap space, causing it to be unable to service requests.
- 2021-09-30 08:13 - Chats, posts, and notifications begin coming in from users that the service is unstable/unresponsive. Team members raise flags for infrastructure
- 2021-09-30 08:30 - Infrastructure team begins investigating initial reports and tracking down source of problem
- 2021-09-30 09:10 - Infrastructure team works to restore services, ultimately the server needs to be rebooted as it is unresponsive
- 2021-09-30 09:30 - Service is restored
- 2021-09-30 12:40 - Mirrorlist server is shut down to resize it as memory is growing again
- 2021-09-30 12:45 - Mirrorlist server is back online
As mentioned earlier, none of these two incidents were detected by monitoring, but were reported in chat (Mattermost and IRC), on the forums, and on reddit and twitter by our community. This is genuinely appreciated and we are grateful that this obvious hole in our monitoring has been identified.
This event has also brought to mind the importance of having clear methods for community team leads to know how to escalate problems such as these between teams and feel empowered to wake someone up when something is going wrong.
Work is underway on integration of some tooling to help with this, but in the meantime, the following immediate actions have been taken:
- Mirrorlist server resources increased
- Investigate and fix pathways for host status alerts to generate page-out alerts
- Content Delivery Network configuration will be modified to serve stale content in the event of backend errors for a period of time. Care will be taken to ensure cacheing is performed in a manner which maintains uniqueness per netblock, asn, query string parameter, etc.
- Establish escalation chains and clear, documented pathways for team leads to get a hold of someone to fix active problems with infrastructure when there are any (though we’ll try to keep these to a minumum if we can!)
Team Lead - Infrastructure