[RFO] [2023-04-17] Public-facing services outage

On April 17, 2023, we had planned maintenance that was aimed to be transparent in nature. In our current work of standing up new infrastructure, we needed to migrate Rocky Linux 8 to 9 for our FreeIPA domain (that runs our DNS and auth for our services internal and external).

The plan was to essentially:

  • Remove one node
  • Add new node, configure as needed via ansible
  • Repeat the above until all nodes are migrated

As FreeIPA handles the internal DNS and responds to internal requests from our external services where necessary, this should be fairly routine. Most internal services won’t notice IPA servers coming and going as the SRV records will have changed and things will move on as normal.

This is mostly transparent, following the proper documentation for FreeIPA migrations between major releases of Enterprise Linux. However, among this work, our internal firewall and unbound DNS caching were not made aware of the new IPA systems as we added and removed nodes, and thus the problems began to start and eventually cascade externally. The remaining 8 node that was still available was the only server responding to requests from our haproxy and unbound. Though this was the case, this essentially caused the following issues:

  • The CDN would fallback to our appropriate parameters to ensure mirror manager would still respond with at least one mirror (dl.rockylinux.org)
  • The CDN would detect our services to be back up briefly
  • Some users will have success, while some would time out.
  • Anyone hitting mirrors.rockylinux.org would eventually timeout
  • The CDN would detect it down again and try to fallback
  • The above would loop endlessly

This also unfortunately prevented us from being able to login to our VPN by normal means. The appropriate infrastructure contacts were notified to assist in getting in and fixing the internal resolver to bring all services back online. After all services were back online, we were able to migrate the final 8 node to 9, without further impact to our infrastructure and our users.

Corrections have been made to the infrastructure configuration to provide better fault tolerance in the case of DNS outages like this.