I’d like to open up the monitoring question to the floor. I’ll add that I do not include Logging in this, as I see that as a separate problem that needs a separate solution.
I’ve already put this in an issue over on Github (https://github.com/rocky-linux/infrastructure/issues/22#issuecomment-743741249), but I’ll repeat it here to get started
Personally I would run as far away from anything that even resembles/includes Nagios as I possibly can. We have a collective Stockholm syndrome with that piece of junk and it should just burn. Zabbix is may be marginally better, but not by much. My suggestion would be Prometheus.
Prometheus implies a certain amount of other tools:
- Consul (or some other form of Service Discovery)
- Prometheus itself
- The exporters
- Grafana for visualisation
- Long-term storage - Thanos seems to have the community mind share
- AlertManager connected to something to send the alerts - perhaps approach Pagerduty, or similar, for sponsorship
I believe there are mature Ansible roles for all of that.