I’d like to open up the monitoring question to the floor. I’ll add that I do not include Logging in this, as I see that as a separate problem that needs a separate solution.
Personally I would run as far away from anything that even resembles/includes Nagios as I possibly can. We have a collective Stockholm syndrome with that piece of junk and it should just burn. Zabbix is may be marginally better, but not by much. My suggestion would be Prometheus.
Prometheus implies a certain amount of other tools:
Consul (or some other form of Service Discovery)
Prometheus itself
The exporters
Grafana for visualisation
Long-term storage - Thanos seems to have the community mind share
AlertManager connected to something to send the alerts - perhaps approach Pagerduty, or similar, for sponsorship
I believe there are mature Ansible roles for all of that.
I agree with Prometheus, though going full service discovery right at the start sounds like overengineering. Ansible can probably configure the list of scraped endpoints just as well
More importantly, Prometheus gives metrics in addition to just monitoring, and that will help with all kinds of things.
Anyone ever heard of checkmk? Its based in Germany but gaining quite the momentum.
It has an fully open source version, though there is a enterprise version. If that might be interesting I could reach out to them to ask for a sponsorship maybe.
Personally I love that solution and it works really good for infrastructure monitoring but it can also integrate for example prometheus and ntopng. Just to name two.
Was suggesting Loki+Prometheus+Grafana as opposed to Elastic+Logstash+Kibana.
From my understanding, ELK is open core while LPG is open source. Or maybe I’m wrong. If I’m correct, we could use on Rocky infra, without legal restrictions, all the power of the tools.
I think later down the road, ELK could be more useful to us, but in the interest of sticking as close to FOSS as we can, Loki is likely a better solution.
We’ve had this discussion with countless other solutions – obviously we favour OSS software, but we can’t rule out proprietary solutions if they also provide the correct feature set, and do it in a nicer way. Which I would argue is the case for the Elastic stack.
LibreNMS works quite well for monitoring. It is a very actively maintained fork of the Observiuim NMS that is 100% free.
You can feed the data it collects into Prometheus, Influxdb, etc. Plus with the database backend, the ability to make nice status dashboards in things like Grafana are easy to do.
It will monitor services, can differentiate VMs running in VMware (and I believe other HVs too), has a bunch of different “plugins” and can be easily extended. And if you really want, you can always load up the nagios monitoring plugins for service monitoring in LibreNMS, as it will run them directly.
SNMP is the main method for monitoring. It can also do just plain ICMP (ping) monitoring to see if something is up.
And the nice thing is there are a ton of different notification options available. Slack,email, push bullet, pager duty, etc (too many to list here). (Listing of “transports” for alerts can be found here https://docs.librenms.org/Alerting/Transports/ )
I may be wrong, but don’t Observium and LibreNMS both specialize in network device monitoring? I mean sure you can monitor classical operating systems using SNMP but you really don’t want to do that.
I think we need a more agnostic tool here.
Or maybe even specialized tools after all, as we will have all sorts of monitoring needs (application, infrastructure, performance).