Monitoring Infrastructure

I’d like to open up the monitoring question to the floor. I’ll add that I do not include Logging in this, as I see that as a separate problem that needs a separate solution.

I’ve already put this in an issue over on Github (https://github.com/rocky-linux/infrastructure/issues/22#issuecomment-743741249), but I’ll repeat it here to get started

Personally I would run as far away from anything that even resembles/includes Nagios as I possibly can. We have a collective Stockholm syndrome with that piece of junk and it should just burn. Zabbix is may be marginally better, but not by much. My suggestion would be Prometheus.

Prometheus implies a certain amount of other tools:

  • Consul (or some other form of Service Discovery)
  • Prometheus itself
  • The exporters
  • Grafana for visualisation
  • Long-term storage - Thanos seems to have the community mind share
  • AlertManager connected to something to send the alerts - perhaps approach Pagerduty, or similar, for sponsorship

I believe there are mature Ansible roles for all of that.

3 Likes

We need an incident response system as well (ex, statuspage)
I believe uptimerobot was mentioned in Slack too.

I agree with Prometheus, though going full service discovery right at the start sounds like overengineering. Ansible can probably configure the list of scraped endpoints just as well :slight_smile:

More importantly, Prometheus gives metrics in addition to just monitoring, and that will help with all kinds of things.

3 Likes

As I said over in Slack, I agree that a full Service Discovery set up is probably not necessary for now. I am 99% sure we can use Ansible to do it.

Prometeus sounds good, any quick ref? I’m so old I use mrtg directly XD

Comon, there is no shame in it! :slight_smile: I love my old school MRTG graphs, although I like rrdtool especially with XKCD font rendering them better. :smiley:

The main site is prometheus.io

becoming expert in prometheus in 3,2,1…
It looks nice, something apart from nagios/cacti/zabbix…

1 Like

Thumbs up for Prometheus+Grafana+Loki. Although no experience here seems the most prominent.

2 Likes

What about Nagios/Icinga at the end of 2020?

Anyone ever heard of checkmk? Its based in Germany but gaining quite the momentum.
It has an fully open source version, though there is a enterprise version. If that might be interesting I could reach out to them to ask for a sponsorship maybe.

Personally I love that solution and it works really good for infrastructure monitoring but it can also integrate for example prometheus and ntopng. Just to name two.

2 Likes

For what purpose are you suggesting Loki for?

For log collecting, it’s not that good compared to ELK or graylog as Grafana isn’t really catered for log viewing.

Also I recommend VictoriaMetrics for long term storage. It’s quite easy to set up and supports PromQL by itself, so it’s performant.

Prometheus is just for collecting. Check Grafana for viewing.

1 Like

Yes, I was thinking about Loki for log.

Was suggesting Loki+Prometheus+Grafana as opposed to Elastic+Logstash+Kibana.

From my understanding, ELK is open core while LPG is open source. Or maybe I’m wrong. If I’m correct, we could use on Rocky infra, without legal restrictions, all the power of the tools.

I think later down the road, ELK could be more useful to us, but in the interest of sticking as close to FOSS as we can, Loki is likely a better solution.

We’ve had this discussion with countless other solutions – obviously we favour OSS software, but we can’t rule out proprietary solutions if they also provide the correct feature set, and do it in a nicer way. Which I would argue is the case for the Elastic stack.

2 Likes

For Log there is already chosen for ELK see:

1 Like

LibreNMS works quite well for monitoring. It is a very actively maintained fork of the Observiuim NMS that is 100% free.

You can feed the data it collects into Prometheus, Influxdb, etc. Plus with the database backend, the ability to make nice status dashboards in things like Grafana are easy to do.

It will monitor services, can differentiate VMs running in VMware (and I believe other HVs too), has a bunch of different “plugins” and can be easily extended. And if you really want, you can always load up the nagios monitoring plugins for service monitoring in LibreNMS, as it will run them directly.

SNMP is the main method for monitoring. It can also do just plain ICMP (ping) monitoring to see if something is up.

And the nice thing is there are a ton of different notification options available. Slack,email, push bullet, pager duty, etc (too many to list here). (Listing of “transports” for alerts can be found here https://docs.librenms.org/Alerting/Transports/ )

I may be wrong, but don’t Observium and LibreNMS both specialize in network device monitoring? I mean sure you can monitor classical operating systems using SNMP but you really don’t want to do that.

I think we need a more agnostic tool here.

Or maybe even specialized tools after all, as we will have all sorts of monitoring needs (application, infrastructure, performance).

1 Like

There is no longer any debate on the subject. Prometheus (and its ecosystem) is the road we are going down.

1 Like

Is thanos going to be the long term storage solution?

1 Like