Monitoring Infrastructure

chriscowley · December 12, 2020, 11:31am

I’d like to open up the monitoring question to the floor. I’ll add that I do not include Logging in this, as I see that as a separate problem that needs a separate solution.

I’ve already put this in an issue over on Github (https://github.com/rocky-linux/infrastructure/issues/22#issuecomment-743741249), but I’ll repeat it here to get started

Personally I would run as far away from anything that even resembles/includes Nagios as I possibly can. We have a collective Stockholm syndrome with that piece of junk and it should just burn. Zabbix is may be marginally better, but not by much. My suggestion would be Prometheus.

Prometheus implies a certain amount of other tools:

Consul (or some other form of Service Discovery)
Prometheus itself
The exporters
Grafana for visualisation
Long-term storage - Thanos seems to have the community mind share
AlertManager connected to something to send the alerts - perhaps approach Pagerduty, or similar, for sponsorship

I believe there are mature Ansible roles for all of that.

amit177 · December 12, 2020, 11:42am

We need an incident response system as well (ex, statuspage)
I believe uptimerobot was mentioned in Slack too.

oranenj · December 12, 2020, 11:45am

I agree with Prometheus, though going full service discovery right at the start sounds like overengineering. Ansible can probably configure the list of scraped endpoints just as well

More importantly, Prometheus gives metrics in addition to just monitoring, and that will help with all kinds of things.

chriscowley · December 12, 2020, 2:14pm

As I said over in Slack, I agree that a full Service Discovery set up is probably not necessary for now. I am 99% sure we can use Ansible to do it.

criptos · December 13, 2020, 3:18am

Prometeus sounds good, any quick ref? I’m so old I use mrtg directly XD

preachermanx · December 13, 2020, 3:58am

Comon, there is no shame in it! I love my old school MRTG graphs, although I like rrdtool especially with XKCD font rendering them better.

chriscowley · December 13, 2020, 1:59pm

The main site is prometheus.io

criptos · December 13, 2020, 3:27pm

becoming expert in prometheus in 3,2,1…
It looks nice, something apart from nagios/cacti/zabbix…

llsousa · December 14, 2020, 10:19am

Thumbs up for Prometheus+Grafana+Loki. Although no experience here seems the most prominent.

matufas · December 14, 2020, 10:49am

What about Nagios/Icinga at the end of 2020?

thorian93 · December 16, 2020, 9:25pm

Anyone ever heard of checkmk? Its based in Germany but gaining quite the momentum.
It has an fully open source version, though there is a enterprise version. If that might be interesting I could reach out to them to ask for a sponsorship maybe.

Personally I love that solution and it works really good for infrastructure monitoring but it can also integrate for example prometheus and ntopng. Just to name two.

mekster · December 18, 2020, 1:19am

For what purpose are you suggesting Loki for?

For log collecting, it’s not that good compared to ELK or graylog as Grafana isn’t really catered for log viewing.

Also I recommend VictoriaMetrics for long term storage. It’s quite easy to set up and supports PromQL by itself, so it’s performant.

mekster · December 18, 2020, 1:22am

Prometheus is just for collecting. Check Grafana for viewing.

llsousa · December 18, 2020, 10:06am

Yes, I was thinking about Loki for log.

Was suggesting Loki+Prometheus+Grafana as opposed to Elastic+Logstash+Kibana.

From my understanding, ELK is open core while LPG is open source. Or maybe I’m wrong. If I’m correct, we could use on Rocky infra, without legal restrictions, all the power of the tools.

hbjy · December 18, 2020, 10:52am

I think later down the road, ELK could be more useful to us, but in the interest of sticking as close to FOSS as we can, Loki is likely a better solution.

We’ve had this discussion with countless other solutions – obviously we favour OSS software, but we can’t rule out proprietary solutions if they also provide the correct feature set, and do it in a nicer way. Which I would argue is the case for the Elastic stack.

RaceAap · December 18, 2020, 8:38pm

For Log there is already chosen for ELK see:

solivas · December 20, 2020, 5:01am

LibreNMS works quite well for monitoring. It is a very actively maintained fork of the Observiuim NMS that is 100% free.

You can feed the data it collects into Prometheus, Influxdb, etc. Plus with the database backend, the ability to make nice status dashboards in things like Grafana are easy to do.

It will monitor services, can differentiate VMs running in VMware (and I believe other HVs too), has a bunch of different “plugins” and can be easily extended. And if you really want, you can always load up the nagios monitoring plugins for service monitoring in LibreNMS, as it will run them directly.

SNMP is the main method for monitoring. It can also do just plain ICMP (ping) monitoring to see if something is up.

And the nice thing is there are a ton of different notification options available. Slack,email, push bullet, pager duty, etc (too many to list here). (Listing of “transports” for alerts can be found here https://docs.librenms.org/Alerting/Transports/ )

thorian93 · December 20, 2020, 4:50pm

I may be wrong, but don’t Observium and LibreNMS both specialize in network device monitoring? I mean sure you can monitor classical operating systems using SNMP but you really don’t want to do that.

I think we need a more agnostic tool here.

Or maybe even specialized tools after all, as we will have all sorts of monitoring needs (application, infrastructure, performance).

chriscowley · December 20, 2020, 6:17pm

There is no longer any debate on the subject. Prometheus (and its ecosystem) is the road we are going down.

mekster · December 20, 2020, 10:07pm

Is thanos going to be the long term storage solution?

Topic		Replies	Views
What servers/services do we need to bootstrap ourselves Infrastructure	98	9558	August 25, 2023
Free and open source alternatives to Icinga (monitoring solution)? Rocky Linux Help & Support	2	131	April 11, 2025
Did Icinga drop support for Rocky Linux? Rocky Linux Help & Support	8	307	April 13, 2025
Installation of monitoring tools Rocky Linux Help & Support	16	3448	August 24, 2023
Rocky / Openstack ( RDO or centos-release-openstack ) Rocky Linux Help & Support rocky-linux-9	4	244	February 5, 2025

Monitoring Infrastructure

Related topics