I think we should put the general requirement down with the current agreed upon recommendation so we know why the item was chosen instead of specifying the products that we have recommended. and build out how that architecture works.
For monitoring, definitely look into Prometheus; it’s pretty simple to set up and quite powerful.
Sourcehut even has a neat public instance of their infrastructure monitoring:
I’m personally meh on zabbix. If people are into it… we can. But something more modern feels better…
grafana/prometheus/influxdb type stack.
While we’re all chatting - if we hypothetically had access to dedicated hosts, how would we want to use them? Hypervisors/clustered workloads (openstack/other)… or just baremetal.
Nothing is set in stone, and we don’t know what or how many of anything we might have.
I’d definitely be in favour of running virtualization on top of hardware hosts where possible. It just makes life simpler.
Having a cloud system to use would also be useful, but I don’t want to inflict OpenStack on anyone I vote on putting it in the “maybe later” pile. I do have experience with it if needed, though…
Agreed for prometheus/influxdb stack instead of zabbix that list will make more sense in a minute with my next two posts. Definitely will need some sort of virtualisation but more importantly we will need some sor of IPMI regardless of the stack since the team will be so distributed
Pagure would be my vote for a git forge, with gitea as a close second. It may be worth also investigating mirrored repos with github/gitlab in the future.
Have a look also to Icinga for host/service health monitoring and alerting. It has good integration with Graphite and a C style DSL for host checks configurations and it’s fully compatible with existing Nagios checks. Its functionality can also be extended with other tools via modules and plugins (eg. director, reporting etc).
Grafana+Prometheus is another good option. Prometheus has also builtin alerting functionality via Alertmanager which has to be configured separately though. Grafana has also pretty interesting “satellite” projects like Loki for log exploration.
My two cents for netbox. Pretty neat and simple DCIM tool with all the features you probably need to organize your infra information.
We have been running all of these on prem (single node setups) quite some time now without any major issues. Something that has to be considered carefully is the storage needs of Prometheus (if you would like to have long data retention policies) and that you need a separate project in order to have a scalable highly available underline storage for it like Cortex or Thanos. Additionally for Icinga, it has builtin HA functionality but it needs some searching and testing to make it work without issues with Icinga’s Web2 GUI.
Is it fair to assume everything on this list would benefit from being logically networked together? I’m thinking in terms of zabbix really only being effective if it’s in pretty constant contact with these servers.
Though if that was the main concern, there are likely alternatives that could be reached out to like New Relic and Datadog.
For monitoring: I’m a developer of openITCOCKPIT which is basically Nagios on steroids. It cams with Graphite, Grafana, Web UI and API. I could assist on setup.
I don’t want to make any AD so please checkout the website / GitHub repo for more information.
Also not sure what the specs are for the wiki but I really like WikiJS. Stil heavily under development but looks really promising. Has also some features as syncing to a GIT repository (pull, push or 2-way)