Tales from a legacy world: migrating (part of) a Nagios infrastructure - Icinga Camp Milan 2019

Tales from a legacy world
migrating (part of) a Nagios infrastructure
Luca Lesinigo - Solutions Architect
Codeway SA
26/09/2019 - Icinga Camp Milan

$ whoami
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
Luca Lesinigo
Solutions Architect
Codeway SA
(Chiasso, CH)
● *nix sysadmin since high school
● now cfg mgmt, monitoring,
devops, CI/CD and some other
buzzwords
● 💙 Icinga ;-)

Migrating a legacy monitoring infrastructure from Nagios to Icinga2 is not
always straightforward but it's not rocket science either.
In this talk we want to present our experience with one challenging
migration, what we had to deal with, how we solved some legacy
compatibility requirements and the lessons we learnt.
what’s up?(what I’m supposed to tell you)

● this will be boring: no new shiny technology or revolutionary tools
○ it’s just Icinga, IcingaWeb2, InﬂuxDB, Grafana, Nginx, PostgreSQL,
GlusterFS, GitLab, Python, Puppet, Docker 😅
● hackish stuff ahead, but it is Enterprise Production Ready Stuff™
○ don’t try this at home
disclaimers(sorry about them)

Our customer is the Italian branch of a worldwide enterprise, it has >2100
hosts and >13000 services (mix of servers, Linux VMs, Windows VMs,
network gear, UPSes, etc…). A lot of them are based in on-premise DCs, but
there’s also stuff all over their Italian locations and on the cloud.
This was being monitored with a classic Nagios instance running on SLES
systems with some additional tools including NagVis and PNP4Nagios.
Conﬁgurations are wholly managed with NagioSQL.
what’s the starting point?(what they said they had)

Nagios was deployed on a pair of nodes managed by heartbeat (active /
standby) with the relevant file systems (configurations, check plugins,
performance metrics) mounted over NFS.
Then we find that the NFS server is a single Linux VM…
...but everything was on vSphere HA
high availability(sort of)

Changes to the monitoring conﬁgurations were managed with NagioSQL;
a couple guys were responsible for them but 99% of the time it was always
the same person managing everything.
There was zero automation: provisioning monitoring for hosts and
services and keeping it up-to-date was handled by email and/or verbal
communication. Decommissioning, when it happened, was manual too.
managing conﬁgs(empowering users with modern tools and methodologies… oh wait, wrong slide)

Plugins were in one of the NFS-mounted shares; some of them were simply
copies of the standard ones and many were in-house developments in
various stages of documentation, maturity, error handling, portability.
We quickly found out that often there were parameters hardcoded in the
plugin scripts. Nagios way of handling commands & arguments also didn’t
encourage any better structure.
monitoring plugins(I thought it simply was some standard library + some customizations, wasn’t it?)

Users of the system employed mainly three interfaces:
● the classic Nagios web interface
● NagStaMon on their desktops
● NagVis views, both from their workstations and in rotation on big
wall-mount displays
Authentication went to the A/D servers for all of the above
user experience(their daily dose)

The monitoring system alerts users through email notifications.
Lots of emails.
Many users simply didn’t look at them as their first resource but relied
heavily on NagStaMon showing up new criticals; they would then go
through the emails or look at the web interface to get more details.
notifications(lots of)

● replace Nagios with a better, modern tool: Icinga2
● better performance / scaling
● better reliability
● be in a better position for future evolutions
● keep both the old and new systems online for a little while
● decommission the old system after a couple months
project goals(hoping in a better future)

● keep Nagios & Icinga2 conﬁgurations perfectly in sync
○ and do that while also keeping NagioSQL
● keep the current checks, agents, protocols, ….
○ no new agents or conﬁgurations on monitored nodes
● “and that’s all, really, nothing else”
○ I knew thing would get interesting sooner or later
project requirements(I said it wasn’t straightforward)

● build a PoC, quick ’n dirty, show something running
● gather comments and suggestions
● build the solution in an automated, maintainable way
● deploy it, do smoke tests, see what happens
● cross-check with Nagios
● if everything’s ok, celebrate 🍻
● after some time, retire Nagios and celebrate again 🍻
the plan(how we’d save the world and be the heros)

basic design is more or less the usual one, with HA:
● two Icinga2 nodes in HA (master zone)
● InﬂuxDB for performance metrics, Grafana for visualization
● IcingaWeb2 with Grafana module 😎 and customized theme
● Icinga ClassicUI 😕
● PostgreSQL database for IDO-DB and Grafana conﬁgs
let’s do it!(we will rock you)

● the ClassicUI came up as a new requirement 😱
● there is little to no documentation about the old system
● oh, they’re also migrating from SLES11 to CentOS7
● and, they’re also migrating from the good old VM cluster to a new one
○ yes, the freshly installed and cool hyperconverged one
● then there’s the new VLANs and subnets
● and lots of other hidden hacks gems
but then...(knock knock. who’s there? REALITY!)

● little new requirements forced the project to change course en route
● customer staff is super friendly and helpful, but they’re also super busy
with $dayjob and don’t always have lots of time to follow the new
monitoring project and listen to our existential problems
● the project changed ownership multiple times
● sometimes internal silos between customer teams made it difﬁcult for
“monitoring systems people” to get the attention of “monitored
systems people”
… and also ...(never enough)

● turns out there’s a Puppet server we can use
○ I’m a Puppet guy as much as I’m an Icinga guy
○ it made a perfect fit for this project
● ops gave us base CentOS7 systems with just the agent pre-installed
● the whole stack is completely installed & configured from Puppet
○ “bash zero” mantra
● useful to keep configurations the same between systems / environments
● useful to do test deployments and/or re-deploy stuff when needed
automate all the things(obviously)

Many plugins were taken from SLES11 packages (and possibly others),
most of them didn’t even run on CentOS7.
Solution: leverage Puppet to switch back to upstream versions
standard plugins(yeah… but from a years-old standard)
$replace_system_checks = [ 'check_http', 'check_nrpe', 'check_snmp' ] # and many more...
$replace_system_checks.each |$plugin| {
file { "/opt/monitoring-plugins/${plugin}":
ensure => link,
target => "/usr/lib64/nagios/plugins/${plugin}",
require => Staging::Deploy['custom-monitoring-plugins.tgz'],
}
}

Some plugins hardcoded stuff that shouldn’t be there and/or had
incompatibilities with the newer OS (eg. Perl, Python, shared libs, …).
Solution: we set up a new internal GitLab repo with a branch to track the
“legacy version” (from the old systems) and another one with the patched
versions (for the new systems). Some hardcoded values were moved to
environment variables which we’d populate at runtime from Icinga2
host/service/global vars.
custom plugins(too much custom, sometimes)

Bonus point #1: for the ﬁrst time ever there was a versioning of the plugins
used on the legacy systems (who did what, when and why?)
Bonus point #2: we could leverage GitLab’s issue tracker also for bugs
affecting the older systems, instead of simply patching in place on the NFS
server (much more streamlined process and better collaboration)
Bonus point #3: also leverage GitLab CI to run tests inside dockerized
GitLab runners
custom plugins repo(this is the good stuff)

● there’s the usual problem with stateful checks when the cluster
re-allocates checks from one node to the other: plugin expects files in
/tmp from previous run, but doesn’t find them (example: SNMP
network interface checks that save traffic counters to derive average
bandwidth on the next check run)
● supporting ClassicUI meant we needed an older Icinga2 version
active/active HA is cool(but… there’s always a “but…”)

We already had plugins in a GitLab repo with automated deployments (on
the new systems), so we opened issues whenever we detected a stateful
plugin that needed ﬁles from the previous run.
Solution (part 1): patch the plugin to use a custom path from an
environment variable instead of hardcoding /tmp.
Solution (part 2): mount a shared GlusterFS ﬁlesystem at the custom path
stateful plugins(I wish more plugin authors will care about this)

A long, long time ago there was the ClassicUI and it’s not supported since
ages. But it became a requirement after the project started…
Workaround: let Puppet pin an older Icinga2 version to avoid bringing in a
newer one at each system update, live with it for some time.
At the same time, encourage users to try IcingaWeb2 and implement an
internal “help desk” where they can get help to start using it.
ClassicUI(ahh, the good ol’ times)

We needed to keep the old and new system conﬁgurations “in sync” and let
people continue using NagioSQL to manage everything.
Various reasons prevented us from migrating to Director at the time.
Solution (in an ideal world): don’t. Just don’t do that.
Solution (in our world): too long, see next slides :-)
conﬁguration sync(this is where it gets weird)

There’s not a 1:1 mapping between all Nagios configurations and Icinga2
ones. Once upon a time the nice Icinga folks had a migration tool but it has
been discontinued, and we also had the NagioSQL part in our equation.
We initially tried to follow the same route, somehow going from the Nagios
config directly to Icinga2 config, but we quickly found out why the migration
tool has not become a universal solution to the world problems.
nagios.conf → icinga2.conf(nice try)

We choose some kind of brute force approach: instead of starting from the
Nagios config, we would start from the Nagios objects precache.
It basically is the whole config with all things already “exploded” (ie: no
templates), thus leveraging Nagios itself for a lot of the config parsing logic
and getting a much simpler (albeit much bigger) thing to digest.
objects.precache → icinga2.conf(the dark side of the migration)

We wrote a Python translation tool that would get the whole Nagios
objects precache, parse it, map everything to Icinga2 configuration objects
and configurations, apply custom logic for this specific customer, apply (a
lot of) manual overrides, apply some more magic and finally emit
everything as one big Icinga2 configuration file.
I swear it’s the worst code I’ve ever written, by any standard, but its key
feature is: It Gets the Job Done™
enter the BFT9000(“T” is for “Translator”)

It’s partly a parser, partly a lexer, partly a macro processor, partly a
translator/remapper. The whole thing is ugly and I won’t show it to you to
avoid melting your eyes. (also because it’s proprietary customer’s IP)
But I can tell you it’s ~ 1000 lines of code for a dozen classes and takes
something like one or two minutes to translate the Nagios objects
precache in a >400.000 lines Icinga conﬁg ﬁle. Yes that’s over four hundred
thousand lines, but there are some empty lines and even some
auto-generated comments :-)
what does it look like?(thanks mom for paying my CS degree)

● users modify stuff in NagioSQL
● NagioSQL writes new Nagios config files and reloads Nagios
● a cron job runs every 2 minutes to watch out for config file changes
● when a change is detected, it runs nagios -p to regenerate
objects.precache
● ...and then uses rsync to copy it to the Icinga2 server
○ nice detail: rsync ensures an atomic update of the copied file
how is it actually used?(Nagios side)

● a SystemD Path unit detects changes to the objects.precache
file rsync’ed from Nagios and triggers a Service unit
● the Service unit runs a local bash script
● the script copies the file in the workdir of a local git repo, if there is no
diff then nothing has changed and simply quits
● if there is any change, it runs the translator script to get a new Icinga2
config, saving it in the local git workdir
● then runs icinga2 daemon -C to validate the generated config
how is it actually used?(Icinga2 side, receiving the changes)

● if validation is successful, both the incoming objects.precache
and the generated Icinga2 cfg gets committed to the git repo
● the new conﬁg is deployed in /etc/icinga2/zones.d/master
● Icinga2 gets reloaded
Bonus point: there is now a log of what happens through NagioSQL… just
browse the git history of the objects precache to see a nice diff of the
changes.
how is it actually used?(Icinga2 side, deploying the changes)

This mechanism is currently in production since somewhere mid 2018 and
performed flawlessly since then. There are hundreds of automatic git
commits with the whole history of both the incoming
objects.precache and the resulting Icinga2 config file.
The whole process takes a couple of minutes between NagioSQL writing a
new config and Icinga2 being reloaded with the translated version, keeping
the two systems configurations in sync.
does it really work?(no kidding)

Often times the same conﬁgurations and check plugins still gave different
results on the two systems: hosts and services ended up being in different
states between Nagios and Icinga2.
Sometimes there were “external” causes: ~90% of them were due to the new
VLANs and networks not being allowed in hosts ACLs or ﬁrewall rules.
Other times there were “internal” causes, most of them being plugin scripts
behaving differently on the two systems.
are we done yet?(you could bet we’re not)

● false alarms that aren’t false
○ Icinga keeps saying something is CRIT, Nagios says OK
○ after triple-checking the Icinga side of things, you discover that
those years-old OKs in Nagios were actually false
● random asymmetric network lockups
○ network is ﬁne between vm A and B, also between B and C
○ but there is exactly 100% packet loss between A and C
○ GlusterFS & Icinga IDO-DB don’t like this
■ ask me how I know
funny failures(“funny” applies only as long as you’re not in production)

● check plugins that leak temporary / state files, exhausting inodes
○ makes for a good opportunity to explain why you can get
“filesystem full” even if df shows lots of free space
● SMTP black holes: people complain they’re not getting notifications,
logs show the smtp server accepted them successfully
○ hardcoded filters in the MTA dropping messages after having
received and enqueued them
● random DNS issues, different servers giving different answers, ...
○ [cit.] everything is a freaking dns problem
funny failures, vol. 2(sometimes they come back)

With over 13.000 services there was a need to quickly detect and identify
problems. We wrote another Python script that would fetch LiveStatus
data from Nagios and Icinga2 and cross check between the two.
This helped us quickly identify different results, also grouping them in
broad categories:
● OK on Nagios and non-OK on Icinga2 (the most common one)
● the other way around (rare)
● non-OK in both, but different states (eg. warning vs unknown)
verifying results(who monitors the monitoring systems?)

Some parts of the configuration are not synchronized between the two
systems using the translation tool, by design. For example:
● notification configurations: defined new NotificationCommands
and used apply rules for Notifications
● surrounding tools like performance metrics: started from a blank
sheet and adopted InfluxDB and Grafana
We are also still finding out small surprises here and there and fixing them
as we discover them :-)
final notes(cleaning up the last bits)

● the two systems are still alive side-to-side
● more than 99.5% of host and service checks, and their results,
perfectly in sync between the two systems
○ (and this metric is improving)
● a better process in place to manage the whole life cycle of custom
plugin checks, which the legacy systems also beneﬁts from
● currently in the process of switching to Icinga2 as primary and Nagios
as a reference backup
what now?(state of the union)

● dockerize NagioSQL + Nagios and keep that container around to let
people keep using NagioSQL after the Nagios system will be retired
and decommissioned
○ that instance will run only to generate objects.precache
● move more and more people to IcingaWeb2 and get rid of the
ClassicUI as soon as possible
○ this will unlock upgrades to newer Icinga2 versions
what tomorrow?(short term)

We don’t have a speciﬁc roadmap (yet), but the next big thing should be /
will be / Ihopeitwillbe removing NagioSQL. Did anyone say Director? 😉
We will also be investigating the possibly to create satellite zones to
delegate some workload and segregate check plugins.
Also, more automation to keep monitoring in sync with the real systems:
● better KPIs in the customer CMDB for monitoring coverage
● integrations with A/D, Azure, VMware and other authoritative sources
a bright future ahead(hope to get there soon)

This project has been a (weird|extreme|nontypical) scenario of legacy
compatibility.
There wasn’t opportunity to introduce disruptive new technologies but it
sure was a step forward for the customer to get ready for future
improvements and a learning opportunity for everyone involved.
What follows is a short collection of some (mostly obvious?) take aways
from this experience, in no speciﬁc order.
lessons learned(is there anything useful here?)

● a project like this has many crossing points on the boundary between
customer and provider / consultant, requiring strict collaboration
● good project management and internal communications are a must
● identify all the users of the system and all the owners of the
monitored entities: they’re often the only ones that can tell you what’s
wrong and what’s correct
● try to get feedback as soon as possible in the project, use it to drive
your process and decisions
people are critical for success(it’s not just an IT thing)

● monitoring is not the kind of of system that you want to use on your
new shiny piece of technology before you have proven it is more than
stable
● ideally choose some reliable piece of your infrastructure that you
know inside out and that’s as “out of band” as possible from the
monitored systems
● ask yourself who’s monitoring the monitoring system
aim for boring(the more boring, the better)

● we had many non obvious cases of wrong results, which IMHO is the
worst thing you can get from a monitoring system: if you can’t trust it,
then it’s useless
● scripting generic sanity checks and cross-checks with the system
being replaced helped us quickly identify the vast majority of wrong
results
double check your checks(the devil is in the details)

● I’ve seen one too many plugin checks which behaved “funny” every
time it went outside of the “everything’s ok” code path
● acknowledge that nowadays we can have HA clusters
○ provide for a way to store state cluster-wide, if needed
● watch out for everything that can go down the timeout route
○ most checks complete in < 1 second when OK or with simple
failures (eg. tcp port closed), but when they have to hang around
waiting for long timeouts there’s a huge impact on monitoring
systems workload
write good check plugins(and test them)

● integrating visual tools in your monitoring allows people to quickly get
a glance of the current situation and how it evolved over time
● IcingaWeb2 grafana/graphite/etc… modules are a perfect example
● tools like Grafana let users build their own dashboards and drill down
to the data they are interested in, without having to ask a sysadmin to
provision custom conﬁgurations on the monitoring systems
● and they can also be hooked up to legacy systems if need be, providing
a nice improvement when you can’t change the whole system
visualizations(a picture is worth a thousand words)

● enables testing stuff in local environments (eg. Vagrant) and ensures
consistency between deployments (staging/prod)
● removes human error from repetitive tasks (or at least guarantees
consistent repetitive errors ;)
● allows quicker and simpler processes (eg. click in NagioSQL → Icinga2
running a translated, validated, versioned conﬁguration)
● come on, it’s 2019: there’s no question about automation, just do it™.
embrace automation(let the machine work for you)

● there are lots of integrations and contact points between different
tools and different systems, lots of moving parts
● having everything documented is crucial
● if you are using any kind of configuration management (like Puppet in
our case) the code is itself a form of documentation
● you should be able to point at any single component (a cron job, a
script, a configuration choice) and answer the five Ws in a matter of
minutes
documentation(I wish you were here)

Q & A

Tales from a legacy world: migrating (part of) a Nagios infrastructure - Icinga Camp Milan 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tales from a legacy world: migrating (part of) a Nagios infrastructure - Icinga Camp Milan 2019

Similar to Tales from a legacy world: migrating (part of) a Nagios infrastructure - Icinga Camp Milan 2019 (20)

Recently uploaded

Recently uploaded (20)

Tales from a legacy world: migrating (part of) a Nagios infrastructure - Icinga Camp Milan 2019