SlideShare a Scribd company logo
1 of 49
Download to read offline
Tales from a legacy world
migrating (part of) a Nagios infrastructure
Luca Lesinigo - Solutions Architect
Codeway SA
26/09/2019 - Icinga Camp Milan
$ whoami
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
Luca Lesinigo
Solutions Architect
Codeway SA
(Chiasso, CH)
● *nix sysadmin since high school
● now cfg mgmt, monitoring,
devops, CI/CD and some other
buzzwords
● 💙 Icinga ;-)
Migrating a legacy monitoring infrastructure from Nagios to Icinga2 is not
always straightforward but it's not rocket science either.
In this talk we want to present our experience with one challenging
migration, what we had to deal with, how we solved some legacy
compatibility requirements and the lessons we learnt.
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
what’s up?(what I’m supposed to tell you)
● this will be boring: no new shiny technology or revolutionary tools
○ it’s just Icinga, IcingaWeb2, InfluxDB, Grafana, Nginx, PostgreSQL,
GlusterFS, GitLab, Python, Puppet, Docker 😅
● hackish stuff ahead, but it is Enterprise Production Ready Stuff™
○ don’t try this at home
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
disclaimers(sorry about them)
Our customer is the Italian branch of a worldwide enterprise, it has >2100
hosts and >13000 services (mix of servers, Linux VMs, Windows VMs,
network gear, UPSes, etc…). A lot of them are based in on-premise DCs, but
there’s also stuff all over their Italian locations and on the cloud.
This was being monitored with a classic Nagios instance running on SLES
systems with some additional tools including NagVis and PNP4Nagios.
Configurations are wholly managed with NagioSQL.
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
what’s the starting point?(what they said they had)
Nagios was deployed on a pair of nodes managed by heartbeat (active /
standby) with the relevant file systems (configurations, check plugins,
performance metrics) mounted over NFS.
Then we find that the NFS server is a single Linux VM…
...but everything was on vSphere HA
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
high availability(sort of)
Changes to the monitoring configurations were managed with NagioSQL;
a couple guys were responsible for them but 99% of the time it was always
the same person managing everything.
There was zero automation: provisioning monitoring for hosts and
services and keeping it up-to-date was handled by email and/or verbal
communication. Decommissioning, when it happened, was manual too.
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
managing configs(empowering users with modern tools and methodologies… oh wait, wrong slide)
Plugins were in one of the NFS-mounted shares; some of them were simply
copies of the standard ones and many were in-house developments in
various stages of documentation, maturity, error handling, portability.
We quickly found out that often there were parameters hardcoded in the
plugin scripts. Nagios way of handling commands & arguments also didn’t
encourage any better structure.
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
monitoring plugins(I thought it simply was some standard library + some customizations, wasn’t it?)
Users of the system employed mainly three interfaces:
● the classic Nagios web interface
● NagStaMon on their desktops
● NagVis views, both from their workstations and in rotation on big
wall-mount displays
Authentication went to the A/D servers for all of the above
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
user experience(their daily dose)
The monitoring system alerts users through email notifications.
Lots of emails.
Many users simply didn’t look at them as their first resource but relied
heavily on NagStaMon showing up new criticals; they would then go
through the emails or look at the web interface to get more details.
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
notifications(lots of)
● replace Nagios with a better, modern tool: Icinga2
● better performance / scaling
● better reliability
● be in a better position for future evolutions
● keep both the old and new systems online for a little while
● decommission the old system after a couple months
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
project goals(hoping in a better future)
● keep Nagios & Icinga2 configurations perfectly in sync
○ and do that while also keeping NagioSQL
● keep the current checks, agents, protocols, ….
○ no new agents or configurations on monitored nodes
● “and that’s all, really, nothing else”
○ I knew thing would get interesting sooner or later
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
project requirements(I said it wasn’t straightforward)
● build a PoC, quick ’n dirty, show something running
● gather comments and suggestions
● build the solution in an automated, maintainable way
● deploy it, do smoke tests, see what happens
● cross-check with Nagios
● if everything’s ok, celebrate 🍻
● after some time, retire Nagios and celebrate again 🍻
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
the plan(how we’d save the world and be the heros)
basic design is more or less the usual one, with HA:
● two Icinga2 nodes in HA (master zone)
● InfluxDB for performance metrics, Grafana for visualization
● IcingaWeb2 with Grafana module 😎 and customized theme
● Icinga ClassicUI 😕
● PostgreSQL database for IDO-DB and Grafana configs
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
let’s do it!(we will rock you)
● the ClassicUI came up as a new requirement 😱
● there is little to no documentation about the old system
● oh, they’re also migrating from SLES11 to CentOS7
● and, they’re also migrating from the good old VM cluster to a new one
○ yes, the freshly installed and cool hyperconverged one
● then there’s the new VLANs and subnets
● and lots of other hidden hacks gems
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
but then...(knock knock. who’s there? REALITY!)
● little new requirements forced the project to change course en route
● customer staff is super friendly and helpful, but they’re also super busy
with $dayjob and don’t always have lots of time to follow the new
monitoring project and listen to our existential problems
● the project changed ownership multiple times
● sometimes internal silos between customer teams made it difficult for
“monitoring systems people” to get the attention of “monitored
systems people”
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
… and also ...(never enough)
● turns out there’s a Puppet server we can use
○ I’m a Puppet guy as much as I’m an Icinga guy
○ it made a perfect fit for this project
● ops gave us base CentOS7 systems with just the agent pre-installed
● the whole stack is completely installed & configured from Puppet
○ “bash zero” mantra
● useful to keep configurations the same between systems / environments
● useful to do test deployments and/or re-deploy stuff when needed
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
automate all the things(obviously)
Many plugins were taken from SLES11 packages (and possibly others),
most of them didn’t even run on CentOS7.
Solution: leverage Puppet to switch back to upstream versions
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
standard plugins(yeah… but from a years-old standard)
$replace_system_checks = [ 'check_http', 'check_nrpe', 'check_snmp' ] # and many more...
$replace_system_checks.each |$plugin| {
file { "/opt/monitoring-plugins/${plugin}":
ensure => link,
target => "/usr/lib64/nagios/plugins/${plugin}",
require => Staging::Deploy['custom-monitoring-plugins.tgz'],
}
}
Some plugins hardcoded stuff that shouldn’t be there and/or had
incompatibilities with the newer OS (eg. Perl, Python, shared libs, …).
Solution: we set up a new internal GitLab repo with a branch to track the
“legacy version” (from the old systems) and another one with the patched
versions (for the new systems). Some hardcoded values were moved to
environment variables which we’d populate at runtime from Icinga2
host/service/global vars.
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
custom plugins(too much custom, sometimes)
Bonus point #1: for the first time ever there was a versioning of the plugins
used on the legacy systems (who did what, when and why?)
Bonus point #2: we could leverage GitLab’s issue tracker also for bugs
affecting the older systems, instead of simply patching in place on the NFS
server (much more streamlined process and better collaboration)
Bonus point #3: also leverage GitLab CI to run tests inside dockerized
GitLab runners
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
custom plugins repo(this is the good stuff)
● there’s the usual problem with stateful checks when the cluster
re-allocates checks from one node to the other: plugin expects files in
/tmp from previous run, but doesn’t find them (example: SNMP
network interface checks that save traffic counters to derive average
bandwidth on the next check run)
● supporting ClassicUI meant we needed an older Icinga2 version
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
active/active HA is cool(but… there’s always a “but…”)
We already had plugins in a GitLab repo with automated deployments (on
the new systems), so we opened issues whenever we detected a stateful
plugin that needed files from the previous run.
Solution (part 1): patch the plugin to use a custom path from an
environment variable instead of hardcoding /tmp.
Solution (part 2): mount a shared GlusterFS filesystem at the custom path
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
stateful plugins(I wish more plugin authors will care about this)
A long, long time ago there was the ClassicUI and it’s not supported since
ages. But it became a requirement after the project started…
Workaround: let Puppet pin an older Icinga2 version to avoid bringing in a
newer one at each system update, live with it for some time.
At the same time, encourage users to try IcingaWeb2 and implement an
internal “help desk” where they can get help to start using it.
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
ClassicUI(ahh, the good ol’ times)
We needed to keep the old and new system configurations “in sync” and let
people continue using NagioSQL to manage everything.
Various reasons prevented us from migrating to Director at the time.
Solution (in an ideal world): don’t. Just don’t do that.
Solution (in our world): too long, see next slides :-)
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
configuration sync(this is where it gets weird)
There’s not a 1:1 mapping between all Nagios configurations and Icinga2
ones. Once upon a time the nice Icinga folks had a migration tool but it has
been discontinued, and we also had the NagioSQL part in our equation.
We initially tried to follow the same route, somehow going from the Nagios
config directly to Icinga2 config, but we quickly found out why the migration
tool has not become a universal solution to the world problems.
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
nagios.conf → icinga2.conf(nice try)
We choose some kind of brute force approach: instead of starting from the
Nagios config, we would start from the Nagios objects precache.
It basically is the whole config with all things already “exploded” (ie: no
templates), thus leveraging Nagios itself for a lot of the config parsing logic
and getting a much simpler (albeit much bigger) thing to digest.
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
objects.precache → icinga2.conf(the dark side of the migration)
We wrote a Python translation tool that would get the whole Nagios
objects precache, parse it, map everything to Icinga2 configuration objects
and configurations, apply custom logic for this specific customer, apply (a
lot of) manual overrides, apply some more magic and finally emit
everything as one big Icinga2 configuration file.
I swear it’s the worst code I’ve ever written, by any standard, but its key
feature is: It Gets the Job Done™
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
enter the BFT9000(“T” is for “Translator”)
It’s partly a parser, partly a lexer, partly a macro processor, partly a
translator/remapper. The whole thing is ugly and I won’t show it to you to
avoid melting your eyes. (also because it’s proprietary customer’s IP)
But I can tell you it’s ~ 1000 lines of code for a dozen classes and takes
something like one or two minutes to translate the Nagios objects
precache in a >400.000 lines Icinga config file. Yes that’s over four hundred
thousand lines, but there are some empty lines and even some
auto-generated comments :-)
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
what does it look like?(thanks mom for paying my CS degree)
● users modify stuff in NagioSQL
● NagioSQL writes new Nagios config files and reloads Nagios
● a cron job runs every 2 minutes to watch out for config file changes
● when a change is detected, it runs nagios -p to regenerate
objects.precache
● ...and then uses rsync to copy it to the Icinga2 server
○ nice detail: rsync ensures an atomic update of the copied file
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
how is it actually used?(Nagios side)
● a SystemD Path unit detects changes to the objects.precache
file rsync’ed from Nagios and triggers a Service unit
● the Service unit runs a local bash script
● the script copies the file in the workdir of a local git repo, if there is no
diff then nothing has changed and simply quits
● if there is any change, it runs the translator script to get a new Icinga2
config, saving it in the local git workdir
● then runs icinga2 daemon -C to validate the generated config
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
how is it actually used?(Icinga2 side, receiving the changes)
● if validation is successful, both the incoming objects.precache
and the generated Icinga2 cfg gets committed to the git repo
● the new config is deployed in /etc/icinga2/zones.d/master
● Icinga2 gets reloaded
Bonus point: there is now a log of what happens through NagioSQL… just
browse the git history of the objects precache to see a nice diff of the
changes.
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
how is it actually used?(Icinga2 side, deploying the changes)
This mechanism is currently in production since somewhere mid 2018 and
performed flawlessly since then. There are hundreds of automatic git
commits with the whole history of both the incoming
objects.precache and the resulting Icinga2 config file.
The whole process takes a couple of minutes between NagioSQL writing a
new config and Icinga2 being reloaded with the translated version, keeping
the two systems configurations in sync.
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
does it really work?(no kidding)
Often times the same configurations and check plugins still gave different
results on the two systems: hosts and services ended up being in different
states between Nagios and Icinga2.
Sometimes there were “external” causes: ~90% of them were due to the new
VLANs and networks not being allowed in hosts ACLs or firewall rules.
Other times there were “internal” causes, most of them being plugin scripts
behaving differently on the two systems.
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
are we done yet?(you could bet we’re not)
● false alarms that aren’t false
○ Icinga keeps saying something is CRIT, Nagios says OK
○ after triple-checking the Icinga side of things, you discover that
those years-old OKs in Nagios were actually false
● random asymmetric network lockups
○ network is fine between vm A and B, also between B and C
○ but there is exactly 100% packet loss between A and C
○ GlusterFS & Icinga IDO-DB don’t like this
■ ask me how I know
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
funny failures(“funny” applies only as long as you’re not in production)
● check plugins that leak temporary / state files, exhausting inodes
○ makes for a good opportunity to explain why you can get
“filesystem full” even if df shows lots of free space
● SMTP black holes: people complain they’re not getting notifications,
logs show the smtp server accepted them successfully
○ hardcoded filters in the MTA dropping messages after having
received and enqueued them
● random DNS issues, different servers giving different answers, ...
○ [cit.] everything is a freaking dns problem
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
funny failures, vol. 2(sometimes they come back)
With over 13.000 services there was a need to quickly detect and identify
problems. We wrote another Python script that would fetch LiveStatus
data from Nagios and Icinga2 and cross check between the two.
This helped us quickly identify different results, also grouping them in
broad categories:
● OK on Nagios and non-OK on Icinga2 (the most common one)
● the other way around (rare)
● non-OK in both, but different states (eg. warning vs unknown)
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
verifying results(who monitors the monitoring systems?)
Some parts of the configuration are not synchronized between the two
systems using the translation tool, by design. For example:
● notification configurations: defined new NotificationCommands
and used apply rules for Notifications
● surrounding tools like performance metrics: started from a blank
sheet and adopted InfluxDB and Grafana
We are also still finding out small surprises here and there and fixing them
as we discover them :-)
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
final notes(cleaning up the last bits)
● the two systems are still alive side-to-side
● more than 99.5% of host and service checks, and their results,
perfectly in sync between the two systems
○ (and this metric is improving)
● a better process in place to manage the whole life cycle of custom
plugin checks, which the legacy systems also benefits from
● currently in the process of switching to Icinga2 as primary and Nagios
as a reference backup
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
what now?(state of the union)
● dockerize NagioSQL + Nagios and keep that container around to let
people keep using NagioSQL after the Nagios system will be retired
and decommissioned
○ that instance will run only to generate objects.precache
● move more and more people to IcingaWeb2 and get rid of the
ClassicUI as soon as possible
○ this will unlock upgrades to newer Icinga2 versions
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
what tomorrow?(short term)
We don’t have a specific roadmap (yet), but the next big thing should be /
will be / Ihopeitwillbe removing NagioSQL. Did anyone say Director? 😉
We will also be investigating the possibly to create satellite zones to
delegate some workload and segregate check plugins.
Also, more automation to keep monitoring in sync with the real systems:
● better KPIs in the customer CMDB for monitoring coverage
● integrations with A/D, Azure, VMware and other authoritative sources
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
a bright future ahead(hope to get there soon)
This project has been a (weird|extreme|nontypical) scenario of legacy
compatibility.
There wasn’t opportunity to introduce disruptive new technologies but it
sure was a step forward for the customer to get ready for future
improvements and a learning opportunity for everyone involved.
What follows is a short collection of some (mostly obvious?) take aways
from this experience, in no specific order.
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
lessons learned(is there anything useful here?)
● a project like this has many crossing points on the boundary between
customer and provider / consultant, requiring strict collaboration
● good project management and internal communications are a must
● identify all the users of the system and all the owners of the
monitored entities: they’re often the only ones that can tell you what’s
wrong and what’s correct
● try to get feedback as soon as possible in the project, use it to drive
your process and decisions
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
people are critical for success(it’s not just an IT thing)
● monitoring is not the kind of of system that you want to use on your
new shiny piece of technology before you have proven it is more than
stable
● ideally choose some reliable piece of your infrastructure that you
know inside out and that’s as “out of band” as possible from the
monitored systems
● ask yourself who’s monitoring the monitoring system
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
aim for boring(the more boring, the better)
● we had many non obvious cases of wrong results, which IMHO is the
worst thing you can get from a monitoring system: if you can’t trust it,
then it’s useless
● scripting generic sanity checks and cross-checks with the system
being replaced helped us quickly identify the vast majority of wrong
results
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
double check your checks(the devil is in the details)
● I’ve seen one too many plugin checks which behaved “funny” every
time it went outside of the “everything’s ok” code path
● acknowledge that nowadays we can have HA clusters
○ provide for a way to store state cluster-wide, if needed
● watch out for everything that can go down the timeout route
○ most checks complete in < 1 second when OK or with simple
failures (eg. tcp port closed), but when they have to hang around
waiting for long timeouts there’s a huge impact on monitoring
systems workload
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
write good check plugins(and test them)
● integrating visual tools in your monitoring allows people to quickly get
a glance of the current situation and how it evolved over time
● IcingaWeb2 grafana/graphite/etc… modules are a perfect example
● tools like Grafana let users build their own dashboards and drill down
to the data they are interested in, without having to ask a sysadmin to
provision custom configurations on the monitoring systems
● and they can also be hooked up to legacy systems if need be, providing
a nice improvement when you can’t change the whole system
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
visualizations(a picture is worth a thousand words)
● enables testing stuff in local environments (eg. Vagrant) and ensures
consistency between deployments (staging/prod)
● removes human error from repetitive tasks (or at least guarantees
consistent repetitive errors ;)
● allows quicker and simpler processes (eg. click in NagioSQL → Icinga2
running a translated, validated, versioned configuration)
● come on, it’s 2019: there’s no question about automation, just do it™.
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
embrace automation(let the machine work for you)
● there are lots of integrations and contact points between different
tools and different systems, lots of moving parts
● having everything documented is crucial
● if you are using any kind of configuration management (like Puppet in
our case) the code is itself a form of documentation
● you should be able to point at any single component (a cron job, a
script, a configuration choice) and answer the five Ws in a matter of
minutes
Tales from a legacy world - Migrating (part of) a Nagios infrastructure
documentation(I wish you were here)
Q & A
Tales from a legacy world - Migrating (part of) a Nagios infrastructure

More Related Content

What's hot

Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
Belmiro Moreira
 
Inside neutron 2
Inside neutron 2Inside neutron 2
Inside neutron 2
Robin Gong
 
Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientists
aspyker
 

What's hot (20)

Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
 
Kernel Recipes 2018 - 10 years of automated evolution in the Linux kernel - J...
Kernel Recipes 2018 - 10 years of automated evolution in the Linux kernel - J...Kernel Recipes 2018 - 10 years of automated evolution in the Linux kernel - J...
Kernel Recipes 2018 - 10 years of automated evolution in the Linux kernel - J...
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
 
Can you trust Neutron?
Can you trust Neutron?Can you trust Neutron?
Can you trust Neutron?
 
Intro to Kubernetes & GitOps Workshop
Intro to Kubernetes & GitOps WorkshopIntro to Kubernetes & GitOps Workshop
Intro to Kubernetes & GitOps Workshop
 
Inside neutron 2
Inside neutron 2Inside neutron 2
Inside neutron 2
 
Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientists
 
PTG recap
PTG recapPTG recap
PTG recap
 
10 Years of OpenStack at CERN - From 0 to 300k cores
10 Years of OpenStack at CERN - From 0 to 300k cores10 Years of OpenStack at CERN - From 0 to 300k cores
10 Years of OpenStack at CERN - From 0 to 300k cores
 
CERN OpenStack Cloud Control Plane - From VMs to K8s
CERN OpenStack Cloud Control Plane - From VMs to K8sCERN OpenStack Cloud Control Plane - From VMs to K8s
CERN OpenStack Cloud Control Plane - From VMs to K8s
 
OpenStack Nova - Developer Introduction
OpenStack Nova - Developer IntroductionOpenStack Nova - Developer Introduction
OpenStack Nova - Developer Introduction
 
Future Science on Future OpenStack
Future Science on Future OpenStackFuture Science on Future OpenStack
Future Science on Future OpenStack
 
The OpenStack Cloud at CERN
The OpenStack Cloud at CERNThe OpenStack Cloud at CERN
The OpenStack Cloud at CERN
 
CERN User Story
CERN User StoryCERN User Story
CERN User Story
 
Evolution of Openstack Networking at CERN
Evolution of Openstack Networking at CERNEvolution of Openstack Networking at CERN
Evolution of Openstack Networking at CERN
 
Cern Cloud Architecture - February, 2016
Cern Cloud Architecture - February, 2016Cern Cloud Architecture - February, 2016
Cern Cloud Architecture - February, 2016
 
OpenStack Telco Architecture: OpenStack Summit Boston 2017
OpenStack Telco Architecture: OpenStack Summit Boston 2017OpenStack Telco Architecture: OpenStack Summit Boston 2017
OpenStack Telco Architecture: OpenStack Summit Boston 2017
 
Why observability matters - now and in the future (w/guest Grafana)
Why observability matters - now and in the future (w/guest Grafana)Why observability matters - now and in the future (w/guest Grafana)
Why observability matters - now and in the future (w/guest Grafana)
 
OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017
 
Open Source Backends for OpenStack Neutron
Open Source Backends for OpenStack NeutronOpen Source Backends for OpenStack Neutron
Open Source Backends for OpenStack Neutron
 

Similar to Tales from a legacy world: migrating (part of) a Nagios infrastructure - Icinga Camp Milan 2019

Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014
vespian_256
 
DevOps - Interview Question.pdf
DevOps - Interview Question.pdfDevOps - Interview Question.pdf
DevOps - Interview Question.pdf
MinhTrnNht7
 

Similar to Tales from a legacy world: migrating (part of) a Nagios infrastructure - Icinga Camp Milan 2019 (20)

Montreal OpenStack Q3-2017 MeetUp
Montreal OpenStack Q3-2017 MeetUpMontreal OpenStack Q3-2017 MeetUp
Montreal OpenStack Q3-2017 MeetUp
 
Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016
 
Heroku to Kubernetes & Gihub to Gitlab success story
Heroku to Kubernetes & Gihub to Gitlab success storyHeroku to Kubernetes & Gihub to Gitlab success story
Heroku to Kubernetes & Gihub to Gitlab success story
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)
 
Cloud Native CI/CD with GitOps
Cloud Native CI/CD with GitOpsCloud Native CI/CD with GitOps
Cloud Native CI/CD with GitOps
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
Icinga Director
Icinga DirectorIcinga Director
Icinga Director
 
Sheepdog Status Report
Sheepdog Status ReportSheepdog Status Report
Sheepdog Status Report
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With Nagios
 
Modern Web development and operations practices
Modern Web development and operations practicesModern Web development and operations practices
Modern Web development and operations practices
 
OpenStack Ottawa Q2 MeetUp - May 31st 2017
OpenStack Ottawa Q2 MeetUp - May 31st 2017OpenStack Ottawa Q2 MeetUp - May 31st 2017
OpenStack Ottawa Q2 MeetUp - May 31st 2017
 
Continuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event KeynoteContinuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event Keynote
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014
 
PaaSTA: Running applications at Yelp
PaaSTA: Running applications at YelpPaaSTA: Running applications at Yelp
PaaSTA: Running applications at Yelp
 
DevOps - Interview Question.pdf
DevOps - Interview Question.pdfDevOps - Interview Question.pdf
DevOps - Interview Question.pdf
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
HPC on OpenStack
HPC on OpenStackHPC on OpenStack
HPC on OpenStack
 

Recently uploaded

Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdf
Overkill Security
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 

Recently uploaded (20)

الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdf
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi Daparthi
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 

Tales from a legacy world: migrating (part of) a Nagios infrastructure - Icinga Camp Milan 2019

  • 1. Tales from a legacy world migrating (part of) a Nagios infrastructure Luca Lesinigo - Solutions Architect Codeway SA 26/09/2019 - Icinga Camp Milan
  • 2. $ whoami Tales from a legacy world - Migrating (part of) a Nagios infrastructure Luca Lesinigo Solutions Architect Codeway SA (Chiasso, CH) ● *nix sysadmin since high school ● now cfg mgmt, monitoring, devops, CI/CD and some other buzzwords ● 💙 Icinga ;-)
  • 3. Migrating a legacy monitoring infrastructure from Nagios to Icinga2 is not always straightforward but it's not rocket science either. In this talk we want to present our experience with one challenging migration, what we had to deal with, how we solved some legacy compatibility requirements and the lessons we learnt. Tales from a legacy world - Migrating (part of) a Nagios infrastructure what’s up?(what I’m supposed to tell you)
  • 4. ● this will be boring: no new shiny technology or revolutionary tools ○ it’s just Icinga, IcingaWeb2, InfluxDB, Grafana, Nginx, PostgreSQL, GlusterFS, GitLab, Python, Puppet, Docker 😅 ● hackish stuff ahead, but it is Enterprise Production Ready Stuff™ ○ don’t try this at home Tales from a legacy world - Migrating (part of) a Nagios infrastructure disclaimers(sorry about them)
  • 5. Our customer is the Italian branch of a worldwide enterprise, it has >2100 hosts and >13000 services (mix of servers, Linux VMs, Windows VMs, network gear, UPSes, etc…). A lot of them are based in on-premise DCs, but there’s also stuff all over their Italian locations and on the cloud. This was being monitored with a classic Nagios instance running on SLES systems with some additional tools including NagVis and PNP4Nagios. Configurations are wholly managed with NagioSQL. Tales from a legacy world - Migrating (part of) a Nagios infrastructure what’s the starting point?(what they said they had)
  • 6. Nagios was deployed on a pair of nodes managed by heartbeat (active / standby) with the relevant file systems (configurations, check plugins, performance metrics) mounted over NFS. Then we find that the NFS server is a single Linux VM… ...but everything was on vSphere HA Tales from a legacy world - Migrating (part of) a Nagios infrastructure high availability(sort of)
  • 7. Changes to the monitoring configurations were managed with NagioSQL; a couple guys were responsible for them but 99% of the time it was always the same person managing everything. There was zero automation: provisioning monitoring for hosts and services and keeping it up-to-date was handled by email and/or verbal communication. Decommissioning, when it happened, was manual too. Tales from a legacy world - Migrating (part of) a Nagios infrastructure managing configs(empowering users with modern tools and methodologies… oh wait, wrong slide)
  • 8. Plugins were in one of the NFS-mounted shares; some of them were simply copies of the standard ones and many were in-house developments in various stages of documentation, maturity, error handling, portability. We quickly found out that often there were parameters hardcoded in the plugin scripts. Nagios way of handling commands & arguments also didn’t encourage any better structure. Tales from a legacy world - Migrating (part of) a Nagios infrastructure monitoring plugins(I thought it simply was some standard library + some customizations, wasn’t it?)
  • 9. Users of the system employed mainly three interfaces: ● the classic Nagios web interface ● NagStaMon on their desktops ● NagVis views, both from their workstations and in rotation on big wall-mount displays Authentication went to the A/D servers for all of the above Tales from a legacy world - Migrating (part of) a Nagios infrastructure user experience(their daily dose)
  • 10. The monitoring system alerts users through email notifications. Lots of emails. Many users simply didn’t look at them as their first resource but relied heavily on NagStaMon showing up new criticals; they would then go through the emails or look at the web interface to get more details. Tales from a legacy world - Migrating (part of) a Nagios infrastructure notifications(lots of)
  • 11. ● replace Nagios with a better, modern tool: Icinga2 ● better performance / scaling ● better reliability ● be in a better position for future evolutions ● keep both the old and new systems online for a little while ● decommission the old system after a couple months Tales from a legacy world - Migrating (part of) a Nagios infrastructure project goals(hoping in a better future)
  • 12. ● keep Nagios & Icinga2 configurations perfectly in sync ○ and do that while also keeping NagioSQL ● keep the current checks, agents, protocols, …. ○ no new agents or configurations on monitored nodes ● “and that’s all, really, nothing else” ○ I knew thing would get interesting sooner or later Tales from a legacy world - Migrating (part of) a Nagios infrastructure project requirements(I said it wasn’t straightforward)
  • 13. ● build a PoC, quick ’n dirty, show something running ● gather comments and suggestions ● build the solution in an automated, maintainable way ● deploy it, do smoke tests, see what happens ● cross-check with Nagios ● if everything’s ok, celebrate 🍻 ● after some time, retire Nagios and celebrate again 🍻 Tales from a legacy world - Migrating (part of) a Nagios infrastructure the plan(how we’d save the world and be the heros)
  • 14. basic design is more or less the usual one, with HA: ● two Icinga2 nodes in HA (master zone) ● InfluxDB for performance metrics, Grafana for visualization ● IcingaWeb2 with Grafana module 😎 and customized theme ● Icinga ClassicUI 😕 ● PostgreSQL database for IDO-DB and Grafana configs Tales from a legacy world - Migrating (part of) a Nagios infrastructure let’s do it!(we will rock you)
  • 15. ● the ClassicUI came up as a new requirement 😱 ● there is little to no documentation about the old system ● oh, they’re also migrating from SLES11 to CentOS7 ● and, they’re also migrating from the good old VM cluster to a new one ○ yes, the freshly installed and cool hyperconverged one ● then there’s the new VLANs and subnets ● and lots of other hidden hacks gems Tales from a legacy world - Migrating (part of) a Nagios infrastructure but then...(knock knock. who’s there? REALITY!)
  • 16. ● little new requirements forced the project to change course en route ● customer staff is super friendly and helpful, but they’re also super busy with $dayjob and don’t always have lots of time to follow the new monitoring project and listen to our existential problems ● the project changed ownership multiple times ● sometimes internal silos between customer teams made it difficult for “monitoring systems people” to get the attention of “monitored systems people” Tales from a legacy world - Migrating (part of) a Nagios infrastructure … and also ...(never enough)
  • 17. ● turns out there’s a Puppet server we can use ○ I’m a Puppet guy as much as I’m an Icinga guy ○ it made a perfect fit for this project ● ops gave us base CentOS7 systems with just the agent pre-installed ● the whole stack is completely installed & configured from Puppet ○ “bash zero” mantra ● useful to keep configurations the same between systems / environments ● useful to do test deployments and/or re-deploy stuff when needed Tales from a legacy world - Migrating (part of) a Nagios infrastructure automate all the things(obviously)
  • 18. Many plugins were taken from SLES11 packages (and possibly others), most of them didn’t even run on CentOS7. Solution: leverage Puppet to switch back to upstream versions Tales from a legacy world - Migrating (part of) a Nagios infrastructure standard plugins(yeah… but from a years-old standard) $replace_system_checks = [ 'check_http', 'check_nrpe', 'check_snmp' ] # and many more... $replace_system_checks.each |$plugin| { file { "/opt/monitoring-plugins/${plugin}": ensure => link, target => "/usr/lib64/nagios/plugins/${plugin}", require => Staging::Deploy['custom-monitoring-plugins.tgz'], } }
  • 19. Some plugins hardcoded stuff that shouldn’t be there and/or had incompatibilities with the newer OS (eg. Perl, Python, shared libs, …). Solution: we set up a new internal GitLab repo with a branch to track the “legacy version” (from the old systems) and another one with the patched versions (for the new systems). Some hardcoded values were moved to environment variables which we’d populate at runtime from Icinga2 host/service/global vars. Tales from a legacy world - Migrating (part of) a Nagios infrastructure custom plugins(too much custom, sometimes)
  • 20. Bonus point #1: for the first time ever there was a versioning of the plugins used on the legacy systems (who did what, when and why?) Bonus point #2: we could leverage GitLab’s issue tracker also for bugs affecting the older systems, instead of simply patching in place on the NFS server (much more streamlined process and better collaboration) Bonus point #3: also leverage GitLab CI to run tests inside dockerized GitLab runners Tales from a legacy world - Migrating (part of) a Nagios infrastructure custom plugins repo(this is the good stuff)
  • 21. ● there’s the usual problem with stateful checks when the cluster re-allocates checks from one node to the other: plugin expects files in /tmp from previous run, but doesn’t find them (example: SNMP network interface checks that save traffic counters to derive average bandwidth on the next check run) ● supporting ClassicUI meant we needed an older Icinga2 version Tales from a legacy world - Migrating (part of) a Nagios infrastructure active/active HA is cool(but… there’s always a “but…”)
  • 22. We already had plugins in a GitLab repo with automated deployments (on the new systems), so we opened issues whenever we detected a stateful plugin that needed files from the previous run. Solution (part 1): patch the plugin to use a custom path from an environment variable instead of hardcoding /tmp. Solution (part 2): mount a shared GlusterFS filesystem at the custom path Tales from a legacy world - Migrating (part of) a Nagios infrastructure stateful plugins(I wish more plugin authors will care about this)
  • 23. A long, long time ago there was the ClassicUI and it’s not supported since ages. But it became a requirement after the project started… Workaround: let Puppet pin an older Icinga2 version to avoid bringing in a newer one at each system update, live with it for some time. At the same time, encourage users to try IcingaWeb2 and implement an internal “help desk” where they can get help to start using it. Tales from a legacy world - Migrating (part of) a Nagios infrastructure ClassicUI(ahh, the good ol’ times)
  • 24. We needed to keep the old and new system configurations “in sync” and let people continue using NagioSQL to manage everything. Various reasons prevented us from migrating to Director at the time. Solution (in an ideal world): don’t. Just don’t do that. Solution (in our world): too long, see next slides :-) Tales from a legacy world - Migrating (part of) a Nagios infrastructure configuration sync(this is where it gets weird)
  • 25. There’s not a 1:1 mapping between all Nagios configurations and Icinga2 ones. Once upon a time the nice Icinga folks had a migration tool but it has been discontinued, and we also had the NagioSQL part in our equation. We initially tried to follow the same route, somehow going from the Nagios config directly to Icinga2 config, but we quickly found out why the migration tool has not become a universal solution to the world problems. Tales from a legacy world - Migrating (part of) a Nagios infrastructure nagios.conf → icinga2.conf(nice try)
  • 26. We choose some kind of brute force approach: instead of starting from the Nagios config, we would start from the Nagios objects precache. It basically is the whole config with all things already “exploded” (ie: no templates), thus leveraging Nagios itself for a lot of the config parsing logic and getting a much simpler (albeit much bigger) thing to digest. Tales from a legacy world - Migrating (part of) a Nagios infrastructure objects.precache → icinga2.conf(the dark side of the migration)
  • 27. We wrote a Python translation tool that would get the whole Nagios objects precache, parse it, map everything to Icinga2 configuration objects and configurations, apply custom logic for this specific customer, apply (a lot of) manual overrides, apply some more magic and finally emit everything as one big Icinga2 configuration file. I swear it’s the worst code I’ve ever written, by any standard, but its key feature is: It Gets the Job Done™ Tales from a legacy world - Migrating (part of) a Nagios infrastructure enter the BFT9000(“T” is for “Translator”)
  • 28. It’s partly a parser, partly a lexer, partly a macro processor, partly a translator/remapper. The whole thing is ugly and I won’t show it to you to avoid melting your eyes. (also because it’s proprietary customer’s IP) But I can tell you it’s ~ 1000 lines of code for a dozen classes and takes something like one or two minutes to translate the Nagios objects precache in a >400.000 lines Icinga config file. Yes that’s over four hundred thousand lines, but there are some empty lines and even some auto-generated comments :-) Tales from a legacy world - Migrating (part of) a Nagios infrastructure what does it look like?(thanks mom for paying my CS degree)
  • 29. ● users modify stuff in NagioSQL ● NagioSQL writes new Nagios config files and reloads Nagios ● a cron job runs every 2 minutes to watch out for config file changes ● when a change is detected, it runs nagios -p to regenerate objects.precache ● ...and then uses rsync to copy it to the Icinga2 server ○ nice detail: rsync ensures an atomic update of the copied file Tales from a legacy world - Migrating (part of) a Nagios infrastructure how is it actually used?(Nagios side)
  • 30. ● a SystemD Path unit detects changes to the objects.precache file rsync’ed from Nagios and triggers a Service unit ● the Service unit runs a local bash script ● the script copies the file in the workdir of a local git repo, if there is no diff then nothing has changed and simply quits ● if there is any change, it runs the translator script to get a new Icinga2 config, saving it in the local git workdir ● then runs icinga2 daemon -C to validate the generated config Tales from a legacy world - Migrating (part of) a Nagios infrastructure how is it actually used?(Icinga2 side, receiving the changes)
  • 31. ● if validation is successful, both the incoming objects.precache and the generated Icinga2 cfg gets committed to the git repo ● the new config is deployed in /etc/icinga2/zones.d/master ● Icinga2 gets reloaded Bonus point: there is now a log of what happens through NagioSQL… just browse the git history of the objects precache to see a nice diff of the changes. Tales from a legacy world - Migrating (part of) a Nagios infrastructure how is it actually used?(Icinga2 side, deploying the changes)
  • 32. This mechanism is currently in production since somewhere mid 2018 and performed flawlessly since then. There are hundreds of automatic git commits with the whole history of both the incoming objects.precache and the resulting Icinga2 config file. The whole process takes a couple of minutes between NagioSQL writing a new config and Icinga2 being reloaded with the translated version, keeping the two systems configurations in sync. Tales from a legacy world - Migrating (part of) a Nagios infrastructure does it really work?(no kidding)
  • 33. Often times the same configurations and check plugins still gave different results on the two systems: hosts and services ended up being in different states between Nagios and Icinga2. Sometimes there were “external” causes: ~90% of them were due to the new VLANs and networks not being allowed in hosts ACLs or firewall rules. Other times there were “internal” causes, most of them being plugin scripts behaving differently on the two systems. Tales from a legacy world - Migrating (part of) a Nagios infrastructure are we done yet?(you could bet we’re not)
  • 34. ● false alarms that aren’t false ○ Icinga keeps saying something is CRIT, Nagios says OK ○ after triple-checking the Icinga side of things, you discover that those years-old OKs in Nagios were actually false ● random asymmetric network lockups ○ network is fine between vm A and B, also between B and C ○ but there is exactly 100% packet loss between A and C ○ GlusterFS & Icinga IDO-DB don’t like this ■ ask me how I know Tales from a legacy world - Migrating (part of) a Nagios infrastructure funny failures(“funny” applies only as long as you’re not in production)
  • 35. ● check plugins that leak temporary / state files, exhausting inodes ○ makes for a good opportunity to explain why you can get “filesystem full” even if df shows lots of free space ● SMTP black holes: people complain they’re not getting notifications, logs show the smtp server accepted them successfully ○ hardcoded filters in the MTA dropping messages after having received and enqueued them ● random DNS issues, different servers giving different answers, ... ○ [cit.] everything is a freaking dns problem Tales from a legacy world - Migrating (part of) a Nagios infrastructure funny failures, vol. 2(sometimes they come back)
  • 36. With over 13.000 services there was a need to quickly detect and identify problems. We wrote another Python script that would fetch LiveStatus data from Nagios and Icinga2 and cross check between the two. This helped us quickly identify different results, also grouping them in broad categories: ● OK on Nagios and non-OK on Icinga2 (the most common one) ● the other way around (rare) ● non-OK in both, but different states (eg. warning vs unknown) Tales from a legacy world - Migrating (part of) a Nagios infrastructure verifying results(who monitors the monitoring systems?)
  • 37. Some parts of the configuration are not synchronized between the two systems using the translation tool, by design. For example: ● notification configurations: defined new NotificationCommands and used apply rules for Notifications ● surrounding tools like performance metrics: started from a blank sheet and adopted InfluxDB and Grafana We are also still finding out small surprises here and there and fixing them as we discover them :-) Tales from a legacy world - Migrating (part of) a Nagios infrastructure final notes(cleaning up the last bits)
  • 38. ● the two systems are still alive side-to-side ● more than 99.5% of host and service checks, and their results, perfectly in sync between the two systems ○ (and this metric is improving) ● a better process in place to manage the whole life cycle of custom plugin checks, which the legacy systems also benefits from ● currently in the process of switching to Icinga2 as primary and Nagios as a reference backup Tales from a legacy world - Migrating (part of) a Nagios infrastructure what now?(state of the union)
  • 39. ● dockerize NagioSQL + Nagios and keep that container around to let people keep using NagioSQL after the Nagios system will be retired and decommissioned ○ that instance will run only to generate objects.precache ● move more and more people to IcingaWeb2 and get rid of the ClassicUI as soon as possible ○ this will unlock upgrades to newer Icinga2 versions Tales from a legacy world - Migrating (part of) a Nagios infrastructure what tomorrow?(short term)
  • 40. We don’t have a specific roadmap (yet), but the next big thing should be / will be / Ihopeitwillbe removing NagioSQL. Did anyone say Director? 😉 We will also be investigating the possibly to create satellite zones to delegate some workload and segregate check plugins. Also, more automation to keep monitoring in sync with the real systems: ● better KPIs in the customer CMDB for monitoring coverage ● integrations with A/D, Azure, VMware and other authoritative sources Tales from a legacy world - Migrating (part of) a Nagios infrastructure a bright future ahead(hope to get there soon)
  • 41. This project has been a (weird|extreme|nontypical) scenario of legacy compatibility. There wasn’t opportunity to introduce disruptive new technologies but it sure was a step forward for the customer to get ready for future improvements and a learning opportunity for everyone involved. What follows is a short collection of some (mostly obvious?) take aways from this experience, in no specific order. Tales from a legacy world - Migrating (part of) a Nagios infrastructure lessons learned(is there anything useful here?)
  • 42. ● a project like this has many crossing points on the boundary between customer and provider / consultant, requiring strict collaboration ● good project management and internal communications are a must ● identify all the users of the system and all the owners of the monitored entities: they’re often the only ones that can tell you what’s wrong and what’s correct ● try to get feedback as soon as possible in the project, use it to drive your process and decisions Tales from a legacy world - Migrating (part of) a Nagios infrastructure people are critical for success(it’s not just an IT thing)
  • 43. ● monitoring is not the kind of of system that you want to use on your new shiny piece of technology before you have proven it is more than stable ● ideally choose some reliable piece of your infrastructure that you know inside out and that’s as “out of band” as possible from the monitored systems ● ask yourself who’s monitoring the monitoring system Tales from a legacy world - Migrating (part of) a Nagios infrastructure aim for boring(the more boring, the better)
  • 44. ● we had many non obvious cases of wrong results, which IMHO is the worst thing you can get from a monitoring system: if you can’t trust it, then it’s useless ● scripting generic sanity checks and cross-checks with the system being replaced helped us quickly identify the vast majority of wrong results Tales from a legacy world - Migrating (part of) a Nagios infrastructure double check your checks(the devil is in the details)
  • 45. ● I’ve seen one too many plugin checks which behaved “funny” every time it went outside of the “everything’s ok” code path ● acknowledge that nowadays we can have HA clusters ○ provide for a way to store state cluster-wide, if needed ● watch out for everything that can go down the timeout route ○ most checks complete in < 1 second when OK or with simple failures (eg. tcp port closed), but when they have to hang around waiting for long timeouts there’s a huge impact on monitoring systems workload Tales from a legacy world - Migrating (part of) a Nagios infrastructure write good check plugins(and test them)
  • 46. ● integrating visual tools in your monitoring allows people to quickly get a glance of the current situation and how it evolved over time ● IcingaWeb2 grafana/graphite/etc… modules are a perfect example ● tools like Grafana let users build their own dashboards and drill down to the data they are interested in, without having to ask a sysadmin to provision custom configurations on the monitoring systems ● and they can also be hooked up to legacy systems if need be, providing a nice improvement when you can’t change the whole system Tales from a legacy world - Migrating (part of) a Nagios infrastructure visualizations(a picture is worth a thousand words)
  • 47. ● enables testing stuff in local environments (eg. Vagrant) and ensures consistency between deployments (staging/prod) ● removes human error from repetitive tasks (or at least guarantees consistent repetitive errors ;) ● allows quicker and simpler processes (eg. click in NagioSQL → Icinga2 running a translated, validated, versioned configuration) ● come on, it’s 2019: there’s no question about automation, just do it™. Tales from a legacy world - Migrating (part of) a Nagios infrastructure embrace automation(let the machine work for you)
  • 48. ● there are lots of integrations and contact points between different tools and different systems, lots of moving parts ● having everything documented is crucial ● if you are using any kind of configuration management (like Puppet in our case) the code is itself a form of documentation ● you should be able to point at any single component (a cron job, a script, a configuration choice) and answer the five Ws in a matter of minutes Tales from a legacy world - Migrating (part of) a Nagios infrastructure documentation(I wish you were here)
  • 49. Q & A Tales from a legacy world - Migrating (part of) a Nagios infrastructure