Andy Brist's presentation on High Availability and Failover Solutions for Nagios XI. The presentation was given during the Nagios World Conference North America held Oct 13th - Oct 16th, 2014 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/conference
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
1. Failover and High Availability
Solutions for Nagios XI
Andy Brist
abrist@nagios.com
2. Introduction
• Who am I?
• Nagios Support Team Manager
• Team Lead for Nagios-Plugins
(github.com/nagios-plugins
3. Disclaimer
• Every environment is different
• Failover/HA by nature, is a customized
solution
• My case studies are not your production
environments
• I know Nagios/XI, not your SLA
• Test in a lab. First.
4. Agenda
● Short overview of the different failback/failover solutions
● Nagios XI Data Locations and other files/services relevant to
failover scenarios.
● Snapback
● Failback
● Failover
● HA? Failover
● Observations, Considerations
5. Backup (snapback)
● Restore VM snapshot or spin up a new instance and restore a
backup
● Most common implementation
● Easiest of all options
● Most potential downtime of scenarios
● Maximum historical and configuration data lost = the interval
between snapshots
● Requires manual intervention
6. Automated XI Backups
● XI provides a method for scheduled backups through the
"Scheduled Backups Component"
– ssh
– ftp
– local fs
● Useful for remote backups or manual failback
8. Failback
● Secondary is periodically updated from an XI backup.
● The nagios process is started by hand when the master has an issue.
● Cronjob on the secondary restores newest backup once a day.
● If unconcerned with historical data and mrtg performance data, just
push/restore the object configs and sql dumps (if not offloaded)
● Not to be confused with snapback as this is a separate, different
instance/image, not just a previous state of the failed instance.
9. Additional Considerations
● Easy to implement with the “Scheduled Backups” XI
component.
● Agents must maintain 2+ allowed hosts
● SNMP traps must be configured to push to 2+ hosts
● May experience substantial downtime if the backup is large
and the primary fails during a data restore on the secondary.
10. Failover
● Difficult to get right
● Demanding on i/o resources and network speed
● Very little to no loss of historical data
● Minimal downtime
● Fully automated
● Can provide minimal clustering for XI services through “High
Availability”
12. Nagios XI
● Object Configuration
● Check Status
● Object State
● Program State
● Historical State Data
● Performance Data
13. Nagios XI - Services
nagios – Monitoring engine
mysql – Object configuration and ndo historical data
ndo2db – Writes historical data to mysql database
postgresql – Nagios XI settings/user database
npcd – Performance data daemon
crond – Task scheduler
httpd – Web server
14. XI Data and Redundancy
Absolute minimum redundant data required for any failover
scenario:
● (Working) Object configuration
● Mysql 'nagiosql' database
● Postgresql 'nagiosxi' database
15. Full Check Redundancy
Additional requirements for full check redundancy:
● mrtg config and RRDs (for bandwidth checks)
● nagios libexec folder (plugins)
Any additional dependencies for plugins.
For example:
● VMWare SDK
● Oracle Perl Library
● Java JRE
16. Runtime State Redundancy
Additional requirements for runtime state redundancy:
● retention.dat (state, runtime options, acknowledgments,
notification depth)
● NDO mysql database "nagios
17. Historical Redundancy
Additional Data required for complete historical redundancy:
● nagios.log and archives directory
● perfdata RRDs
● mrtg config and RRDs
● NDO mysql database "nagios"
18. XI Data Summary
Logs/archives
Perfdata
Mrtg/configs
Databases
Object configs
Plugins
19. XI Data Summary
/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/archives/
/usr/local/nagios/share/perfdata/
/var/lib/mrtg/
/etc/mrtg/
/var/lib/pgsql/
/var/lib/mysql/
/usr/local/nagios/etc/
/usr/local/nagios/libexec/
/usr/local/nagiosxi/
20. High Availability?
1. Elimination of single points of failure.
2. Reliable crossover/failover.
3. Detection of failures as they occur.
21. High Availability?
Why would you need it?
● Least amount of downtime
● (limited) Service clustering
● Shared volumes solve the issues with syncing historical data in
redundant configurations
23. Shared Storage
● DRBD – block level replication, part of the linux kernel, well supported and
understood. Works well for all XI data types (including RRDs/DBs)
● NFS – Fine option, just make sure the NFS share does not have an i/o latency issue
or your checks WILL get behind. Do not mount the volume on more than one
server at time to avoid writing multiple checks in the case of a partial failover.
● Replicated DBs – Fine solution, clusters well. Use DNS or virtual ips to control access
to the databases.
● rsync – Not immediate replication, but close. Easy to implement.
● GlusterFS – More problematic to set up, but good for offloaded mrtg/RRDs
24. DRBD
● Active/passive suggested
● Low latency storage
● Active mount should move with the vip
● Refer to Jeremy Rust's presentation notes for more
information
26. HA Failover Management
● Pacemaker/Heartbeat (the HA stack)
● uCarp scripts
● keepalived scripts
Custom Scripts:
● nagios itself – Event handler driven
● cron – Job that checks the master for connectivity. Reuse the
check_icmp or check_http plugins for this purpose.
27. Extra Considerations
● STONITH
● Clustering?
● DRBD/Shared Storage
● High Latency HA
● NDO/Databases
● Recovery
28. STONITH
(shoot the other node in the head)
● Mechanism by which a failing
server is guaranteed to be
removed from the cluster
● Not required, but advised
● Hardware (including ups) and
software (vmware stonith “device”
and shell scripts)
● Only failing over when the primary
is unreachable is safest
● Beware of overzealous failover
conditions as they can lead to a . .
29. Deathmatch!
No, really. Stonith gives your servers the ability to
KILL THEMSELVES and FRIENDS
● Beware of services whose init
actions/failures should not cause
failover/stonith
● Any actions requiring a shared
volume in active/passive mode
should not immediately cause
failover due to potential latency
during volume mounts
● Test, test, test the disaster
scenarios in a LAB first or the
fragfest may include your job!
30. Clustering/Fencing
● A number of portions of Nagios Core and Nagios XI are clusterable.
Processes that can potentially be clustered:
– offloaded postgresql
– offloaded mysql/ndo2db
– offloaded mrtg
● Services that are dependent on the core monitoring engine and
filesystem and should not be clustered:
– nagios, npcd, cronjobs
– httpd
– snmptrapd, snmptt
31. DUAL DRBD Primary
● Disconnecting from the master before mounting of the shared
volume during failover is no longer needed.
● Careful implementation allows multiple servers to
concurrently access the shared volume. Potentially useful for
ambitious clusters and shared historical records.
● Slower, as the “secondary” can lock blocks.
● More prone to “split-brains”
● Usually requires clustered file systems
32. High Latency HA
● Problematic if the HA solution was not designed for potential
high latency
● Will potentially cause i/o wait issues
● It may be better to push checks to a central server(s) with
NRDP/outbound checks/etc, keeping HA solutions local, or to
pay for a faster pipe.
● DRBD Proxy – A good solution if high latency HA is a must –
uses an asynchronous buffer for block writes to the secondary
volumes (does not support dual primary)
33. NDO Considerations
● Enforce single ndo instance access to mysql
● If multiple ndo processes connecting to a single ndo db is required,
consider using ndo db instances
● You can control ndo's access to the mysql server through iptables
and the vip.
● Offload ndo2db to the offloaded mysql server
● Configure ndomod it to connect through a tcp socket. This
can potentially decrease load on the nagios server.
34. Database Considerations
● Initiating failover due to crashed DBs may cause a deathmatch
as all nodes will fail (due to their shared nature)
● Offload both postgresql and mysql databases. Requires a
virtual ip or careful management of DNS.
● XI has scripts to repair the databases, use them!
35. Recovering from Failover
● Degraded ex-primaries should not be added back to the cluster
automatically. Doing so may cause split brains.
● Split brains REQUIRE manual intervention if preservation of historical
data is desired.
● Stonith Deathmatches – Have a primary image/instance without
stonith enabled for recovery
● Maintain an ultimate disaster recovery server instance/image
outside of the cluster pool for when all else has failed.
36. A Plea from Nagios Support
● Failover/HA != backups
● Test, test, TEST! Use your lab please.
● Document. Everything. The biggest barrier and largest hurdle for
support are unknown, undocumented, non-standard configurations.
Failover/HA deployments definitely qualify.
37. Final Comparisons
● Snapback: Easy. Slow recovery. Requires manual intervention.
Highest potential historical loss.
● Failback: Intermediate. Moderate recovery. Can be automated. Less
historical loss.
● Failover: Difficult. Fast recovery. Fully automated. Nearly no historical
loss.
● High Availability: Difficult. Fast recovery. Automated. Redundancy
across WAN links. Limited clustering. Least potential downtime.
Multiple potential issues with split-brain, stonith/deathmatches and
latency, so care should be given, and scenarios tested.
38. Food for thought . . . .
● HA in a federated model . . . . . . . .
39. Final Questions For You
● How much of Nagios XI, or Core, can truly be
set up to be "HA"? Do you care? :P
● Do you need HA/failover, or will
failback/snapback suffice?
● Is the time trade off in your environment
worth it?
40. Questions for Me?
Any questions?
(common/critical answers noted below for the sake of efficiency)
● 11 meters/sec (unladen European swallow)
● 42
● The Prime Directive
● 3 Times
● The Categorical Imperative/Pragmatism (choose 1)
● No.*
● Evasive Subjunctive
● . . . Yes?