SlideShare a Scribd company logo
1 of 41
Failover and High Availability 
Solutions for Nagios XI 
Andy Brist 
abrist@nagios.com
Introduction 
• Who am I? 
• Nagios Support Team Manager 
• Team Lead for Nagios-Plugins 
(github.com/nagios-plugins
Disclaimer 
• Every environment is different 
• Failover/HA by nature, is a customized 
solution 
• My case studies are not your production 
environments 
• I know Nagios/XI, not your SLA 
• Test in a lab. First.
Agenda 
● Short overview of the different failback/failover solutions 
● Nagios XI Data Locations and other files/services relevant to 
failover scenarios. 
● Snapback 
● Failback 
● Failover 
● HA? Failover 
● Observations, Considerations
Backup (snapback) 
● Restore VM snapshot or spin up a new instance and restore a 
backup 
● Most common implementation 
● Easiest of all options 
● Most potential downtime of scenarios 
● Maximum historical and configuration data lost = the interval 
between snapshots 
● Requires manual intervention
Automated XI Backups 
● XI provides a method for scheduled backups through the 
"Scheduled Backups Component" 
– ssh 
– ftp 
– local fs 
● Useful for remote backups or manual failback
Failback
Failback 
● Secondary is periodically updated from an XI backup. 
● The nagios process is started by hand when the master has an issue. 
● Cronjob on the secondary restores newest backup once a day. 
● If unconcerned with historical data and mrtg performance data, just 
push/restore the object configs and sql dumps (if not offloaded) 
● Not to be confused with snapback as this is a separate, different 
instance/image, not just a previous state of the failed instance.
Additional Considerations 
● Easy to implement with the “Scheduled Backups” XI 
component. 
● Agents must maintain 2+ allowed hosts 
● SNMP traps must be configured to push to 2+ hosts 
● May experience substantial downtime if the backup is large 
and the primary fails during a data restore on the secondary.
Failover 
● Difficult to get right 
● Demanding on i/o resources and network speed 
● Very little to no loss of historical data 
● Minimal downtime 
● Fully automated 
● Can provide minimal clustering for XI services through “High 
Availability”
Failover
Nagios XI 
● Object Configuration 
● Check Status 
● Object State 
● Program State 
● Historical State Data 
● Performance Data
Nagios XI - Services 
nagios – Monitoring engine 
mysql – Object configuration and ndo historical data 
ndo2db – Writes historical data to mysql database 
postgresql – Nagios XI settings/user database 
npcd – Performance data daemon 
crond – Task scheduler 
httpd – Web server
XI Data and Redundancy 
Absolute minimum redundant data required for any failover 
scenario: 
● (Working) Object configuration 
● Mysql 'nagiosql' database 
● Postgresql 'nagiosxi' database
Full Check Redundancy 
Additional requirements for full check redundancy: 
● mrtg config and RRDs (for bandwidth checks) 
● nagios libexec folder (plugins) 
Any additional dependencies for plugins. 
For example: 
● VMWare SDK 
● Oracle Perl Library 
● Java JRE
Runtime State Redundancy 
Additional requirements for runtime state redundancy: 
● retention.dat (state, runtime options, acknowledgments, 
notification depth) 
● NDO mysql database "nagios
Historical Redundancy 
Additional Data required for complete historical redundancy: 
● nagios.log and archives directory 
● perfdata RRDs 
● mrtg config and RRDs 
● NDO mysql database "nagios"
XI Data Summary 
Logs/archives 
Perfdata 
Mrtg/configs 
Databases 
Object configs 
Plugins
XI Data Summary 
/usr/local/nagios/var/nagios.log 
/usr/local/nagios/var/archives/ 
/usr/local/nagios/share/perfdata/ 
/var/lib/mrtg/ 
/etc/mrtg/ 
/var/lib/pgsql/ 
/var/lib/mysql/ 
/usr/local/nagios/etc/ 
/usr/local/nagios/libexec/ 
/usr/local/nagiosxi/
High Availability? 
1. Elimination of single points of failure. 
2. Reliable crossover/failover. 
3. Detection of failures as they occur.
High Availability? 
Why would you need it? 
● Least amount of downtime 
● (limited) Service clustering 
● Shared volumes solve the issues with syncing historical data in 
redundant configurations
High Availability/Failover 
Major components: 
● Shared storage 
● Virtual IP 
● Management applications/scripts
Shared Storage 
● DRBD – block level replication, part of the linux kernel, well supported and 
understood. Works well for all XI data types (including RRDs/DBs) 
● NFS – Fine option, just make sure the NFS share does not have an i/o latency issue 
or your checks WILL get behind. Do not mount the volume on more than one 
server at time to avoid writing multiple checks in the case of a partial failover. 
● Replicated DBs – Fine solution, clusters well. Use DNS or virtual ips to control access 
to the databases. 
● rsync – Not immediate replication, but close. Easy to implement. 
● GlusterFS – More problematic to set up, but good for offloaded mrtg/RRDs
DRBD 
● Active/passive suggested 
● Low latency storage 
● Active mount should move with the vip 
● Refer to Jeremy Rust's presentation notes for more 
information
Virtual IP 
● pacemaker vip script 
● Custom ifconfig/ip shell scripts 
● uCarp Scripts 
● keepalived
HA Failover Management 
● Pacemaker/Heartbeat (the HA stack) 
● uCarp scripts 
● keepalived scripts 
Custom Scripts: 
● nagios itself – Event handler driven 
● cron – Job that checks the master for connectivity. Reuse the 
check_icmp or check_http plugins for this purpose.
Extra Considerations 
● STONITH 
● Clustering? 
● DRBD/Shared Storage 
● High Latency HA 
● NDO/Databases 
● Recovery
STONITH 
(shoot the other node in the head) 
● Mechanism by which a failing 
server is guaranteed to be 
removed from the cluster 
● Not required, but advised 
● Hardware (including ups) and 
software (vmware stonith “device” 
and shell scripts) 
● Only failing over when the primary 
is unreachable is safest 
● Beware of overzealous failover 
conditions as they can lead to a . .
Deathmatch! 
No, really. Stonith gives your servers the ability to 
KILL THEMSELVES and FRIENDS 
● Beware of services whose init 
actions/failures should not cause 
failover/stonith 
● Any actions requiring a shared 
volume in active/passive mode 
should not immediately cause 
failover due to potential latency 
during volume mounts 
● Test, test, test the disaster 
scenarios in a LAB first or the 
fragfest may include your job!
Clustering/Fencing 
● A number of portions of Nagios Core and Nagios XI are clusterable. 
Processes that can potentially be clustered: 
– offloaded postgresql 
– offloaded mysql/ndo2db 
– offloaded mrtg 
● Services that are dependent on the core monitoring engine and 
filesystem and should not be clustered: 
– nagios, npcd, cronjobs 
– httpd 
– snmptrapd, snmptt
DUAL DRBD Primary 
● Disconnecting from the master before mounting of the shared 
volume during failover is no longer needed. 
● Careful implementation allows multiple servers to 
concurrently access the shared volume. Potentially useful for 
ambitious clusters and shared historical records. 
● Slower, as the “secondary” can lock blocks. 
● More prone to “split-brains” 
● Usually requires clustered file systems
High Latency HA 
● Problematic if the HA solution was not designed for potential 
high latency 
● Will potentially cause i/o wait issues 
● It may be better to push checks to a central server(s) with 
NRDP/outbound checks/etc, keeping HA solutions local, or to 
pay for a faster pipe. 
● DRBD Proxy – A good solution if high latency HA is a must – 
uses an asynchronous buffer for block writes to the secondary 
volumes (does not support dual primary)
NDO Considerations 
● Enforce single ndo instance access to mysql 
● If multiple ndo processes connecting to a single ndo db is required, 
consider using ndo db instances 
● You can control ndo's access to the mysql server through iptables 
and the vip. 
● Offload ndo2db to the offloaded mysql server 
● Configure ndomod it to connect through a tcp socket. This 
can potentially decrease load on the nagios server.
Database Considerations 
● Initiating failover due to crashed DBs may cause a deathmatch 
as all nodes will fail (due to their shared nature) 
● Offload both postgresql and mysql databases. Requires a 
virtual ip or careful management of DNS. 
● XI has scripts to repair the databases, use them!
Recovering from Failover 
● Degraded ex-primaries should not be added back to the cluster 
automatically. Doing so may cause split brains. 
● Split brains REQUIRE manual intervention if preservation of historical 
data is desired. 
● Stonith Deathmatches – Have a primary image/instance without 
stonith enabled for recovery 
● Maintain an ultimate disaster recovery server instance/image 
outside of the cluster pool for when all else has failed.
A Plea from Nagios Support 
● Failover/HA != backups 
● Test, test, TEST! Use your lab please. 
● Document. Everything. The biggest barrier and largest hurdle for 
support are unknown, undocumented, non-standard configurations. 
Failover/HA deployments definitely qualify.
Final Comparisons 
● Snapback: Easy. Slow recovery. Requires manual intervention. 
Highest potential historical loss. 
● Failback: Intermediate. Moderate recovery. Can be automated. Less 
historical loss. 
● Failover: Difficult. Fast recovery. Fully automated. Nearly no historical 
loss. 
● High Availability: Difficult. Fast recovery. Automated. Redundancy 
across WAN links. Limited clustering. Least potential downtime. 
Multiple potential issues with split-brain, stonith/deathmatches and 
latency, so care should be given, and scenarios tested.
Food for thought . . . . 
● HA in a federated model . . . . . . . .
Final Questions For You 
● How much of Nagios XI, or Core, can truly be 
set up to be "HA"? Do you care? :P 
● Do you need HA/failover, or will 
failback/snapback suffice? 
● Is the time trade off in your environment 
worth it?
Questions for Me? 
Any questions? 
(common/critical answers noted below for the sake of efficiency) 
● 11 meters/sec (unladen European swallow) 
● 42 
● The Prime Directive 
● 3 Times 
● The Categorical Imperative/Pragmatism (choose 1) 
● No.* 
● Evasive Subjunctive 
● . . . Yes?
The End 
Andy Brist 
abrist@nagios.com

More Related Content

What's hot

MySQL Monitoring with Zabbix
MySQL Monitoring with ZabbixMySQL Monitoring with Zabbix
MySQL Monitoring with Zabbix
FromDual GmbH
 

What's hot (20)

Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios
Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical NagiosNagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios
Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios
 
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
 
Trevor McDonald - Nagios XI Under The Hood
Trevor McDonald  - Nagios XI Under The HoodTrevor McDonald  - Nagios XI Under The Hood
Trevor McDonald - Nagios XI Under The Hood
 
Nagios Conference 2014 - Mike Merideth - The Art and Zen of Managing Nagios w...
Nagios Conference 2014 - Mike Merideth - The Art and Zen of Managing Nagios w...Nagios Conference 2014 - Mike Merideth - The Art and Zen of Managing Nagios w...
Nagios Conference 2014 - Mike Merideth - The Art and Zen of Managing Nagios w...
 
Nagios XI Best Practices
Nagios XI Best PracticesNagios XI Best Practices
Nagios XI Best Practices
 
Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring
Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios MonitoringNagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring
Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
 
Nagios
NagiosNagios
Nagios
 
Nagios Conference 2011 - Mike Weber - Training: Reducing Nagios Server Load ...
Nagios Conference 2011 - Mike Weber - Training:  Reducing Nagios Server Load ...Nagios Conference 2011 - Mike Weber - Training:  Reducing Nagios Server Load ...
Nagios Conference 2011 - Mike Weber - Training: Reducing Nagios Server Load ...
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson Opening
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service Checks
 
Nagios Conference 2014 - Jim Prins - Passive Monitoring with Nagios
Nagios Conference 2014 - Jim Prins - Passive Monitoring with NagiosNagios Conference 2014 - Jim Prins - Passive Monitoring with Nagios
Nagios Conference 2014 - Jim Prins - Passive Monitoring with Nagios
 
Learning Nagios module 1
Learning Nagios module 1Learning Nagios module 1
Learning Nagios module 1
 
MySQL Monitoring with Zabbix
MySQL Monitoring with ZabbixMySQL Monitoring with Zabbix
MySQL Monitoring with Zabbix
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal Nagios
 
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
NGINX Installation and Tuning
NGINX Installation and TuningNGINX Installation and Tuning
NGINX Installation and Tuning
 
Nagios Conference 2011 - Jeff Sly - Case Study Nagios @ Nu Skin
Nagios Conference 2011 - Jeff Sly - Case Study Nagios @ Nu SkinNagios Conference 2011 - Jeff Sly - Case Study Nagios @ Nu Skin
Nagios Conference 2011 - Jeff Sly - Case Study Nagios @ Nu Skin
 

Viewers also liked

Logical Framework And Project Proposal
Logical Framework And Project ProposalLogical Framework And Project Proposal
Logical Framework And Project Proposal
rexcris
 
Manual final nagios
Manual final nagiosManual final nagios
Manual final nagios
rpm-alerts
 

Viewers also liked (10)

Monitoring with Nagios and Ganglia
Monitoring with Nagios and GangliaMonitoring with Nagios and Ganglia
Monitoring with Nagios and Ganglia
 
Open Issues and a Proposal for Open-source Technologies to Assure Quality, Re...
Open Issues and a Proposal for Open-source Technologies to Assure Quality, Re...Open Issues and a Proposal for Open-source Technologies to Assure Quality, Re...
Open Issues and a Proposal for Open-source Technologies to Assure Quality, Re...
 
Nagios Consulting Implementation and Maintenance
Nagios Consulting Implementation and MaintenanceNagios Consulting Implementation and Maintenance
Nagios Consulting Implementation and Maintenance
 
What is Nagios XI and how is it different from Nagios Core
What is Nagios XI and how is it different from Nagios CoreWhat is Nagios XI and how is it different from Nagios Core
What is Nagios XI and how is it different from Nagios Core
 
Time to say goodbye to your Nagios based setup
Time to say goodbye to your Nagios based setupTime to say goodbye to your Nagios based setup
Time to say goodbye to your Nagios based setup
 
Computer monitoring with the Open Monitoring Distribution
Computer monitoring with the Open Monitoring DistributionComputer monitoring with the Open Monitoring Distribution
Computer monitoring with the Open Monitoring Distribution
 
Stop using Nagios (so it can die peacefully)
Stop using Nagios (so it can die peacefully)Stop using Nagios (so it can die peacefully)
Stop using Nagios (so it can die peacefully)
 
Logical Framework And Project Proposal
Logical Framework And Project ProposalLogical Framework And Project Proposal
Logical Framework And Project Proposal
 
Manual final nagios
Manual final nagiosManual final nagios
Manual final nagios
 
10 Project Proposal Writing
10 Project Proposal Writing10 Project Proposal Writing
10 Project Proposal Writing
 

Similar to Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions

Screaming Fast Wpmu
Screaming Fast WpmuScreaming Fast Wpmu
Screaming Fast Wpmu
djcp
 
Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014
vespian_256
 
MySQL High Availability Solutions
MySQL High Availability SolutionsMySQL High Availability Solutions
MySQL High Availability Solutions
Lenz Grimmer
 
Mysqlhacodebits20091203 1260184765-phpapp02
Mysqlhacodebits20091203 1260184765-phpapp02Mysqlhacodebits20091203 1260184765-phpapp02
Mysqlhacodebits20091203 1260184765-phpapp02
Louis liu
 

Similar to Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions (20)

Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia Databases
 
The Accidental DBA
The Accidental DBAThe Accidental DBA
The Accidental DBA
 
Screaming Fast Wpmu
Screaming Fast WpmuScreaming Fast Wpmu
Screaming Fast Wpmu
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
Backups
BackupsBackups
Backups
 
Archivematica Technical Training Diagnostics Guide (September 2018)
Archivematica Technical Training Diagnostics Guide (September 2018)Archivematica Technical Training Diagnostics Guide (September 2018)
Archivematica Technical Training Diagnostics Guide (September 2018)
 
Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
 
PLNOG14: Automation at Brainly - Paweł Rozlach
PLNOG14: Automation at Brainly - Paweł RozlachPLNOG14: Automation at Brainly - Paweł Rozlach
PLNOG14: Automation at Brainly - Paweł Rozlach
 
PLNOG Automation@Brainly
PLNOG Automation@BrainlyPLNOG Automation@Brainly
PLNOG Automation@Brainly
 
Hands on Virtualization with Ganeti (part 1) - LinuxCon 2012
Hands on Virtualization with Ganeti (part 1)  - LinuxCon 2012Hands on Virtualization with Ganeti (part 1)  - LinuxCon 2012
Hands on Virtualization with Ganeti (part 1) - LinuxCon 2012
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
 
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptxThink_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
 
High-performance high-availability Plone
High-performance high-availability PloneHigh-performance high-availability Plone
High-performance high-availability Plone
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
MySQL High Availability Solutions
MySQL High Availability SolutionsMySQL High Availability Solutions
MySQL High Availability Solutions
 
Mysqlhacodebits20091203 1260184765-phpapp02
Mysqlhacodebits20091203 1260184765-phpapp02Mysqlhacodebits20091203 1260184765-phpapp02
Mysqlhacodebits20091203 1260184765-phpapp02
 
MySQL High Availability Solutions
MySQL High Availability SolutionsMySQL High Availability Solutions
MySQL High Availability Solutions
 

More from Nagios

More from Nagios (15)

Sean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsSean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient Notifications
 
Janice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsJanice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios Plugins
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With Nagios
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - Features
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - Features
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
 
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
Nagios Conference 2014 - Trevor McDonald - Monitoring The Physical World With...
 
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios SolutionsNagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
Nagios Conference 2014 - Shamas Demoret - An Overview of Nagios Solutions
 
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XINagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
Nagios Conference 2014 - Shamas Demoret - Getting Started With Nagios XI
 
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...
Nagios Conference 2014 - Abbas Haider Ali - Proactive Alerting and Intelligen...
 
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
Nagios Conference 2014 - Sam Lansing - Utilizing Data Visualizations in Syste...
 
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...
Nagios Conference 2014 - Paloma Galan - Monitoring Financial Protocols With N...
 
Nagios Conference 2014 - Scott Wilkerson - Getting Started with Nagios Networ...
Nagios Conference 2014 - Scott Wilkerson - Getting Started with Nagios Networ...Nagios Conference 2014 - Scott Wilkerson - Getting Started with Nagios Networ...
Nagios Conference 2014 - Scott Wilkerson - Getting Started with Nagios Networ...
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions

  • 1. Failover and High Availability Solutions for Nagios XI Andy Brist abrist@nagios.com
  • 2. Introduction • Who am I? • Nagios Support Team Manager • Team Lead for Nagios-Plugins (github.com/nagios-plugins
  • 3. Disclaimer • Every environment is different • Failover/HA by nature, is a customized solution • My case studies are not your production environments • I know Nagios/XI, not your SLA • Test in a lab. First.
  • 4. Agenda ● Short overview of the different failback/failover solutions ● Nagios XI Data Locations and other files/services relevant to failover scenarios. ● Snapback ● Failback ● Failover ● HA? Failover ● Observations, Considerations
  • 5. Backup (snapback) ● Restore VM snapshot or spin up a new instance and restore a backup ● Most common implementation ● Easiest of all options ● Most potential downtime of scenarios ● Maximum historical and configuration data lost = the interval between snapshots ● Requires manual intervention
  • 6. Automated XI Backups ● XI provides a method for scheduled backups through the "Scheduled Backups Component" – ssh – ftp – local fs ● Useful for remote backups or manual failback
  • 8. Failback ● Secondary is periodically updated from an XI backup. ● The nagios process is started by hand when the master has an issue. ● Cronjob on the secondary restores newest backup once a day. ● If unconcerned with historical data and mrtg performance data, just push/restore the object configs and sql dumps (if not offloaded) ● Not to be confused with snapback as this is a separate, different instance/image, not just a previous state of the failed instance.
  • 9. Additional Considerations ● Easy to implement with the “Scheduled Backups” XI component. ● Agents must maintain 2+ allowed hosts ● SNMP traps must be configured to push to 2+ hosts ● May experience substantial downtime if the backup is large and the primary fails during a data restore on the secondary.
  • 10. Failover ● Difficult to get right ● Demanding on i/o resources and network speed ● Very little to no loss of historical data ● Minimal downtime ● Fully automated ● Can provide minimal clustering for XI services through “High Availability”
  • 12. Nagios XI ● Object Configuration ● Check Status ● Object State ● Program State ● Historical State Data ● Performance Data
  • 13. Nagios XI - Services nagios – Monitoring engine mysql – Object configuration and ndo historical data ndo2db – Writes historical data to mysql database postgresql – Nagios XI settings/user database npcd – Performance data daemon crond – Task scheduler httpd – Web server
  • 14. XI Data and Redundancy Absolute minimum redundant data required for any failover scenario: ● (Working) Object configuration ● Mysql 'nagiosql' database ● Postgresql 'nagiosxi' database
  • 15. Full Check Redundancy Additional requirements for full check redundancy: ● mrtg config and RRDs (for bandwidth checks) ● nagios libexec folder (plugins) Any additional dependencies for plugins. For example: ● VMWare SDK ● Oracle Perl Library ● Java JRE
  • 16. Runtime State Redundancy Additional requirements for runtime state redundancy: ● retention.dat (state, runtime options, acknowledgments, notification depth) ● NDO mysql database "nagios
  • 17. Historical Redundancy Additional Data required for complete historical redundancy: ● nagios.log and archives directory ● perfdata RRDs ● mrtg config and RRDs ● NDO mysql database "nagios"
  • 18. XI Data Summary Logs/archives Perfdata Mrtg/configs Databases Object configs Plugins
  • 19. XI Data Summary /usr/local/nagios/var/nagios.log /usr/local/nagios/var/archives/ /usr/local/nagios/share/perfdata/ /var/lib/mrtg/ /etc/mrtg/ /var/lib/pgsql/ /var/lib/mysql/ /usr/local/nagios/etc/ /usr/local/nagios/libexec/ /usr/local/nagiosxi/
  • 20. High Availability? 1. Elimination of single points of failure. 2. Reliable crossover/failover. 3. Detection of failures as they occur.
  • 21. High Availability? Why would you need it? ● Least amount of downtime ● (limited) Service clustering ● Shared volumes solve the issues with syncing historical data in redundant configurations
  • 22. High Availability/Failover Major components: ● Shared storage ● Virtual IP ● Management applications/scripts
  • 23. Shared Storage ● DRBD – block level replication, part of the linux kernel, well supported and understood. Works well for all XI data types (including RRDs/DBs) ● NFS – Fine option, just make sure the NFS share does not have an i/o latency issue or your checks WILL get behind. Do not mount the volume on more than one server at time to avoid writing multiple checks in the case of a partial failover. ● Replicated DBs – Fine solution, clusters well. Use DNS or virtual ips to control access to the databases. ● rsync – Not immediate replication, but close. Easy to implement. ● GlusterFS – More problematic to set up, but good for offloaded mrtg/RRDs
  • 24. DRBD ● Active/passive suggested ● Low latency storage ● Active mount should move with the vip ● Refer to Jeremy Rust's presentation notes for more information
  • 25. Virtual IP ● pacemaker vip script ● Custom ifconfig/ip shell scripts ● uCarp Scripts ● keepalived
  • 26. HA Failover Management ● Pacemaker/Heartbeat (the HA stack) ● uCarp scripts ● keepalived scripts Custom Scripts: ● nagios itself – Event handler driven ● cron – Job that checks the master for connectivity. Reuse the check_icmp or check_http plugins for this purpose.
  • 27. Extra Considerations ● STONITH ● Clustering? ● DRBD/Shared Storage ● High Latency HA ● NDO/Databases ● Recovery
  • 28. STONITH (shoot the other node in the head) ● Mechanism by which a failing server is guaranteed to be removed from the cluster ● Not required, but advised ● Hardware (including ups) and software (vmware stonith “device” and shell scripts) ● Only failing over when the primary is unreachable is safest ● Beware of overzealous failover conditions as they can lead to a . .
  • 29. Deathmatch! No, really. Stonith gives your servers the ability to KILL THEMSELVES and FRIENDS ● Beware of services whose init actions/failures should not cause failover/stonith ● Any actions requiring a shared volume in active/passive mode should not immediately cause failover due to potential latency during volume mounts ● Test, test, test the disaster scenarios in a LAB first or the fragfest may include your job!
  • 30. Clustering/Fencing ● A number of portions of Nagios Core and Nagios XI are clusterable. Processes that can potentially be clustered: – offloaded postgresql – offloaded mysql/ndo2db – offloaded mrtg ● Services that are dependent on the core monitoring engine and filesystem and should not be clustered: – nagios, npcd, cronjobs – httpd – snmptrapd, snmptt
  • 31. DUAL DRBD Primary ● Disconnecting from the master before mounting of the shared volume during failover is no longer needed. ● Careful implementation allows multiple servers to concurrently access the shared volume. Potentially useful for ambitious clusters and shared historical records. ● Slower, as the “secondary” can lock blocks. ● More prone to “split-brains” ● Usually requires clustered file systems
  • 32. High Latency HA ● Problematic if the HA solution was not designed for potential high latency ● Will potentially cause i/o wait issues ● It may be better to push checks to a central server(s) with NRDP/outbound checks/etc, keeping HA solutions local, or to pay for a faster pipe. ● DRBD Proxy – A good solution if high latency HA is a must – uses an asynchronous buffer for block writes to the secondary volumes (does not support dual primary)
  • 33. NDO Considerations ● Enforce single ndo instance access to mysql ● If multiple ndo processes connecting to a single ndo db is required, consider using ndo db instances ● You can control ndo's access to the mysql server through iptables and the vip. ● Offload ndo2db to the offloaded mysql server ● Configure ndomod it to connect through a tcp socket. This can potentially decrease load on the nagios server.
  • 34. Database Considerations ● Initiating failover due to crashed DBs may cause a deathmatch as all nodes will fail (due to their shared nature) ● Offload both postgresql and mysql databases. Requires a virtual ip or careful management of DNS. ● XI has scripts to repair the databases, use them!
  • 35. Recovering from Failover ● Degraded ex-primaries should not be added back to the cluster automatically. Doing so may cause split brains. ● Split brains REQUIRE manual intervention if preservation of historical data is desired. ● Stonith Deathmatches – Have a primary image/instance without stonith enabled for recovery ● Maintain an ultimate disaster recovery server instance/image outside of the cluster pool for when all else has failed.
  • 36. A Plea from Nagios Support ● Failover/HA != backups ● Test, test, TEST! Use your lab please. ● Document. Everything. The biggest barrier and largest hurdle for support are unknown, undocumented, non-standard configurations. Failover/HA deployments definitely qualify.
  • 37. Final Comparisons ● Snapback: Easy. Slow recovery. Requires manual intervention. Highest potential historical loss. ● Failback: Intermediate. Moderate recovery. Can be automated. Less historical loss. ● Failover: Difficult. Fast recovery. Fully automated. Nearly no historical loss. ● High Availability: Difficult. Fast recovery. Automated. Redundancy across WAN links. Limited clustering. Least potential downtime. Multiple potential issues with split-brain, stonith/deathmatches and latency, so care should be given, and scenarios tested.
  • 38. Food for thought . . . . ● HA in a federated model . . . . . . . .
  • 39. Final Questions For You ● How much of Nagios XI, or Core, can truly be set up to be "HA"? Do you care? :P ● Do you need HA/failover, or will failback/snapback suffice? ● Is the time trade off in your environment worth it?
  • 40. Questions for Me? Any questions? (common/critical answers noted below for the sake of efficiency) ● 11 meters/sec (unladen European swallow) ● 42 ● The Prime Directive ● 3 Times ● The Categorical Imperative/Pragmatism (choose 1) ● No.* ● Evasive Subjunctive ● . . . Yes?
  • 41. The End Andy Brist abrist@nagios.com