Time to say goodbye to your Nagios based setup. Discover all the new cool tools out there to do some more efficient monitoring. A talk made at OSMC 2014.
https://www.youtube.com/watch?v=_BAWi9Zhmic
2. About me
❖ System admin and architect
❖ Co-founder of « Communauté Francophone de la Supervision Libre »
❖ Writer of the book « Nagios 3 au coeur de la supervision Open Source »
❖ Co-founder of Check my Website, a SaaS service for remote monitoring of
websites and applications (current)
3. Content
❖ Why switch off ? the good and maybe not so good reasons to do so !
❖ Which way to take ?
❖ Building a monitoring solution without Nagios :
❖ Tools available
❖ A personal work in progress
❖ Migrating from Nagios to this kind of solution
4. Some reasons to switch off…
❖ The godfather of OSS monitoring is dead as an
Open Source project ?
❖ Can’t do better with it
❖ Cool new kids out there
❖ Better « cloud » support
❖ Clear states, metrics and messages monitoring
distinction
❖ Better charting solution
❖ Near realtime monitoring
❖ Routing, aggregation, correlation…
❖ YOUR reasons ;)
5. Which way to take ?
❖ The « 4 mousquetaires »
❖ Naemon
❖ Icinga 2
❖ Shinken
❖ Centreon
❖ Reboot from building blocks
❖ Collect
❖ Store
❖ Visualize
❖ Alert
7. Tools : External collecting
❖ End user perspective
❖ Controls done closest to the
end-user
❖ Application behavior
❖ Real User Monitoring
❖ Webpagetest
❖ Selenium
❖ PhantomasJS
❖ Boomerang
❖ Bucky
8. Tools : Routing metrics and messages
❖ Messages : Logstash, Flume, Fluentd
❖ Metrics : StatsD
❖ Metrics : Carbon Relay NG
One or more messages can fire an event
9. Tools : Databases
❖ Graphite : The most used.
❖ OpenTSDB : HBase
❖ KairosDB : Cassandra
❖ InfluxDB : The most promising ?
❖ Elasticsearch : Index database
11. Tools : Alerting
❖ Seyren : Alerting dashboard for
Graphite.
❖ Cabot : Get alerted when services
go down or metrics go crazy
❖ Bosun : An advanced, open-source
monitoring and alerting system
❖ Skyline : Real-time anomaly
detection system
❖ Oculus : Anomaly correlation
component of Etsy's Kale system
❖ Esper : Complex Event Processing
12. The French Monitoring Community
Xperience
❖ Reboot from building blocks
❖ Collect
❖ Store
❖ Visualize
❖ Alert
16. Collecting @ network level : Packetbeat
❖ Specific agent
❖ Collect traffic for
❖ HTTP
❖ MySQL
❖ PostgreSQL
❖ Redis
17. Routing messages : Logstash
❖ Inputs
❖ Codecs/filters
❖ Outputs
input {
udp {
port => 10514
codec => "json"
type => "syslog"
}
}
filter {
# This replaces the host field with the host that generated the message (sysloghost)
if [sysloghost] {
mutate {
replace => [ "host", "%{sysloghost}" ]
remove_field => "sysloghost"
}
}
}
output {
elasticsearch { host => localhost }
}
18. Routing metrics :
StatsD
❖ Is now a protocol implemented
in all languages
❖ InfluxDB plugin
❖ Collectd can behave as a statsD
daemon (plugin)
❖ Very easy to push metrics
echo "foo:1|c" | nc -u -w0 127.0.0.1 8125
19. Storing metrics : InfluxDB
❖ Make it behave like Graphite
❖ graphite-api
❖ carbon-relay-ng
❖ graphite-influxdb
❖ Cluster, cluster, cluster
❖ Design for events and metrics
20. Storing messages : Elasticsearch
❖ Index database
❖ Cluster, cluster, cluster
❖ Full text search
21. Visualizing @ network level : Packetbeat
❖ Kibana 3 modified version
❖ Dashboards ready out
of the box
22. Visualizing metrics : Grafana
❖ Compatible
❖ Graphite
❖ InfluxDB
❖ OpenTSDB
❖ Built on Kibana 3
24. What's missing ? Wishes
❖ Alerting
❖ External monitoring
❖ Repository for dashboards…
❖ Giving sense to metrics and
messages
25. Alerting reboot
❖ Alert only on end user problems from an end
user perspective
❖ IRC, Chat channel…
❖ Alert thresholds based on history vs static
thresholds
❖ Statistics functions
❖ Boolean conditions
❖ Dynamic thresholds
❖ Anomaly detection
❖ Standard deviation
26. Coming from Nagios
❖ Graphios will inject perfdatas in Graphite or InfluxDB
❖ Check_graphite can query Graphite API from Nagios for alert based on
history
❖ Logstash will send events to NSCA
❖ Nagios log in Kibana with Grok %{NAGIOSLINE}
❖ Keep Nagios for states ?
3 years older since last time you see me there !
Always that typical french accent on english
So these reasons or your own have convinced you to switch off. Welcome to the Club !
Graph from data loop.io blog
A question i asked to myself.
You can continue on the same godfather logic but with much more community friendly and powerful software.
Athos Portos Aramis D’artagnan
heka, Riemann.io : Metrics and messages
Sensu
sensu
riemann
I wanted to collect metrics, states and messages on all possible levels.
The standard is the Graphite metric format.
SQL powered operating system instrumentation and analytics
I wanted to monitor the monitoring-fr.org infrastructure in a new way. This is the classical website stack.
Let’s see now what i wanted to do and the tools that might fit.
Internal & External monitoring
Alert me when it’s not available or degraded for end-users
Push first, poll if needed
Relative monitoring
Better « Cloud » support
Trends made 3 years ago
Data-driven architecture
I wanted to monitor the monitoring-fr.org infrastructure in a new way. This is the classical website stack.
Let’s see now what i wanted to do and the tools that might fit.
Internal & External monitoring
Alert me when it’s not available or degraded for end-users
Push first, poll if needed
Relative monitoring
Better « Cloud » support
Trends made 3 years ago
Data-driven architecture
In fact, i also use a packetbeat agent
ph5-fpm metrics, nginx metrics
Packetbeat reads and parses network protocols on the fly, correlating the requests with the responses and measuring the response times.
A network daemon that runs on the Node.js platform and listens for statistics, like counters and timers, sent over UDP or TCP and sends aggregates to one or more pluggable backend services (e.g., Graphite).
I wanted to collect metrics, states and events on all possible levels.
The standard is the Graphite metric format.
http://dieter.plaetinck.be/on-graphite-whisper-and-influxdb.html
I wanted to collect metrics, states and events on all possible levels.
The standard is the Graphite metric format.
I wanted to monitor the monitoring-fr.org infrastructure in a new way. This is the classical website stack.
Let’s see now what i wanted to do and the tools that might fit.
having repository to share dashboards, logstash configurations…
• Alert thresholds based on history vs static thresholds (or both combined)
• Statistics functions: Min, Percentile, Median, Deviations, Forecasting. You can change the duration that these evaluate over (i.e. 5 minutes, 1 hour, 1 week?)
• Scope-aware: How should components in your environment be grouped? By Host, subsystem, cluster, a combination of those things
• Boolean conditions: The interaction of multiple components
Not convinced to switch off
Side by side to explore