WTF is Sensu and Monitoring

.WTF/is/sensu
A DevOps guide to monitoring

.WTF/is/monitorin
g
A DevOps guide to monitoring

.WTF/whois
self:
author: ‘Toby Jackson <toby.jackson@futurenet.com>’
role: ‘Operations Engineer’
twitter: ‘@warmfusion’
github: ‘github.com/warmfusion’
employer: ‘www.futureplc.com/yourfuturejob/’

.WTF/is/monitoring?experience
●Developer turned Engineer
●Implemented Sensu at Future PLC
○340+ hosts, vms, switches etc
●Helped shape our approach to monitoring

.WTF/is/monitoring?_index
Why do we monitor our systems?
What should we look for?
How can Sensu help us?
Questions…?

.WTF/is/monitoring?why
Part One - Why do we monitor our systems

.WTF/is/monitoring?why
● Client - Are they down, or is it just me?
● CEO - Are we making money?
● Manager - Are we meeting SLA agreements?
● Engineer - Am I woken up for right reasons?
● Developer - Did my deploy work?
● Everyone...
○ What’s happening in our environment?

.WTF/is/monitoring?why_tomorrow
● Client - Is maintenance going to happen soon?
● CEO - Are we going to keep making money?
● Manager - Can we meet new SLA agreements?
● Engineer - Why might I get woken up tonight?
● Developer - When do I need to optimise?
● Everyone...
○ Whats going to happen in our environment?

.WTF/is/monitoring?what
Part Two - What should we look for?

.WTF/is/monitoring?disclaimer
Some approaches work better than others
don’t be afraid to experiment.

.WTF/is/monitoring?principles
Focus on your customers
Use a couple of monitoring systems
De-couple your checks from your code
Remember workflow events
Many simple checks > Fewer clever checks
Don’t wake me up if it can wait

.WTF/is/monitoring?first_steps
● Look for the big impact entry points
● Review past incidents for danger zones
● Don’t be afraid to admit that risky code exists

.WTF/is/monitoring?common
●Disk, Ram, Load, Network
●Patches available
●Uptime
●Logged in users
●Config Management status

.WTF/is/monitoring?services
●Create http status endpoints
●JSON is great
●200 OK / 503 Service Unavailable
●Lightweight
●Downstream dependencies?
●Service metrics?

.WTF/is/monitoring?clusters
●Aggregate checks
●Members don’t matter
●Deploys and maintenance is ok
●Avoid bypassing balancers

.WTF/is/monitoring?company
●Programmatic goals can be monitored
●See if revenue, purchases or direct customer
interactions can be watched
●Watch for social media mentions

.WTF/is/monitoring?practise_simple
● nginx & php running
● Balancer: 200 OK
● nginx: 200 OK
● Cron: ignore for now
Web Load Balancer
Web01
nginx
php
cron
Web02
nginx
php

.WTF/is/monitoring?practise_adv
● Balancer
>50% backends up
● Nginx
< 200ms response
● Cron
err log empty &&
<1hr old
Web Load Balancer
Web01
nginx
php
cron
Web02
nginx
php

.WTF/is/monitoring?practise_clever
● Spike in traffic
● Failure counts
above thresholds
● Response sizes are
curiously large
● Lots of (valid) API
Auth requests
Web Load Balancer
Web01
nginx
php
cron
Web02
nginx
php

Your users matter
Know when they’re in pain
Develop a standardised app status page
Conventional checks are used more frequently
Check lots of small things
Scales better and helps to isolate incidents quickly
.WTF/is/monitoring?what

.WTF/is/sensu
Part Three - How can Sensu help us

.WTF/is/sensu?introduction
“New generation” of monitoring solutions
Open source with paid for Enterprise edition
Site: sensuapp.org
GitHub: github.com/sensu
IRC: freenode - #sensu

.WTF/is/sensu?what
Consistent way to describe a service check
Executes those checks as required
Reliably handles events (and metrics)

.WTF/is/sensu?why
●Tries to do one thing well; handle events
●Compatible with existing check scripts
●Large active open-source community
●Scales effectively

.WTF/is/sensu?experience
●Replaced nagios, crons etc
●Raised visibility of monitoring
●Devolved control to development
●340 (ish) hosts, vms, switches, firewalls etc
●Managed exclusively through Puppet
●Developed custom plugins and extensions

.WTF/is/sensu?architecture_simple

.WTF/is/sensu?how
The Sensu Standalone Check Process:
a. Sensu-Client runs a script with 1 line output and an
exit code
b. Sensu-Client converts event into JSON and puts on
RabbitMQ
c. Sensu-Server reads event and sends to handlers
d. Handlers process event, performing some action

.WTF/is/sensu?architecture_simple
You are here

.WTF/is/sensu?standalone_check
● Describes
○ what check to run
○ how to handle events
● Runs at a given interval (default
60s)
● sensu-client handles output and
emits events over message
brokers
● Can include custom
configuration which is included
in event sent to handlers
sensu::checks:
'sensu-server':
command: 'check-procs.rb -p bin/sensu-
server -c 1'
handlers: ['high', 'pagerduty']
custom:
runbook: 'https://wiki.ftr.com/x/4oqq'
tip: 'Check /var/log/sensu-server.log'
slack:
channels:
- '#craggyisland'

.WTF/is/sensu?runbook
URI to page summary of
Impacted services
Troubleshooting
Common problems
How to fix
Who to talk to
References to other information

.WTF/is/sensu?tip
Tweet length one-liner
Gets included in Pagerduty and Slack notices
Useful at 4am on a Sunday morning

.WTF/is/sensu?handler
● Process events
● Perform some (or no) action
● Typically used to send alerts or
emails
sensu::handler:
slack:
type: 'pipe'
command: 'slack.rb'
config:
webhook_token: 'SECRET/KEY'
bot_name: 'sensu'
channel: '#alerts'
pagerduty:
type: 'pipe'
command: 'pagerduty.rb'
severities: ['ok', 'critical']
config:
api_key: SECRET_TOKEN_HERE

.WTF/is/sensu?standalone_metrics
● The same as checks but...
● handlers: [‘metrics’]
○ A special handler for this
kind of result
● type: metric
○ Tells sensu to always send
the output to the handler
sensu::checks:
cpu-pcnt-usage-metrics:
command: 'cpu-pcnt-usage-metrics.rb'
handlers: ['metrics']
type: metric

.WTF/is/sensu?metric_example
ix-sensu01.cpu.user 70.92 1440425049
ix-sensu01.cpu.nice 0.00 1440425049
ix-sensu01.cpu.system 8.16 1440425049
ix-sensu01.cpu.idle 19.90 1440425049
ix-sensu01.cpu.iowait 0.00 1440425049
ix-sensu01.cpu.irq 0.00 1440425049
ix-sensu01.cpu.softirq 1.02 1440425049
ix-sensu01.cpu.steal 0.00 1440425049
ix-sensu01.cpu.guest 0.00 1440425049
Key Value Timestamp

.WTF/is/sensu?dashboards
● Uchiwa - github.com/sensu/uchiwa
● Mosaic - github.com/warmfusion/mosaic
● Sensu-Grid - github.com/alex-leonhardt/sensu-grid

.WTF/is/sensu?issues
●Uchiwa isn’t perfect
●Sensu-API can crash sometimes
●No maintained history (over 20 events)
●Check dependencies are handled on clients
●Redis for datastore
○Redundancy is a little harder (for me at least)

.WTF/is/sensu?wins
●Alerts into Slack channels
●Handles network partitions really well
●Easy to create new checks and handlers

.WTF/is/monitoring?further_reading
Programmatic Alert Correlation - Elik Eizenberg
youtu.be/EXk19d09n54
Effective Incident Communication - Scott Klein
youtu.be/ySSdqfZlC7Y
Search for Operability 2015 in YouTube

.WTF/whois?q=
self:
author: ‘Toby Jackson <toby.jackson@futurenet.com>’
role: ‘Operations Engineer’
twitter: ‘@warmfusion’
github: ‘github.com/warmfusion’
employer: ‘www.futureplc.com/yourfuturejob/’
Any Questions…?

WTF is Sensu and Monitoring

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to WTF is Sensu and Monitoring

Similar to WTF is Sensu and Monitoring (20)

Recently uploaded

Recently uploaded (20)

WTF is Sensu and Monitoring

Editor's Notes