Broken up into three core sections, this presentation tries to help explain why you monitor software, platforms and your products. What you can look for, and how to best get that information out of your code and finally how Sensu can be used to achieve this in a scalable platform.
7. .WTF/is/monitoring?why
● Client - Are they down, or is it just me?
● CEO - Are we making money?
● Manager - Are we meeting SLA agreements?
● Engineer - Am I woken up for right reasons?
● Developer - Did my deploy work?
● Everyone...
○ What’s happening in our environment?
8. .WTF/is/monitoring?why_tomorrow
● Client - Is maintenance going to happen soon?
● CEO - Are we going to keep making money?
● Manager - Can we meet new SLA agreements?
● Engineer - Why might I get woken up tonight?
● Developer - When do I need to optimise?
● Everyone...
○ Whats going to happen in our environment?
11. .WTF/is/monitoring?principles
Focus on your customers
Use a couple of monitoring systems
De-couple your checks from your code
Remember workflow events
Many simple checks > Fewer clever checks
Don’t wake me up if it can wait
19. .WTF/is/monitoring?practise_clever
● Spike in traffic
● Failure counts
above thresholds
● Response sizes are
curiously large
● Lots of (valid) API
Auth requests
Web Load Balancer
Web01
nginx
php
cron
Web02
nginx
php
20. Your users matter
Know when they’re in pain
Develop a standardised app status page
Conventional checks are used more frequently
Check lots of small things
Scales better and helps to isolate incidents quickly
.WTF/is/monitoring?what
24. .WTF/is/sensu?why
●Tries to do one thing well; handle events
●Compatible with existing check scripts
●Large active open-source community
●Scales effectively
25. .WTF/is/sensu?experience
●Replaced nagios, crons etc
●Raised visibility of monitoring
●Devolved control to development
●340 (ish) hosts, vms, switches, firewalls etc
●Managed exclusively through Puppet
●Developed custom plugins and extensions
27. .WTF/is/sensu?how
The Sensu Standalone Check Process:
a. Sensu-Client runs a script with 1 line output and an
exit code
b. Sensu-Client converts event into JSON and puts on
RabbitMQ
c. Sensu-Server reads event and sends to handlers
d. Handlers process event, performing some action
29. .WTF/is/sensu?standalone_check
● Describes
○ what check to run
○ how to handle events
● Runs at a given interval (default
60s)
● sensu-client handles output and
emits events over message
brokers
● Can include custom
configuration which is included
in event sent to handlers
sensu::checks:
'sensu-server':
command: 'check-procs.rb -p bin/sensu-
server -c 1'
handlers: ['high', 'pagerduty']
custom:
runbook: 'https://wiki.ftr.com/x/4oqq'
tip: 'Check /var/log/sensu-server.log'
slack:
channels:
- '#craggyisland'
30. .WTF/is/sensu?runbook
URI to page summary of
Impacted services
Troubleshooting
Common problems
How to fix
Who to talk to
References to other information
34. .WTF/is/sensu?handler
● Process events
● Perform some (or no) action
● Typically used to send alerts or
emails
sensu::handler:
slack:
type: 'pipe'
command: 'slack.rb'
config:
webhook_token: 'SECRET/KEY'
bot_name: 'sensu'
channel: '#alerts'
pagerduty:
type: 'pipe'
command: 'pagerduty.rb'
severities: ['ok', 'critical']
config:
api_key: SECRET_TOKEN_HERE
35. .WTF/is/sensu?standalone_metrics
● The same as checks but...
● handlers: [‘metrics’]
○ A special handler for this
kind of result
● type: metric
○ Tells sensu to always send
the output to the handler
sensu::checks:
cpu-pcnt-usage-metrics:
command: 'cpu-pcnt-usage-metrics.rb'
handlers: ['metrics']
type: metric
38. .WTF/is/sensu?issues
●Uchiwa isn’t perfect
●Sensu-API can crash sometimes
●No maintained history (over 20 events)
●Check dependencies are handled on clients
●Redis for datastore
○Redundancy is a little harder (for me at least)
41. .WTF/whois?q=
self:
author: ‘Toby Jackson <toby.jackson@futurenet.com>’
role: ‘Operations Engineer’
twitter: ‘@warmfusion’
github: ‘github.com/warmfusion’
employer: ‘www.futureplc.com/yourfuturejob/’
Any Questions…?
Editor's Notes
Originally going to talk about Sensu, but thought there was move value in sharing some general observations about monitoring
Still going to discuss Sensu, but going to start with WTF is monitoring
Mixture of physical servers and virtual machines, docker containers (prototype)
basic checks of hardware such as switches, routers, firewalls.
Broken up into three core sections
Going to discuss why you monitor
and what you can look for, and how to best get that information out of your code
and finally how Sensu can be used to achieve this in a scalable platfom
Hold questions to the end - my timekeeping isn’t great, so try to avoid distracting me
There are quite a few people that (should?) be interested in you environment and they each have their own motivations;
statuspages are a great method of communicating with your more technical clients such as API users or perhaps business to business clients
Easy to provide basic feedback to initially, but you may need to consider how you want to communicate with your paying customers
Some further reading at the end might be useful
CEOs and Managers are infrequent users of monitoring, but often ask harder questions about trends or aggregate values - don’t worry about these users right away.
Engineers and Developers are are initial target audience and can be easily pleased
Devs can create monitors for their own needs
Engineers can quickly get grumpy and demand better alerts
But monitoring can provide more information that what’s going on with your systems right now...
Simply ask yourselves: “What impact does this service have if it breaks?”
Customers can mean clients, advertisers, other developers, internal staff
When developing checks for your systems consider how to recognise impact to their workflows
Don’t rely on a single monitoring system - use a blended approach that provide different features and fault tolerance
Pingdom or StatusCake to monitor from outside your site, use Sensu internally,
but perhaps run some crons with simple messaging for dead-man-switches on core components
Your platform probably has some asynchronous actions, maybe with humans involved
Keep an eye that things that should happen, are happening
Try and provide back pressue on queues to prevent overloading downstream systems
Don’t run straight to PagerDuty to wake someone up when one of your 8 node cluster fails or if a server is starting to run low on disk
You don’t make friends with your on-call team like that
So where can you start?
Walk before you can run by starting with the big obvious failure modes
My servers gone offline
The websites throwing 500 errors
Jon just logged onto the live servers
Think about how your systems have behaved in the past
you’ve hopefully fixed the issues, but what symptoms can you look for in the future
Who can help diagnose different parts of your infrastructure
Some code is simply more fragile that others
That clever broker mesh used to distribute content around the world - it’ll fail
The shared filesystem that you use to lock the cluster - it’ll timeout
That legacy website using obsolete libraries and obscure databases - it’ll crash
It’s only a failure if you put your head in the sand and pretend it’ll all be fine
Watch out for checks eating up CPU creating false positives
Uptime - Do you like to reboot boxes every so often, or perhaps want to know if a machine just restarted
Configuration Management state can be useful - for example, at Future we use Puppet to manage our servers, but we sometimes forget to change environments back after testing or phased deployment. Our monitoring system alerts us after a period of time to bring it back onto production.
Avoid coupling your application stack to your monitoring software by trying to push messages directly
Develop a convention for status pages that can be used by your team for all services
Use json formatting - easy to parse, human friendly, lightweight
HTTP Status codes are great for indicating basic state for simpler upstream systems - eg HAProxy
Keep your checks lightweight - Don’t take your services offline by having a heavy status page
Caching results is acceptable, but it’s probably wise to allow a cache clear or indication of status age in your API
Consider what your cluster
Check processes are running
External check against balancers ensure you site is available for customers
Internal checks against each server tell you when you have a partial failure
Don’t wake me up.. probably
Easy to implement - could probably do this without anything more complex than cronjobs and config management
But in a scalable system you probably want to know a little more about your environment
Balancers should have at least 50% capacity avaliable
web servers should be not only returning pages, but doing so in a sensible time
cronjobs are running as required
Check with inference in most cases, but ideally use a dead-mans-switch
But why only focus on the now...
What about trends
harder to measure
starting to become relevant to managers
To highlight the important points
and now for some technology...
Spoken about why
spoken about what you can look for
and now for a little bit about how you can use sensu to achieve this
Installation process is straightforward, but involves a few steps
RabbitMQ or Redis brokers need to be setup and configured as message transport
Redis is needed for a keyvalue store
Sensu itself is simple to install on clients - a ruby or jvm daemon
Often excellent documentation, though sometimes features are better described in older versions for some reason
IRC community very supportive - I can often be found there helping to answer questions, or asking some of my own
Events are created in one of three ways:
Checks running at a defined interval managed by the client
Checks run on demand from the master controller
Passively accepting events in json format from arbitrary sources
Compatible with de-facto standards introduced by Nagios - exit codes and single line message output
Can scale both by introducing additional sensu-masters to your environment
Allows you to devolve control of monitoring checks to your development team
The previous monitoring system at Future PLC used a very clever auto-configuring Nagios system
Checks created on the fly based on our metrics feeds
No-one really understood how it worked and even fewer wanted to
A replacement was needed…
We looked at a few options, but Sensu stood out as it shared some of our existing technologies (Ruby, RabbitMQ, Nagios checks)
aligned with our goals of moving towards a hybrid scalable platform of potentially short lived services.
We could make the technology scale - that’s comparitivly easy - throw in a few web balancers, broker meshes, virtual IP’s and you’re pretty much done.
More difficult is scaling the creation of new checks, and management of existing ones.
At Future, anyone can create new check scripts and add them into our monitoring system without assistance using Puppet.
They can even raise alerts to out-of-hours support from their own configuration - but I’ve recommended having a conversation before that particular power is used.
We’ve created and adapted existing checks, and worked on some bespoke plugins and extentions to solve some specific use cases we have.
Working on an SNMPTrap extension - find in my GitHub repository.
Sensu runs an client agent on your nodes (Or in a side-car container)
Sensu runs an client agent on your nodes (Or in a side-car container)
Described as json configuration files
Shown here in yaml - taken from our puppet config
Assume the brokers have sent the message, and the sensu-servers are ready to consume those events.
The sensu-server process will read the event, and if the status is deemed to require handling, the event is passed to a handler script.
Typically use ‘pipe’ type handlers where events are piped through STDOUT to forked processes
Can also use TCP, UDP or transport types.
handlers can define some configuration, but as shown in our example check, some of those values can be extended or modified by the check itself
Allows for flexibility as you can allow developers to define the chat rooms to notify, or the nuances of your handler script from their clients without having to modify servers
but another form of event includes a ‘metrics’ event
Sensu can also be used to collect qualititive data from your platform, submitting those events as ‘metrics’ types
The only real different between metrics and normal check types is that the handlers are invoked for every check result, and the output of the event is always sent to the handler
The output of a metrics type can also include multiple lines of output - all of which are sent to the handlers. This lets you aggregate lots of key/value pairs into a single response
Uchiwa has some quirks
Doesn’t like being load-balanced - lists change ordering depending which sensu-master you hit
Sensu-API is a little odd sometimes
Sensu-Server restarts, the api sometimes doesn’t realise and needs a kicking itself - only affects Uchiwa really; monitoring itself is unaffected
No real history of events
Sensu is a ‘router’ of events; if you want reports of the last 7 days uptime, you need to look towards another tool
Perhaps an ELK stack (its what we do) collecting sensu events
Dependencies between services are checked when a result is generated, not when it is handled - so your backend server being in maintaince, may still alert from the haproxy cluster because setting up dependencies isn’t simple.
Basically easy to live with, but for the enthusiastic look at my reading list for programmatic alert correlation
Alert Correlation appears to be a complicated process of identifying related events and alerts from the stream of alerts into one or more incidents.
This presentation describes the concepts more fully, and gives some suggestions on how to start correlating events
Effective incident communication is from one of the founders of StatusPage.io and gives some really interesting pointers on how to properly communicate service events with your customers, or even internal staff.
Hiring:
2 Junior Ops - Devs looking to get closer to infrastructure and architecture decisions and management
PHP Devs - Working on our front and backend systems delivering content to millions of daily users
Other roles - check the link here