SlideShare a Scribd company logo
1 of 41
Download to read offline
Sensu @ Yelp - A Guided Tour
Kyle Anderson
https://github.com/solarkennedy
Disclaimer
I’m just a dude.
I know that when I watch a presentation by a company that I
recognize, I think to myself, “Hmm, $company, I’ve heard of them.
They probably have their stuff together. Lets see what they do…”
I’m here to describe, not persuade. I may not have everything
together. Just because I have things with “Unit Tests”, doesn’t mean I’
m “Right”.
Especially with a “framework” like Sensu, there can be more than one
way to do things. The trick is figuring out what works for you. I hope
by giving a real concrete example, you might be inspired to step up
your monitoring game?
Outline
1. Overall Architecture
2. Sensu Server Setup
a. Custom Base Handler
3. Client Configuration
a. Sensu Check Puppet Wrapper
4. Yelp SOA Checks
5. AWS/Cloudwatch Checks
6. Dealing with Ephemeral EC Servers
7. Cron Job Monitoring
8. Future Work
Overall Architecture
● profile::sensu_client
○ Sensu clients connect to RabbitMQ on one of
the servers (DNS Round Robin)
● profile::sensu_server
○ Base HAProxy install
○ RabbitMQ in Mirror Mode, load balanced via
HAProxy
○ Redis in Master/slave mode, load balanced via
HAProxy. (only master passes healthcheck)
○ Sensu Server installed, subscribes on RabbitMQ
○ API Load balanced via HAProxy
○ Dashboard Load balanced by HAProxy
Logical Diagram
Puppet Modules in Use
puppetlabs/rabbitmq
puppetlabs/haproxy
kyleanderson/redis_sentinel
arioch/redis
sensu/sensu
Addressing Complexity
“Sensu has so many moving parts that I
wouldn’t be able to sleep at night unless I
set up a Nagios instance to make sure they
were all running.”
Laurie Denness
https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-
bit-longer-thank-you-very-much/
Addressing Complexity
“I will be honest; I haven’t used Sensu, because I’m in
a happy place right now, but just the architectural
diagram of how it works scares the shit out of me.
When you need 7 arrow colours to describe where
data is going in a monitoring system, I’m starting to
fear it slightly. But hey, if it works, good on you guys.
It just looks a lot like this. Nothing wrong with that, if
you can make it stable and reliable.”
Laurie Denness
https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-
bit-longer-thank-you-very-much/
First Principle: Single Point of Truth
Pop Quiz: Determine what Servers are
Puppetmasters?
• A: Puppet manifests (include puppetmaster)
• B: DNS (puppet.local A 10.5.x.x)
• C: update-live script (for Server in ….)
• D: The servers that have had the puppetmaster bootstrap script run on them
• E: What MCollective says (mco find -C puppetmaster)
Answer: All / None of the above!
Sensu Server Detection
# Use DNS to detect if this server is a sensu server
$local_sensu_server_array = gethostbyname2array("sensu.local-${::habitat}.yelpcorp.com")
$ip_address_array = split($::all_ipaddresses, ',')
validate_array($local_sensu_server_array)
validate_array($ip_address_array)
$array_intersection = intersection($ip_address_array, $local_sensu_server_array)
# If our ipaddresses are in the dns entries, we must be a sensu server!
if size($array_intersection) > 0 {
$is_sensu_server = true
} else {
$is_sensu_server = false
}
HAProxy
• Every server in the sensu cluster runs its own HAProxy
• HAProxy listens on the “standard” ports, individual
instances listen on standard + 1
• Having an array of sensu servers from DNS allows us to
grow the backends
• If HAProxy dies, clients will re-resolve, and reconnect.
RabbitMQ
• Every server in the sensu cluster runs a rabbitmq
server in mirror mode (with autoheal for AP)
• Lots of individual clusters, not doing shoveling.
• Client authentication via SSL client certs (controlled by
puppet)
• Load balanced by haproxy
• Sensu-clients automatically reconnect on failure
Redis
• Redis is the persistent store used by Sensu to keep
track of heartbeats, what alerts are silenced, how
many times a check has failed, etc
• Redis is setup in a cluster mode, with redis-sentinel
doing automatic master/slave promotion. (Kinda CP)
• We use the redis-role haproxy master pattern
suggestion from http://failshell.io/sensu/high-
availability-sensu/
Sensu API + Dashboard
• sensu-api provides a rest api with json output for
integration.
• sensu-cli is provided for easy command line interactive
use
• Both the API and Dashboard use basic auth internally
(shared secret), and then LDAP+SSL auth externally.
• sensu-dashboard uses this api, and is behind our
external facing apache for authentication.
Sensu Servers:
• Automatically does master election, good. Build for 3.
• Connects to RabbitMQ, pulls events off and acts on
them
• Runs “handlers” on the event data
• Thats kinda it
• Which leads to handlers….
Sensu Timing Tunables Before/After
Custom check definition key-values
Custom key-values can be added to a check definition, which
will be included in event data, enabling handler creativity.
Common custom check definitions:
• interval: How frequently (in seconds) the check will be
executed
• occurrences: Number of event occurrences before the
handler should take action
• refresh: Number of seconds handlers should wait before
taking second action. Relies on sensu-plugin.
Yelp Monitoring Check Definition Key
Values
The custom base handler interprets these values:
• check_every = '5m',
• alert_after = '0s',
• realert_every = '1',
Custom Base Handler
def filter_repeated
interval = @event['check']['interval'] || 0
alert_after = @event['check']['alert_after'] || 0
realert_every = @event['check']['realert_every'] || 1
failing_for = @event['occurrences'].to_i * @event['check']['interval'].to_i
if failing_for < alert_after
bail "Only failing for #{failing_for}, less than #{alert_after}. Not performing any action yet."
elsif interval > 0 and @event['action'] == 'create'
initial_failing_occurrences = alert_after.fdiv(interval).to_i
number_of_failed_attempts = @event['occurrences'] - initial_failing_occurrences
unless number_of_failed_attempts == 0 || number_of_failed_attempts % realert_every == 0
bail 'only handling every ' + number.to_s + ' occurrences'
end
end
end
Other Handlers In Use
● IRC (Triaged by who is “on-point”)
● Email (not a thing)
● Pagerduty (Handled by “on-call”)
● OpsGenie (trialing)
● aws_prune (only on ec2 nodes)
● motd (sensu-report, not really a handler. Used for situation
awareness)
Future Handlers
● JIRA (auto create/close a ticket after a while?)
● Flapjack?
Sensu Clients
• Almost every server @yelp runs the sensu client (thank
you omnibus packages!)
• They connect to the Round-Robin dns entry local to
their zone.
• All checks are standalone, configured by puppet
Monitoring Check Puppet Wrapper
define monitoring_check (
$command,
$runbook,
$check_every = '5m',
$alert_after = '0s',
$realert_every = '1',
$irc_channels = undef,
$tip = false,
$page = false,
$wake = true,
$needs_sudo = false,
$sudo_user = 'root',
$team = 'operations',
$ensure = 'present',
$dependencies = [],
$sensu_custom = {},
) {
……
Lots of validation. Lots of tests.
mandatory runbook!
Human readable time units!
Easy to add sudo rules!
TIP: The one line runbook for lazy humans!
Team defaults to ops for convenience.
Usually set to $::profile::server::team
Monitoring Check Puppet Wrapper Example
# Make sure apt-mirroring is working by checking the age of the NEW file left over.
monitoring_check { 'apt-mirror':
check_every => '4h',
team => 'operations',
page => false,
runbook => 'y/rb-package-mirroring',
tip => 'Talk to kwa. Check /var/spool/apt-mirror/var/cron.log, then /nail/apt-mirror/var/apt-mirror.lock.',
command => '/usr/lib/nagios/plugins/check_file_age /nail/apt-mirror/var/NEW -w 86400 -c 172800',
}
Why Not Use The Native Puppet Type?
● The wrapper reduces the boilerplate and gives good
defaults
● Enforces site-specific policies and validation (team names,
mandatory runbooks)
● Allows us to modify all puppet-controlled sensu checks in
the future from a single spot.
● Custom tests
● Allows us to be backend agnostic (maybe)
Yelp SOA Checks
• How do we (Yelp) empower our developers to monitor
their services?
• How can we safely and conveniently allow devs to
define checks within our SOA framework?
• How can Devs not be blocked by Ops for service
deployment?
Define the Meta Check
# Defined on all hosts that run yelp SOA infrastructure
monitoring_check { 'check-yelp_soa':
check_every => '1m',
alert_after => '10m',
page => true,
runbook => 'http://y/rb-check-yelpsoa',
tip => 'Run /etc/sensu/plugins/check-yelp_soa.rb --debug to see what is wrong?',
command => '/etc/sensu/plugins/check-yelp_soa.rb',
require => Class['::yelp_soa']
}
check-yelp_soa.rb redux
def run
# TODO: Parallelize?
configs.each do | service, config |
next unless services_that_run_here.include?(service)
$log.debug "Processing #{service} as apparently it runs here"
srv_configs = read_srv_configs(service)
next unless srv_configs.include?('monitoring_check')
monitoring_check = srv_configs['monitoring_check']
if numeric?(config['port'])
...
if command == 'check_http'
url = monitoring_check['check_url'] || '/status'
$log.debug "Making a http check for #{service}, team: #{team}, warn_timeout: #{warn_timeout}, crit_timeout: #{crit_timeout}"
output, status = check_http(port,url,http_expect,warn_timeout,crit_timeout)
elsif monitoring_check['command'] == 'check_tcp'
$log.debug "Making a tcp check for #{service}, team: #{team}, warn_timeout: #{warn_timeout}, crit_timeout: #{crit_timeout}"
output, status = check_tcp(port,warn_timeout,crit_timeout)
else
$log.debug "Not spawning a check for #{service} because I don't know how to run #{command}"
next
end
send_result_to_sensu(service, status, output, team, runbook, tip, page, alert_after, realert_every, irc_channels)
services_checked << service
end # End port check
end # End for loop
ok "Finished run. Ran checks on #{services_checked}"
end
What was that?
Iterate through the SOA services that are configured to run
on a server.
Determine if that service has monitoring metadata defined
by the authors
Operate on that metadata to check it (usually check_http)
Send the results of the check to the localhost:3030 socket
as a *Different* check (“soa_$servicename”)
See https://gist.github.com/joemiller/5806570 for another example
An example service (request_blocking)
# from request_blocking.yaml
monitoring_check:
team: 'infra'
alert_after: 2m
realert_every: 2
irc_channels: 'infra'
url: '/status'
tip: "no tips yet"
warn_timout: 2.0
crit_timeout: 5.0
AWS/Cloudwatch Checks
• Pretty much the same thing, except:
• Checks are executed on special monitoring hosts in
the AZ (not on the ephemeral node)
• Runs graphite/check_data.rb against the provided
metric name
• Written in python this time! (https://pypi.python.
org/pypi/sensu)
Dealing with Ephemeral EC2 Nodes
• Yelps lives in a hybrid world, we have lots of
“ephemeral” EC2 nodes that are baked and do NOT run
puppet. Can Sensu still work on them?
• How do we prevent ourselves from being spammed
when hosts go away “normally”?
• How do we know what a host is without logging into
it? (EC2 metadata)
• Baking………..
EC2 Considerations
• We use puppet to bake AMIs for ELBs, so we can
control (via puppet) how Sensu is configured at bake
time.
• We can query the AWS API to know if a host has gone
away, and prune it from the Queue to squelch alerts.
• Using custom client metadata, we can add things like
puppet cert name, AMI_ID, etc at runtime with a
special init script.
For Non-Ephemeral Instances
if str2bool($::is_ec2) == true {
$client_custom = {
'instance_id' => $::ec2_instanceid,
'keepalive' => {
'handlers' => [ 'aws_prune', 'default' ],
'team' => $team,
'page' => true
}
}
} else {
$client_custom = {
'team' => $team,
'page' => true
}
}
Only EC2 Servers need the special aws_prune handler
A Fact! Embed it for easy troubleshooting
For Ephemeral (baked) Instances
description "Fix Sensu clientinfo on startup for baked ec2 instances"
author "Kyle Anderson <kwa@yelp.com>"
start on starting sensu-client
task
script
ADDRESS=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)
AMI_ID=$(curl -s http://169.254.169.254/latest/meta-data/ami-id)
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
/usr/bin/jq ".client.name = "$(/usr/local/sbin/puppet-certname)" | .client.address = "$ADDRESS" | .client.instance_id =
"$INSTANCE_ID" | .client.ami_id = "$AMI_ID" " /etc/sensu/conf.d/client.json > /etc/sensu/conf.d/newclient.json
mv /etc/sensu/conf.d/client.json /etc/sensu/conf.d/client.json.old
mv /etc/sensu/conf.d/newclient.json /etc/sensu/conf.d/client.json
end script
Only run once, right before sensu-client
Real data. Can’t lie.
Overwrite what we were
baked with. It is wrong.
jq FTW
Pruning Terminated EC2 Nodes
● Modification of https://github.com/sensu/sensu-community-
plugins/blob/master/handlers/other/ec2_node.rb
● Instead we use a cron job to cache the results of the api call
into json so we can be nice to AWS
● Then we can have *every* check use this handler, as it is
easy to just to check on disk if the instance_id is active.
● Use the instance_id from the client data to figure out who
you are. (which should be correct from the above)
What Does It Look Like?
file { '/etc/sensu/plugins/cache_instance_list.rb':
owner => 'root',
group => 'root',
mode => '0500',
source => 'puppet:///modules/profile/sensu/handlers/cache_instance_list.rb',
} ->
cron::d { 'cache_instance_list':
minute => '*',
user => 'root',
command => "/etc/sensu/plugins/cache_instance_list.rb -a ${access_key} -r ${region} -k ${secret_key}",
} ->
monitoring_check { 'cache_instance_list-staleness':
check_every => '10m',
alert_after => '1h',
team => 'test',
runbook => 'y/rb-aws-prune',
command => "/usr/lib/nagios/plugins/check_file_age /var/cache/instance_list.json -w 1800 -c 3600",
page => false,
}
The Handler (puppet)
$access_key = hiera('sensu::aws_key')
$secret_key = hiera('sensu::aws_secret')
$aws_config_hash = {
access_key => $access_key,
secret_key => $secret_key,
region => $region,
blacklist_name_array => [ 'bake_soa_ami', 'Packer Builder' ]
}
sensu::handler { 'aws_prune':
type => 'pipe',
source => 'puppet:///modules/profile/sensu/handlers/aws_prune.rb',
config => $aws_config_hash,
require => [ Package['rubygem-fog'], Package['rubygem-sensu-plugin'], Package['rubygem-unf'] ],
}
}
The Handler (Ruby)
def ec2_node_exists?
running_instances = load_instances_cache
instance_ids = running_instances.collect { |s| Hash[ 'id', s['id'], 'tags', s['tags'] ]}
my_instance_id = @event['client']['instance_id']
instance_ids.each do |instance|
# YELP SPECIFIC CODE
instance_name = instance['tags']['Name'].to_s
# Yelp specific: pretend that the node does not exist if we are in our blacklist
return false if blacklist_name_array.include?(instance_name)
return true if my_instance_id == instance['id']
end
return false # no match found, node doesn't exist
end
Cron Job Monitoring
• I believe cron sending emails is an anti-pattern and not
*web-scale*
• Lets use Sensu to monitor our cron jobs!
• Use a combination of a cron puppet type wrapper and
my Sensu-Shell-Helper
• Modified sensu-shell-helper includes fields for team
and page for yelp-specific things: https://github.
com/solarkennedy/sensu-shell-helper
What does it look like?
$command = 'chgrp -R admin /nail/packages/'
cron::d { 'fix-packages-permissions':
mailto => '',
minute => '10',
user => 'root',
comment => 'Make permissions group writable for collaboration purposes',
command => “sensu-shell-helper -n fix-packages-permissions -p false -t operations ${command}”,
ensure => 'present'
}
See https://github.com/torrancew/puppet-cron#cronjob for related work.
Future Work
● battle-test more of the pagerduty stuff (blocked on bogus aws nodes still)
● sort out AWS pruning, harder (#61626)
● make tools that work on nagios *and* sensu?
● really monitor the sensu instances in nagios with alerts (#60164)
● enable self-serve sensu alerts for services (#62201)
● make a library for sending passive checks (#62440)
● set up infrastructure for “aggregate” checks (cluster checks)
● better test the alerting tunables we have (#61628)
● enable sensu alerts for Asgardy services (#57450)
● set up easy to use metric based alerting (like horsefly, blocked on #67000)
● write my sensu-downtime tool
● write an super-dashboard (hackathon)
● write the sensu archive service (sensu-db?)
Thanks!

More Related Content

What's hot

Stop using Nagios (so it can die peacefully)
Stop using Nagios (so it can die peacefully)Stop using Nagios (so it can die peacefully)
Stop using Nagios (so it can die peacefully)Andy Sykes
 
How Yelp uses Mesos to Power its SOA Infrastructure
How Yelp uses Mesos to Power its SOA InfrastructureHow Yelp uses Mesos to Power its SOA Infrastructure
How Yelp uses Mesos to Power its SOA InfrastructureEvanKrall
 
How Yelp does Service Discovery
How Yelp does Service DiscoveryHow Yelp does Service Discovery
How Yelp does Service DiscoveryJohn Billings
 
Superb Supervision of Short-lived Servers with Sensu
Superb Supervision of Short-lived Servers with SensuSuperb Supervision of Short-lived Servers with Sensu
Superb Supervision of Short-lived Servers with SensuPaul O'Connor
 
Puppet Development Workflow
Puppet Development WorkflowPuppet Development Workflow
Puppet Development WorkflowJeffery Smith
 
Security Testing with OWASP ZAP in CI/CD - Simon Bennetts - Codemotion Amster...
Security Testing with OWASP ZAP in CI/CD - Simon Bennetts - Codemotion Amster...Security Testing with OWASP ZAP in CI/CD - Simon Bennetts - Codemotion Amster...
Security Testing with OWASP ZAP in CI/CD - Simon Bennetts - Codemotion Amster...Codemotion
 
Grafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and ChallengesGrafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and ChallengesPhilip Wernersbach
 
Steamlining your puppet development workflow
Steamlining your puppet development workflowSteamlining your puppet development workflow
Steamlining your puppet development workflowTomas Doran
 
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning TalkVladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning TalkZabbix
 
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014Puppet
 
Rihards Olups - Zabbix at Nokia - Case Study
Rihards Olups - Zabbix at Nokia - Case StudyRihards Olups - Zabbix at Nokia - Case Study
Rihards Olups - Zabbix at Nokia - Case StudyZabbix
 
Monitoring a billion kilometers of monthly ride sharing at BlaBlaCar - Zabbix...
Monitoring a billion kilometers of monthly ride sharing at BlaBlaCar - Zabbix...Monitoring a billion kilometers of monthly ride sharing at BlaBlaCar - Zabbix...
Monitoring a billion kilometers of monthly ride sharing at BlaBlaCar - Zabbix...Jean Baptiste Favre
 
Cf summit-2016-monitoring-cf-sensu-graphite
Cf summit-2016-monitoring-cf-sensu-graphiteCf summit-2016-monitoring-cf-sensu-graphite
Cf summit-2016-monitoring-cf-sensu-graphiteJeff Barrows
 
Prometheus meets Consul -- Consul Casual Talks
Prometheus meets Consul -- Consul Casual TalksPrometheus meets Consul -- Consul Casual Talks
Prometheus meets Consul -- Consul Casual TalksSatoshi Suzuki
 
Puppet Camp Denver 2015: Developing and Testing with Enhanced Oscar
Puppet Camp Denver 2015: Developing and Testing with Enhanced OscarPuppet Camp Denver 2015: Developing and Testing with Enhanced Oscar
Puppet Camp Denver 2015: Developing and Testing with Enhanced OscarPuppet
 
Developing and Testing with Enhanced Oscar
Developing and Testing with Enhanced OscarDeveloping and Testing with Enhanced Oscar
Developing and Testing with Enhanced OscarJeff Scelza
 
2017 Codemotion OWASP ZAP in CI/CD
2017 Codemotion OWASP ZAP in CI/CD2017 Codemotion OWASP ZAP in CI/CD
2017 Codemotion OWASP ZAP in CI/CDSimon Bennetts
 
Introduction to SaltStack
Introduction to SaltStackIntroduction to SaltStack
Introduction to SaltStackAymen EL Amri
 
Windows Configuration Management: Managing Packages, Services, & Power Shell-...
Windows Configuration Management: Managing Packages, Services, & Power Shell-...Windows Configuration Management: Managing Packages, Services, & Power Shell-...
Windows Configuration Management: Managing Packages, Services, & Power Shell-...Puppet
 

What's hot (20)

Stop using Nagios (so it can die peacefully)
Stop using Nagios (so it can die peacefully)Stop using Nagios (so it can die peacefully)
Stop using Nagios (so it can die peacefully)
 
How Yelp uses Mesos to Power its SOA Infrastructure
How Yelp uses Mesos to Power its SOA InfrastructureHow Yelp uses Mesos to Power its SOA Infrastructure
How Yelp uses Mesos to Power its SOA Infrastructure
 
How Yelp does Service Discovery
How Yelp does Service DiscoveryHow Yelp does Service Discovery
How Yelp does Service Discovery
 
Superb Supervision of Short-lived Servers with Sensu
Superb Supervision of Short-lived Servers with SensuSuperb Supervision of Short-lived Servers with Sensu
Superb Supervision of Short-lived Servers with Sensu
 
Puppet Development Workflow
Puppet Development WorkflowPuppet Development Workflow
Puppet Development Workflow
 
Security Testing with OWASP ZAP in CI/CD - Simon Bennetts - Codemotion Amster...
Security Testing with OWASP ZAP in CI/CD - Simon Bennetts - Codemotion Amster...Security Testing with OWASP ZAP in CI/CD - Simon Bennetts - Codemotion Amster...
Security Testing with OWASP ZAP in CI/CD - Simon Bennetts - Codemotion Amster...
 
Grafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and ChallengesGrafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and Challenges
 
Steamlining your puppet development workflow
Steamlining your puppet development workflowSteamlining your puppet development workflow
Steamlining your puppet development workflow
 
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning TalkVladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
 
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
 
Rihards Olups - Zabbix at Nokia - Case Study
Rihards Olups - Zabbix at Nokia - Case StudyRihards Olups - Zabbix at Nokia - Case Study
Rihards Olups - Zabbix at Nokia - Case Study
 
Monitoring a billion kilometers of monthly ride sharing at BlaBlaCar - Zabbix...
Monitoring a billion kilometers of monthly ride sharing at BlaBlaCar - Zabbix...Monitoring a billion kilometers of monthly ride sharing at BlaBlaCar - Zabbix...
Monitoring a billion kilometers of monthly ride sharing at BlaBlaCar - Zabbix...
 
Cf summit-2016-monitoring-cf-sensu-graphite
Cf summit-2016-monitoring-cf-sensu-graphiteCf summit-2016-monitoring-cf-sensu-graphite
Cf summit-2016-monitoring-cf-sensu-graphite
 
Prometheus meets Consul -- Consul Casual Talks
Prometheus meets Consul -- Consul Casual TalksPrometheus meets Consul -- Consul Casual Talks
Prometheus meets Consul -- Consul Casual Talks
 
Ansible Case Studies
Ansible Case StudiesAnsible Case Studies
Ansible Case Studies
 
Puppet Camp Denver 2015: Developing and Testing with Enhanced Oscar
Puppet Camp Denver 2015: Developing and Testing with Enhanced OscarPuppet Camp Denver 2015: Developing and Testing with Enhanced Oscar
Puppet Camp Denver 2015: Developing and Testing with Enhanced Oscar
 
Developing and Testing with Enhanced Oscar
Developing and Testing with Enhanced OscarDeveloping and Testing with Enhanced Oscar
Developing and Testing with Enhanced Oscar
 
2017 Codemotion OWASP ZAP in CI/CD
2017 Codemotion OWASP ZAP in CI/CD2017 Codemotion OWASP ZAP in CI/CD
2017 Codemotion OWASP ZAP in CI/CD
 
Introduction to SaltStack
Introduction to SaltStackIntroduction to SaltStack
Introduction to SaltStack
 
Windows Configuration Management: Managing Packages, Services, & Power Shell-...
Windows Configuration Management: Managing Packages, Services, & Power Shell-...Windows Configuration Management: Managing Packages, Services, & Power Shell-...
Windows Configuration Management: Managing Packages, Services, & Power Shell-...
 

Viewers also liked

Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu
Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with SensuSense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu
Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with SensuBethany Erskine
 
An Introduction to Sensu by Bethany Erskine
An Introduction to Sensu by Bethany Erskine An Introduction to Sensu by Bethany Erskine
An Introduction to Sensu by Bethany Erskine Hakka Labs
 
The ultimate container monitoring bake-off - Rancher Online Meetup October 2016
The ultimate container monitoring bake-off - Rancher Online Meetup October 2016The ultimate container monitoring bake-off - Rancher Online Meetup October 2016
The ultimate container monitoring bake-off - Rancher Online Meetup October 2016Shannon Williams
 
Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More!
Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More! Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More!
Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More! Redis Labs
 
PuppetConf 2016: Watching the Puppet Show – Sean Porter, Heavy Water Operations
PuppetConf 2016: Watching the Puppet Show – Sean Porter, Heavy Water OperationsPuppetConf 2016: Watching the Puppet Show – Sean Porter, Heavy Water Operations
PuppetConf 2016: Watching the Puppet Show – Sean Porter, Heavy Water OperationsPuppet
 
Volta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceVolta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceLN Renganarayana
 
Open Source Monitoring in 2014, from #monitoringssucks to #monitoringlove and...
Open Source Monitoring in 2014, from #monitoringssucks to #monitoringlove and...Open Source Monitoring in 2014, from #monitoringssucks to #monitoringlove and...
Open Source Monitoring in 2014, from #monitoringssucks to #monitoringlove and...Kris Buytaert
 
Time to say goodbye to your Nagios based setup
Time to say goodbye to your Nagios based setupTime to say goodbye to your Nagios based setup
Time to say goodbye to your Nagios based setupCheck my Website
 
Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBleesjensen
 
Building a Global Multi-Tenant Monitoring Platform
Building a Global Multi-Tenant Monitoring PlatformBuilding a Global Multi-Tenant Monitoring Platform
Building a Global Multi-Tenant Monitoring PlatformAmazon Web Services
 
The Top 5 Performance Management Tools
The Top 5 Performance Management ToolsThe Top 5 Performance Management Tools
The Top 5 Performance Management ToolsBernard Marr
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheLeslie Samuel
 

Viewers also liked (14)

Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu
Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with SensuSense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu
Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu
 
An Introduction to Sensu by Bethany Erskine
An Introduction to Sensu by Bethany Erskine An Introduction to Sensu by Bethany Erskine
An Introduction to Sensu by Bethany Erskine
 
The ultimate container monitoring bake-off - Rancher Online Meetup October 2016
The ultimate container monitoring bake-off - Rancher Online Meetup October 2016The ultimate container monitoring bake-off - Rancher Online Meetup October 2016
The ultimate container monitoring bake-off - Rancher Online Meetup October 2016
 
Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More!
Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More! Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More!
Redis in a Multi Tenant Environment–High Availability, Monitoring & Much More!
 
PuppetConf 2016: Watching the Puppet Show – Sean Porter, Heavy Water Operations
PuppetConf 2016: Watching the Puppet Show – Sean Porter, Heavy Water OperationsPuppetConf 2016: Watching the Puppet Show – Sean Porter, Heavy Water Operations
PuppetConf 2016: Watching the Puppet Show – Sean Porter, Heavy Water Operations
 
Volta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a ServiceVolta: Logging, Metrics, and Monitoring as a Service
Volta: Logging, Metrics, and Monitoring as a Service
 
Influxdb and time series data
Influxdb and time series dataInfluxdb and time series data
Influxdb and time series data
 
Open Source Monitoring in 2014, from #monitoringssucks to #monitoringlove and...
Open Source Monitoring in 2014, from #monitoringssucks to #monitoringlove and...Open Source Monitoring in 2014, from #monitoringssucks to #monitoringlove and...
Open Source Monitoring in 2014, from #monitoringssucks to #monitoringlove and...
 
InfluxDB & Grafana
InfluxDB & GrafanaInfluxDB & Grafana
InfluxDB & Grafana
 
Time to say goodbye to your Nagios based setup
Time to say goodbye to your Nagios based setupTime to say goodbye to your Nagios based setup
Time to say goodbye to your Nagios based setup
 
Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDB
 
Building a Global Multi-Tenant Monitoring Platform
Building a Global Multi-Tenant Monitoring PlatformBuilding a Global Multi-Tenant Monitoring Platform
Building a Global Multi-Tenant Monitoring Platform
 
The Top 5 Performance Management Tools
The Top 5 Performance Management ToolsThe Top 5 Performance Management Tools
The Top 5 Performance Management Tools
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Similar to Sensu @ Yelp!: A Guided Tour

“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...Puppet
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys AdminsPuppet
 
Albert Witteveen - With Cloud Computing Who Needs Performance Testing
Albert Witteveen - With Cloud Computing Who Needs Performance TestingAlbert Witteveen - With Cloud Computing Who Needs Performance Testing
Albert Witteveen - With Cloud Computing Who Needs Performance TestingTEST Huddle
 
SiestaTime - Defcon27 Red Team Village
SiestaTime - Defcon27 Red Team VillageSiestaTime - Defcon27 Red Team Village
SiestaTime - Defcon27 Red Team VillageAlvaro Folgado Rueda
 
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebula Project
 
Monitoring of OpenNebula installations
Monitoring of OpenNebula installationsMonitoring of OpenNebula installations
Monitoring of OpenNebula installationsNETWAYS
 
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!Puppet
 
Puppet At Twitter - Puppet Camp Silicon Valley
Puppet At Twitter - Puppet Camp Silicon ValleyPuppet At Twitter - Puppet Camp Silicon Valley
Puppet At Twitter - Puppet Camp Silicon ValleyPuppet
 
Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...
Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...
Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...Puppet
 
Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...
Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...
Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...NETWAYS
 
Node.js Development Workflow Automation with Grunt.js
Node.js Development Workflow Automation with Grunt.jsNode.js Development Workflow Automation with Grunt.js
Node.js Development Workflow Automation with Grunt.jskiyanwang
 
Puppet for SysAdmins
Puppet for SysAdminsPuppet for SysAdmins
Puppet for SysAdminsPuppet
 
From SLO to GOTY
From SLO to GOTYFrom SLO to GOTY
From SLO to GOTYScyllaDB
 
Ansible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less CoffeeAnsible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less CoffeeSarah Z
 
Cloudops fundamentals management, tdd, test driven design, continuous integra...
Cloudops fundamentals management, tdd, test driven design, continuous integra...Cloudops fundamentals management, tdd, test driven design, continuous integra...
Cloudops fundamentals management, tdd, test driven design, continuous integra...Bret Piatt
 
Alert Fatigue: Avoidance and Course Correction
Alert Fatigue: Avoidance and Course CorrectionAlert Fatigue: Avoidance and Course Correction
Alert Fatigue: Avoidance and Course CorrectionSensu Inc.
 
Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...
Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...
Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...DevOpsDays Tel Aviv
 

Similar to Sensu @ Yelp!: A Guided Tour (20)

“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys Admins
 
Albert Witteveen - With Cloud Computing Who Needs Performance Testing
Albert Witteveen - With Cloud Computing Who Needs Performance TestingAlbert Witteveen - With Cloud Computing Who Needs Performance Testing
Albert Witteveen - With Cloud Computing Who Needs Performance Testing
 
SiestaTime - Defcon27 Red Team Village
SiestaTime - Defcon27 Red Team VillageSiestaTime - Defcon27 Red Team Village
SiestaTime - Defcon27 Red Team Village
 
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
 
Monitoring of OpenNebula installations
Monitoring of OpenNebula installationsMonitoring of OpenNebula installations
Monitoring of OpenNebula installations
 
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
 
DevOps
DevOpsDevOps
DevOps
 
Dev ops
Dev opsDev ops
Dev ops
 
Puppet At Twitter - Puppet Camp Silicon Valley
Puppet At Twitter - Puppet Camp Silicon ValleyPuppet At Twitter - Puppet Camp Silicon Valley
Puppet At Twitter - Puppet Camp Silicon Valley
 
I hunt sys admins 2.0
I hunt sys admins 2.0I hunt sys admins 2.0
I hunt sys admins 2.0
 
Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...
Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...
Puppet Camp Düsseldorf 2014: Continuously Deliver Your Puppet Code with Jenki...
 
Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...
Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...
Puppet Camp Duesseldorf 2014: Toni Schmidbauer - Continuously deliver your pu...
 
Node.js Development Workflow Automation with Grunt.js
Node.js Development Workflow Automation with Grunt.jsNode.js Development Workflow Automation with Grunt.js
Node.js Development Workflow Automation with Grunt.js
 
Puppet for SysAdmins
Puppet for SysAdminsPuppet for SysAdmins
Puppet for SysAdmins
 
From SLO to GOTY
From SLO to GOTYFrom SLO to GOTY
From SLO to GOTY
 
Ansible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less CoffeeAnsible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less Coffee
 
Cloudops fundamentals management, tdd, test driven design, continuous integra...
Cloudops fundamentals management, tdd, test driven design, continuous integra...Cloudops fundamentals management, tdd, test driven design, continuous integra...
Cloudops fundamentals management, tdd, test driven design, continuous integra...
 
Alert Fatigue: Avoidance and Course Correction
Alert Fatigue: Avoidance and Course CorrectionAlert Fatigue: Avoidance and Course Correction
Alert Fatigue: Avoidance and Course Correction
 
Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...
Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...
Deploy and Destroy: Testing Environments - Michael Arenzon - DevOpsDays Tel A...
 

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Sensu @ Yelp!: A Guided Tour

  • 1. Sensu @ Yelp - A Guided Tour Kyle Anderson https://github.com/solarkennedy
  • 2. Disclaimer I’m just a dude. I know that when I watch a presentation by a company that I recognize, I think to myself, “Hmm, $company, I’ve heard of them. They probably have their stuff together. Lets see what they do…” I’m here to describe, not persuade. I may not have everything together. Just because I have things with “Unit Tests”, doesn’t mean I’ m “Right”. Especially with a “framework” like Sensu, there can be more than one way to do things. The trick is figuring out what works for you. I hope by giving a real concrete example, you might be inspired to step up your monitoring game?
  • 3. Outline 1. Overall Architecture 2. Sensu Server Setup a. Custom Base Handler 3. Client Configuration a. Sensu Check Puppet Wrapper 4. Yelp SOA Checks 5. AWS/Cloudwatch Checks 6. Dealing with Ephemeral EC Servers 7. Cron Job Monitoring 8. Future Work
  • 4. Overall Architecture ● profile::sensu_client ○ Sensu clients connect to RabbitMQ on one of the servers (DNS Round Robin) ● profile::sensu_server ○ Base HAProxy install ○ RabbitMQ in Mirror Mode, load balanced via HAProxy ○ Redis in Master/slave mode, load balanced via HAProxy. (only master passes healthcheck) ○ Sensu Server installed, subscribes on RabbitMQ ○ API Load balanced via HAProxy ○ Dashboard Load balanced by HAProxy
  • 6. Puppet Modules in Use puppetlabs/rabbitmq puppetlabs/haproxy kyleanderson/redis_sentinel arioch/redis sensu/sensu
  • 7. Addressing Complexity “Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.” Laurie Denness https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a- bit-longer-thank-you-very-much/
  • 8. Addressing Complexity “I will be honest; I haven’t used Sensu, because I’m in a happy place right now, but just the architectural diagram of how it works scares the shit out of me. When you need 7 arrow colours to describe where data is going in a monitoring system, I’m starting to fear it slightly. But hey, if it works, good on you guys. It just looks a lot like this. Nothing wrong with that, if you can make it stable and reliable.” Laurie Denness https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a- bit-longer-thank-you-very-much/
  • 9. First Principle: Single Point of Truth
  • 10. Pop Quiz: Determine what Servers are Puppetmasters? • A: Puppet manifests (include puppetmaster) • B: DNS (puppet.local A 10.5.x.x) • C: update-live script (for Server in ….) • D: The servers that have had the puppetmaster bootstrap script run on them • E: What MCollective says (mco find -C puppetmaster) Answer: All / None of the above!
  • 11. Sensu Server Detection # Use DNS to detect if this server is a sensu server $local_sensu_server_array = gethostbyname2array("sensu.local-${::habitat}.yelpcorp.com") $ip_address_array = split($::all_ipaddresses, ',') validate_array($local_sensu_server_array) validate_array($ip_address_array) $array_intersection = intersection($ip_address_array, $local_sensu_server_array) # If our ipaddresses are in the dns entries, we must be a sensu server! if size($array_intersection) > 0 { $is_sensu_server = true } else { $is_sensu_server = false }
  • 12. HAProxy • Every server in the sensu cluster runs its own HAProxy • HAProxy listens on the “standard” ports, individual instances listen on standard + 1 • Having an array of sensu servers from DNS allows us to grow the backends • If HAProxy dies, clients will re-resolve, and reconnect.
  • 13. RabbitMQ • Every server in the sensu cluster runs a rabbitmq server in mirror mode (with autoheal for AP) • Lots of individual clusters, not doing shoveling. • Client authentication via SSL client certs (controlled by puppet) • Load balanced by haproxy • Sensu-clients automatically reconnect on failure
  • 14. Redis • Redis is the persistent store used by Sensu to keep track of heartbeats, what alerts are silenced, how many times a check has failed, etc • Redis is setup in a cluster mode, with redis-sentinel doing automatic master/slave promotion. (Kinda CP) • We use the redis-role haproxy master pattern suggestion from http://failshell.io/sensu/high- availability-sensu/
  • 15. Sensu API + Dashboard • sensu-api provides a rest api with json output for integration. • sensu-cli is provided for easy command line interactive use • Both the API and Dashboard use basic auth internally (shared secret), and then LDAP+SSL auth externally. • sensu-dashboard uses this api, and is behind our external facing apache for authentication.
  • 16. Sensu Servers: • Automatically does master election, good. Build for 3. • Connects to RabbitMQ, pulls events off and acts on them • Runs “handlers” on the event data • Thats kinda it • Which leads to handlers….
  • 17. Sensu Timing Tunables Before/After Custom check definition key-values Custom key-values can be added to a check definition, which will be included in event data, enabling handler creativity. Common custom check definitions: • interval: How frequently (in seconds) the check will be executed • occurrences: Number of event occurrences before the handler should take action • refresh: Number of seconds handlers should wait before taking second action. Relies on sensu-plugin. Yelp Monitoring Check Definition Key Values The custom base handler interprets these values: • check_every = '5m', • alert_after = '0s', • realert_every = '1',
  • 18. Custom Base Handler def filter_repeated interval = @event['check']['interval'] || 0 alert_after = @event['check']['alert_after'] || 0 realert_every = @event['check']['realert_every'] || 1 failing_for = @event['occurrences'].to_i * @event['check']['interval'].to_i if failing_for < alert_after bail "Only failing for #{failing_for}, less than #{alert_after}. Not performing any action yet." elsif interval > 0 and @event['action'] == 'create' initial_failing_occurrences = alert_after.fdiv(interval).to_i number_of_failed_attempts = @event['occurrences'] - initial_failing_occurrences unless number_of_failed_attempts == 0 || number_of_failed_attempts % realert_every == 0 bail 'only handling every ' + number.to_s + ' occurrences' end end end
  • 19. Other Handlers In Use ● IRC (Triaged by who is “on-point”) ● Email (not a thing) ● Pagerduty (Handled by “on-call”) ● OpsGenie (trialing) ● aws_prune (only on ec2 nodes) ● motd (sensu-report, not really a handler. Used for situation awareness) Future Handlers ● JIRA (auto create/close a ticket after a while?) ● Flapjack?
  • 20. Sensu Clients • Almost every server @yelp runs the sensu client (thank you omnibus packages!) • They connect to the Round-Robin dns entry local to their zone. • All checks are standalone, configured by puppet
  • 21. Monitoring Check Puppet Wrapper define monitoring_check ( $command, $runbook, $check_every = '5m', $alert_after = '0s', $realert_every = '1', $irc_channels = undef, $tip = false, $page = false, $wake = true, $needs_sudo = false, $sudo_user = 'root', $team = 'operations', $ensure = 'present', $dependencies = [], $sensu_custom = {}, ) { …… Lots of validation. Lots of tests. mandatory runbook! Human readable time units! Easy to add sudo rules! TIP: The one line runbook for lazy humans! Team defaults to ops for convenience. Usually set to $::profile::server::team
  • 22. Monitoring Check Puppet Wrapper Example # Make sure apt-mirroring is working by checking the age of the NEW file left over. monitoring_check { 'apt-mirror': check_every => '4h', team => 'operations', page => false, runbook => 'y/rb-package-mirroring', tip => 'Talk to kwa. Check /var/spool/apt-mirror/var/cron.log, then /nail/apt-mirror/var/apt-mirror.lock.', command => '/usr/lib/nagios/plugins/check_file_age /nail/apt-mirror/var/NEW -w 86400 -c 172800', }
  • 23. Why Not Use The Native Puppet Type? ● The wrapper reduces the boilerplate and gives good defaults ● Enforces site-specific policies and validation (team names, mandatory runbooks) ● Allows us to modify all puppet-controlled sensu checks in the future from a single spot. ● Custom tests ● Allows us to be backend agnostic (maybe)
  • 24. Yelp SOA Checks • How do we (Yelp) empower our developers to monitor their services? • How can we safely and conveniently allow devs to define checks within our SOA framework? • How can Devs not be blocked by Ops for service deployment?
  • 25. Define the Meta Check # Defined on all hosts that run yelp SOA infrastructure monitoring_check { 'check-yelp_soa': check_every => '1m', alert_after => '10m', page => true, runbook => 'http://y/rb-check-yelpsoa', tip => 'Run /etc/sensu/plugins/check-yelp_soa.rb --debug to see what is wrong?', command => '/etc/sensu/plugins/check-yelp_soa.rb', require => Class['::yelp_soa'] }
  • 26. check-yelp_soa.rb redux def run # TODO: Parallelize? configs.each do | service, config | next unless services_that_run_here.include?(service) $log.debug "Processing #{service} as apparently it runs here" srv_configs = read_srv_configs(service) next unless srv_configs.include?('monitoring_check') monitoring_check = srv_configs['monitoring_check'] if numeric?(config['port']) ... if command == 'check_http' url = monitoring_check['check_url'] || '/status' $log.debug "Making a http check for #{service}, team: #{team}, warn_timeout: #{warn_timeout}, crit_timeout: #{crit_timeout}" output, status = check_http(port,url,http_expect,warn_timeout,crit_timeout) elsif monitoring_check['command'] == 'check_tcp' $log.debug "Making a tcp check for #{service}, team: #{team}, warn_timeout: #{warn_timeout}, crit_timeout: #{crit_timeout}" output, status = check_tcp(port,warn_timeout,crit_timeout) else $log.debug "Not spawning a check for #{service} because I don't know how to run #{command}" next end send_result_to_sensu(service, status, output, team, runbook, tip, page, alert_after, realert_every, irc_channels) services_checked << service end # End port check end # End for loop ok "Finished run. Ran checks on #{services_checked}" end
  • 27. What was that? Iterate through the SOA services that are configured to run on a server. Determine if that service has monitoring metadata defined by the authors Operate on that metadata to check it (usually check_http) Send the results of the check to the localhost:3030 socket as a *Different* check (“soa_$servicename”) See https://gist.github.com/joemiller/5806570 for another example
  • 28. An example service (request_blocking) # from request_blocking.yaml monitoring_check: team: 'infra' alert_after: 2m realert_every: 2 irc_channels: 'infra' url: '/status' tip: "no tips yet" warn_timout: 2.0 crit_timeout: 5.0
  • 29. AWS/Cloudwatch Checks • Pretty much the same thing, except: • Checks are executed on special monitoring hosts in the AZ (not on the ephemeral node) • Runs graphite/check_data.rb against the provided metric name • Written in python this time! (https://pypi.python. org/pypi/sensu)
  • 30. Dealing with Ephemeral EC2 Nodes • Yelps lives in a hybrid world, we have lots of “ephemeral” EC2 nodes that are baked and do NOT run puppet. Can Sensu still work on them? • How do we prevent ourselves from being spammed when hosts go away “normally”? • How do we know what a host is without logging into it? (EC2 metadata) • Baking………..
  • 31. EC2 Considerations • We use puppet to bake AMIs for ELBs, so we can control (via puppet) how Sensu is configured at bake time. • We can query the AWS API to know if a host has gone away, and prune it from the Queue to squelch alerts. • Using custom client metadata, we can add things like puppet cert name, AMI_ID, etc at runtime with a special init script.
  • 32. For Non-Ephemeral Instances if str2bool($::is_ec2) == true { $client_custom = { 'instance_id' => $::ec2_instanceid, 'keepalive' => { 'handlers' => [ 'aws_prune', 'default' ], 'team' => $team, 'page' => true } } } else { $client_custom = { 'team' => $team, 'page' => true } } Only EC2 Servers need the special aws_prune handler A Fact! Embed it for easy troubleshooting
  • 33. For Ephemeral (baked) Instances description "Fix Sensu clientinfo on startup for baked ec2 instances" author "Kyle Anderson <kwa@yelp.com>" start on starting sensu-client task script ADDRESS=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4) AMI_ID=$(curl -s http://169.254.169.254/latest/meta-data/ami-id) INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id) /usr/bin/jq ".client.name = "$(/usr/local/sbin/puppet-certname)" | .client.address = "$ADDRESS" | .client.instance_id = "$INSTANCE_ID" | .client.ami_id = "$AMI_ID" " /etc/sensu/conf.d/client.json > /etc/sensu/conf.d/newclient.json mv /etc/sensu/conf.d/client.json /etc/sensu/conf.d/client.json.old mv /etc/sensu/conf.d/newclient.json /etc/sensu/conf.d/client.json end script Only run once, right before sensu-client Real data. Can’t lie. Overwrite what we were baked with. It is wrong. jq FTW
  • 34. Pruning Terminated EC2 Nodes ● Modification of https://github.com/sensu/sensu-community- plugins/blob/master/handlers/other/ec2_node.rb ● Instead we use a cron job to cache the results of the api call into json so we can be nice to AWS ● Then we can have *every* check use this handler, as it is easy to just to check on disk if the instance_id is active. ● Use the instance_id from the client data to figure out who you are. (which should be correct from the above)
  • 35. What Does It Look Like? file { '/etc/sensu/plugins/cache_instance_list.rb': owner => 'root', group => 'root', mode => '0500', source => 'puppet:///modules/profile/sensu/handlers/cache_instance_list.rb', } -> cron::d { 'cache_instance_list': minute => '*', user => 'root', command => "/etc/sensu/plugins/cache_instance_list.rb -a ${access_key} -r ${region} -k ${secret_key}", } -> monitoring_check { 'cache_instance_list-staleness': check_every => '10m', alert_after => '1h', team => 'test', runbook => 'y/rb-aws-prune', command => "/usr/lib/nagios/plugins/check_file_age /var/cache/instance_list.json -w 1800 -c 3600", page => false, }
  • 36. The Handler (puppet) $access_key = hiera('sensu::aws_key') $secret_key = hiera('sensu::aws_secret') $aws_config_hash = { access_key => $access_key, secret_key => $secret_key, region => $region, blacklist_name_array => [ 'bake_soa_ami', 'Packer Builder' ] } sensu::handler { 'aws_prune': type => 'pipe', source => 'puppet:///modules/profile/sensu/handlers/aws_prune.rb', config => $aws_config_hash, require => [ Package['rubygem-fog'], Package['rubygem-sensu-plugin'], Package['rubygem-unf'] ], } }
  • 37. The Handler (Ruby) def ec2_node_exists? running_instances = load_instances_cache instance_ids = running_instances.collect { |s| Hash[ 'id', s['id'], 'tags', s['tags'] ]} my_instance_id = @event['client']['instance_id'] instance_ids.each do |instance| # YELP SPECIFIC CODE instance_name = instance['tags']['Name'].to_s # Yelp specific: pretend that the node does not exist if we are in our blacklist return false if blacklist_name_array.include?(instance_name) return true if my_instance_id == instance['id'] end return false # no match found, node doesn't exist end
  • 38. Cron Job Monitoring • I believe cron sending emails is an anti-pattern and not *web-scale* • Lets use Sensu to monitor our cron jobs! • Use a combination of a cron puppet type wrapper and my Sensu-Shell-Helper • Modified sensu-shell-helper includes fields for team and page for yelp-specific things: https://github. com/solarkennedy/sensu-shell-helper
  • 39. What does it look like? $command = 'chgrp -R admin /nail/packages/' cron::d { 'fix-packages-permissions': mailto => '', minute => '10', user => 'root', comment => 'Make permissions group writable for collaboration purposes', command => “sensu-shell-helper -n fix-packages-permissions -p false -t operations ${command}”, ensure => 'present' } See https://github.com/torrancew/puppet-cron#cronjob for related work.
  • 40. Future Work ● battle-test more of the pagerduty stuff (blocked on bogus aws nodes still) ● sort out AWS pruning, harder (#61626) ● make tools that work on nagios *and* sensu? ● really monitor the sensu instances in nagios with alerts (#60164) ● enable self-serve sensu alerts for services (#62201) ● make a library for sending passive checks (#62440) ● set up infrastructure for “aggregate” checks (cluster checks) ● better test the alerting tunables we have (#61628) ● enable sensu alerts for Asgardy services (#57450) ● set up easy to use metric based alerting (like horsefly, blocked on #67000) ● write my sensu-downtime tool ● write an super-dashboard (hackathon) ● write the sensu archive service (sensu-db?)