Current monitoring tools are clearly reaching the limit of their capabilities. That's because these tools are based on fundamental assumptions that are no longer true such as assuming that the underlying system being monitored is relatively static or that the behavioral limits of these systems can be defined by static rules and thresholds. Interest in applying analytics and machine learning to detect anomalies in dynamic web environments is gaining steam. However, understanding which algorithms should be used to identify and predict anomalies accurately within all that data we generate is not so easy.
This talk builds on an Open Space discussion that was started at DevOps Days Austin. We will begin with a brief definition of the types of anomalies commonly found in dynamic data center environments and then discuss some of the key elements to consider when thinking about anomaly detection such as:
Understanding your data and the two main approaches for analyzing operations data: parametric and non-parametric methods
The importance of context
Simple data transformations that can give you powerful results
Beyond pretty charts, Analytics for the rest of us. Toufic Boubez DevOps Days Silicon Valley 2013-06-22
1. Beyond The Pretty Charts
Analytics for the rest of us
Toufic Boubez, Ph.D.
Co-Founder, CTO
Metafor Software
2. 2
Toufic intro – who I am
• Co-Founder/CTO Metafor Software
• Co-Founder/CTO Layer 7 Technologies
– API Management
– Acquired by Computer Associates in 2013
• I escaped
• Building large scale software systems for 20
years (I’m older than I look, I know!)
3. 3
Why this talk?
• DevOps Days Austin: Open Space talk
– Blog: http://metaforsoftware.com/beyond-the-
pretty-charts-a-report-from-devopsdays-in-austin/
• Five major discussion points/lessons learned
• Note: no labels on charts – on purpose!!
• Note: real data
5. 5
1. We’ve moved beyond static thresholds
• Most current monitoring tools assume that
the underlying system is relatively static so we
can surround it with static thresholds and
rules. BUT:
– So what if my unicorn usage is at 91%, and has
been stable at 91% for a while?
– I’d much rather know if it’s at 60% and has been
rapidly increasing over the last few hours.
6. 6
Need more better analytics
• Thresholds won’t help you in this case
• Need some more dynamic analytics
7. 7
2. Context is really important
– Do I really want to be alerted when I know someone is
performing maintenance or backups?
– Is there an event that caused the change in behaviour (e.g. new
deploy)?
– Correlate your event line with your monitoring
Down for maintenance?
8. 8
3. Know your data!!
– You need to understand the statistical properties
of your data, and where it comes from, in order to
determine what kind of analytics to use.
• For example, it’s important to know if your data is
normally distributed.
• http://codeascraft.com/2013/06/11/introducing-kale/
• https://github.com/etsy/skyline/blob/master/src/analy
zer/algorithms.py
– Three-sigma, Grubbs and other algorithms assume normal
distribution
12. 12
4. Is all data important to collect?
– Two camps:
• Data is data, let’s collect and analyze everything and
figure out the trends.
• Not all data is important, so let’s figure out what’s
important first and understand the underlying model so
we don’t waste resources on the rest.
– Similar to the very public bun fight between Noam
Chomsky and Peter Norvig
• http://norvig.com/chomsky.html
– Unresolved as far as I know
14. 14
5. We all want to automate
• Having humans in the way of detecting and
solving DevOps issues doesn’t scale.
• At some point, we need systems that can
detect anomalies before problems become
critical, and take appropriate action.
15. 15
Open Loop Control System:
Heating your house – the wrong way!
• Steps:
– Tweak heater input
– Get to ideal temperature
– Lock gas valve
– Hope nothing changes
Controller
(gas valve)
System
(heater)
Sensor
(thermometer)
18. 18
How much data do we need?
• Trend towards higher and higher sampling
rates in data collection
• Reminds me of Jorge Luis Borges’ story about
Funes the Memorious
– Perfect recollection of the slightest details of every
instant of his life, but lost the ability for
abstraction
• Our brain works on abstraction
– We notice patterns BECAUSE we can abstract
20. 20
So, how much data DO you need?
– You don’t need more resolution that twice your
highest frequency (Nyquist-Shanon sampling
theorem)
– Most of the algorithms for analytics will smooth,
average, filter, and pre-process the data.
– Watch out for correlated metrics (e.g. used vs.
available memory)
21. 21
More?
• I want to talk more about analytics, in more
depth, but time’s up!!
– (Actually John won’t let me)
• Come talk to me during the breaks!
• Thank you!